Advanced PDF Carver

The PyFlag PDF Carver is based on the work of DFRWS_Scudette, described in DFRWS_PDF.

There are a number of limitations with the current implementation:

  • We only support the Deflate method for object verification. This means that we are unable to test for fragmentation occurring within objects encoded in another way. We hope to add support to LZW, CCITT and JPEG encodings in future.

  • We only implement disambiguation of points, suitable for first order fragmentation only. This will fail when the file is too fragmented.

Example

The following is an example of how to use the PDF carver. First obtain the latest development version of PyFlag:

$ darcs get http://www.pyflag.net/pyflag/
$ cd pyflag
$ ./configure
$ make
$ sudo make install

The advanced carvers are only available as stand alone tools at this stage and are not incorporated into the GUI. PyFlag uses a test driven development model, so we can test the carvers by calling make in the carver directory:

$ cd src/pyflag/Carvers/
$ make

This will generate and test the carvers against a test set (which may be downloaded from the pyflag site).

We now use the carvers to solve the DFRWS 2007 challenge. Carving for PDFs requires going through a number of steps. The first step is to index the image for PDF artifacts. The following will create the index file pdf_test.idx:

$ python pdf_carver.py  -c -i pdf_test.idx dfrws-2007-challenge.img 

Next we need to create the map files. These map files are the initial guess for the mapping functions of each PDF found in the image. Each map file corresponds to a single XREF table, and the carver tries to coalesce related XREF tables into the same mapping function.

$ python pdf_carver.py  -m -i pdf_test.idx dfrws-2007-challenge.img
$ ls *.map
103228588.map            148733892.map            300968051.map            38267254.map
103537905-103228588.map  236510367.map            304512789.map            38315692-38267254.map
103537905.map            300187251.map            304516723-304512789.map  38315692.map

It is possible to view each of these maps graphically (You need to have gnuplot installed for this), for example:

$ python pdf_carver.py  -p -M 146306832-146288243.map

We can see the number and the nature of the discontinuities, although their exact position is inaccurate as we did not test this file yet. For PDF files this map is usually sufficient to be able to open the file with a pdf viewer despite the errors. This is because most of the PDF objects are there, and the general structure of the file is correct. If for some reason we are unable to verify this file, we can just extract them:

$ python pdf_carver.py -M 146306832-146288243.map -e output.pdf dfrws-2007-challenge.img

To work out exactly where the discontinuities are, we need to force the file. In this example we also ask for the output map to be saved in "output.map":

$ python pdf_carver.py -M 146306832-146288243.map -f output.pdf -F output.map dfrws-2007-challenge.img

Ambiguous point found at offset 2048: forward=146290176 vs reverse=146292224...
Checking until 2982
Found a hit at 2048
Verifying complete file:
No errors
Extracting into file output.pdf
Saving map in output.map

As can be seen the discontinuity was moved from its original position to a more accurate position.