Advanced PDF Carver
The PyFlag PDF Carver is based on the work of DFRWS_Scudette, described in DFRWS_PDF.
There are a number of limitations with the current implementation:
We only support the Deflate method for object verification. This means that we are unable to test for fragmentation occurring within objects encoded in another way. We hope to add support to LZW, CCITT and JPEG encodings in future.
- We only implement disambiguation of points, suitable for first order fragmentation only. This will fail when the file is too fragmented.
Example
The following is an example of how to use the PDF carver. First obtain the latest development version of PyFlag:
$ darcs get http://www.pyflag.net/pyflag/ $ cd pyflag $ ./configure $ make $ sudo make install
The advanced carvers are only available as stand alone tools at this stage and are not incorporated into the GUI. PyFlag uses a test driven development model, so we can test the carvers by calling make in the carver directory:
$ cd src/pyflag/Carvers/ $ make
This will generate and test the carvers against a test set (which may be downloaded from the pyflag site).
We now use the carvers to solve the DFRWS 2007 challenge. Carving for PDFs requires going through a number of steps. The first step is to index the image for PDF artifacts. The following will create the index file pdf_test.idx:
$ python pdf_carver.py -c -i pdf_test.idx dfrws-2007-challenge.img
Next we need to create the map files. These map files are the initial guess for the mapping functions of each PDF found in the image. Each map file corresponds to a single XREF table, and the carver tries to coalesce related XREF tables into the same mapping function.
$ python pdf_carver.py -m -i pdf_test.idx dfrws-2007-challenge.img $ ls *.map 103228588.map 148733892.map 300968051.map 38267254.map 103537905-103228588.map 236510367.map 304512789.map 38315692-38267254.map 103537905.map 300187251.map 304516723-304512789.map 38315692.map
It is possible to view each of these maps graphically (You need to have gnuplot installed for this), for example:
$ python pdf_carver.py -p -M 146306832-146288243.map
We can see the number and the nature of the discontinuities, although their exact position is inaccurate as we did not test this file yet. For PDF files this map is usually sufficient to be able to open the file with a pdf viewer despite the errors. This is because most of the PDF objects are there, and the general structure of the file is correct. If for some reason we are unable to verify this file, we can just extract them:
$ python pdf_carver.py -M 146306832-146288243.map -e output.pdf dfrws-2007-challenge.img
To work out exactly where the discontinuities are, we need to force the file. In this example we also ask for the output map to be saved in "output.map":
$ python pdf_carver.py -M 146306832-146288243.map -f output.pdf -F output.map dfrws-2007-challenge.img Ambiguous point found at offset 2048: forward=146290176 vs reverse=146292224... Checking until 2982 Found a hit at 2048 Verifying complete file: No errors Extracting into file output.pdf Saving map in output.map
As can be seen the discontinuity was moved from its original position to a more accurate position.
