Skip to content

0xkyle/pdf_object_hashing

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf-tools

PDF tools and libraries

Starting with a generic python library. I wanted this to help streamline creating rules and identifying what is weird about a PDF.

pdf_obj_hash.py

Command line tool to generate the PDF object hash of a given PDF. Also supports scanning an entire directory.

usage: pdf_obj_hash.py [-h] [-f FILE] [-d DIR] [--ftrace] [--debug] [--time-trace] [--print-hash-string] [--hunt-string HUNT_STRING] [--print-info]

Generate a PDF Object Hash of the provided file or files.

options:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  file to parse
  -d DIR, --dir DIR     directory to scan for PDFs
  --ftrace              DEBUG: prints functions as they're called
  --debug               DEBUG: prints mid-function debug info
  --time-trace          DEBUG: time the individual regex scans in the run.
  --print-hash-string   print the hash string instead of the obj hash
  --hunt-string HUNT_STRING
                        hunt for a complete or partial hash string ("Catalog|Producer|Pages|Page|None|Length")
  --print-info          kinda debug, print object and object number


What is a PDF Object Hash?

PDF Object Hash is a way to identifying similarities between PDFs without relying on the content of the document. With object hashing we can identify the structure or skeleton of the document. Think of this as similar to an imphash or a ja3 hash. We extract out the object type and hash those to generate the hash. This allows us to quickly cluster similar documents and helps with identifying overlaps in disparate files.

Recent updates to pdf_obj_hash.py and pdf_lib.py change how we parse objects, which should allow for better and more accurate parsing. This should (and in testing does) give us better results when dealing with "weird" pdfs (such as invalid xref entries).

pdf_lib.py

This is a python library for analyzing PDFs.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%