pdf-tools

PDF tools and libraries

Starting with a generic python library. I wanted this to help streamline creating rules and identifying what is weird about a PDF.

pdf_obj_hash.py

Command line tool to generate the PDF object hash of a given PDF. Also supports scanning an entire directory.

usage: pdf_obj_hash.py [-h] [-f FILE] [-d DIR] [--ftrace] [--debug] [--time-trace] [--print-hash-string] [--hunt-string HUNT_STRING] [--print-info]

Generate a PDF Object Hash of the provided file or files.

options:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  file to parse
  -d DIR, --dir DIR     directory to scan for PDFs
  --ftrace              DEBUG: prints functions as they're called
  --debug               DEBUG: prints mid-function debug info
  --time-trace          DEBUG: time the individual regex scans in the run.
  --print-hash-string   print the hash string instead of the obj hash
  --hunt-string HUNT_STRING
                        hunt for a complete or partial hash string ("Catalog|Producer|Pages|Page|None|Length")
  --print-info          kinda debug, print object and object number

What is a PDF Object Hash?

PDF Object Hash is a way to identifying similarities between PDFs without relying on the content of the document. With object hashing we can identify the structure or skeleton of the document. Think of this as similar to an imphash or a ja3 hash. We extract out the object type and hash those to generate the hash. This allows us to quickly cluster similar documents and helps with identifying overlaps in disparate files.

Recent updates to pdf_obj_hash.py and pdf_lib.py change how we parse objects, which should allow for better and more accurate parsing. This should (and in testing does) give us better results when dealing with "weird" pdfs (such as invalid xref entries).

pdf_lib.py

This is a python library for analyzing PDFs.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
pdf_object_hashing		pdf_object_hashing
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pdf-tools

pdf_obj_hash.py

What is a PDF Object Hash?

pdf_lib.py

About

Uh oh!

Releases

Packages

Languages

License

0xkyle/pdf_object_hashing

Folders and files

Latest commit

History

Repository files navigation

pdf-tools

pdf_obj_hash.py

What is a PDF Object Hash?

pdf_lib.py

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages