PDF‑Outline‑Challenge

Extract a clean heading outline (Title → H1/H2/H3) from any PDF, fully offline, on CPU‑only hardware.

Encoder – Donut‑base INT8 (ONNX)
Classifier head – small MLP (models/donut_head.pkl)
Clustering – rule‑based, layout‑agnostic (pdf_outline/cluster.py)
Renderer – PyMuPDF (fitz) + Pillow, so no external OCR engine required
Fully offline at runtime – all models & wheels shipped in repo

1 Quick Start (Offline Docker Build)

Pre‑download all Python wheels into wheelhouse/ (one‑time, already done):

mkdir -p wheelhouse
pip download \
  --only-binary=:all: \
  --no-deps \
  -d wheelhouse \
  -r requirements.txt \
  --extra-index-url https://download.pytorch.org/whl/cpu

Build the Docker image (uses wheelhouse/ for offline install):

docker build --platform linux/amd64 -t pdfoutline.challenge .

Run against your PDFs:

docker run --rm \
  -v "$(pwd)/sample_dataset/pdfs:/app/input:ro" \
  -v "$(pwd)/sample_dataset/outputs:/app/output" \
  --network none \
  pdfoutline.challenge

Your JSON outlines will appear in sample_dataset/outputs/ in ≈2 min (<1 GB image).

2 Repository Layout

pdf_outline_project/ ├── Dockerfile # offline, CPU‑only container spec (uses wheelhouse/) ├── process_pdfs.py # entry‑point: loops /app/input → /app/output ├── requirements.txt # runtime deps ├── README.md # ← you are here │ ├── wheelhouse/ # pre‑downloaded wheels for pip install --no-index │ ├── models/ # bundled INT8 Donut & head │ ├── donut_base_int8/int8/ # encoder+decoder .onnx │ └── donut_head.pkl │ ├── pdf_outline/ # installable package │ ├── init.py # exposes CLI entrypoint │ ├── cli.py # thin wrapper around run() │ ├── render.py # PDF → RGB images │ ├── donut_infer.py # ONNXRuntime inference │ ├── extract_lines.py # text‑line detection │ ├── classify.py # small MLP head │ └── cluster.py # robust heading‑level assignment │ └── sample_dataset/ # demo + schema ├── pdfs/ # input PDFs ├── outputs/ # output JSONs └── schema/ └── output_schema.json

3 How It Works

Stage	Component	Description
1. Render pages	`pdf_outline.render` (PyMuPDF)	Renders each page at DPI 120 (configurable)
2. Encode pages	Donut‑base INT8 ONNX	Produces one 1024‑D CLS token embedding per page
3. Classify lines	MLP head (`donut_head.pkl`)	Yields a heading‑probability for each detected text line
4. Cluster heads	`pdf_outline.cluster.assign_levels`	Font-size ranking + numbering rules + repeat filtering ⇒ Title/H1…
5. Dump JSON	`process_pdfs.py`	Writes `<pdf_name>.json` matching `output_schema.json`

All steps run CPU, offline, on amd64 with ≤16 GB RAM.

4 Key Commands

Purpose	Command
Build image	`docker build --platform linux/amd64 -t pdfoutline.challenge .`
Run pipeline	`docker run --rm -v $(pwd)/sample_dataset/pdfs:/app/input:ro -v $(pwd)/sample_dataset/outputs:/app/output --network none pdfoutline.challenge`
Local dev	`python -m pdf_outline.cli <file.pdf> --dpi 120`

5 Troubleshooting

Symptom	Fix
`ConvInteger` node not implemented	Ensure you're using the INT8 ONNX files under `models/donut_base_int8/int8/`
Out-of-memory (>16 GB)	Lower DPI (`--dpi 80`) or reduce batch size (in `donut_infer.py`)
No JSON outputs	Verify you mounted `/app/output` correctly; inspect container logs for per‑PDF errors

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
build/lib/pdf_outline		build/lib/pdf_outline
models		models
pdf_outline		pdf_outline
sample_dataset		sample_dataset
scripts		scripts
wheelhouse		wheelhouse
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
process_pdfs.py		process_pdfs.py
pyproject.toml		pyproject.toml
quick_tests.py		quick_tests.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF‑Outline‑Challenge

1 Quick Start (Offline Docker Build)

2 Repository Layout

3 How It Works

4 Key Commands

5 Troubleshooting

About

Uh oh!

Releases

Packages

Languages

invst-git/PDF-Structure-Extraction

Folders and files

Latest commit

History

Repository files navigation

PDF‑Outline‑Challenge

1 Quick Start (Offline Docker Build)

2 Repository Layout

3 How It Works

4 Key Commands

5 Troubleshooting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages