feat: Add PDF comparison support by houfu · Pull Request #83 · houfu/redlines

houfu · 2026-01-09T06:02:00Z

Summary

Adds PDFFile document class for comparing text-based PDF files
Uses pdfplumber library for text extraction (MIT license, supports word-level bounding boxes for future chunk comparison features)
PDF support is an optional dependency: pip install redlines[pdf]
CLI auto-detects .pdf files and handles them automatically

Implementation Notes

This is a text dump implementation - extracts plain text only using page.extract_text():

✓ Works well for simple PDFs (invoices, reports, single-column documents)
⚠️ Limited for complex legal contracts (multi-column layouts, tables, headers/footers are flattened)
Future enhancements tracked in Enhanced PDF support: Layout-aware extraction and structure preservation #84 for layout-aware extraction

Usage

from redlines import Redlines
from redlines.pdf import PDFFile

source = PDFFile("contract_v1.pdf")
test = PDFFile("contract_v2.pdf")

redline = Redlines(source, test)
print(redline.output_markdown)

# CLI usage - auto-detects PDF files
redlines json doc_v1.pdf doc_v2.pdf --pretty

Features

PDFFile.text - Extracted text from all pages
PDFFile.pages - List of PDFPage objects with page_number and text
PDFFile.page_count - Number of pages
preserve_pages=True option to insert [Page N] markers in text

Test plan

Run pytest tests/test_pdf.py -v - 9 tests pass
Run full test suite pytest tests/ - 123 pass, 12 skipped (nupunkt)
Run mypy redlines/ - no issues
Test CLI with PDF files manually

Closes #1

🤖 Generated with Claude Code

Add PDFFile document class for comparing text-based PDF files. Uses pdfplumber for text extraction with support for page-level access. - New PDFFile class extending Document with text, pages, and page_count properties - Optional preserve_pages parameter to insert [Page N] markers - CLI auto-detects .pdf files and extracts text automatically - PDF support is an optional dependency: pip install redlines[pdf] Closes #1 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add PDF optional dependency to installation sections in README, AGENT_GUIDE, and __init__.py - Add Pattern 1b for comparing PDF files in AGENT_GUIDE - Add PDF to API documentation list in __init__.py - Add PDF example to CLI docstring Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add comprehensive module-level docstring with usage examples - Expand PDFPage dataclass docstring with example - Enhance PDFFile class docstring with features, limitations, and multiple examples - Add detailed docstrings to all properties (text, pages, page_count) - Add docstring to _extract_text method - Follow project style with Sphinx-style parameter documentation Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Update module docstring: emphasize text dump implementation - Update class docstring: note limitations for complex documents - Brief and factual about multi-column, table, header/footer handling Related: #84 (future enhancements) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

houfu and others added 3 commits January 9, 2026 14:01

houfu force-pushed the feature/pdf-support branch from e3decf2 to 6dee1fd Compare January 9, 2026 06:44

houfu mentioned this pull request Jan 9, 2026

Enhanced PDF support: Layout-aware extraction and structure preservation #84

Open

houfu merged commit 9eb5b8f into main Jan 9, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add PDF comparison support#83

feat: Add PDF comparison support#83
houfu merged 4 commits intomainfrom
feature/pdf-support

houfu commented Jan 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

houfu commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation Notes

Usage

Features

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

houfu commented Jan 9, 2026 •

edited

Loading