Skip to content

feat: Add PDF comparison support#83

Merged
houfu merged 4 commits intomainfrom
feature/pdf-support
Jan 9, 2026
Merged

feat: Add PDF comparison support#83
houfu merged 4 commits intomainfrom
feature/pdf-support

Conversation

@houfu
Copy link
Owner

@houfu houfu commented Jan 9, 2026

Summary

  • Adds PDFFile document class for comparing text-based PDF files
  • Uses pdfplumber library for text extraction (MIT license, supports word-level bounding boxes for future chunk comparison features)
  • PDF support is an optional dependency: pip install redlines[pdf]
  • CLI auto-detects .pdf files and handles them automatically

Implementation Notes

This is a text dump implementation - extracts plain text only using page.extract_text():

Usage

from redlines import Redlines
from redlines.pdf import PDFFile

source = PDFFile("contract_v1.pdf")
test = PDFFile("contract_v2.pdf")

redline = Redlines(source, test)
print(redline.output_markdown)
# CLI usage - auto-detects PDF files
redlines json doc_v1.pdf doc_v2.pdf --pretty

Features

  • PDFFile.text - Extracted text from all pages
  • PDFFile.pages - List of PDFPage objects with page_number and text
  • PDFFile.page_count - Number of pages
  • preserve_pages=True option to insert [Page N] markers in text

Test plan

  • Run pytest tests/test_pdf.py -v - 9 tests pass
  • Run full test suite pytest tests/ - 123 pass, 12 skipped (nupunkt)
  • Run mypy redlines/ - no issues
  • Test CLI with PDF files manually

Closes #1

🤖 Generated with Claude Code

houfu and others added 3 commits January 9, 2026 14:01
Add PDFFile document class for comparing text-based PDF files.
Uses pdfplumber for text extraction with support for page-level access.

- New PDFFile class extending Document with text, pages, and page_count properties
- Optional preserve_pages parameter to insert [Page N] markers
- CLI auto-detects .pdf files and extracts text automatically
- PDF support is an optional dependency: pip install redlines[pdf]

Closes #1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add PDF optional dependency to installation sections in README, AGENT_GUIDE, and __init__.py
- Add Pattern 1b for comparing PDF files in AGENT_GUIDE
- Add PDF to API documentation list in __init__.py
- Add PDF example to CLI docstring

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add comprehensive module-level docstring with usage examples
- Expand PDFPage dataclass docstring with example
- Enhance PDFFile class docstring with features, limitations, and multiple examples
- Add detailed docstrings to all properties (text, pages, page_count)
- Add docstring to _extract_text method
- Follow project style with Sphinx-style parameter documentation

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Update module docstring: emphasize text dump implementation
- Update class docstring: note limitations for complex documents
- Brief and factual about multi-column, table, header/footer handling

Related: #84 (future enhancements)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@houfu houfu merged commit 9eb5b8f into main Jan 9, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read two PDFs. Compare. Redline.

1 participant