feat: Add RichToken dataclass and extend Chunk for format-aware comparison by houfu · Pull Request #86 · houfu/redlines

houfu · 2026-02-07T08:31:12Z

RichToken is a frozen, hashable token that carries both text and formatting
metadata (as a sorted tuple of key/value pairs). This enables SequenceMatcher
to detect differences in formatting — e.g. "Hello" bold ≠ "Hello" normal.

Chunk gains an optional rich_tokens field so processors can attach rich data
alongside the plain-text tokens used by existing rendering code.

https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz

…rison RichToken is a frozen, hashable token that carries both text and formatting metadata (as a sorted tuple of key/value pairs). This enables SequenceMatcher to detect differences in formatting — e.g. "Hello" bold ≠ "Hello" normal. Chunk gains an optional rich_tokens field so processors can attach rich data alongside the plain-text tokens used by existing rendering code. https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz

docx_parser.py — low-level XML parsing using zipfile + lxml: - Extracts paragraph and run structure from word/document.xml - Captures character formatting (bold, italic, underline, strikethrough, font, size, color, highlight, vert-align) and paragraph formatting (style, alignment, indentation, spacing, numbering) - Reads track-change wrappers as-is (accepted state) - Ignores headers, footers, comments, footnotes docx.py — high-level document and processor classes: - DocxFile(Document): loads a .docx, exposes .text (plain) and .rich_tokens (flat list of RichToken with merged paragraph + run props) - DocxProcessor(RedlinesProcessor): word-level diff that compares RichTokens so both text and formatting changes are detected Requires lxml (optional dependency: pip install redlines[docx]). https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz

- Redlines.__init__ auto-detects DocxFile inputs and selects DocxProcessor - output_json() enriched: when rich_tokens are present, tokens include formatting dicts, and replace changes include text_changed flag, source/test formatting, and a formatting_changes summary - __init__.py exports new DOCX classes - pyproject.toml adds optional docx = ["lxml>=4.9.0"] dependency - 41 new tests covering parser, RichToken, DocxFile, DocxProcessor, Redlines integration, and edge cases (all pass alongside 112 existing) https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz

- Fix variable shadowing in _extract_run_properties: use distinct names (toggle, u_val, font_val, etc.) instead of reusing `val` across scopes where mypy narrows to incompatible types - Add type: ignore[attr-defined] on lxml imports (C extension without stubs) - Add type: ignore[assignment] on lxml .get() calls that return Any - Fix pyproject.toml mypy override to cover both "lxml" and "lxml.*" https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz

Add tests/documents/DocxFile/ with source.docx and test.docx fixtures representing a short contract with deliberate differences: - Text change: "hereby agrees" → "consents" - Formatting-only: "in good faith" italic → bold+italic - Paragraph style: Normal → Heading2 - Alignment: center → left, color removed - New paragraph appended 13 new tests in TestDocxFileComparison verify each change type is detected, formatting_changes summaries are correct, JSON structure is valid, and token coverage spans the full document. https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz

The existing `# type: ignore[no-untyped-call]` comments didn't cover the `untyped-decorator` error code that newer mypy versions emit for `@cli.command()`. Add both codes so mypy passes cleanly. https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz

Formatting-only changes (same text, different formatting) now render distinctly from text changes in all output formats: Markdown (default style): - Formatting-only: blue text + superscript annotation e.g. <span style='color:blue;...'>text</span><sup>[+bold]</sup> - Text+formatting: standard del/ins + annotation - All 6 markdown styles (red-green, none, red, ghfm, bbcode, streamlit, custom_css) include fmt/fmt_note tag pairs Rich (terminal): - Formatting-only: bold blue text + blue annotation - Text+formatting: red strikethrough + green + blue annotation The _describe_formatting_changes() helper produces compact labels: "+bold", "-italic", "style: Normal→Heading2", "alignment: center→left", "-color: FF0000" Also adds 8 new tests for markdown/rich output behavior on the fixture files. https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz

Follow the same pattern as ins_class/del_class — the custom_css markdown style now reads fmt_class and fmt_note_class from RedlinesOptions, defaulting to 'redline-formatting' and 'redline-formatting-note'. https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz

claude added 8 commits February 7, 2026 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add RichToken dataclass and extend Chunk for format-aware comparison#86

feat: Add RichToken dataclass and extend Chunk for format-aware comparison#86
houfu wants to merge 8 commits intomainfrom
claude/docx-redlines-feature-KAzrS

houfu commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

houfu commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants