feat: Add RichToken dataclass and extend Chunk for format-aware comparison#86
Open
feat: Add RichToken dataclass and extend Chunk for format-aware comparison#86
Conversation
…rison RichToken is a frozen, hashable token that carries both text and formatting metadata (as a sorted tuple of key/value pairs). This enables SequenceMatcher to detect differences in formatting — e.g. "Hello" bold ≠ "Hello" normal. Chunk gains an optional rich_tokens field so processors can attach rich data alongside the plain-text tokens used by existing rendering code. https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
docx_parser.py — low-level XML parsing using zipfile + lxml:
- Extracts paragraph and run structure from word/document.xml
- Captures character formatting (bold, italic, underline, strikethrough,
font, size, color, highlight, vert-align) and paragraph formatting
(style, alignment, indentation, spacing, numbering)
- Reads track-change wrappers as-is (accepted state)
- Ignores headers, footers, comments, footnotes
docx.py — high-level document and processor classes:
- DocxFile(Document): loads a .docx, exposes .text (plain) and
.rich_tokens (flat list of RichToken with merged paragraph + run props)
- DocxProcessor(RedlinesProcessor): word-level diff that compares
RichTokens so both text and formatting changes are detected
Requires lxml (optional dependency: pip install redlines[docx]).
https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
- Redlines.__init__ auto-detects DocxFile inputs and selects DocxProcessor - output_json() enriched: when rich_tokens are present, tokens include formatting dicts, and replace changes include text_changed flag, source/test formatting, and a formatting_changes summary - __init__.py exports new DOCX classes - pyproject.toml adds optional docx = ["lxml>=4.9.0"] dependency - 41 new tests covering parser, RichToken, DocxFile, DocxProcessor, Redlines integration, and edge cases (all pass alongside 112 existing) https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
- Fix variable shadowing in _extract_run_properties: use distinct names (toggle, u_val, font_val, etc.) instead of reusing `val` across scopes where mypy narrows to incompatible types - Add type: ignore[attr-defined] on lxml imports (C extension without stubs) - Add type: ignore[assignment] on lxml .get() calls that return Any - Fix pyproject.toml mypy override to cover both "lxml" and "lxml.*" https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
Add tests/documents/DocxFile/ with source.docx and test.docx fixtures representing a short contract with deliberate differences: - Text change: "hereby agrees" → "consents" - Formatting-only: "in good faith" italic → bold+italic - Paragraph style: Normal → Heading2 - Alignment: center → left, color removed - New paragraph appended 13 new tests in TestDocxFileComparison verify each change type is detected, formatting_changes summaries are correct, JSON structure is valid, and token coverage spans the full document. https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
The existing `# type: ignore[no-untyped-call]` comments didn't cover the `untyped-decorator` error code that newer mypy versions emit for `@cli.command()`. Add both codes so mypy passes cleanly. https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
Formatting-only changes (same text, different formatting) now render
distinctly from text changes in all output formats:
Markdown (default style):
- Formatting-only: blue text + superscript annotation
e.g. <span style='color:blue;...'>text</span><sup>[+bold]</sup>
- Text+formatting: standard del/ins + annotation
- All 6 markdown styles (red-green, none, red, ghfm, bbcode,
streamlit, custom_css) include fmt/fmt_note tag pairs
Rich (terminal):
- Formatting-only: bold blue text + blue annotation
- Text+formatting: red strikethrough + green + blue annotation
The _describe_formatting_changes() helper produces compact labels:
"+bold", "-italic", "style: Normal→Heading2",
"alignment: center→left", "-color: FF0000"
Also adds 8 new tests for markdown/rich output behavior on the
fixture files.
https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
Follow the same pattern as ins_class/del_class — the custom_css markdown style now reads fmt_class and fmt_note_class from RedlinesOptions, defaulting to 'redline-formatting' and 'redline-formatting-note'. https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
RichToken is a frozen, hashable token that carries both text and formatting
metadata (as a sorted tuple of key/value pairs). This enables SequenceMatcher
to detect differences in formatting — e.g. "Hello" bold ≠ "Hello" normal.
Chunk gains an optional rich_tokens field so processors can attach rich data
alongside the plain-text tokens used by existing rendering code.
https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz