Skip to content

feat: Add RichToken dataclass and extend Chunk for format-aware comparison#86

Open
houfu wants to merge 8 commits intomainfrom
claude/docx-redlines-feature-KAzrS
Open

feat: Add RichToken dataclass and extend Chunk for format-aware comparison#86
houfu wants to merge 8 commits intomainfrom
claude/docx-redlines-feature-KAzrS

Conversation

@houfu
Copy link
Owner

@houfu houfu commented Feb 7, 2026

RichToken is a frozen, hashable token that carries both text and formatting
metadata (as a sorted tuple of key/value pairs). This enables SequenceMatcher
to detect differences in formatting — e.g. "Hello" bold ≠ "Hello" normal.

Chunk gains an optional rich_tokens field so processors can attach rich data
alongside the plain-text tokens used by existing rendering code.

https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz

…rison

RichToken is a frozen, hashable token that carries both text and formatting
metadata (as a sorted tuple of key/value pairs). This enables SequenceMatcher
to detect differences in formatting — e.g. "Hello" bold ≠ "Hello" normal.

Chunk gains an optional rich_tokens field so processors can attach rich data
alongside the plain-text tokens used by existing rendering code.

https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
docx_parser.py — low-level XML parsing using zipfile + lxml:
  - Extracts paragraph and run structure from word/document.xml
  - Captures character formatting (bold, italic, underline, strikethrough,
    font, size, color, highlight, vert-align) and paragraph formatting
    (style, alignment, indentation, spacing, numbering)
  - Reads track-change wrappers as-is (accepted state)
  - Ignores headers, footers, comments, footnotes

docx.py — high-level document and processor classes:
  - DocxFile(Document): loads a .docx, exposes .text (plain) and
    .rich_tokens (flat list of RichToken with merged paragraph + run props)
  - DocxProcessor(RedlinesProcessor): word-level diff that compares
    RichTokens so both text and formatting changes are detected

Requires lxml (optional dependency: pip install redlines[docx]).

https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
- Redlines.__init__ auto-detects DocxFile inputs and selects DocxProcessor
- output_json() enriched: when rich_tokens are present, tokens include
  formatting dicts, and replace changes include text_changed flag,
  source/test formatting, and a formatting_changes summary
- __init__.py exports new DOCX classes
- pyproject.toml adds optional docx = ["lxml>=4.9.0"] dependency
- 41 new tests covering parser, RichToken, DocxFile, DocxProcessor,
  Redlines integration, and edge cases (all pass alongside 112 existing)

https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
- Fix variable shadowing in _extract_run_properties: use distinct names
  (toggle, u_val, font_val, etc.) instead of reusing `val` across scopes
  where mypy narrows to incompatible types
- Add type: ignore[attr-defined] on lxml imports (C extension without stubs)
- Add type: ignore[assignment] on lxml .get() calls that return Any
- Fix pyproject.toml mypy override to cover both "lxml" and "lxml.*"

https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
Add tests/documents/DocxFile/ with source.docx and test.docx fixtures
representing a short contract with deliberate differences:
  - Text change: "hereby agrees" → "consents"
  - Formatting-only: "in good faith" italic → bold+italic
  - Paragraph style: Normal → Heading2
  - Alignment: center → left, color removed
  - New paragraph appended

13 new tests in TestDocxFileComparison verify each change type is
detected, formatting_changes summaries are correct, JSON structure
is valid, and token coverage spans the full document.

https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
The existing `# type: ignore[no-untyped-call]` comments didn't cover
the `untyped-decorator` error code that newer mypy versions emit for
`@cli.command()`. Add both codes so mypy passes cleanly.

https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
Formatting-only changes (same text, different formatting) now render
distinctly from text changes in all output formats:

Markdown (default style):
  - Formatting-only: blue text + superscript annotation
    e.g. <span style='color:blue;...'>text</span><sup>[+bold]</sup>
  - Text+formatting: standard del/ins + annotation
  - All 6 markdown styles (red-green, none, red, ghfm, bbcode,
    streamlit, custom_css) include fmt/fmt_note tag pairs

Rich (terminal):
  - Formatting-only: bold blue text + blue annotation
  - Text+formatting: red strikethrough + green + blue annotation

The _describe_formatting_changes() helper produces compact labels:
  "+bold", "-italic", "style: Normal→Heading2",
  "alignment: center→left", "-color: FF0000"

Also adds 8 new tests for markdown/rich output behavior on the
fixture files.

https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
Follow the same pattern as ins_class/del_class — the custom_css markdown
style now reads fmt_class and fmt_note_class from RedlinesOptions,
defaulting to 'redline-formatting' and 'redline-formatting-note'.

https://claude.ai/code/session_01Aj4QM8K7w5SXTzK2cdPDCz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants