Skip to content

Conversation

@chazeon
Copy link
Owner

@chazeon chazeon commented Dec 13, 2025

This commit implements a full pipeline for extracting table of contents from Typst, Markdown, and PDF files, converting them to XTC chapters, and preserving chapters when concatenating XTC files.

Changes

Pipeline Refactoring (Typst/Markdown → PDF → Images)

  • Switch TypstFileAsset to compile to PDF instead of PNG
  • Switch MarkdownFileAsset to compile to PDF instead of PNG
  • Both now delegate to PDFAsset for rendering and TOC extraction
  • Enables consistent TOC extraction across all document types

Chapter Extraction and Propagation

  • Extract chapters from XTC files when loading (XTContainerAsset)
  • Attach chapter metadata to frames for downstream processing
  • Update extract_chapters_from_toc() to handle both:
    • TOC metadata (from PDF/Typst/Markdown headings)
    • Chapter metadata (from existing XTC files)
  • Automatically adjust page numbers when concatenating

XTC Format Fix

  • Ensure metadata section is written when chapters are present
  • Chapter count is stored in metadata, so metadata must exist

Tests

  • Add tests/test_toc_to_xtc.py: End-to-end TOC → XTC pipeline tests
  • Add tests/test_xtc_concat_chapters.py: Chapter preservation tests
  • All 79 tests pass

Features

✅ TOC extraction from Typst/Markdown/PDF to XTC chapters ✅ Chapter preservation when concatenating XTC files ✅ Page number adjustment for concatenated chapters ✅ Mixed source support (XTC + PDF/Typst/Markdown)
✅ Page subsetting still works with new PDF pipeline

🤖 Generated with Claude Code

chazeon and others added 2 commits December 13, 2025 00:20
This commit implements a full pipeline for extracting table of contents
from Typst, Markdown, and PDF files, converting them to XTC chapters,
and preserving chapters when concatenating XTC files.

## Changes

### Pipeline Refactoring (Typst/Markdown → PDF → Images)
- Switch TypstFileAsset to compile to PDF instead of PNG
- Switch MarkdownFileAsset to compile to PDF instead of PNG
- Both now delegate to PDFAsset for rendering and TOC extraction
- Enables consistent TOC extraction across all document types

### Chapter Extraction and Propagation
- Extract chapters from XTC files when loading (XTContainerAsset)
- Attach chapter metadata to frames for downstream processing
- Update extract_chapters_from_toc() to handle both:
  - TOC metadata (from PDF/Typst/Markdown headings)
  - Chapter metadata (from existing XTC files)
- Automatically adjust page numbers when concatenating

### XTC Format Fix
- Ensure metadata section is written when chapters are present
- Chapter count is stored in metadata, so metadata must exist

### Tests
- Add tests/test_toc_to_xtc.py: End-to-end TOC → XTC pipeline tests
- Add tests/test_xtc_concat_chapters.py: Chapter preservation tests
- All 79 tests pass

## Features

✅ TOC extraction from Typst/Markdown/PDF to XTC chapters
✅ Chapter preservation when concatenating XTC files
✅ Page number adjustment for concatenated chapters
✅ Mixed source support (XTC + PDF/Typst/Markdown)
✅ Page subsetting still works with new PDF pipeline

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- TypstFileAsset now returns PDFAsset instead of list[ImageAsset]
- MarkdownFileAsset now returns PDFAsset instead of list[ImageAsset]
- Assets are now truly atomic - each converts to exactly one next stage
- CLI stack automatically chains conversions for complete pipeline flow
- Enables future parallelization of asset conversions

Fix XTC chapter reading bug
- XTCReader._read_chapters was called with has_chapters (0/1) instead of chapter_count
- Now correctly reads chapter_count from metadata section
- Fixes bug where only 1 chapter was read regardless of actual count

Update tests for atomic pipeline
- Updated 19 tests across 4 test files to follow atomic pipeline pattern
- Tests now explicitly chain: asset → PDF → Images → Frames
- All 79 tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants