-
Notifications
You must be signed in to change notification settings - Fork 1
Add Hybrid and Markdown Chunking Strategies with Automatic Fallback #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implement a structure-aware chunking strategy that improves RAG retrieval accuracy for markdown documents by: - Parsing markdown structure (headings, code blocks, lists, blockquotes, tables) - Preserving heading context hierarchy in each chunk - Applying two-pass refinement: split oversized chunks at natural boundaries, merge undersized consecutive chunks that share the same heading context New files: - src/hybrid_chunking.c: Core implementation - test/sql/hybrid_chunking.sql: Comprehensive test suite - test/expected/hybrid_chunking.out: Expected test output Usage: SELECT pgedge_vectorizer.chunk_text(content, 'hybrid', 400, 50);
- Implement pure `markdown` chunking strategy (structure-aware, no refinement) - Add `is_likely_markdown()` detection function that checks for: - Headings (# syntax) - Code fences (``` or ~~~) - Lists, blockquotes, tables, links - Both `hybrid` and `markdown` strategies now automatically fall back to `token_based` chunking when content doesn't appear to be markdown - This avoids unnecessary overhead for plain text documents New features: - `chunk_markdown()` function for simpler structure-aware chunking - Automatic content detection with configurable thresholds - Fallback ensures optimal strategy regardless of content type Updated documentation with strategy comparison and usage examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces two new structure-aware chunking strategies (hybrid and markdown) for markdown documents, with automatic fallback to token-based chunking for plain text. The implementation is inspired by Docling's hybrid chunking approach and aims to improve RAG retrieval accuracy by preserving document structure and heading context.
Changes:
- Added
hybridstrategy with two-pass refinement (splits oversized chunks, merges undersized chunks with same heading context) - Added
markdownstrategy for simpler structure-aware chunking without refinement - Implemented automatic fallback detection that switches to token-based chunking for non-markdown content
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/hybrid_chunking.c | Implements hybrid and markdown chunking strategies with markdown parsing, structure detection, and refinement passes |
| src/pgedge_vectorizer.h | Adds enum types and struct definitions for markdown elements and hybrid chunks |
| src/chunking.c | Registers new strategies in parse_chunk_strategy and routes them in chunk_text |
| test/sql/hybrid_chunking.sql | Comprehensive test suite covering markdown parsing, heading context, fallback behavior, and edge cases |
| test/expected/hybrid_chunking.out | Expected test output for the new chunking strategies |
| docs/configuration.md | Documents new strategies, automatic fallback, and usage examples |
| docs/changelog.md | Adds changelog entries for the new features |
| Makefile | Adds hybrid_chunking.c to build and test targets |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The fallback logic was calling chunk_text() which would dispatch back to chunk_markdown/chunk_hybrid based on the config strategy, causing infinite recursion when plain text was detected. Fixed by: - Exposing chunk_by_tokens() (removed static, added to header) - Calling chunk_by_tokens() directly in fallback cases This ensures plain text is chunked with token-based strategy without re-entering the strategy dispatcher.
Replace tests that output specific chunk content with boolean aggregate checks (bool_or/bool_and) to make tests more robust and avoid failures due to minor output format differences. All 20 tests now return simple t/f boolean results making them easier to maintain and less fragile.
- Fix bounds check order in heading detection to avoid accessing pos[6] - Add null terminator checks before accessing pos[1] and pos[2] in code fence detection - Add null terminator checks in list item detection for both unordered and ordered lists - Improve is_list_item function with clearer bounds checking These fixes address potential security issues flagged by static analysis where array indices were accessed before verifying they were within bounds.
PostgreSQL 18 outputs column headers with trailing spaces. Update the expected output file to match this format, which is consistent with how other test expected files handle column headers.
Replace hardcoded magic number 6 with a named constant MAX_HEADING_LEVELS to improve code maintainability and clarity. This addresses the code review feedback about the hardcoded value appearing in multiple places.
- Remove unused variables (stack_depth, content_len) - Move variable declarations to start of blocks for C89 compatibility - Wrap code blocks to avoid declarations after statements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughThis PR introduces hybrid and markdown-aware chunking strategies for text content. A new module implements structure-aware parsing of markdown documents with two-pass refinement (splitting oversized, merging undersized chunks) and automatic fallback to token-based chunking for plain text. Documentation and comprehensive SQL tests are included. Changes
Sequence DiagramsequenceDiagram
participant Client as Application
participant Dispatcher as chunk_text<br/>(chunking.c)
participant Detector as is_likely_markdown<br/>(hybrid_chunking.c)
participant Parser as parse_markdown_structure<br/>(hybrid_chunking.c)
participant Refiner as Refinement Passes<br/>(hybrid_chunking.c)
participant Fallback as chunk_by_tokens<br/>(token-based)
Client->>Dispatcher: chunk_text(content, strategy)
alt Strategy == HYBRID
Dispatcher->>Detector: Check if markdown
alt Is Markdown
Detector-->>Dispatcher: true
Dispatcher->>Parser: Parse markdown elements
Parser-->>Dispatcher: MarkdownElement list
Dispatcher->>Refiner: Pass 1: Split oversized
Refiner->>Refiner: Pass 2: Merge undersized
Refiner-->>Dispatcher: Refined HybridChunk array
else Not Markdown
Detector-->>Dispatcher: false
Dispatcher->>Fallback: Fallback to token-based
Fallback-->>Dispatcher: Text array
end
else Strategy == MARKDOWN
Dispatcher->>Detector: Check if markdown
alt Is Markdown
Dispatcher->>Parser: Parse markdown structure
Parser-->>Dispatcher: Convert to chunks (no refinement)
else Not Markdown
Dispatcher->>Fallback: Fallback to token-based
end
end
Dispatcher-->>Client: ArrayType (text array)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (7)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This PR introduces two new structure-aware chunking strategies inspired by Docling's hybrid chunking approach, significantly improving RAG retrieval accuracy for markdown documents.
Key features:
hybrid strategy: Full two-pass refinement that splits oversized chunks and merges undersized consecutive chunks sharing the same heading context
markdown strategy: Simpler structure-aware chunking without refinement passes
Automatic fallback: Both strategies detect plain text and fall back to token_based chunking to avoid unnecessary overhead
Why This Matters
Traditional token-based chunking destroys semantic context by splitting text at arbitrary boundaries. The new strategies:
Preserve document structure - Headings, code blocks, tables, and lists stay intact
Maintain heading hierarchy - Each chunk includes context like [Context: # Chapter 1 > ## Section 1.1]
Optimize chunk sizes - The hybrid strategy merges small chunks and splits large ones intelligently
Summary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.