Skip to content

Conversation

@dpage
Copy link
Member

@dpage dpage commented Jan 11, 2026

This PR introduces two new structure-aware chunking strategies inspired by Docling's hybrid chunking approach, significantly improving RAG retrieval accuracy for markdown documents.

Key features:

hybrid strategy: Full two-pass refinement that splits oversized chunks and merges undersized consecutive chunks sharing the same heading context

markdown strategy: Simpler structure-aware chunking without refinement passes

Automatic fallback: Both strategies detect plain text and fall back to token_based chunking to avoid unnecessary overhead

Why This Matters

Traditional token-based chunking destroys semantic context by splitting text at arbitrary boundaries. The new strategies:

Preserve document structure - Headings, code blocks, tables, and lists stay intact
Maintain heading hierarchy - Each chunk includes context like [Context: # Chapter 1 > ## Section 1.1]
Optimize chunk sizes - The hybrid strategy merges small chunks and splits large ones intelligently

Summary by CodeRabbit

  • New Features

    • Hybrid chunking strategy for structured documents with heading context preservation and two-pass refinement to improve retrieval accuracy
    • Markdown chunking strategy for markdown content offering structure-aware parsing without refinement passes
    • Automatic fallback detection that uses appropriate strategy based on content type
  • Documentation

    • Added configuration guide for chunking strategies with use case recommendations and examples
  • Tests

    • Added comprehensive regression test suite for new chunking strategies

✏️ Tip: You can customize this high-level summary in your review settings.

Implement a structure-aware chunking strategy that improves RAG retrieval
accuracy for markdown documents by:

- Parsing markdown structure (headings, code blocks, lists, blockquotes, tables)
- Preserving heading context hierarchy in each chunk
- Applying two-pass refinement: split oversized chunks at natural boundaries,
  merge undersized consecutive chunks that share the same heading context

New files:
- src/hybrid_chunking.c: Core implementation
- test/sql/hybrid_chunking.sql: Comprehensive test suite
- test/expected/hybrid_chunking.out: Expected test output

Usage: SELECT pgedge_vectorizer.chunk_text(content, 'hybrid', 400, 50);
- Implement pure `markdown` chunking strategy (structure-aware, no refinement)
- Add `is_likely_markdown()` detection function that checks for:
  - Headings (# syntax)
  - Code fences (``` or ~~~)
  - Lists, blockquotes, tables, links
- Both `hybrid` and `markdown` strategies now automatically fall back to
  `token_based` chunking when content doesn't appear to be markdown
- This avoids unnecessary overhead for plain text documents

New features:
- `chunk_markdown()` function for simpler structure-aware chunking
- Automatic content detection with configurable thresholds
- Fallback ensures optimal strategy regardless of content type

Updated documentation with strategy comparison and usage examples.
@dpage dpage requested a review from Copilot January 11, 2026 01:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces two new structure-aware chunking strategies (hybrid and markdown) for markdown documents, with automatic fallback to token-based chunking for plain text. The implementation is inspired by Docling's hybrid chunking approach and aims to improve RAG retrieval accuracy by preserving document structure and heading context.

Changes:

  • Added hybrid strategy with two-pass refinement (splits oversized chunks, merges undersized chunks with same heading context)
  • Added markdown strategy for simpler structure-aware chunking without refinement
  • Implemented automatic fallback detection that switches to token-based chunking for non-markdown content

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/hybrid_chunking.c Implements hybrid and markdown chunking strategies with markdown parsing, structure detection, and refinement passes
src/pgedge_vectorizer.h Adds enum types and struct definitions for markdown elements and hybrid chunks
src/chunking.c Registers new strategies in parse_chunk_strategy and routes them in chunk_text
test/sql/hybrid_chunking.sql Comprehensive test suite covering markdown parsing, heading context, fallback behavior, and edge cases
test/expected/hybrid_chunking.out Expected test output for the new chunking strategies
docs/configuration.md Documents new strategies, automatic fallback, and usage examples
docs/changelog.md Adds changelog entries for the new features
Makefile Adds hybrid_chunking.c to build and test targets

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

claude and others added 6 commits January 11, 2026 01:12
The fallback logic was calling chunk_text() which would dispatch back
to chunk_markdown/chunk_hybrid based on the config strategy, causing
infinite recursion when plain text was detected.

Fixed by:
- Exposing chunk_by_tokens() (removed static, added to header)
- Calling chunk_by_tokens() directly in fallback cases

This ensures plain text is chunked with token-based strategy without
re-entering the strategy dispatcher.
Replace tests that output specific chunk content with boolean aggregate
checks (bool_or/bool_and) to make tests more robust and avoid failures
due to minor output format differences.

All 20 tests now return simple t/f boolean results making them easier
to maintain and less fragile.
- Fix bounds check order in heading detection to avoid accessing pos[6]
- Add null terminator checks before accessing pos[1] and pos[2] in code fence detection
- Add null terminator checks in list item detection for both unordered and ordered lists
- Improve is_list_item function with clearer bounds checking

These fixes address potential security issues flagged by static analysis where
array indices were accessed before verifying they were within bounds.
PostgreSQL 18 outputs column headers with trailing spaces. Update the
expected output file to match this format, which is consistent with
how other test expected files handle column headers.
Replace hardcoded magic number 6 with a named constant MAX_HEADING_LEVELS
to improve code maintainability and clarity. This addresses the code
review feedback about the hardcoded value appearing in multiple places.
- Remove unused variables (stack_depth, content_len)
- Move variable declarations to start of blocks for C89 compatibility
- Wrap code blocks to avoid declarations after statements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Jan 13, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This PR introduces hybrid and markdown-aware chunking strategies for text content. A new module implements structure-aware parsing of markdown documents with two-pass refinement (splitting oversized, merging undersized chunks) and automatic fallback to token-based chunking for plain text. Documentation and comprehensive SQL tests are included.

Changes

Cohort / File(s) Summary
Build Configuration
Makefile
Added src/hybrid_chunking.o to object list and hybrid_chunking to regression test targets
Documentation Updates
docs/changelog.md, docs/configuration.md
Added Unreleased changelog entries describing hybrid and markdown strategies; expanded configuration guide with strategy definitions, use-case mapping, and example usage blocks
Core Chunking Module
src/chunking.c
Exported chunk_by_tokens (removed static); added support for CHUNK_STRATEGY_HYBRID and CHUNK_STRATEGY_MARKDOWN in strategy parser and dispatcher
Hybrid Chunking Implementation
src/hybrid_chunking.c
New module implementing markdown detection, structure-aware parsing (elements: headings, code blocks, lists, blockquotes, tables), hierarchical heading context tracking, and two-pass chunk refinement (split oversized via token boundaries; merge undersized with identical context)
Public API Extensions
src/pgedge_vectorizer.h
Added CHUNK_STRATEGY_HYBRID enum value; new enums: MarkdownElementType; new structs: MarkdownElement, HybridChunk; new function declarations: chunk_by_tokens, chunk_hybrid, chunk_markdown, parse_markdown_structure, free_markdown_elements
Test Suite
test/sql/hybrid_chunking.sql
New SQL test file with 20 tests covering hybrid/markdown strategies, heading context preservation, code block integrity, plain text fallback, nested structures, and token-based behavior validation

Sequence Diagram

sequenceDiagram
    participant Client as Application
    participant Dispatcher as chunk_text<br/>(chunking.c)
    participant Detector as is_likely_markdown<br/>(hybrid_chunking.c)
    participant Parser as parse_markdown_structure<br/>(hybrid_chunking.c)
    participant Refiner as Refinement Passes<br/>(hybrid_chunking.c)
    participant Fallback as chunk_by_tokens<br/>(token-based)

    Client->>Dispatcher: chunk_text(content, strategy)
    alt Strategy == HYBRID
        Dispatcher->>Detector: Check if markdown
        alt Is Markdown
            Detector-->>Dispatcher: true
            Dispatcher->>Parser: Parse markdown elements
            Parser-->>Dispatcher: MarkdownElement list
            Dispatcher->>Refiner: Pass 1: Split oversized
            Refiner->>Refiner: Pass 2: Merge undersized
            Refiner-->>Dispatcher: Refined HybridChunk array
        else Not Markdown
            Detector-->>Dispatcher: false
            Dispatcher->>Fallback: Fallback to token-based
            Fallback-->>Dispatcher: Text array
        end
    else Strategy == MARKDOWN
        Dispatcher->>Detector: Check if markdown
        alt Is Markdown
            Dispatcher->>Parser: Parse markdown structure
            Parser-->>Dispatcher: Convert to chunks (no refinement)
        else Not Markdown
            Dispatcher->>Fallback: Fallback to token-based
        end
    end
    Dispatcher-->>Client: ArrayType (text array)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hop through structures, chunk by chunk,
Headings preserved, no context sunk,
Markdown wisdom in two-pass refine,
Plain text falls back, all chunks align!

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 373e3ab and ce4b90a.

⛔ Files ignored due to path filters (1)
  • test/expected/hybrid_chunking.out is excluded by !**/*.out
📒 Files selected for processing (7)
  • Makefile
  • docs/changelog.md
  • docs/configuration.md
  • src/chunking.c
  • src/hybrid_chunking.c
  • src/pgedge_vectorizer.h
  • test/sql/hybrid_chunking.sql

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dpage dpage merged commit 9210853 into main Jan 13, 2026
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants