fix(chunking): Fallback to StringChunker for Tree-sitter nodes with no children#145
Conversation
…o children When a Tree-sitter node has no children, the TreeSitterChunker would previously not yield any chunks for its content. This change adds a check for nodes with no children and falls back to using the StringChunker on the node's text, ensuring the content is processed.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #145 +/- ##
=======================================
Coverage 99.24% 99.24%
=======================================
Files 22 22
Lines 1457 1462 +5
=======================================
+ Hits 1446 1451 +5
Misses 11 11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi, thanks for this PR! This is something that I realised a while ago but somehow forgot to add. I'll add some tests to this PR later (for the coverage) and then merge it. I noticed that in (neo)vim, the |
|
Ok, I just remembered why I decided to put this off... and this is actually a bug that should be fixed. The |
When a Tree-sitter node has no children, the TreeSitterChunker would previously not yield any chunks for its content. This change adds a check for nodes with no children and falls back to using the StringChunker on the node's text, ensuring the content is processed.
This is for example useful for files containing embedded content that the primary treesitter parser can't process. For example in Vue3 SFC components (.vue) you can have embedded javascript logic within a tag and then your html template. The parser will parse the html part but will only see one 'RawText' node for the embedded javascript code. If this part is longer then your configured chunk_size, it won't be returned at all by the TreeSitterChunker since this node has no children.
With this fix, the embedded content will fallback to the naive chunking, which is probably way better than being completly ignored
In my neovim environment I somehow managed to have Treesitter be able to parse the embedded content in .vue files but here, for the TreeSitterChunker, the treesiter 'Vue' grammar apparently is not responsible of handling the embedded content (I guess my nvim-treesitter plugin does some magic to merge multiple treesitter parsers for these kind of embedded content....)