-
Notifications
You must be signed in to change notification settings - Fork 18
Fix exclude metadata & embedding in llama-stack #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds per-field metadata exclusion for embeddings and LLM usage by introducing exclude_embed_metadata and exclude_llm_metadata options, wiring them through DocumentProcessor → _Config → DB backends, adding metadata-removal helpers, model_copy wrapping, and an embedding-preparation step for the llama-stack path. Changes
Sequence DiagramsequenceDiagram
participant User
participant DocProc as DocumentProcessor
participant Config
participant DB as _LlamaIndexDB/_LlamaStackDB
participant EmbedSvc as Embeddings
participant LLM
User->>DocProc: init(..., exclude_embed_metadata, exclude_llm_metadata)
DocProc->>Config: store exclusion lists
User->>DocProc: add_docs(documents)
DocProc->>DB: add_docs(documents, config)
DB->>DB: split & prepare nodes
DB->>DB: exclude_metadata(documents) %% remove specified keys & bind model_copy wrapper
DB->>DB: attach llm_metadata (filtered) and embed_metadata (filtered)
alt llama-stack + manual_chunking
DB->>DB: _calculate_embeddings(documents) %% formats content using embed_metadata
DB->>EmbedSvc: request embeddings (formatted contents)
EmbedSvc-->>DB: embeddings
DB->>DB: remove embed_metadata from docs after embedding
else other paths
DB->>EmbedSvc: embeddings requested downstream (embed metadata not included)
end
DB->>LLM: pass llm_metadata (filtered) for generation/indexing
DB-->>DocProc: add_docs complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lightspeed_rag_content/document_processor.py (1)
401-417: Break long lines and handle Optional metadata lists in llama-stackadd_docsLines building
embed_metadata/llm_metadataare over 100 chars and also assumeexclude_*_metadataare non‑None. To address Ruff E501 and make the code robust to Optional config:- chunk_metadata = { - "document_id": node.ref_doc_id, - "chunk_id": node.id_, - "source": node.metadata.get("docs_url", node.metadata["title"]), - } - embed_metadata = self._remove_metadata(node.metadata, self.config.exclude_embed_metadata) - llm_metadata = self._remove_metadata(node.metadata, self.config.exclude_llm_metadata) + chunk_metadata = { + "document_id": node.ref_doc_id, + "chunk_id": node.id_, + "source": node.metadata.get( + "docs_url", + node.metadata["title"], + ), + } + embed_metadata = self._remove_metadata( + node.metadata, + self.config.exclude_embed_metadata, + ) + llm_metadata = self._remove_metadata( + node.metadata, + self.config.exclude_llm_metadata, + )Given the earlier
_remove_metadatachange, this will safely handleNonewhile keeping lines under 100 chars.
🧹 Nitpick comments (5)
src/lightspeed_rag_content/utils.py (2)
19-22: DEFAULT_METADATA_EXCLUSSION name has a spelling typoThe constant name has a typo (
EXCLUSSIONvsEXCLUSION). It works functionally but may be confusing and propagate into other modules if imported. Consider renaming toDEFAULT_METADATA_EXCLUSIONand updating internal references now while usage is still small.
45-57: CLI wiring for metadata exclusion looks correct; ensure lint passesThe updated default for
--exclude-metadataand the new--exclude-llm-metadataflag align with the intended behaviour and with the README example (they will pass proper lists intoDocumentProcessor). No functional issues here.Ruff/Black are currently failing on this file for import formatting; please run Black/Ruff (or move
import argparseto the top with a single blank line before constants) to get the checks passing.src/lightspeed_rag_content/document_processor.py (3)
17-25: Fix import ordering to satisfy Ruff/BlackRuff and Black are complaining about import formatting;
import typesis currently separated from the other stdlib imports. Grouping stdlib imports together will fix this.You can apply something like:
-import json -import logging -import os -import tempfile -import time -from pathlib import Path -from typing import TYPE_CHECKING, Any, Optional, Union -import types +import json +import logging +import os +import tempfile +import time +import types +from pathlib import Path +from typing import TYPE_CHECKING, Any, Optional, UnionBlack will handle blank lines between stdlib / third-party / local imports.
148-153: Re-runningexclude_metadataon everyadd_docsis safe but a bit redundantCalling
self.exclude_metadata(self._good_nodes)after extending_good_nodesmeans previously processed nodes will have their exclusions andmodel_copyre-applied every time. It’s functionally safe (idempotent assignment andobject.__setattr__), but a minor inefficiency.If you care about avoiding the repeated work, you could call
exclude_metadataonly on thevalid_nodesyou just added:- valid_nodes = self._split_and_filter(docs) - self._good_nodes.extend(valid_nodes) - self.exclude_metadata(self._good_nodes) + valid_nodes = self._split_and_filter(docs) + self._good_nodes.extend(valid_nodes) + self.exclude_metadata(valid_nodes)Not critical, just a small clean‑up.
498-529: DocumentProcessor metadata-exclusion parameters: defaults vs Optional semanticsExtending
DocumentProcessor.__init__withexclude_embed_metadataandexclude_llm_metadatais consistent with the CLI and README usage. However, these parameters are Optional and passed through_Configas-is; lower layers (_LlamaIndexDB / _LlamaStackDB) previously assumed iterables.With the
_remove_metadata/exclude_metadatafixes suggested above,Noneis now handled safely and won’t break directDocumentProcessorinstantiations that don’t go through your CLI defaults.Given that:
- CLI users get the desired behaviour via
DEFAULT_METADATA_EXCLUSSION.- Library users can opt in by explicitly passing lists.
No code changes are strictly required here beyond those lower-level fixes, but it’s worth keeping this contract in mind if you later want
DocumentProcessorto apply the same default list automatically.If you want
DocumentProcessorto always mirror the CLI defaults, you could coerceNoneto your default list at construction time and import the constant fromutils:-from lightspeed_rag_content.utils import DEFAULT_METADATA_EXCLUSSION +from lightspeed_rag_content.utils import DEFAULT_METADATA_EXCLUSSION @@ - self.config = _Config( + self.config = _Config( @@ - exclude_embed_metadata=exclude_embed_metadata, - exclude_llm_metadata=exclude_llm_metadata, + exclude_embed_metadata=exclude_embed_metadata or DEFAULT_METADATA_EXCLUSSION, + exclude_llm_metadata=exclude_llm_metadata or DEFAULT_METADATA_EXCLUSSION, )(Adjust spelling of the constant if you rename it.)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
README.md(1 hunks)src/lightspeed_rag_content/document_processor.py(10 hunks)src/lightspeed_rag_content/utils.py(2 hunks)
🧰 Additional context used
🪛 GitHub Actions: Black
src/lightspeed_rag_content/utils.py
[error] 1-1: Black formatting check failed. File would be reformatted by Black.
src/lightspeed_rag_content/document_processor.py
[error] 1-1: Black formatting check failed. File would be reformatted by Black.
🪛 GitHub Actions: Pyright
src/lightspeed_rag_content/document_processor.py
[error] 447-447: pyright: Cannot access attribute 'pop' for class 'str'. Attribute 'pop' is unknown.
[error] 454-454: pyright: Argument of type "Literal['content']" cannot be assigned to parameter "key" of type "SupportsIndex | slice[Any, Any, Any]" in function "getitem". Type "Literal['content']" is not assignable to type "SupportsIndex".
[error] 462-462: pyright: "setitem" method not defined on type "str" (reportIndexIssue).
[error] 482-482: pyright: Argument of type "list[dict[str, Any] | Unknown]" cannot be assigned to parameter "documents" of type "dict[str, Any]" in function "_calculate_embeddings". List type is not assignable to dict.
🪛 GitHub Actions: Ruff
src/lightspeed_rag_content/utils.py
[error] 16-16: I001 Import block is unsorted or un-formatted. Organize imports.
src/lightspeed_rag_content/document_processor.py
[error] 17-17: I001 Import block is unsorted or un-formatted. Organize imports.
[error] 38-38: I001 Import block is unsorted or un-formatted. Organize imports.
[error] 406-406: E501 Line too long (105 > 100).
[error] 407-407: E501 Line too long (101 > 100).
🪛 GitHub Actions: Type checks
src/lightspeed_rag_content/document_processor.py
[error] 195-195: Mypy error: Function is missing a type annotation [no-untyped-def]
[error] 202-202: Mypy error: Function is missing a type annotation for one or more arguments [no-untyped-def]
[error] 432-432: Mypy error: Function is missing a type annotation for one or more arguments [no-untyped-def]
[error] 454-454: Mypy error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice[Any, Any, Any]" [index]
[error] 462-462: Mypy error: Unsupported target for indexed assignment ("str") [index]
🪛 GitHub Actions: Unit tests
src/lightspeed_rag_content/document_processor.py
[error] 445-445: TypeError: '_SentinelObject' object is not iterable in _calculate_embeddings during manual chunking path (likely caused by incorrect documents input).
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Pylinter
- GitHub Check: build-and-push-dev
- GitHub Check: Konflux kflux-prd-rh02 / rag-content-on-pull-request
🔇 Additional comments (3)
src/lightspeed_rag_content/document_processor.py (2)
237-241: Embedding metadata templates for llama-stack look reasonableThe new
EMBEDDING_METADATA_SEPARATOR,EMBEDDING_METADATA_TEMPLATE, andEMBEDDING_TEMPLATEconstants give you a clear, deterministic way to flatten metadata into the embedding input. The simple text “key: value” lines separated by newlines should be easy to parse or evolve later if needed.No changes needed here.
421-431: Warning about automatic mode and metadata is usefulThe warning log when
manual_chunkingis False clarifies that llama-stack automatic mode doesn’t use metadata for embeddings, which is exactly the subtle behaviour this PR is addressing for the manual path.No changes required here.
README.md (1)
128-138: README example correctly wires new metadata-exclusion parametersThe
custom_processor.pyexample now passesexclude_embed_metadata=args.exclude_metadataandexclude_llm_metadata=args.exclude_llm_metadataintoDocumentProcessor, matching the new CLI flags and constructor signature. This keeps the documentation aligned with the implementation.No changes needed here.
9fa1bbf to
7a65151
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
src/lightspeed_rag_content/document_processor.py (5)
195-203: Add type annotations to satisfy mypy.While this was flagged in past reviews, adding annotations here improves type safety:
-def _model_copy_excluding_llm_metadata( - self, node: TextNode, *args: Any, **kwargs: Any -) -> TextNode: +def _model_copy_excluding_llm_metadata( + self, node: TextNode, *args: Any, **kwargs: Any +) -> TextNode: """Replace node's model_copy to remove metadata.""" res = self.original_model_copy(node, *args, **kwargs) - res.metadata = self._remove_metadata( - res.metadata, node.excluded_llm_metadata_keys - ) + remove_keys = getattr(node, "excluded_llm_metadata_keys", None) + res.metadata = self._remove_metadata(res.metadata, remove_keys) return resUsing
getattrwith a default makes it resilient if the attribute is missing.
205-236: LGTM with a note on potentialNoneassignment.The
exclude_metadatamethod is well-documented with links to the relevant llama-index code. However, ifconfig.exclude_embed_metadataorconfig.exclude_llm_metadataisNone, assigning it directly todoc.excluded_embed_metadata_keysmay not match llama-index expectations (which typically expect a list).Consider guarding:
if self.config.exclude_embed_metadata is not None: doc.excluded_embed_metadata_keys = self.config.exclude_embed_metadata
108-111: Make_remove_metadatarobust toNoneto preventTypeError.The
removeparameter can beNonewhenconfig.exclude_embed_metadataorconfig.exclude_llm_metadatais not set (defaults toNone). The current implementation will raiseTypeError: argument of type 'NoneType' is not iterablewhen checkingkey not in remove.@staticmethod -def _remove_metadata(metadata: dict[str, Any], remove: list[str]) -> dict[str, Any]: +def _remove_metadata( + metadata: dict[str, Any], remove: Optional[list[str]] = None +) -> dict[str, Any]: """Return a metadata dictionary without some keys.""" + if not remove: + return dict(metadata) return {key: value for key, value in metadata.items() if key not in remove}This aligns with the
Optional[list[str]]types used inDocumentProcessor.__init__.
439-470: Add type annotation forclientparameter to fix pipeline failure.The mypy error on line 439 indicates a missing type annotation. Since the client is dynamically imported and used via duck typing, use
Any:-def _calculate_embeddings(self, client, documents: list[dict[str, Any]]) -> None: +def _calculate_embeddings(self, client: Any, documents: list[dict[str, Any]]) -> None:This will resolve the pipeline failure.
409-424: VerifyNoneis handled when calling_remove_metadata.These calls pass
self.config.exclude_embed_metadataandself.config.exclude_llm_metadatawhich can beNone. If_remove_metadatais not updated to handleNone, this will raise aTypeError.Either:
- Fix
_remove_metadatato handleNone(recommended), or- Use
self.config.exclude_embed_metadata or []hereembed_metadata = self._remove_metadata( - node.metadata, self.config.exclude_embed_metadata + node.metadata, self.config.exclude_embed_metadata or [] ) llm_metadata = self._remove_metadata( - node.metadata, self.config.exclude_llm_metadata + node.metadata, self.config.exclude_llm_metadata or [] )
🧹 Nitpick comments (1)
tests/test_document_processor.py (1)
207-245: Consider adding a test case forNoneinput to_remove_metadata.The test suite covers empty list and nonexistent keys, but based on past review feedback,
_remove_metadatashould handleNonegracefully sinceconfig.exclude_embed_metadataandconfig.exclude_llm_metadatacan beNone. Consider adding a test case:def test__remove_metadata_none_keys(self): """Test that _remove_metadata handles None keys gracefully.""" metadata = {"key1": "value1", "key2": "value2"} result = document_processor._BaseDB._remove_metadata(metadata, None) assert result == metadataThis would help ensure the
Nonehandling fix (if applied) is tested.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
README.md(1 hunks)src/lightspeed_rag_content/document_processor.py(10 hunks)src/lightspeed_rag_content/utils.py(2 hunks)tests/test_document_processor.py(4 hunks)tests/test_document_processor_llama_index.py(1 hunks)tests/test_document_processor_llama_stack.py(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lightspeed_rag_content/utils.py
🧰 Additional context used
🧬 Code graph analysis (3)
tests/test_document_processor.py (1)
src/lightspeed_rag_content/document_processor.py (1)
_remove_metadata(109-111)
tests/test_document_processor_llama_index.py (2)
tests/conftest.py (1)
RagMockEmbedding(19-28)src/lightspeed_rag_content/document_processor.py (5)
DocumentProcessor(503-618)exclude_metadata(205-236)_model_copy_excluding_llm_metadata(195-203)add_docs(148-152)add_docs(398-437)
tests/test_document_processor_llama_stack.py (1)
src/lightspeed_rag_content/document_processor.py (4)
_LlamaStackDB(239-500)_calculate_embeddings(439-470)add_docs(148-152)add_docs(398-437)
🪛 GitHub Actions: Type checks
src/lightspeed_rag_content/document_processor.py
[error] 439-439: Function is missing a type annotation for one or more arguments [no-untyped-def]. Mypy failed on command: 'uv run mypy --explicit-package-bases --disallow-untyped-calls --disallow-untyped-defs --disallow-incomplete-defs --ignore-missing-imports --disable-error-code attr-defined src/ scripts/' (exit code 1).
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Pylinter
- GitHub Check: build-and-push-dev
- GitHub Check: Konflux kflux-prd-rh02 / rag-content-on-pull-request
🔇 Additional comments (16)
README.md (1)
136-138: LGTM!The documentation correctly demonstrates the new
exclude_embed_metadataandexclude_llm_metadataparameters in theDocumentProcessorexample, which aligns with the bug fix for the--exclude-metadataparameter.tests/test_document_processor.py (1)
84-86: LGTM!The test correctly verifies that
exclude_embed_metadataandexclude_llm_metadatadefault toNonewhen not provided.tests/test_document_processor_llama_index.py (4)
247-286: LGTM!Comprehensive test that verifies
exclude_metadatacorrectly setsexcluded_embed_metadata_keysandexcluded_llm_metadata_keyson all nodes.
287-325: LGTM!Good test coverage for verifying that
model_copyis properly overridden with a bound method.
326-367: LGTM!Thorough test that verifies
_model_copy_excluding_llm_metadatacorrectly removes excluded keys while preserving non-excluded keys and text content.
369-427: LGTM!Good coverage for verifying the integration of
exclude_metadatain theadd_docsflow and that configuration values are properly stored and defaulted.tests/test_document_processor_llama_stack.py (6)
86-103: LGTM!Good addition of tiktoken mocking to prevent network calls and the fixture correctly initializes
exclude_embed_metadataandexclude_llm_metadataas empty lists for the test configuration.
135-163: LGTM!Smart approach using
side_effectto handle multipleos.path.existscalls with different return values based on the path content.
321-345: LGTM!Test expectations correctly updated to include
embed_metadatain the expected document structure, which aligns with the new embedding metadata handling.
386-437: LGTM!The
_test_savehelper and save tests are properly updated to mock the embeddings response and verify the correct flow for both manual and automatic chunking paths.
439-498: LGTM!Excellent test coverage for
_calculate_embeddings:
- Verifies
embed_metadatais popped from documents- Verifies embeddings are added correctly
- Verifies the formatted data includes metadata in the expected format
- Edge case with empty metadata is properly tested
500-631: LGTM!Thorough test coverage for the exclude metadata functionality in the LlamaStack path:
exclude_embed_metadataonly affectsembed_metadataexclude_llm_metadataonly affectsmetadata(llm metadata)- Both exclusions can work together independently
- Multiple documents are handled correctly with per-document embeddings
src/lightspeed_rag_content/document_processor.py (4)
22-22: LGTM!The
typesimport is needed fortypes.MethodTypeused in theexclude_metadatamethod.
80-81: LGTM!Storing the original
model_copyreference in the base class allows subclasses to use it when overriding the method on nodes.
240-243: LGTM!The embedding template constants follow the llama-index metadata formatting pattern and enable proper embedding calculation with metadata.
517-536: LGTM!The new
exclude_embed_metadataandexclude_llm_metadataparameters are correctly added toDocumentProcessor.__init__withOptional[list[str]]types and properly propagated to the_Config.
This patch fixes the --exclude-metadata parameter that is currently not being honored. The fix is only for llama-index, because llama-stack is not using the metadata for the embedding and is only using the contents [1]. This path adds a CLI argument to exclude metadata from the data that is sent to the LLM (--exclude-llm-metadata), and we actually change what is saved in the DB instead of hoping that the user of the DB filters things manually using the node's `excluded_llm_metadata_keys` field or calls method `get_content(metadata_mode=MetadataMode.LLM)` to get a string representation of the contents together with the LLM metadata. The exclusion arguments default to the llama-index default values, although I would personally add a couple to each of them: - `exclude_metadata` with ["file_path", "url", "url_reachable"] - `exclude_llm_metadata` with ["file_path", "url_reachable"] But they haven't been changed for backward compatibility. Assisted-by: N/A Generated-by: Cursor for tests [1]: https://github.com/llamastack/llama-stack/blob/b1c5b8fa9f2c994f0ed15849fef99b6304440b5b/src/llama_stack/providers/utils/memory/vector_store.py#L283
Llama-stack doesn't use the metadata for the embeddings [1], which is a different behavior from what we had with llama-index. In this patch we fix this by manually calculating the embedding ourselves and passing it to llama-stack. The embedding is calculated using the exact same templates as llama-index to try to keep things as similar as possible. Unfortunately this is only possible when manual chunking is being used in this script (when configuration option `--auto-chunking` is not provided). Fortunately that is the default behavior. This patch also adds support for the exclusion of metadata for both embedding and llm. Asysted-by: N/A Generated-by: Cursor for tests [1]: https://github.com/llamastack/llama-stack/blob/b1c5b8fa9f2c994f0ed15849fef99b6304440b5b/src/llama_stack/providers/utils/memory/vector_store.py#L283
7a65151 to
4b753dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/lightspeed_rag_content/document_processor.py (1)
199-240: Consider defensive handling of None metadata exclusion lists.The implementation now includes proper type annotations (addressing previous review feedback). However, the code assigns potentially
Nonevalues todoc.excluded_embed_metadata_keysanddoc.excluded_llm_metadata_keys(lines 220, 232). While_remove_metadatahandlesNonecorrectly, it's unclear whether llama-index's TextNode expects these attributes to be lists or acceptsNone.Consider adding guards to only set these attributes when the config values are not
None:def exclude_metadata(self, documents: list[TextNode]) -> None: """Exclude metadata from documents. ... """ for doc in documents: - doc.excluded_embed_metadata_keys = self.config.exclude_embed_metadata + if self.config.exclude_embed_metadata is not None: + doc.excluded_embed_metadata_keys = self.config.exclude_embed_metadata ... - doc.excluded_llm_metadata_keys = self.config.exclude_llm_metadata + if self.config.exclude_llm_metadata is not None: + doc.excluded_llm_metadata_keys = self.config.exclude_llm_metadataThis would be more defensive, though the current implementation appears functional given that
_remove_metadatahandlesNonecorrectly.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
README.md(1 hunks)src/lightspeed_rag_content/document_processor.py(10 hunks)src/lightspeed_rag_content/utils.py(2 hunks)tests/test_document_processor.py(4 hunks)tests/test_document_processor_llama_index.py(1 hunks)tests/test_document_processor_llama_stack.py(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- README.md
- src/lightspeed_rag_content/utils.py
🧰 Additional context used
🧬 Code graph analysis (2)
tests/test_document_processor_llama_stack.py (1)
src/lightspeed_rag_content/document_processor.py (4)
_LlamaStackDB(243-506)_calculate_embeddings(443-476)add_docs(152-156)add_docs(402-441)
tests/test_document_processor.py (1)
src/lightspeed_rag_content/document_processor.py (1)
_remove_metadata(109-115)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Konflux kflux-prd-rh02 / rag-content-on-pull-request
- GitHub Check: Pylinter
- GitHub Check: build-and-push-dev
🔇 Additional comments (19)
tests/test_document_processor.py (2)
84-86: LGTM!The test assertions correctly verify that the new
exclude_embed_metadataandexclude_llm_metadataconfiguration attributes default toNonewhen not provided during initialization.Also applies to: 111-112, 142-143
207-245: LGTM!The new
TestBaseDBtest suite provides comprehensive coverage of_remove_metadatabehavior, including edge cases like empty removal lists and nonexistent keys. The tests validate both that specified keys are removed and that other keys are preserved correctly.tests/test_document_processor_llama_index.py (1)
246-427: LGTM!The new test suite for LlamaIndex integration provides excellent coverage of the
exclude_metadatafunctionality:
- Configuration propagation through initialization
- Runtime behavior of
exclude_metadatamethod setting node attributesmodel_copyoverride mechanism to filter LLM metadata- Validation that
_model_copy_excluding_llm_metadatacorrectly removes excluded keys while preserving others- Integration with
add_docsworkflow- Default and custom configuration scenarios
The tests are well-structured and appropriately use mocks for unit testing.
tests/test_document_processor_llama_stack.py (7)
86-89: LGTM!Mocking tiktoken to prevent network calls during test initialization is a good practice that ensures tests remain fast and reliable without external dependencies.
101-102: LGTM!The fixture correctly initializes the new
exclude_embed_metadataandexclude_llm_metadataconfiguration attributes with empty lists. Thetest_init_model_pathupdate withexists_side_effectproperly handles multipleos.path.existscalls (for both embeddings_model_dir and tiktoken cache) by checking for the "embeddings_model" string in the path.Also applies to: 135-163
321-345: LGTM!The test correctly validates that
embed_metadatais now included in each document's payload when using manual chunking, containing the appropriate metadata fields (document_id, title, docs_url) that will be used during embedding calculation.
386-437: LGTM!The
_test_savehelper and save tests are correctly updated to:
- Include
embed_metadatain document payload- Mock the embeddings inference response (needed for
_calculate_embeddings)- Verify the save flow calls the appropriate llama-stack methods with expected parameters using a more flexible kwargs-based assertion approach
439-499: LGTM!The tests for
_calculate_embeddingsthoroughly validate:
- Removal of
embed_metadatafrom documents after processing- Addition of
embeddingfield with correct values- Proper formatting of metadata into the embedding input (metadata + content)
- Correct ordering (metadata first, then content)
- Edge case handling with empty metadata
The assertions verify both the data transformation and the llama-stack client API calls.
500-596: LGTM!The metadata exclusion tests provide excellent coverage:
exclude_embed_metadataonly affects embed_metadata, not llm_metadataexclude_llm_metadataonly affects llm_metadata (the metadata field), not embed_metadata- Combined exclusions work independently and correctly
- Each test verifies both removal of excluded keys and preservation of non-excluded keys
The tests confirm the exclusion mechanism is properly isolated between embedding and LLM use cases.
597-631: LGTM!The test for multiple documents validates:
- Each document receives a distinct embedding
- The inference client is called once per document
embed_metadatais properly removed from all documentsThis confirms the per-document embedding calculation approach works correctly for multiple documents.
src/lightspeed_rag_content/document_processor.py (9)
22-22: LGTM!The
typesimport and storage of the originalTextNode.model_copymethod are necessary for the metadata exclusion mechanism. Storing the original in_BaseDB.__init__makes it available to subclasses that need to wrapmodel_copywith metadata-filtering behavior.Also applies to: 80-80
108-115: LGTM!The
_remove_metadatamethod is correctly implemented with proper type hints (Optional[list[str]]) and guards againstNoneor empty removal lists by returning a copy of the original metadata. This addresses the concerns raised in previous reviews about robustness and type safety.
156-156: LGTM!The call to
exclude_metadatais correctly placed in theadd_docsflow after nodes are split and filtered. This ensures metadata exclusion settings are applied before embeddings are calculated.
244-247: LGTM!The embedding template constants provide a clear, readable format for incorporating metadata into embeddings: each metadata field on its own line in "key: value" format, followed by a blank line separator, then the content. This structured approach ensures metadata is included in the embedding calculation for the llama-stack backend.
413-428: LGTM!The metadata computation in
add_docscorrectly:
- Derives
embed_metadataby excluding configured embed metadata keys from node metadata- Derives
llm_metadataby excluding configured LLM metadata keys from node metadata- Adds
document_idtollm_metadata(required by llama-stack)- Stores
embed_metadatain the document dictionary for later use during embedding calculationThis enables separate metadata control for embeddings versus LLM retrieval results.
432-432: LGTM!The warning message appropriately informs users that llama-stack's automatic chunking mode does not incorporate metadata into embeddings, helping set correct expectations about the behavior difference between manual and automatic modes.
443-476: LGTM!The
_calculate_embeddingsmethod is correctly implemented with proper type hints and logic:
- Accepts
documents: list[dict[str, Any]]parameter matching actual usage- Iterates over each document dictionary
- Extracts and removes
embed_metadatafrom each document- Formats metadata fields into a readable string using the defined templates
- Constructs embedding input by combining formatted metadata and content
- Calls llama-stack inference API to compute embeddings per document
- Stores computed embeddings back in the document dictionary
This addresses previous review concerns about type mismatches and iteration issues.
496-496: LGTM!The call to
_calculate_embeddingsis correctly placed in the save flow for manual chunking mode, computing embeddings (with metadata) before inserting chunks into the vector store. This ensures metadata is incorporated into embeddings for llama-stack.
523-524: LGTM!The new
exclude_embed_metadataandexclude_llm_metadataparameters are properly added to theDocumentProcessor.__init__signature with correct type hints (Optional[list[str]]) and default values (None), and are correctly passed through to the_Configobject. This cleanly extends the public API to support metadata exclusion.Also applies to: 541-542
Description
Currently the Document Processor is ignoring the
--exclude-metadataparameter and in the case of Llama Stack the embedding is begin generated only with the node content, ignoring all the metadata.Type of change
Checklist before requesting a review
Testing
I have create DBs using llama-index and llama-stack (
--vector-store-type=llamastack-faiss) and then queried the database using a customquery_rag.pythat has a fix for llama-stack so it shows the metadata (PR to follow).Summary by CodeRabbit
New Features
CLI
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.