Skip to content

Conversation

@Akrog
Copy link
Contributor

@Akrog Akrog commented Nov 26, 2025

Description

Currently the Document Processor is ignoring the --exclude-metadata parameter and in the case of Llama Stack the embedding is begin generated only with the node content, ignoring all the metadata.

Type of change

  • Bug fix

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

I have create DBs using llama-index and llama-stack (--vector-store-type=llamastack-faiss) and then queried the database using a custom query_rag.py that has a fix for llama-stack so it shows the metadata (PR to follow).

Summary by CodeRabbit

  • New Features

    • Added configuration to exclude specific metadata from embeddings and LLM outputs; defaults provided.
    • Metadata exclusion is applied during document processing, embedding generation, and storage.
  • CLI

    • New CLI option to specify LLM-metadata exclusions; updated default exclusion list.
  • Documentation

    • README usage example updated to show new constructor parameters.
  • Tests

    • Expanded unit tests covering metadata exclusion, embedding generation, and related flows.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 26, 2025

Walkthrough

Adds per-field metadata exclusion for embeddings and LLM usage by introducing exclude_embed_metadata and exclude_llm_metadata options, wiring them through DocumentProcessor → _Config → DB backends, adding metadata-removal helpers, model_copy wrapping, and an embedding-preparation step for the llama-stack path.

Changes

Cohort / File(s) Summary
Docs
README.md
Updated README example to include exclude_embed_metadata and exclude_llm_metadata keyword args for DocumentProcessor.
Core: processor & DBs
src/lightspeed_rag_content/document_processor.py
Added exclude_embed_metadata and exclude_llm_metadata to DocumentProcessor.__init__ and propagated into _Config; added _remove_metadata utility; added _model_copy_excluding_llm_metadata binding and exclude_metadata flows on _LlamaIndexDB/_LlamaStackDB; integrated embed-metadata extraction, filtering, and embedding-prep (_calculate_embeddings) for llama-stack; adjusted add_docs/save flows to compute and strip embed/LLM metadata.
DB constants & embedding formatting
src/lightspeed_rag_content/document_processor.py (same file, llama-stack sections)
Introduced EMBEDDING_METADATA_SEPARATOR, EMBEDDING_METADATA_TEMPLATE, EMBEDDING_TEMPLATE and logic to format contents including embed metadata when computing embeddings; conditional embedding computation for manual_chunking.
Utils / CLI
src/lightspeed_rag_content/utils.py
Added DEFAULT_METADATA_EXCLUSSION constant; changed -em/--exclude-metadata default to that constant; added -elm/--exclude-llm-metadata CLI option.
Tests: unit & integration
tests/test_document_processor.py, tests/test_document_processor_llama_index.py, tests/test_document_processor_llama_stack.py
Updated tests to pass new constructor args; added TestBaseDB for _remove_metadata; extensive tests for exclude metadata propagation, model_copy wrapping, embed-metadata formatting/embedding behavior, manual vs auto chunking flows, and updated mocks/assertions for tokenizer/embeddings and vector insert expectations.

Sequence Diagram

sequenceDiagram
    participant User
    participant DocProc as DocumentProcessor
    participant Config
    participant DB as _LlamaIndexDB/_LlamaStackDB
    participant EmbedSvc as Embeddings
    participant LLM

    User->>DocProc: init(..., exclude_embed_metadata, exclude_llm_metadata)
    DocProc->>Config: store exclusion lists

    User->>DocProc: add_docs(documents)
    DocProc->>DB: add_docs(documents, config)

    DB->>DB: split & prepare nodes
    DB->>DB: exclude_metadata(documents)        %% remove specified keys & bind model_copy wrapper
    DB->>DB: attach llm_metadata (filtered) and embed_metadata (filtered)

    alt llama-stack + manual_chunking
        DB->>DB: _calculate_embeddings(documents)   %% formats content using embed_metadata
        DB->>EmbedSvc: request embeddings (formatted contents)
        EmbedSvc-->>DB: embeddings
        DB->>DB: remove embed_metadata from docs after embedding
    else other paths
        DB->>EmbedSvc: embeddings requested downstream (embed metadata not included)
    end

    DB->>LLM: pass llm_metadata (filtered) for generation/indexing
    DB-->>DocProc: add_docs complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas needing attention:
    • Correctness and test coverage of _remove_metadata across DBs.
    • Proper binding/wrapping of TextNode.model_copy and side effects.
    • Formatting templates and ensuring embed metadata is used only when intended and removed afterward.
    • CLI → Config → DB propagation and defaults for exclusion lists.
    • Tests that mock tokenizer/embeddings and verify vector insertion behavior.

Suggested reviewers

  • tisnik

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main bug fix: implementing exclude metadata functionality and embedding fixes for llama-stack.
Docstring Coverage ✅ Passed Docstring coverage is 90.48% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lightspeed_rag_content/document_processor.py (1)

401-417: Break long lines and handle Optional metadata lists in llama-stack add_docs

Lines building embed_metadata/llm_metadata are over 100 chars and also assume exclude_*_metadata are non‑None. To address Ruff E501 and make the code robust to Optional config:

-                chunk_metadata = {
-                    "document_id": node.ref_doc_id,
-                    "chunk_id": node.id_,
-                    "source": node.metadata.get("docs_url", node.metadata["title"]),
-                }
-                embed_metadata = self._remove_metadata(node.metadata, self.config.exclude_embed_metadata)
-                llm_metadata = self._remove_metadata(node.metadata, self.config.exclude_llm_metadata)
+                chunk_metadata = {
+                    "document_id": node.ref_doc_id,
+                    "chunk_id": node.id_,
+                    "source": node.metadata.get(
+                        "docs_url",
+                        node.metadata["title"],
+                    ),
+                }
+                embed_metadata = self._remove_metadata(
+                    node.metadata,
+                    self.config.exclude_embed_metadata,
+                )
+                llm_metadata = self._remove_metadata(
+                    node.metadata,
+                    self.config.exclude_llm_metadata,
+                )

Given the earlier _remove_metadata change, this will safely handle None while keeping lines under 100 chars.

🧹 Nitpick comments (5)
src/lightspeed_rag_content/utils.py (2)

19-22: DEFAULT_METADATA_EXCLUSSION name has a spelling typo

The constant name has a typo (EXCLUSSION vs EXCLUSION). It works functionally but may be confusing and propagate into other modules if imported. Consider renaming to DEFAULT_METADATA_EXCLUSION and updating internal references now while usage is still small.


45-57: CLI wiring for metadata exclusion looks correct; ensure lint passes

The updated default for --exclude-metadata and the new --exclude-llm-metadata flag align with the intended behaviour and with the README example (they will pass proper lists into DocumentProcessor). No functional issues here.

Ruff/Black are currently failing on this file for import formatting; please run Black/Ruff (or move import argparse to the top with a single blank line before constants) to get the checks passing.

src/lightspeed_rag_content/document_processor.py (3)

17-25: Fix import ordering to satisfy Ruff/Black

Ruff and Black are complaining about import formatting; import types is currently separated from the other stdlib imports. Grouping stdlib imports together will fix this.

You can apply something like:

-import json
-import logging
-import os
-import tempfile
-import time
-from pathlib import Path
-from typing import TYPE_CHECKING, Any, Optional, Union
-import types
+import json
+import logging
+import os
+import tempfile
+import time
+import types
+from pathlib import Path
+from typing import TYPE_CHECKING, Any, Optional, Union

Black will handle blank lines between stdlib / third-party / local imports.


148-153: Re-running exclude_metadata on every add_docs is safe but a bit redundant

Calling self.exclude_metadata(self._good_nodes) after extending _good_nodes means previously processed nodes will have their exclusions and model_copy re-applied every time. It’s functionally safe (idempotent assignment and object.__setattr__), but a minor inefficiency.

If you care about avoiding the repeated work, you could call exclude_metadata only on the valid_nodes you just added:

-        valid_nodes = self._split_and_filter(docs)
-        self._good_nodes.extend(valid_nodes)
-        self.exclude_metadata(self._good_nodes)
+        valid_nodes = self._split_and_filter(docs)
+        self._good_nodes.extend(valid_nodes)
+        self.exclude_metadata(valid_nodes)

Not critical, just a small clean‑up.


498-529: DocumentProcessor metadata-exclusion parameters: defaults vs Optional semantics

Extending DocumentProcessor.__init__ with exclude_embed_metadata and exclude_llm_metadata is consistent with the CLI and README usage. However, these parameters are Optional and passed through _Config as-is; lower layers (_LlamaIndexDB / _LlamaStackDB) previously assumed iterables.

With the _remove_metadata / exclude_metadata fixes suggested above, None is now handled safely and won’t break direct DocumentProcessor instantiations that don’t go through your CLI defaults.

Given that:

  • CLI users get the desired behaviour via DEFAULT_METADATA_EXCLUSSION.
  • Library users can opt in by explicitly passing lists.

No code changes are strictly required here beyond those lower-level fixes, but it’s worth keeping this contract in mind if you later want DocumentProcessor to apply the same default list automatically.

If you want DocumentProcessor to always mirror the CLI defaults, you could coerce None to your default list at construction time and import the constant from utils:

-from lightspeed_rag_content.utils import DEFAULT_METADATA_EXCLUSSION
+from lightspeed_rag_content.utils import DEFAULT_METADATA_EXCLUSSION
@@
-        self.config = _Config(
+        self.config = _Config(
@@
-            exclude_embed_metadata=exclude_embed_metadata,
-            exclude_llm_metadata=exclude_llm_metadata,
+            exclude_embed_metadata=exclude_embed_metadata or DEFAULT_METADATA_EXCLUSSION,
+            exclude_llm_metadata=exclude_llm_metadata or DEFAULT_METADATA_EXCLUSSION,
         )

(Adjust spelling of the constant if you rename it.)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef376d0 and 9fa1bbf.

📒 Files selected for processing (3)
  • README.md (1 hunks)
  • src/lightspeed_rag_content/document_processor.py (10 hunks)
  • src/lightspeed_rag_content/utils.py (2 hunks)
🧰 Additional context used
🪛 GitHub Actions: Black
src/lightspeed_rag_content/utils.py

[error] 1-1: Black formatting check failed. File would be reformatted by Black.

src/lightspeed_rag_content/document_processor.py

[error] 1-1: Black formatting check failed. File would be reformatted by Black.

🪛 GitHub Actions: Pyright
src/lightspeed_rag_content/document_processor.py

[error] 447-447: pyright: Cannot access attribute 'pop' for class 'str'. Attribute 'pop' is unknown.


[error] 454-454: pyright: Argument of type "Literal['content']" cannot be assigned to parameter "key" of type "SupportsIndex | slice[Any, Any, Any]" in function "getitem". Type "Literal['content']" is not assignable to type "SupportsIndex".


[error] 462-462: pyright: "setitem" method not defined on type "str" (reportIndexIssue).


[error] 482-482: pyright: Argument of type "list[dict[str, Any] | Unknown]" cannot be assigned to parameter "documents" of type "dict[str, Any]" in function "_calculate_embeddings". List type is not assignable to dict.

🪛 GitHub Actions: Ruff
src/lightspeed_rag_content/utils.py

[error] 16-16: I001 Import block is unsorted or un-formatted. Organize imports.

src/lightspeed_rag_content/document_processor.py

[error] 17-17: I001 Import block is unsorted or un-formatted. Organize imports.


[error] 38-38: I001 Import block is unsorted or un-formatted. Organize imports.


[error] 406-406: E501 Line too long (105 > 100).


[error] 407-407: E501 Line too long (101 > 100).

🪛 GitHub Actions: Type checks
src/lightspeed_rag_content/document_processor.py

[error] 195-195: Mypy error: Function is missing a type annotation [no-untyped-def]


[error] 202-202: Mypy error: Function is missing a type annotation for one or more arguments [no-untyped-def]


[error] 432-432: Mypy error: Function is missing a type annotation for one or more arguments [no-untyped-def]


[error] 454-454: Mypy error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice[Any, Any, Any]" [index]


[error] 462-462: Mypy error: Unsupported target for indexed assignment ("str") [index]

🪛 GitHub Actions: Unit tests
src/lightspeed_rag_content/document_processor.py

[error] 445-445: TypeError: '_SentinelObject' object is not iterable in _calculate_embeddings during manual chunking path (likely caused by incorrect documents input).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Pylinter
  • GitHub Check: build-and-push-dev
  • GitHub Check: Konflux kflux-prd-rh02 / rag-content-on-pull-request
🔇 Additional comments (3)
src/lightspeed_rag_content/document_processor.py (2)

237-241: Embedding metadata templates for llama-stack look reasonable

The new EMBEDDING_METADATA_SEPARATOR, EMBEDDING_METADATA_TEMPLATE, and EMBEDDING_TEMPLATE constants give you a clear, deterministic way to flatten metadata into the embedding input. The simple text “key: value” lines separated by newlines should be easy to parse or evolve later if needed.

No changes needed here.


421-431: Warning about automatic mode and metadata is useful

The warning log when manual_chunking is False clarifies that llama-stack automatic mode doesn’t use metadata for embeddings, which is exactly the subtle behaviour this PR is addressing for the manual path.

No changes required here.

README.md (1)

128-138: README example correctly wires new metadata-exclusion parameters

The custom_processor.py example now passes exclude_embed_metadata=args.exclude_metadata and exclude_llm_metadata=args.exclude_llm_metadata into DocumentProcessor, matching the new CLI flags and constructor signature. This keeps the documentation aligned with the implementation.

No changes needed here.

@Akrog Akrog force-pushed the fix-exclude-metadata branch from 9fa1bbf to 7a65151 Compare November 26, 2025 18:08
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
src/lightspeed_rag_content/document_processor.py (5)

195-203: Add type annotations to satisfy mypy.

While this was flagged in past reviews, adding annotations here improves type safety:

-def _model_copy_excluding_llm_metadata(
-    self, node: TextNode, *args: Any, **kwargs: Any
-) -> TextNode:
+def _model_copy_excluding_llm_metadata(
+    self, node: TextNode, *args: Any, **kwargs: Any
+) -> TextNode:
     """Replace node's model_copy to remove metadata."""
     res = self.original_model_copy(node, *args, **kwargs)
-    res.metadata = self._remove_metadata(
-        res.metadata, node.excluded_llm_metadata_keys
-    )
+    remove_keys = getattr(node, "excluded_llm_metadata_keys", None)
+    res.metadata = self._remove_metadata(res.metadata, remove_keys)
     return res

Using getattr with a default makes it resilient if the attribute is missing.


205-236: LGTM with a note on potential None assignment.

The exclude_metadata method is well-documented with links to the relevant llama-index code. However, if config.exclude_embed_metadata or config.exclude_llm_metadata is None, assigning it directly to doc.excluded_embed_metadata_keys may not match llama-index expectations (which typically expect a list).

Consider guarding:

if self.config.exclude_embed_metadata is not None:
    doc.excluded_embed_metadata_keys = self.config.exclude_embed_metadata

108-111: Make _remove_metadata robust to None to prevent TypeError.

The remove parameter can be None when config.exclude_embed_metadata or config.exclude_llm_metadata is not set (defaults to None). The current implementation will raise TypeError: argument of type 'NoneType' is not iterable when checking key not in remove.

 @staticmethod
-def _remove_metadata(metadata: dict[str, Any], remove: list[str]) -> dict[str, Any]:
+def _remove_metadata(
+    metadata: dict[str, Any], remove: Optional[list[str]] = None
+) -> dict[str, Any]:
     """Return a metadata dictionary without some keys."""
+    if not remove:
+        return dict(metadata)
     return {key: value for key, value in metadata.items() if key not in remove}

This aligns with the Optional[list[str]] types used in DocumentProcessor.__init__.


439-470: Add type annotation for client parameter to fix pipeline failure.

The mypy error on line 439 indicates a missing type annotation. Since the client is dynamically imported and used via duck typing, use Any:

-def _calculate_embeddings(self, client, documents: list[dict[str, Any]]) -> None:
+def _calculate_embeddings(self, client: Any, documents: list[dict[str, Any]]) -> None:

This will resolve the pipeline failure.


409-424: Verify None is handled when calling _remove_metadata.

These calls pass self.config.exclude_embed_metadata and self.config.exclude_llm_metadata which can be None. If _remove_metadata is not updated to handle None, this will raise a TypeError.

Either:

  1. Fix _remove_metadata to handle None (recommended), or
  2. Use self.config.exclude_embed_metadata or [] here
 embed_metadata = self._remove_metadata(
-    node.metadata, self.config.exclude_embed_metadata
+    node.metadata, self.config.exclude_embed_metadata or []
 )
 llm_metadata = self._remove_metadata(
-    node.metadata, self.config.exclude_llm_metadata
+    node.metadata, self.config.exclude_llm_metadata or []
 )
🧹 Nitpick comments (1)
tests/test_document_processor.py (1)

207-245: Consider adding a test case for None input to _remove_metadata.

The test suite covers empty list and nonexistent keys, but based on past review feedback, _remove_metadata should handle None gracefully since config.exclude_embed_metadata and config.exclude_llm_metadata can be None. Consider adding a test case:

def test__remove_metadata_none_keys(self):
    """Test that _remove_metadata handles None keys gracefully."""
    metadata = {"key1": "value1", "key2": "value2"}
    
    result = document_processor._BaseDB._remove_metadata(metadata, None)
    
    assert result == metadata

This would help ensure the None handling fix (if applied) is tested.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9fa1bbf and 7a65151.

📒 Files selected for processing (6)
  • README.md (1 hunks)
  • src/lightspeed_rag_content/document_processor.py (10 hunks)
  • src/lightspeed_rag_content/utils.py (2 hunks)
  • tests/test_document_processor.py (4 hunks)
  • tests/test_document_processor_llama_index.py (1 hunks)
  • tests/test_document_processor_llama_stack.py (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lightspeed_rag_content/utils.py
🧰 Additional context used
🧬 Code graph analysis (3)
tests/test_document_processor.py (1)
src/lightspeed_rag_content/document_processor.py (1)
  • _remove_metadata (109-111)
tests/test_document_processor_llama_index.py (2)
tests/conftest.py (1)
  • RagMockEmbedding (19-28)
src/lightspeed_rag_content/document_processor.py (5)
  • DocumentProcessor (503-618)
  • exclude_metadata (205-236)
  • _model_copy_excluding_llm_metadata (195-203)
  • add_docs (148-152)
  • add_docs (398-437)
tests/test_document_processor_llama_stack.py (1)
src/lightspeed_rag_content/document_processor.py (4)
  • _LlamaStackDB (239-500)
  • _calculate_embeddings (439-470)
  • add_docs (148-152)
  • add_docs (398-437)
🪛 GitHub Actions: Type checks
src/lightspeed_rag_content/document_processor.py

[error] 439-439: Function is missing a type annotation for one or more arguments [no-untyped-def]. Mypy failed on command: 'uv run mypy --explicit-package-bases --disallow-untyped-calls --disallow-untyped-defs --disallow-incomplete-defs --ignore-missing-imports --disable-error-code attr-defined src/ scripts/' (exit code 1).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Pylinter
  • GitHub Check: build-and-push-dev
  • GitHub Check: Konflux kflux-prd-rh02 / rag-content-on-pull-request
🔇 Additional comments (16)
README.md (1)

136-138: LGTM!

The documentation correctly demonstrates the new exclude_embed_metadata and exclude_llm_metadata parameters in the DocumentProcessor example, which aligns with the bug fix for the --exclude-metadata parameter.

tests/test_document_processor.py (1)

84-86: LGTM!

The test correctly verifies that exclude_embed_metadata and exclude_llm_metadata default to None when not provided.

tests/test_document_processor_llama_index.py (4)

247-286: LGTM!

Comprehensive test that verifies exclude_metadata correctly sets excluded_embed_metadata_keys and excluded_llm_metadata_keys on all nodes.


287-325: LGTM!

Good test coverage for verifying that model_copy is properly overridden with a bound method.


326-367: LGTM!

Thorough test that verifies _model_copy_excluding_llm_metadata correctly removes excluded keys while preserving non-excluded keys and text content.


369-427: LGTM!

Good coverage for verifying the integration of exclude_metadata in the add_docs flow and that configuration values are properly stored and defaulted.

tests/test_document_processor_llama_stack.py (6)

86-103: LGTM!

Good addition of tiktoken mocking to prevent network calls and the fixture correctly initializes exclude_embed_metadata and exclude_llm_metadata as empty lists for the test configuration.


135-163: LGTM!

Smart approach using side_effect to handle multiple os.path.exists calls with different return values based on the path content.


321-345: LGTM!

Test expectations correctly updated to include embed_metadata in the expected document structure, which aligns with the new embedding metadata handling.


386-437: LGTM!

The _test_save helper and save tests are properly updated to mock the embeddings response and verify the correct flow for both manual and automatic chunking paths.


439-498: LGTM!

Excellent test coverage for _calculate_embeddings:

  • Verifies embed_metadata is popped from documents
  • Verifies embeddings are added correctly
  • Verifies the formatted data includes metadata in the expected format
  • Edge case with empty metadata is properly tested

500-631: LGTM!

Thorough test coverage for the exclude metadata functionality in the LlamaStack path:

  • exclude_embed_metadata only affects embed_metadata
  • exclude_llm_metadata only affects metadata (llm metadata)
  • Both exclusions can work together independently
  • Multiple documents are handled correctly with per-document embeddings
src/lightspeed_rag_content/document_processor.py (4)

22-22: LGTM!

The types import is needed for types.MethodType used in the exclude_metadata method.


80-81: LGTM!

Storing the original model_copy reference in the base class allows subclasses to use it when overriding the method on nodes.


240-243: LGTM!

The embedding template constants follow the llama-index metadata formatting pattern and enable proper embedding calculation with metadata.


517-536: LGTM!

The new exclude_embed_metadata and exclude_llm_metadata parameters are correctly added to DocumentProcessor.__init__ with Optional[list[str]] types and properly propagated to the _Config.

This patch fixes the --exclude-metadata parameter that is currently not
being honored.

The fix is only for llama-index, because llama-stack is not using the
metadata for the embedding and is only using the contents [1].

This path adds a CLI argument to exclude metadata from the data that is
sent to the LLM (--exclude-llm-metadata), and we actually change what is
saved in the DB instead of hoping that the user of the DB filters things
manually using the node's `excluded_llm_metadata_keys` field or calls
method `get_content(metadata_mode=MetadataMode.LLM)` to get a string
representation of the contents together with the LLM metadata.

The exclusion arguments default to the llama-index default values,
although I would personally add a couple to each of them:

- `exclude_metadata` with ["file_path", "url", "url_reachable"]
- `exclude_llm_metadata` with ["file_path", "url_reachable"]

But they haven't been changed for backward compatibility.

Assisted-by: N/A
Generated-by: Cursor for tests

[1]: https://github.com/llamastack/llama-stack/blob/b1c5b8fa9f2c994f0ed15849fef99b6304440b5b/src/llama_stack/providers/utils/memory/vector_store.py#L283
Llama-stack doesn't use the metadata for the embeddings [1], which is a
different behavior from what we had with llama-index.

In this patch we fix this by manually calculating the embedding
ourselves and passing it to llama-stack.

The embedding is calculated using the exact same templates as
llama-index to try to keep things as similar as possible.

Unfortunately this is only possible when manual chunking is being used
in this script (when configuration option `--auto-chunking` is not
provided).  Fortunately that is the default behavior.

This patch also adds support for the exclusion of metadata for both
embedding and llm.

Asysted-by: N/A
Generated-by: Cursor for tests

[1]: https://github.com/llamastack/llama-stack/blob/b1c5b8fa9f2c994f0ed15849fef99b6304440b5b/src/llama_stack/providers/utils/memory/vector_store.py#L283
@Akrog Akrog force-pushed the fix-exclude-metadata branch from 7a65151 to 4b753dc Compare November 26, 2025 18:19
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/lightspeed_rag_content/document_processor.py (1)

199-240: Consider defensive handling of None metadata exclusion lists.

The implementation now includes proper type annotations (addressing previous review feedback). However, the code assigns potentially None values to doc.excluded_embed_metadata_keys and doc.excluded_llm_metadata_keys (lines 220, 232). While _remove_metadata handles None correctly, it's unclear whether llama-index's TextNode expects these attributes to be lists or accepts None.

Consider adding guards to only set these attributes when the config values are not None:

     def exclude_metadata(self, documents: list[TextNode]) -> None:
         """Exclude metadata from documents.
         ...
         """
         for doc in documents:
-            doc.excluded_embed_metadata_keys = self.config.exclude_embed_metadata
+            if self.config.exclude_embed_metadata is not None:
+                doc.excluded_embed_metadata_keys = self.config.exclude_embed_metadata
             ...
-            doc.excluded_llm_metadata_keys = self.config.exclude_llm_metadata
+            if self.config.exclude_llm_metadata is not None:
+                doc.excluded_llm_metadata_keys = self.config.exclude_llm_metadata

This would be more defensive, though the current implementation appears functional given that _remove_metadata handles None correctly.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a65151 and 4b753dc.

📒 Files selected for processing (6)
  • README.md (1 hunks)
  • src/lightspeed_rag_content/document_processor.py (10 hunks)
  • src/lightspeed_rag_content/utils.py (2 hunks)
  • tests/test_document_processor.py (4 hunks)
  • tests/test_document_processor_llama_index.py (1 hunks)
  • tests/test_document_processor_llama_stack.py (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • README.md
  • src/lightspeed_rag_content/utils.py
🧰 Additional context used
🧬 Code graph analysis (2)
tests/test_document_processor_llama_stack.py (1)
src/lightspeed_rag_content/document_processor.py (4)
  • _LlamaStackDB (243-506)
  • _calculate_embeddings (443-476)
  • add_docs (152-156)
  • add_docs (402-441)
tests/test_document_processor.py (1)
src/lightspeed_rag_content/document_processor.py (1)
  • _remove_metadata (109-115)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Konflux kflux-prd-rh02 / rag-content-on-pull-request
  • GitHub Check: Pylinter
  • GitHub Check: build-and-push-dev
🔇 Additional comments (19)
tests/test_document_processor.py (2)

84-86: LGTM!

The test assertions correctly verify that the new exclude_embed_metadata and exclude_llm_metadata configuration attributes default to None when not provided during initialization.

Also applies to: 111-112, 142-143


207-245: LGTM!

The new TestBaseDB test suite provides comprehensive coverage of _remove_metadata behavior, including edge cases like empty removal lists and nonexistent keys. The tests validate both that specified keys are removed and that other keys are preserved correctly.

tests/test_document_processor_llama_index.py (1)

246-427: LGTM!

The new test suite for LlamaIndex integration provides excellent coverage of the exclude_metadata functionality:

  • Configuration propagation through initialization
  • Runtime behavior of exclude_metadata method setting node attributes
  • model_copy override mechanism to filter LLM metadata
  • Validation that _model_copy_excluding_llm_metadata correctly removes excluded keys while preserving others
  • Integration with add_docs workflow
  • Default and custom configuration scenarios

The tests are well-structured and appropriately use mocks for unit testing.

tests/test_document_processor_llama_stack.py (7)

86-89: LGTM!

Mocking tiktoken to prevent network calls during test initialization is a good practice that ensures tests remain fast and reliable without external dependencies.


101-102: LGTM!

The fixture correctly initializes the new exclude_embed_metadata and exclude_llm_metadata configuration attributes with empty lists. The test_init_model_path update with exists_side_effect properly handles multiple os.path.exists calls (for both embeddings_model_dir and tiktoken cache) by checking for the "embeddings_model" string in the path.

Also applies to: 135-163


321-345: LGTM!

The test correctly validates that embed_metadata is now included in each document's payload when using manual chunking, containing the appropriate metadata fields (document_id, title, docs_url) that will be used during embedding calculation.


386-437: LGTM!

The _test_save helper and save tests are correctly updated to:

  • Include embed_metadata in document payload
  • Mock the embeddings inference response (needed for _calculate_embeddings)
  • Verify the save flow calls the appropriate llama-stack methods with expected parameters using a more flexible kwargs-based assertion approach

439-499: LGTM!

The tests for _calculate_embeddings thoroughly validate:

  • Removal of embed_metadata from documents after processing
  • Addition of embedding field with correct values
  • Proper formatting of metadata into the embedding input (metadata + content)
  • Correct ordering (metadata first, then content)
  • Edge case handling with empty metadata

The assertions verify both the data transformation and the llama-stack client API calls.


500-596: LGTM!

The metadata exclusion tests provide excellent coverage:

  • exclude_embed_metadata only affects embed_metadata, not llm_metadata
  • exclude_llm_metadata only affects llm_metadata (the metadata field), not embed_metadata
  • Combined exclusions work independently and correctly
  • Each test verifies both removal of excluded keys and preservation of non-excluded keys

The tests confirm the exclusion mechanism is properly isolated between embedding and LLM use cases.


597-631: LGTM!

The test for multiple documents validates:

  • Each document receives a distinct embedding
  • The inference client is called once per document
  • embed_metadata is properly removed from all documents

This confirms the per-document embedding calculation approach works correctly for multiple documents.

src/lightspeed_rag_content/document_processor.py (9)

22-22: LGTM!

The types import and storage of the original TextNode.model_copy method are necessary for the metadata exclusion mechanism. Storing the original in _BaseDB.__init__ makes it available to subclasses that need to wrap model_copy with metadata-filtering behavior.

Also applies to: 80-80


108-115: LGTM!

The _remove_metadata method is correctly implemented with proper type hints (Optional[list[str]]) and guards against None or empty removal lists by returning a copy of the original metadata. This addresses the concerns raised in previous reviews about robustness and type safety.


156-156: LGTM!

The call to exclude_metadata is correctly placed in the add_docs flow after nodes are split and filtered. This ensures metadata exclusion settings are applied before embeddings are calculated.


244-247: LGTM!

The embedding template constants provide a clear, readable format for incorporating metadata into embeddings: each metadata field on its own line in "key: value" format, followed by a blank line separator, then the content. This structured approach ensures metadata is included in the embedding calculation for the llama-stack backend.


413-428: LGTM!

The metadata computation in add_docs correctly:

  • Derives embed_metadata by excluding configured embed metadata keys from node metadata
  • Derives llm_metadata by excluding configured LLM metadata keys from node metadata
  • Adds document_id to llm_metadata (required by llama-stack)
  • Stores embed_metadata in the document dictionary for later use during embedding calculation

This enables separate metadata control for embeddings versus LLM retrieval results.


432-432: LGTM!

The warning message appropriately informs users that llama-stack's automatic chunking mode does not incorporate metadata into embeddings, helping set correct expectations about the behavior difference between manual and automatic modes.


443-476: LGTM!

The _calculate_embeddings method is correctly implemented with proper type hints and logic:

  • Accepts documents: list[dict[str, Any]] parameter matching actual usage
  • Iterates over each document dictionary
  • Extracts and removes embed_metadata from each document
  • Formats metadata fields into a readable string using the defined templates
  • Constructs embedding input by combining formatted metadata and content
  • Calls llama-stack inference API to compute embeddings per document
  • Stores computed embeddings back in the document dictionary

This addresses previous review concerns about type mismatches and iteration issues.


496-496: LGTM!

The call to _calculate_embeddings is correctly placed in the save flow for manual chunking mode, computing embeddings (with metadata) before inserting chunks into the vector store. This ensures metadata is incorporated into embeddings for llama-stack.


523-524: LGTM!

The new exclude_embed_metadata and exclude_llm_metadata parameters are properly added to the DocumentProcessor.__init__ signature with correct type hints (Optional[list[str]]) and default values (None), and are correctly passed through to the _Config object. This cleanly extends the public API to support metadata exclusion.

Also applies to: 541-542

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant