LCORE-1179: Update metadata extraction from chunks #1073

asimurka · 2026-01-29T10:55:37Z

What is the reason for dropping this section?
Check also OpenAIResponseOutputMessageContent type that can be part of OpenAIResponseMessage content. Isn't this also relevant? More specifically, OpenAIResponseAnnotations object

We already get the references to the chunks used for the response.

Annotations is useful if we want to determine exactly which files the model has used to give the answer (determined by the model) and at which position in the text.

Personally I think that we can use this data, but right now it is not needed. Because the scope of referenced_docs is to keep data of all chunks retrieved and used as input for reference. The data that we get from citation is just redundant since it is a small subset of these docs.

I think we can make use of citations in a separate feature, after agreement across our customers since it adds a new functionality.

coderabbitai · 2026-01-30T12:16:51Z

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the actual code around lines 575-586 in query_v2.py cat -n src/app/endpoints/query_v2.py | sed -n '570,590p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 1147

🏁 Script executed:

# Let's also check the ReferencedDocument class definition to understand its structure rg -A 10 'class ReferencedDocument' src/

Repository: lightspeed-core/lightspeed-stack

Length of output: 708

🏁 Script executed:

# Let's see more context around this code section to understand the full logic cat -n src/app/endpoints/query_v2.py | sed -n '550,600p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 2444

🏁 Script executed:

# Get full ReferencedDocument class definition cat -n src/models/responses.py | grep -A 20 'class ReferencedDocument'

Repository: lightspeed-core/lightspeed-stack

Length of output: 840

🏁 Script executed:

# Search for where ReferencedDocument is used to understand the intent rg 'ReferencedDocument' src/ --type py -B 2 -A 2

Repository: lightspeed-core/lightspeed-stack

Length of output: 10260

🏁 Script executed:

# Check if there are any tests that show expected behavior with doc_id-only documents fd 'test.*query' src/ --type py | head -5

Repository: lightspeed-core/lightspeed-stack

Length of output: 245

Include doc_id-only documents to prevent data loss.

Documents with only document_id but no title or url are currently filtered out. Since doc_id is explicitly extracted and used in deduplication, it should be preserved. Update the condition to include doc_id:

Suggested fix

- if doc_title or doc_url: + if doc_title or doc_url or doc_id:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

doc_title = attributes.get("title")

doc_id = attributes.get("document_id") or attributes.get("doc_id")

# If we have at least a filename or url

if filename or doc_url:

if doc_title or doc_url:

# Treat empty string as None for URL to satisfy Optional[AnyUrl]

final_url = doc_url if doc_url else None

if (final_url, filename) not in seen_docs:

if (final_url, doc_title, doc_id) not in seen_docs:

documents.append(

ReferencedDocument(doc_url=final_url, doc_title=filename)

)

seen_docs.add((final_url, filename))

# 2. Parse from message content annotations

elif item_type == "message":

content = getattr(output_item, "content", None)

if isinstance(content, list):

for part in content:

# Skip if part is a string or doesn't have annotations

if isinstance(part, str):

continue

annotations = getattr(part, "annotations", []) or []

for annotation in annotations:

# Handle both object and dict access for annotations

if isinstance(annotation, dict):

anno_type = annotation.get("type")

anno_url = annotation.get("url")

anno_title = annotation.get("title") or annotation.get(

"filename"

)

else:

anno_type = getattr(annotation, "type", None)

anno_url = getattr(annotation, "url", None)

anno_title = getattr(annotation, "title", None) or getattr(

annotation, "filename", None

ReferencedDocument(

doc_url=final_url, doc_title=doc_title, doc_id=doc_id

)

if anno_type == "url_citation":

# Treat empty string as None

final_url = anno_url if anno_url else None

if (final_url, anno_title) not in seen_docs:

documents.append(

ReferencedDocument(

doc_url=final_url, doc_title=anno_title

)

)

seen_docs.add((final_url, anno_title))

elif anno_type == "file_citation":

if (None, anno_title) not in seen_docs:

documents.append(

ReferencedDocument(

doc_url=None, doc_title=anno_title

)

)

seen_docs.add((None, anno_title))

)

doc_title = attributes.get("title")

doc_id = attributes.get("document_id") or attributes.get("doc_id")

if doc_title or doc_url or doc_id:

# Treat empty string as None for URL to satisfy Optional[AnyUrl]

final_url = doc_url if doc_url else None

if (final_url, doc_title, doc_id) not in seen_docs:

documents.append(

ReferencedDocument(

doc_url=final_url, doc_title=doc_title, doc_id=doc_id

)

)

🤖 Prompt for AI Agents

In `@src/app/endpoints/query_v2.py` around lines 575 - 586, The current filter only appends ReferencedDocument when doc_title or doc_url exists, which drops entries that only have doc_id; change the if-condition to check any of doc_title, doc_url, or doc_id so documents with only a document_id are preserved. Keep the existing final_url normalization (treat empty string as None), use the same dedupe tuple (final_url, doc_title, doc_id) against seen_docs, and append ReferencedDocument(doc_url=final_url, doc_title=doc_title, doc_id=doc_id) when that tuple is not in seen_docs.

referenced_docs are used because we need to show in the frontend what document have been referenced, IMO it does not make sense to return empty documents without TItle / Url

@are-ces, that makes perfect sense! You're right that a document reference with only an ID wouldn't provide meaningful information for the frontend to display. The current condition ensures that only documents with at least a title or URL—something actionable and human-readable—are returned to the UI. Thank you for the clarification!

✏️ Learnings added

Learnt from: are-ces Repo: lightspeed-core/lightspeed-stack PR: 1073 File: src/app/endpoints/query_v2.py:575-586 Timestamp: 2026-01-30T13:33:34.749Z Learning: In `src/app/endpoints/query_v2.py`, the `parse_referenced_documents_from_responses_api` function filters referenced documents to include only those with `doc_title` or `doc_url` because these documents are displayed in the frontend. Documents with only `doc_id` are intentionally excluded as they wouldn't provide useful information to end users.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

-Original file line number
+Diff line change
@@ Expand Up / @@ -3664,7 +3664,7 @@ @@
                         "rlsapi-v1"
                     ],
                     "summary": "Infer Endpoint",
-                    "description": "Handle rlsapi v1 /infer requests for stateless inference.\n\nThis endpoint serves requests from the RHEL Lightspeed Command Line Assistant (CLA).\n\nAccepts a question with optional context (stdin, attachments, terminal output,\nsystem info) and returns an LLM-generated response.\n\nArgs:\n    infer_request: The inference request containing question and context.\n    auth: Authentication tuple from the configured auth provider.\n\nReturns:\n    RlsapiV1InferResponse containing the generated response text and request ID.\n\nRaises:\n    HTTPException: 503 if the LLM service is unavailable.",
+                    "description": "Handle rlsapi v1 /infer requests for stateless inference.\n\nThis endpoint serves requests from the RHEL Lightspeed Command Line Assistant (CLA).\n\nAccepts a question with optional context (stdin, attachments, terminal output,\nsystem info) and returns an LLM-generated response.\n\nArgs:\n    infer_request: The inference request containing question and context.\n    request: The FastAPI request object for accessing headers and state.\n    background_tasks: FastAPI background tasks for async Splunk event sending.\n    auth: Authentication tuple from the configured auth provider.\n\nReturns:\n    RlsapiV1InferResponse containing the generated response text and request ID.\n\nRaises:\n    HTTPException: 503 if the LLM service is unavailable.",
                     "operationId": "infer_endpoint_v1_infer_post",
                     "requestBody": {
                         "content": {
@@ Expand Down Expand Up / @@ -4290,7 +4290,7 @@ @@
                     ],
                     "summary": "Handle A2A Jsonrpc",
                     "description": "Handle A2A JSON-RPC requests following the A2A protocol specification.\n\nThis endpoint uses the DefaultRequestHandler from the A2A SDK to handle\nall JSON-RPC requests including message/send, message/stream, etc.\n\nThe A2A SDK application is created per-request to include authentication\ncontext while still leveraging FastAPI's authorization middleware.\n\nAutomatically detects streaming requests (message/stream JSON-RPC method)\nand returns a StreamingResponse to enable real-time chunk delivery.\n\nArgs:\n    request: FastAPI request object\n    auth: Authentication tuple\n    mcp_headers: MCP headers for context propagation\n\nReturns:\n    JSON-RPC response or streaming response",
-                    "operationId": "handle_a2a_jsonrpc_a2a_get",
+                    "operationId": "handle_a2a_jsonrpc_a2a_post",
                     "responses": {
                         "200": {
                             "description": "Successful Response",
@@ Expand All / @@ -4308,7 +4308,7 @@ @@
                     ],
                     "summary": "Handle A2A Jsonrpc",
                     "description": "Handle A2A JSON-RPC requests following the A2A protocol specification.\n\nThis endpoint uses the DefaultRequestHandler from the A2A SDK to handle\nall JSON-RPC requests including message/send, message/stream, etc.\n\nThe A2A SDK application is created per-request to include authentication\ncontext while still leveraging FastAPI's authorization middleware.\n\nAutomatically detects streaming requests (message/stream JSON-RPC method)\nand returns a StreamingResponse to enable real-time chunk delivery.\n\nArgs:\n    request: FastAPI request object\n    auth: Authentication tuple\n    mcp_headers: MCP headers for context propagation\n\nReturns:\n    JSON-RPC response or streaming response",
-                    "operationId": "handle_a2a_jsonrpc_a2a_get",
+                    "operationId": "handle_a2a_jsonrpc_a2a_post",
                     "responses": {
                         "200": {
                             "description": "Successful Response",
@@ Expand Down Expand Up / @@ -5335,11 +5335,11 @@ @@
                             "description": "Dimensionality of embedding vectors.",
                             "default": 768
                         },
-                        "vector_db_id": {
+                        "vector_store_id": {
                             "type": "string",
                             "minLength": 1,
-                            "title": "Vector DB ID",
-                            "description": "Vector DB identification."
+                            "title": "Vector Store ID",
+                            "description": "Vector store identification."
                         },
                         "db_path": {
                             "type": "string",
@@ Expand All / @@ -5352,7 +5352,7 @@ @@
                     "type": "object",
                     "required": [
                         "rag_id",
-                        "vector_db_id",
+                        "vector_store_id",
                         "db_path"
                     ],
                     "title": "ByokRag",
@@ Expand Down Expand Up / @@ -8431,11 +8431,23 @@ @@
                             ],
                             "title": "Doc Title",
                             "description": "Title of the referenced document"
+                        },
+                        "doc_id": {
+                            "anyOf": [
+                                {
+                                    "type": "string"
+                                },
+                                {
+                                    "type": "null"
+                                }
+                            ],
+                            "title": "Doc Id",
+                            "description": "ID of the referenced document"
                         }
                     },
                     "type": "object",
                     "title": "ReferencedDocument",
-                    "description": "Model representing a document referenced in generating a response.\n\nAttributes:\n    doc_url: Url to the referenced doc.\n    doc_title: Title of the referenced doc."
+                    "description": "Model representing a document referenced in generating a response.\n\nAttributes:\n    doc_url: Url to the referenced doc.\n    doc_title: Title of the referenced doc.\n    doc_id: ID of the referenced doc."
                 },
                 "RlsapiV1Attachment": {
                     "properties": {
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -3200,6 +3200,8 @@ system info) and returns an LLM-generated response. @@
     Args:
         infer_request: The inference request containing question and context.
+        request: The FastAPI request object for accessing headers and state.
+        background_tasks: FastAPI background tasks for async Splunk event sending.
         auth: Authentication tuple from the configured auth provider.
     Returns:
@@ Expand Down Expand Up / @@ -4184,7 +4186,7 @@ BYOK (Bring Your Own Knowledge) RAG configuration. @@
     | rag_type | string | Type of RAG database. |
     | embedding_model | string | Embedding model identification |
     | embedding_dimension | integer | Dimensionality of embedding vectors. |
-    | vector_db_id | string | Vector DB identification. |
+    | vector_store_id | string | Vector store identification. |
     | db_path | string | Path to RAG database. |
@@ Expand Down Expand Up @@
     Attributes:
         doc_url: Url to the referenced doc.
         doc_title: Title of the referenced doc.
+        doc_id: ID of the referenced doc.
     | Field | Type | Description |
     |-------|------|-------------|
     | doc_url |  | URL of the referenced document |
     | doc_title |  | Title of the referenced document |
+    | doc_id |  | ID of the referenced document |
     ## RlsapiV1Attachment
@@ Expand Down @@

-Original file line number
+Diff line change
@@ -1,8 +1,6 @@
     """Handler for REST API call to provide answer to query."""
-    import ast
     import logging
-    import re
     from datetime import UTC, datetime
     from typing import Annotated, Any, Optional
@@ Expand All / @@ -14,7 +12,6 @@ @@
         RateLimitError,  # type: ignore
     )
     from llama_stack_client.types.model_list_response import ModelListResponse
-    from llama_stack_client.types.shared.interleaved_content_item import TextContentItem
     from sqlalchemy.exc import SQLAlchemyError
     import constants
@@ Expand All / @@ -36,7 +33,6 @@ @@
         PromptTooLongResponse,
         QueryResponse,
         QuotaExceededResponse,
-        ReferencedDocument,
         ServiceUnavailableResponse,
         UnauthorizedResponse,
         UnprocessableEntityResponse,
@@ Expand Down Expand Up / @@ -553,39 +549,6 @@ def is_input_shield(shield: Shield) -> bool: @@
         return _is_inout_shield(shield) or not is_output_shield(shield)
-    def parse_metadata_from_text_item(
-        text_item: TextContentItem,
-    ) -> Optional[ReferencedDocument]:
-        """
-        Parse a single TextContentItem to extract referenced documents.
-        Args:
-            text_item (TextContentItem): The TextContentItem containing metadata.
-        Returns:
-            ReferencedDocument: A ReferencedDocument object containing 'doc_url' and 'doc_title'
-            representing the referenced documents found in the metadata.
-        """
-        docs: list[ReferencedDocument] = []
-        if not isinstance(text_item, TextContentItem):
-            return docs
-        metadata_blocks = re.findall(
-            r"Metadata:\s*({.*?})(?:\n|$)", text_item.text, re.DOTALL
-        )
-        for block in metadata_blocks:
-            try:
-                data = ast.literal_eval(block)
-                url = data.get("docs_url")
-                title = data.get("title")
-                if url and title:
-                    return ReferencedDocument(doc_url=url, doc_title=title)
-                logger.debug("Invalid metadata block (missing url or title): %s", block)
-            except (ValueError, SyntaxError) as e:
-                logger.debug("Failed to parse metadata block: %s | Error: %s", block, e)
-        return None
     def validate_attachments_metadata(attachments: list[Attachment]) -> None:
         """Validate the attachments metadata provided in the request.
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LCORE-1179: Update metadata extraction from chunks #1073

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

asimurka Jan 29, 2026 •

edited

Loading

Uh oh!

are-ces Jan 30, 2026

Uh oh!

coderabbitai bot Jan 30, 2026 •

edited

Loading

Uh oh!

are-ces Jan 30, 2026

Uh oh!

coderabbitai bot Jan 30, 2026

Uh oh!

Uh oh!

Uh oh!

LCORE-1179: Update metadata extraction from chunks #1073

Are you sure you want to change the base?

Uh oh!

LCORE-1179: Update metadata extraction from chunks #1073

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

asimurka Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

are-ces Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

are-ces Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

asimurka Jan 29, 2026 •

edited

Loading

coderabbitai bot Jan 30, 2026 •

edited

Loading