LCORE-1179: Update metadata extraction from chunks #1073

are-ces · 2026-01-28T15:57:27Z

Description

now extracting doc_title, doc_url, doc_id from the "attributes" in response
added doc_id to keep a link between rag_chunks and referenced_docs
removed chunk metadata extraction from results which are not "type": "file_search_call"
deleted deprecated method in old query endpoint

Important change

There is an additional field in the referenced_documents field: doc_id- which points to the OpenAI Files object created by llama-stack has been added to the attributes of referenced_documents.

E.g.

"referenced_documents":` [
    {
      "doc_url": "https://www.redhat.com/openshift_docs/virt/about_virt/about-virt.txt",
      "doc_title": "About OpenShift Virtualization",
      "doc_id": "file-6376556fe5b8435d830140109b73834e"
    }
],
"rag_chunks": [
    {
      "content": "# About OpenShift [...]",
      "source": "file-6376556fe5b8435d830140109b73834e",
      "score": 2.5908213494396537
    }
]

The reason is to keep a link between referenced_documents and rag_chunks in the query endpoint.

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude

Related Tickets & Documents

Related Issue # LCORE-1179
Closes # LCORE-1179

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

I have registered a vector store, called /v1/query and /v1/streaming_query endpoints, making sure that the referenced_docs field is well structured.

E.g.

"referenced_documents":` [
    {
      "doc_url": "https://www.redhat.com/openshift_docs/virt/about_virt/about-virt.txt",
      "doc_title": "About OpenShift Virtualization",
      "doc_id": "file-6376556fe5b8435d830140109b73834e"
    },

Summary by CodeRabbit

Release Notes

Improvements

Enhanced document reference identification and tracking by adding document ID support alongside document titles and URLs
Refined document extraction and citation logic for improved reliability and consistency in document reference handling
Simplified document processing workflows for better system efficiency

_{✏️ Tip: You can customize this high-level summary in your review settings.}

- now extracting doc_title, doc_url, doc_id from the "attributes" in response - added doc_id to keep a link between rag_chunks and referenced_docs - removed chunk metadata extraction from results which are not "type": "file_search_call"

coderabbitai · 2026-01-28T15:57:44Z

Walkthrough

The PR refactors document extraction logic by removing TextContentItem metadata parsing from query.py and consolidating referenced document extraction into query_v2.py. Document identity shifts from filename-based to title and doc_id-based identification, with a new optional doc_id field added to the ReferencedDocument model. All related tests are updated to reflect the new extraction patterns.

Changes

Cohort / File(s)	Summary
Document Extraction Refactoring `src/app/endpoints/query.py`, `src/app/endpoints/query_v2.py`	Removed parse_metadata_from_text_item function and related imports from query.py; consolidated referenced document extraction in query_v2.py to use doc_title/doc_id instead of filename, updated URL extraction order (doc_url, docs_url, url, link), removed message annotation handling path.
Model Updates `src/models/responses.py`	Added optional doc_id field to ReferencedDocument with docstring and Field description; minor import reordering.
Test Updates `tests/unit/app/endpoints/test_query.py`, `tests/unit/app/endpoints/test_query_v2.py`, `tests/unit/cache/test_postgres_cache.py`	Removed TextContentItem/parse_metadata tests from query tests; updated query_v2 mocks to include doc_url/title/document_id attributes; adjusted assertions for new document identification fields; added rag_chunks extraction validation; updated cache tests to include doc_id field.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

[RHDHPAI-1143] Implement referenced_documents caching #643 — Modifies ReferencedDocument usage and propagation across endpoints and caching layers.
LCORE-693: Added rag_chunks to streaming_query #585 — Alters ReferencedDocument field extraction from rag_chunks and metadata, affecting document representation.
LCORE-347: Implement referenced documents support on /query endpoint #572 — Previously added ReferencedDocument and parsing helpers to query.py; this PR removes and consolidates that logic.

Suggested reviewers

tisnik

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Update metadata extraction from chunks' accurately reflects the main change—refactoring how metadata is extracted from API responses, specifically from chunk results with doc_title, doc_url, and doc_id fields.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/app/endpoints/query_v2.py (1)

547-588: Include doc_id in de-duplication to avoid dropping distinct documents.
If two documents share title/URL (or URL is missing) but have different doc_id values, the current (final_url, doc_title) key will collapse them, which undermines the new doc_id linkage.

✅ Suggested fix

-    # Use a set to track unique documents by (doc_url, doc_title) tuple
-    seen_docs: set[tuple[Optional[str], Optional[str]]] = set()
+    # Use a set to track unique documents by (doc_id, doc_url, doc_title) tuple
+    seen_docs: set[tuple[Optional[str], Optional[str], Optional[str]]] = set()
@@
-                if (final_url, doc_title) not in seen_docs:
+                unique_key = (doc_id, final_url, doc_title)
+                if unique_key not in seen_docs:
                     documents.append(
                         ReferencedDocument(
                             doc_url=final_url, doc_title=doc_title, doc_id=doc_id
                         )
                     )
-                    seen_docs.add((final_url, doc_title))
+                    seen_docs.add(unique_key)

asimurka · 2026-01-29T10:55:37Z

src/app/endpoints/query_v2.py

-                        )
-                        seen_docs.add((final_url, filename))
-
-        # 2. Parse from message content annotations


What is the reason for dropping this section?
Check also OpenAIResponseOutputMessageContent type that can be part of OpenAIResponseMessage content. Isn't this also relevant? More specifically, OpenAIResponseAnnotations object

We already get the references to the chunks used for the response.

Annotations is useful if we want to determine exactly which files the model has used to give the answer (determined by the model) and at which position in the text.

Personally I think that we can use this data, but right now it is not needed. Because the scope of referenced_docs is to keep data of all chunks retrieved and used as input for reference. The data that we get from citation is just redundant since it is a small subset of these docs.

I think we can make use of citations in a separate feature, after agreement across our customers since it adds a new functionality.

asimurka

LGTM

Fixed metadata extraction from chunks

244ed89

- now extracting doc_title, doc_url, doc_id from the "attributes" in response - added doc_id to keep a link between rag_chunks and referenced_docs - removed chunk metadata extraction from results which are not "type": "file_search_call"

are-ces requested review from asimurka and tisnik January 28, 2026 15:57

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

are-ces requested a review from onmete January 28, 2026 16:15

asimurka reviewed Jan 29, 2026

View reviewed changes

asimurka approved these changes Jan 29, 2026

View reviewed changes

are-ces marked this pull request as draft January 30, 2026 09:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LCORE-1179: Update metadata extraction from chunks #1073

LCORE-1179: Update metadata extraction from chunks #1073

are-ces commented Jan 28, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 28, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

asimurka Jan 29, 2026 •

edited

Loading

Uh oh!

are-ces Jan 30, 2026

Uh oh!

asimurka left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LCORE-1179: Update metadata extraction from chunks #1073

Are you sure you want to change the base?

LCORE-1179: Update metadata extraction from chunks #1073

Conversation

are-ces commented Jan 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Important change

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

asimurka Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

are-ces Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

asimurka left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

are-ces commented Jan 28, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 28, 2026 •

edited

Loading

asimurka Jan 29, 2026 •

edited

Loading