Skip to content

Conversation

@are-ces
Copy link
Contributor

@are-ces are-ces commented Jan 28, 2026

Description

  • now extracting doc_title, doc_url, doc_id from the "attributes" in response
  • added doc_id to keep a link between rag_chunks and referenced_docs
  • removed chunk metadata extraction from results which are not "type": "file_search_call"
  • deleted deprecated method in old query endpoint

Important change

There is an additional field in the referenced_documents field: doc_id- which points to the OpenAI Files object created by llama-stack has been added to the attributes of referenced_documents.

E.g.

"referenced_documents":` [
    {
      "doc_url": "https://www.redhat.com/openshift_docs/virt/about_virt/about-virt.txt",
      "doc_title": "About OpenShift Virtualization",
      "doc_id": "file-6376556fe5b8435d830140109b73834e"
    }
],
"rag_chunks": [
    {
      "content": "# About OpenShift [...]",
      "source": "file-6376556fe5b8435d830140109b73834e",
      "score": 2.5908213494396537
    }
]

The reason is to keep a link between referenced_documents and rag_chunks in the query endpoint.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude

Related Tickets & Documents

  • Related Issue # LCORE-1179
  • Closes # LCORE-1179

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

I have registered a vector store, called /v1/query and /v1/streaming_query endpoints, making sure that the referenced_docs field is well structured.

E.g.

"referenced_documents":` [
    {
      "doc_url": "https://www.redhat.com/openshift_docs/virt/about_virt/about-virt.txt",
      "doc_title": "About OpenShift Virtualization",
      "doc_id": "file-6376556fe5b8435d830140109b73834e"
    },

Summary by CodeRabbit

Release Notes

Improvements

  • Enhanced document reference identification and tracking by adding document ID support alongside document titles and URLs
  • Refined document extraction and citation logic for improved reliability and consistency in document reference handling
  • Simplified document processing workflows for better system efficiency

✏️ Tip: You can customize this high-level summary in your review settings.

- now extracting doc_title, doc_url, doc_id from the "attributes" in response
- added doc_id to keep a link between rag_chunks and referenced_docs
- removed chunk metadata extraction from results which are not "type": "file_search_call"
@are-ces are-ces requested review from asimurka and tisnik January 28, 2026 15:57
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 28, 2026

Walkthrough

The PR refactors document extraction logic by removing TextContentItem metadata parsing from query.py and consolidating referenced document extraction into query_v2.py. Document identity shifts from filename-based to title and doc_id-based identification, with a new optional doc_id field added to the ReferencedDocument model. All related tests are updated to reflect the new extraction patterns.

Changes

Cohort / File(s) Summary
Document Extraction Refactoring
src/app/endpoints/query.py, src/app/endpoints/query_v2.py
Removed parse_metadata_from_text_item function and related imports from query.py; consolidated referenced document extraction in query_v2.py to use doc_title/doc_id instead of filename, updated URL extraction order (doc_url, docs_url, url, link), removed message annotation handling path.
Model Updates
src/models/responses.py
Added optional doc_id field to ReferencedDocument with docstring and Field description; minor import reordering.
Test Updates
tests/unit/app/endpoints/test_query.py, tests/unit/app/endpoints/test_query_v2.py, tests/unit/cache/test_postgres_cache.py
Removed TextContentItem/parse_metadata tests from query tests; updated query_v2 mocks to include doc_url/title/document_id attributes; adjusted assertions for new document identification fields; added rag_chunks extraction validation; updated cache tests to include doc_id field.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested reviewers

  • tisnik
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Update metadata extraction from chunks' accurately reflects the main change—refactoring how metadata is extracted from API responses, specifically from chunk results with doc_title, doc_url, and doc_id fields.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/app/endpoints/query_v2.py (1)

547-588: Include doc_id in de-duplication to avoid dropping distinct documents.
If two documents share title/URL (or URL is missing) but have different doc_id values, the current (final_url, doc_title) key will collapse them, which undermines the new doc_id linkage.

✅ Suggested fix
-    # Use a set to track unique documents by (doc_url, doc_title) tuple
-    seen_docs: set[tuple[Optional[str], Optional[str]]] = set()
+    # Use a set to track unique documents by (doc_id, doc_url, doc_title) tuple
+    seen_docs: set[tuple[Optional[str], Optional[str], Optional[str]]] = set()
@@
-                if (final_url, doc_title) not in seen_docs:
+                unique_key = (doc_id, final_url, doc_title)
+                if unique_key not in seen_docs:
                     documents.append(
                         ReferencedDocument(
                             doc_url=final_url, doc_title=doc_title, doc_id=doc_id
                         )
                     )
-                    seen_docs.add((final_url, doc_title))
+                    seen_docs.add(unique_key)

@are-ces are-ces requested a review from onmete January 28, 2026 16:15
)
seen_docs.add((final_url, filename))

# 2. Parse from message content annotations
Copy link
Contributor

@asimurka asimurka Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for dropping this section?
Check also OpenAIResponseOutputMessageContent type that can be part of OpenAIResponseMessage content. Isn't this also relevant? More specifically, OpenAIResponseAnnotations object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already get the references to the chunks used for the response.

Annotations is useful if we want to determine exactly which files the model has used to give the answer (determined by the model) and at which position in the text.

Personally I think that we can use this data, but right now it is not needed. Because the scope of referenced_docs is to keep data of all chunks retrieved and used as input for reference. The data that we get from citation is just redundant since it is a small subset of these docs.

I think we can make use of citations in a separate feature, after agreement across our customers since it adds a new functionality.

Copy link
Contributor

@asimurka asimurka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@are-ces are-ces marked this pull request as draft January 30, 2026 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants