-
Notifications
You must be signed in to change notification settings - Fork 70
LCORE-1179: Update metadata extraction from chunks #1073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- now extracting doc_title, doc_url, doc_id from the "attributes" in response - added doc_id to keep a link between rag_chunks and referenced_docs - removed chunk metadata extraction from results which are not "type": "file_search_call"
WalkthroughThe PR refactors document extraction logic by removing TextContentItem metadata parsing from query.py and consolidating referenced document extraction into query_v2.py. Document identity shifts from filename-based to title and doc_id-based identification, with a new optional doc_id field added to the ReferencedDocument model. All related tests are updated to reflect the new extraction patterns. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/app/endpoints/query_v2.py (1)
547-588: Include doc_id in de-duplication to avoid dropping distinct documents.
If two documents share title/URL (or URL is missing) but have different doc_id values, the current(final_url, doc_title)key will collapse them, which undermines the new doc_id linkage.✅ Suggested fix
- # Use a set to track unique documents by (doc_url, doc_title) tuple - seen_docs: set[tuple[Optional[str], Optional[str]]] = set() + # Use a set to track unique documents by (doc_id, doc_url, doc_title) tuple + seen_docs: set[tuple[Optional[str], Optional[str], Optional[str]]] = set() @@ - if (final_url, doc_title) not in seen_docs: + unique_key = (doc_id, final_url, doc_title) + if unique_key not in seen_docs: documents.append( ReferencedDocument( doc_url=final_url, doc_title=doc_title, doc_id=doc_id ) ) - seen_docs.add((final_url, doc_title)) + seen_docs.add(unique_key)
| ) | ||
| seen_docs.add((final_url, filename)) | ||
|
|
||
| # 2. Parse from message content annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for dropping this section?
Check also OpenAIResponseOutputMessageContent type that can be part of OpenAIResponseMessage content. Isn't this also relevant? More specifically, OpenAIResponseAnnotations object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already get the references to the chunks used for the response.
Annotations is useful if we want to determine exactly which files the model has used to give the answer (determined by the model) and at which position in the text.
Personally I think that we can use this data, but right now it is not needed. Because the scope of referenced_docs is to keep data of all chunks retrieved and used as input for reference. The data that we get from citation is just redundant since it is a small subset of these docs.
I think we can make use of citations in a separate feature, after agreement across our customers since it adds a new functionality.
asimurka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Important change
There is an additional field in the
referenced_documentsfield:doc_id- which points to theOpenAI Filesobject created by llama-stack has been added to the attributes ofreferenced_documents.E.g.
The reason is to keep a link between
referenced_documentsandrag_chunksin the query endpoint.Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
I have registered a vector store, called
/v1/queryand/v1/streaming_queryendpoints, making sure that thereferenced_docsfield is well structured.E.g.
Summary by CodeRabbit
Release Notes
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.