Conversation
…nt and web search Add a new tool-augmented collection mode alongside the existing bare mode for response collection. When enabled via tool_config, endpoints use a ReAct agent with TavilySearchTool to autonomously search and verify real papers before recommending them, enabling direct comparison of hallucination rates between bare and tool-augmented modes. - Add ToolConfig schema with tavily_api_key, max_iterations, search_depth - Add tool_config field to OpenAIEndpoint - Add tool-specific system prompts (zh/en) instructing models to use search - Implement _call_endpoint_with_tools using ReActAgent - Refactor retry logic with explicit rate-limit handling (429) - Update example config.yaml with tool-augmented endpoint example Co-authored-by: Cursor <cursoragent@cursor.com>
…oncurrency, and reporting - Add LaTeX stripping and brace-depth counting for robust BibTeX field extraction - Refactor response collector to use per-endpoint concurrency semaphores - Update chart generator and schema for enhanced reporting - Add resume/ to .gitignore for intermediate output files Co-authored-by: Cursor <cursoragent@cursor.com>
Summary of ChangesHello @XiaoBoAI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to the reference hallucination arena by integrating a tool-augmented evaluation mode. This allows for a more sophisticated assessment of LLMs' ability to recommend real academic papers by enabling them to use web search tools. Alongside this, the changes improve the robustness of BibTeX parsing, refine concurrency handling for API calls, and update reporting capabilities, all while ensuring comprehensive documentation of these new features and methodologies. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: a "tool-augmented" mode for the reference hallucination evaluation. This allows models to use a web search tool via a ReAct agent to find and verify papers before recommending them, enabling a direct comparison between a model's performance with and without tools. The implementation is robust, including a major refactoring of response_collector.py to support the new mode, improved concurrency handling with per-endpoint semaphores, and a more flexible retry mechanism. Additionally, the BibTeX extraction logic has been enhanced to handle LaTeX markup and nested braces more effectively. The configuration schema and documentation have been thoroughly updated to reflect these new capabilities. Overall, this is a high-quality contribution that significantly enhances the benchmark's power. I have one suggestion to improve code robustness.
|
|
||
| Returns: | ||
| List of dicts: {query, discipline, num_refs, responses: {endpoint: text}} | ||
| """ |
There was a problem hiding this comment.
Using the internal _value attribute of asyncio.Semaphore is generally discouraged as it's not part of the public API and could change in future Python versions, making the code less robust. A safer and cleaner approach would be to use the max_concurrency value from the endpoint configuration, which is what the semaphore was initialized with and represents the intended concurrency limit.
| """ | |
| concurrency_info = ", ".join(f"{n}={self.endpoints[n].max_concurrency}" for n in self.endpoints) |
…cific params - Strip thinking/reasoning blocks (e.g. <think>, <thinking>) from model output before BibTeX extraction to avoid false positives from CoT content - Route non-standard OpenAI SDK params to extra_body in ResponseCollector to prevent TypeError with provider-specific params (enable_thinking, etc.) - Avoid overriding enable_thinking for Qwen models if already set explicitly Co-authored-by: Cursor <cursoragent@cursor.com>
Combine response collection, BibTeX extraction, and verification into a single streaming pipeline that processes each model response as soon as it arrives, rather than waiting for all responses first. This overlaps I/O-bound collection with verification, reducing overall wall-clock time. - Add on_single_response callback to ResponseCollector - Implement _collect_and_verify_streaming with async workers - Refactor evaluate() to use the streaming pipeline for steps 2+3+4 - Support checkpoint resume for partially completed streaming runs Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
OpenJudge Version
[The version of OpenJudge you are working on, e.g.
import openjudge; print(openjudge.__version__)]Description
[Please describe the background, purpose, changes made, and how to test this PR]
Checklist
Please check the following items before code is ready to be reviewed.
pre-commit run --all-filescommand