Skip to content

Feat/ref hallucination tool augmented#125

Merged
XiaoBoAI merged 6 commits intomainfrom
feat/ref-hallucination-tool-augmented
Feb 13, 2026
Merged

Feat/ref hallucination tool augmented#125
XiaoBoAI merged 6 commits intomainfrom
feat/ref-hallucination-tool-augmented

Conversation

@XiaoBoAI
Copy link
Collaborator

OpenJudge Version

[The version of OpenJudge you are working on, e.g. import openjudge; print(openjudge.__version__)]

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with pre-commit run --all-files command
  • All tests are passing
  • Docstrings are in Google style
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

XiaoBoAI and others added 2 commits February 13, 2026 12:14
…nt and web search

Add a new tool-augmented collection mode alongside the existing bare mode
for response collection. When enabled via tool_config, endpoints use a
ReAct agent with TavilySearchTool to autonomously search and verify real
papers before recommending them, enabling direct comparison of
hallucination rates between bare and tool-augmented modes.

- Add ToolConfig schema with tavily_api_key, max_iterations, search_depth
- Add tool_config field to OpenAIEndpoint
- Add tool-specific system prompts (zh/en) instructing models to use search
- Implement _call_endpoint_with_tools using ReActAgent
- Refactor retry logic with explicit rate-limit handling (429)
- Update example config.yaml with tool-augmented endpoint example

Co-authored-by: Cursor <cursoragent@cursor.com>
…oncurrency, and reporting

- Add LaTeX stripping and brace-depth counting for robust BibTeX field extraction
- Refactor response collector to use per-endpoint concurrency semaphores
- Update chart generator and schema for enhanced reporting
- Add resume/ to .gitignore for intermediate output files

Co-authored-by: Cursor <cursoragent@cursor.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @XiaoBoAI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the reference hallucination arena by integrating a tool-augmented evaluation mode. This allows for a more sophisticated assessment of LLMs' ability to recommend real academic papers by enabling them to use web search tools. Alongside this, the changes improve the robustness of BibTeX parsing, refine concurrency handling for API calls, and update reporting capabilities, all while ensuring comprehensive documentation of these new features and methodologies.

Highlights

  • Tool-Augmented Mode for LLM Evaluation: Introduced an optional tool-augmented mode for LLM evaluation in the reference hallucination arena. This mode utilizes a ReAct agent with Tavily web search to enable models to verify and find real papers before recommending them, allowing for direct comparison of 'bare model' vs. 'tool-augmented' hallucination rates.
  • Per-Endpoint Concurrency Control: Implemented per-endpoint concurrency management using asyncio.Semaphore, replacing the previous global concurrency manager. This allows each model endpoint to run simultaneously while respecting its individual rate limits, improving efficiency and robustness of evaluations.
  • Robust BibTeX Parsing and LaTeX Stripping: Enhanced the BibTeX extractor to correctly handle nested braces in field values and added a new utility to strip LaTeX markup from BibTeX fields (e.g., titles, authors, journals). This ensures cleaner and more accurate data for downstream verification.
  • Improved CJK Font Support in Reporting: Updated the chart generator to provide more robust CJK font detection and configuration for Matplotlib, ensuring Chinese text renders correctly in evaluation reports.
  • Comprehensive Documentation Updates: Significantly updated the documentation to reflect the new tool-augmented mode, clarify verification logic, explain new configuration options like max_concurrency and tool_config, and provide updated guidance on interpreting results and best practices.
Changelog
  • .gitignore
    • Added '/resume' to the ignore list for intermediate output files.
  • cookbooks/ref_hallucination_arena/collectors/bib_extractor.py
    • Added a static method _strip_latex to remove LaTeX markup from text.
    • Refactored _parse_fields to use a new helper _extract_braced_value for more accurate extraction of brace-delimited BibTeX fields.
    • Applied LaTeX stripping to extracted title, author, eprint (arXiv ID), and journal fields for cleaner data.
  • cookbooks/ref_hallucination_arena/collectors/response_collector.py
    • Removed tenacity import and ConcurrencyManager as global concurrency is replaced by per-endpoint semaphores.
    • Introduced new default system prompts for tool-augmented mode in both Chinese and English.
    • Modified the __init__ method to initialize per-endpoint semaphores and ReActAgent instances for tool-augmented endpoints.
    • Added _create_tool_agent static method to instantiate ReAct agents with TavilySearchTool.
    • Updated _build_system_prompt to dynamically select system prompts based on whether tool-augmented mode is enabled for an endpoint.
    • Refactored the endpoint calling logic into _call_endpoint, _call_with_retry, _do_bare_call, _do_tool_call, and _tool_fallback_summary to support unified retry and tool-augmented calls with fallback summarization.
    • Added _has_bibtex helper to check for BibTeX content in responses.
    • Updated the collect method to use per-endpoint semaphores for managing concurrent requests.
  • cookbooks/ref_hallucination_arena/examples/config.yaml
    • Added max_concurrency field to example target_endpoints configurations.
    • Included a new example endpoint deepseek_with_tools demonstrating the tool_config structure for enabling tool-augmented mode.
    • Removed the global evaluation.max_concurrency setting.
  • cookbooks/ref_hallucination_arena/examples/minimal_config.yaml
    • Updated comments to reflect the new default for endpoint.max_concurrency and the removal of evaluation.max_concurrency.
  • cookbooks/ref_hallucination_arena/reporting/chart_generator.py
    • Enhanced _setup_cjk_font to improve CJK font detection by rebuilding the font cache, searching a broader list of font families, and using keyword-based file path searches as a fallback.
  • cookbooks/ref_hallucination_arena/schema.py
    • Defined a new Pydantic model ToolConfig to encapsulate configuration for tool-augmented mode, including enabled, tavily_api_key, max_iterations, and search_depth.
    • Added max_concurrency and tool_config fields to the OpenAIEndpoint model.
    • Removed max_concurrency from the EvaluationConfig model.
  • docs/validating_graders/ref_hallucination_arena.md
    • Updated the HuggingFace dataset link to the new repository.
    • Added 'Strict Verification' and 'Tool-augmented Mode' to the key features table.
    • Expanded 'Multi-discipline Coverage' to include Social Science and Interdisciplinary fields.
    • Updated the ResponseCollector description to mention bare and tool-augmented ReAct modes.
    • Added a metadata field to the query item schema.
    • Revised the example configuration to include shuffle, max_queries, per-endpoint max_concurrency, and tool_config for tool-augmented mode.
    • Updated the 'Interpreting Results' section to emphasize 'overall accuracy' and include a link to the OpenJudge Leaderboard.
    • Modified example output for 'Per-Field Accuracy' and 'Per-Discipline Performance'.
    • Clarified the strict verification logic for marking a reference as VERIFIED.
    • Added new solutions for 'Low verification rate overall' and 'Tool mode reaches max iterations' in the 'Error Pattern' table.
    • Updated descriptions for output files.
    • Added a dedicated 'Tool-Augmented Mode' section with detailed explanation, configuration parameters, and a tip about separate prompts.
    • Added new best practices related to tool-augmented mode and per-endpoint concurrency.
Activity
  • The pull request description is a template and does not contain specific activity details or comments from reviewers.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: a "tool-augmented" mode for the reference hallucination evaluation. This allows models to use a web search tool via a ReAct agent to find and verify papers before recommending them, enabling a direct comparison between a model's performance with and without tools. The implementation is robust, including a major refactoring of response_collector.py to support the new mode, improved concurrency handling with per-endpoint semaphores, and a more flexible retry mechanism. Additionally, the BibTeX extraction logic has been enhanced to handle LaTeX markup and nested braces more effectively. The configuration schema and documentation have been thoroughly updated to reflect these new capabilities. Overall, this is a high-quality contribution that significantly enhances the benchmark's power. I have one suggestion to improve code robustness.


Returns:
List of dicts: {query, discipline, num_refs, responses: {endpoint: text}}
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the internal _value attribute of asyncio.Semaphore is generally discouraged as it's not part of the public API and could change in future Python versions, making the code less robust. A safer and cleaner approach would be to use the max_concurrency value from the endpoint configuration, which is what the semaphore was initialized with and represents the intended concurrency limit.

Suggested change
"""
concurrency_info = ", ".join(f"{n}={self.endpoints[n].max_concurrency}" for n in self.endpoints)

Copy link
Collaborator

@ployts ployts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

XiaoBoAI and others added 3 commits February 13, 2026 17:22
…cific params

- Strip thinking/reasoning blocks (e.g. <think>, <thinking>) from model
  output before BibTeX extraction to avoid false positives from CoT content
- Route non-standard OpenAI SDK params to extra_body in ResponseCollector
  to prevent TypeError with provider-specific params (enable_thinking, etc.)
- Avoid overriding enable_thinking for Qwen models if already set explicitly

Co-authored-by: Cursor <cursoragent@cursor.com>
Combine response collection, BibTeX extraction, and verification into a
single streaming pipeline that processes each model response as soon as
it arrives, rather than waiting for all responses first. This overlaps
I/O-bound collection with verification, reducing overall wall-clock time.

- Add on_single_response callback to ResponseCollector
- Implement _collect_and_verify_streaming with async workers
- Refactor evaluate() to use the streaming pipeline for steps 2+3+4
- Support checkpoint resume for partially completed streaming runs

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@XiaoBoAI XiaoBoAI merged commit f1cea02 into main Feb 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments