-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Hi team,
First, thank you for the excellent work on the BrowseComp-Plus benchmark. It's a valuable contribution to the fair and transparent evaluation of Deep Research Agents.
I am writing to open a discussion about the evaluation methodology for GPT-OSS.
In the paper and current benchmark setup, GPT-OSS is evaluated using external retrieval tools. However, as GPT-OSS was natively trained with a browsing toolset, I believe an evaluation using its native tools would provide a more authentic and insightful measure of its intended capabilities.
The Challenge
I understand that a key reason for not using the native tools might be the flawed implementations in popular serving frameworks like vLLM and SGLang, which make a correct evaluation difficult.
Proposed Solution & My Work
To address this, I've developed a repository that enables a full and proper evaluation of GPT-OSS on BrowseComp-Plus using its native browsing tools. This implementation works around the existing framework limitations.
You can find the project, code, and some interesting preliminary findings here: https://github.com/Hannibal046/GPT-OSS-BrowseCompPlus-Eval
Thanks again for your great work! I'm happy to discuss this further.