Suggestion: Evaluate GPT-OSS with its native browsing tools for a more accurate assessment

Hi team,

First, thank you for the excellent work on the BrowseComp-Plus benchmark. It's a valuable contribution to the fair and transparent evaluation of Deep Research Agents.

I am writing to open a discussion about the evaluation methodology for `GPT-OSS`.

In the paper and current benchmark setup, `GPT-OSS` is evaluated using external retrieval tools. However, as `GPT-OSS` was natively trained with a `browsing` toolset, I believe an evaluation using its native tools would provide a more authentic and insightful measure of its intended capabilities.

### The Challenge
I understand that a key reason for not using the native tools might be the flawed implementations in popular serving frameworks like `vLLM` and `SGLang`, which make a correct evaluation difficult.

### Proposed Solution & My Work
To address this, I've developed a repository that enables a full and proper evaluation of `GPT-OSS` on BrowseComp-Plus using its native browsing tools. This implementation works around the existing framework limitations.

You can find the project, code, and some interesting preliminary findings here: https://github.com/Hannibal046/GPT-OSS-BrowseCompPlus-Eval

<img width="1122" height="470" alt="Image" src="https://github.com/user-attachments/assets/f4d576d3-fec6-4cf4-93f4-080e7b6bab1f" />

Thanks again for your great work! I'm happy to discuss this further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suggestion: Evaluate GPT-OSS with its native browsing tools for a more accurate assessment #9

The Challenge

Proposed Solution & My Work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suggestion: Evaluate GPT-OSS with its native browsing tools for a more accurate assessment #9

Description

The Challenge

Proposed Solution & My Work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions