Skip to content

Suggestion: Evaluate GPT-OSS with its native browsing tools for a more accurate assessment #9

@Hannibal046

Description

@Hannibal046

Hi team,

First, thank you for the excellent work on the BrowseComp-Plus benchmark. It's a valuable contribution to the fair and transparent evaluation of Deep Research Agents.

I am writing to open a discussion about the evaluation methodology for GPT-OSS.

In the paper and current benchmark setup, GPT-OSS is evaluated using external retrieval tools. However, as GPT-OSS was natively trained with a browsing toolset, I believe an evaluation using its native tools would provide a more authentic and insightful measure of its intended capabilities.

The Challenge

I understand that a key reason for not using the native tools might be the flawed implementations in popular serving frameworks like vLLM and SGLang, which make a correct evaluation difficult.

Proposed Solution & My Work

To address this, I've developed a repository that enables a full and proper evaluation of GPT-OSS on BrowseComp-Plus using its native browsing tools. This implementation works around the existing framework limitations.

You can find the project, code, and some interesting preliminary findings here: https://github.com/Hannibal046/GPT-OSS-BrowseCompPlus-Eval

Image

Thanks again for your great work! I'm happy to discuss this further.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions