Supergpqa eval environment #319
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Type
📝 General Information
Description
Adds the SuperGPQA benchmark to eval environments. SuperGPQA is a comprehensive benchmark designed to evaluate the knowledge and reasoning abilities of LLMs across 285 graduate-level disciplines. It features at least 50 questions per discipline, covering a broad spectrum of graduate-level topics. Top performing LLMs currently achieve scores around 60 on SuperGPQA, highlighting the benchmark's difficulty and its potential to drive future advancements in LLM research.
🔖 Environment Snapshot
Also pasting a sample jsonl saved during testing (as a proof for the env working):
Examples of the Environment scoring a good example and a bad example (conceptual):
A sample trajectory:
After running through the environment:
Bad example:
Same trajectory as above but the Assistant response is different:
After running through the environment:
✅ Developer & Reviewer Checklist