Supergpqa eval environment #319

srishti-git1110 · 2026-01-16T07:27:34Z

PR Type

RL Environment PR - Complete Environment Snapshot & Zero-Training sections
Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

Adds the SuperGPQA benchmark to eval environments. SuperGPQA is a comprehensive benchmark designed to evaluate the knowledge and reasoning abilities of LLMs across 285 graduate-level disciplines. It features at least 50 questions per discipline, covering a broad spectrum of graduate-level topics. Top performing LLMs currently achieve scores around 60 on SuperGPQA, highlighting the benchmark's difficulty and its potential to drive future advancements in LLM research.

🔖 Environment Snapshot

Field	Your Entry
Environment Name	SuperGPQA
Short Description	A benchmark to test knowledge and reasoning abilities of LLMs across 285 graduate-level disciplines
Category	Verifiable-Reasoning
Dataset Needed?	Yes
External Deps	None
Environmental Variables	None specific to the environment itself (standard Atropos LLM server API keys/URLs would be needed for operation).
Compute Footprint Estimate	negligible CPU time per instance for processing, correctness checking etc. Overall latency and VRAM requirement dependent on model size, serving framework/technique

Also pasting a sample jsonl saved during testing (as a proof for the env working):

{"question": "The common-mode rejection ratio of the first stage amplification circuit in a three-op-amp differential circuit is determined by ( ).", "choices": ["the size of A1's common-mode rejection ratio", "the sum of A1 and A2's common-mode rejection ratios", "the average of A1 and A2's common-mode rejection ratios", "The difference in the common-mode rejection ratio of A1 and A2 themselves", "the square root of the product of A1 and A2's common-mode rejection ratios", "the size of A2's common-mode rejection ratio", "input resistance", "the product of A1 and A2's common-mode rejection ratios", "the absolute value of the difference in the common-mode rejection ratio of A1 and A2 themselves", "all of the above"], "gold_answer": "D", "model_response": "<answer>D</answer>", "extracted_answer": "D", "extraction_method": "answer_tag", "is_correct": true, "subfield": "Circuits and Systems", "finish_reason": "stop", "response_length": 18, "thinking_mode": false, "format_valid": true}

Examples of the Environment scoring a good example and a bad example (conceptual):

A sample trajectory:

User: Answer the following multiple-choice question. There is only one correct answer.
Provide your final answer within <answer></answer> tags, containing only the letter (A, B, C, D, E, F, G, H, I, or J.).

Example format:
<answer>A</answer>

Question: The common-mode rejection ratio of the first stage amplification circuit in a three-op-amp differential circuit is determined by ( ).
A) the absolute value of the difference in the common-mode rejection ratio of A1 and A2 themselves
B) all of the above
C) the average of A1 and A2's common-mode rejection ratios
D) the sum of A1 and A2's common-mode rejection ratios
E) the product of A1 and A2's common-mode rejection ratios
F) the square root of the product of A1 and A2's common-mode rejection ratios 
G) the size of A2's common-mode rejection ratio
H) the size of A1's common-mode rejection ratio
I) The difference in the common-mode rejection ratio of A1 and A2 themselves 
J) input resistance

Assistant: "<answer>D</answer>"

After running through the environment:

{"question": "The common-mode rejection ratio of the first stage amplification circuit in a three-op-amp differential circuit is determined by ( ).", "choices": ["the size of A1's common-mode rejection ratio", "the sum of A1 and A2's common-mode rejection ratios", "the average of A1 and A2's common-mode rejection ratios", "The difference in the common-mode rejection ratio of A1 and A2 themselves", "the square root of the product of A1 and A2's common-mode rejection ratios", "the size of A2's common-mode rejection ratio", "input resistance", "the product of A1 and A2's common-mode rejection ratios", "the absolute value of the difference in the common-mode rejection ratio of A1 and A2 themselves", "all of the above"], "gold_answer": "D", "model_response": "<answer>D</answer>", "extracted_answer": "D", "extraction_method": "answer_tag", "is_correct": true, "subfield": "Circuits and Systems", "finish_reason": "stop", "response_length": 18, "thinking_mode": false, "format_valid": true}

Bad example:

Same trajectory as above but the Assistant response is different:

Assistant: "<answer>B</answer>"

After running through the environment:

{"question": "The common-mode rejection ratio of the first stage amplification circuit in a three-op-amp differential circuit is determined by ( ).", "choices": ["the size of A1's common-mode rejection ratio", "the sum of A1 and A2's common-mode rejection ratios", "the average of A1 and A2's common-mode rejection ratios", "The difference in the common-mode rejection ratio of A1 and A2 themselves", "the square root of the product of A1 and A2's common-mode rejection ratios", "the size of A2's common-mode rejection ratio", "input resistance", "the product of A1 and A2's common-mode rejection ratios", "the absolute value of the difference in the common-mode rejection ratio of A1 and A2 themselves", "all of the above"], "gold_answer": "D", "model_response": "<answer>B</answer>", "extracted_answer": "B", "extraction_method": "answer_tag", "is_correct": false, "subfield": "Circuits and Systems", "finish_reason": "stop", "response_length": 18, "thinking_mode": false, "format_valid": true}

✅ Developer & Reviewer Checklist

Code follows project style (black, isort, flake8 pass with pre-commit)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
New and existing unit tests pass locally with my changes
Docstrings added for all new public classes / functions
If .env vars required, did you add it to the .env.example in repo root?

for more information, see https://pre-commit.ci

srishti-git1110 and others added 5 commits January 16, 2026 12:44

supergpqa

4f9e4ea

dataset info

bd2f774

[pre-commit.ci] auto fixes from pre-commit.com hooks

8159240

for more information, see https://pre-commit.ci

pre-commit+bugfixes

f090e79

Merge remote-tracking branch 'origin/main' into supergpqa_eval

ea50029

srishti-git1110 marked this pull request as ready for review January 16, 2026 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supergpqa eval environment #319

Supergpqa eval environment #319

Uh oh!

srishti-git1110 commented Jan 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Supergpqa eval environment #319

Are you sure you want to change the base?

Supergpqa eval environment #319

Uh oh!

Conversation

srishti-git1110 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

📝 General Information

Description

🔖 Environment Snapshot

✅ Developer & Reviewer Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

srishti-git1110 commented Jan 16, 2026 •

edited

Loading