Skip to content

Conversation

@srishti-git1110
Copy link

@srishti-git1110 srishti-git1110 commented Jan 16, 2026

PR Type

  • RL Environment PR - Complete Environment Snapshot & Zero-Training sections
  • Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

Adds the SuperGPQA benchmark to eval environments. SuperGPQA is a comprehensive benchmark designed to evaluate the knowledge and reasoning abilities of LLMs across 285 graduate-level disciplines. It features at least 50 questions per discipline, covering a broad spectrum of graduate-level topics. Top performing LLMs currently achieve scores around 60 on SuperGPQA, highlighting the benchmark's difficulty and its potential to drive future advancements in LLM research.


🔖 Environment Snapshot

Field Your Entry
Environment Name SuperGPQA
Short Description A benchmark to test knowledge and reasoning abilities of LLMs across 285 graduate-level disciplines
Category Verifiable-Reasoning
Dataset Needed? Yes
External Deps None
Environmental Variables None specific to the environment itself (standard Atropos LLM server API keys/URLs would be needed for operation).
Compute Footprint Estimate negligible CPU time per instance for processing, correctness checking etc. Overall latency and VRAM requirement dependent on model size, serving framework/technique

Also pasting a sample jsonl saved during testing (as a proof for the env working):

{"question": "The common-mode rejection ratio of the first stage amplification circuit in a three-op-amp differential circuit is determined by ( ).", "choices": ["the size of A1's common-mode rejection ratio", "the sum of A1 and A2's common-mode rejection ratios", "the average of A1 and A2's common-mode rejection ratios", "The difference in the common-mode rejection ratio of A1 and A2 themselves", "the square root of the product of A1 and A2's common-mode rejection ratios", "the size of A2's common-mode rejection ratio", "input resistance", "the product of A1 and A2's common-mode rejection ratios", "the absolute value of the difference in the common-mode rejection ratio of A1 and A2 themselves", "all of the above"], "gold_answer": "D", "model_response": "<answer>D</answer>", "extracted_answer": "D", "extraction_method": "answer_tag", "is_correct": true, "subfield": "Circuits and Systems", "finish_reason": "stop", "response_length": 18, "thinking_mode": false, "format_valid": true}

Examples of the Environment scoring a good example and a bad example (conceptual):

A sample trajectory:

User: Answer the following multiple-choice question. There is only one correct answer.
Provide your final answer within <answer></answer> tags, containing only the letter (A, B, C, D, E, F, G, H, I, or J.).

Example format:
<answer>A</answer>

Question: The common-mode rejection ratio of the first stage amplification circuit in a three-op-amp differential circuit is determined by ( ).
A) the absolute value of the difference in the common-mode rejection ratio of A1 and A2 themselves
B) all of the above
C) the average of A1 and A2's common-mode rejection ratios
D) the sum of A1 and A2's common-mode rejection ratios
E) the product of A1 and A2's common-mode rejection ratios
F) the square root of the product of A1 and A2's common-mode rejection ratios 
G) the size of A2's common-mode rejection ratio
H) the size of A1's common-mode rejection ratio
I) The difference in the common-mode rejection ratio of A1 and A2 themselves 
J) input resistance

Assistant: "<answer>D</answer>"

After running through the environment:

{"question": "The common-mode rejection ratio of the first stage amplification circuit in a three-op-amp differential circuit is determined by ( ).", "choices": ["the size of A1's common-mode rejection ratio", "the sum of A1 and A2's common-mode rejection ratios", "the average of A1 and A2's common-mode rejection ratios", "The difference in the common-mode rejection ratio of A1 and A2 themselves", "the square root of the product of A1 and A2's common-mode rejection ratios", "the size of A2's common-mode rejection ratio", "input resistance", "the product of A1 and A2's common-mode rejection ratios", "the absolute value of the difference in the common-mode rejection ratio of A1 and A2 themselves", "all of the above"], "gold_answer": "D", "model_response": "<answer>D</answer>", "extracted_answer": "D", "extraction_method": "answer_tag", "is_correct": true, "subfield": "Circuits and Systems", "finish_reason": "stop", "response_length": 18, "thinking_mode": false, "format_valid": true}

Bad example:

Same trajectory as above but the Assistant response is different:

Assistant: "<answer>B</answer>"

After running through the environment:

{"question": "The common-mode rejection ratio of the first stage amplification circuit in a three-op-amp differential circuit is determined by ( ).", "choices": ["the size of A1's common-mode rejection ratio", "the sum of A1 and A2's common-mode rejection ratios", "the average of A1 and A2's common-mode rejection ratios", "The difference in the common-mode rejection ratio of A1 and A2 themselves", "the square root of the product of A1 and A2's common-mode rejection ratios", "the size of A2's common-mode rejection ratio", "input resistance", "the product of A1 and A2's common-mode rejection ratios", "the absolute value of the difference in the common-mode rejection ratio of A1 and A2 themselves", "all of the above"], "gold_answer": "D", "model_response": "<answer>B</answer>", "extracted_answer": "B", "extraction_method": "answer_tag", "is_correct": false, "subfield": "Circuits and Systems", "finish_reason": "stop", "response_length": 18, "thinking_mode": false, "format_valid": true}

✅ Developer & Reviewer Checklist

  • Code follows project style (black, isort, flake8 pass with pre-commit)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes
  • Docstrings added for all new public classes / functions
  • If .env vars required, did you add it to the .env.example in repo root?

@srishti-git1110 srishti-git1110 marked this pull request as ready for review January 16, 2026 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant