Skip to content

Adding example on preset discoverability#78

Open
rsmith49 wants to merge 1 commit intomainfrom
examples/add-environment-discovery
Open

Adding example on preset discoverability#78
rsmith49 wants to merge 1 commit intomainfrom
examples/add-environment-discovery

Conversation

@rsmith49
Copy link
Contributor

@rsmith49 rsmith49 commented Jan 29, 2026

User description

Adds a script that shows how to list env presets, shows all the presets available, and gets at the idea of indexing within those presets for specific tasks. Also highlights the new VMax dataset 🚀

example_6


Generated description

Below is a concise technical summary of the changes proposed in this PR:

graph LR
main_("main"):::added
discover_presets_("discover_presets"):::added
show_featured_presets_("show_featured_presets"):::added
prompt_preset_selection_("prompt_preset_selection"):::added
prompt_task_selection_("prompt_task_selection"):::added
run_evaluation_("run_evaluation"):::added
print_step_("print_step"):::modified
list_all_presets_("list_all_presets"):::added
main_ -- "Calls discover_presets to list presets using ares.info." --> discover_presets_
main_ -- "Calls show_featured_presets to display curated preset list." --> show_featured_presets_
main_ -- "Calls prompt_preset_selection to interactively choose preset." --> prompt_preset_selection_
main_ -- "Calls prompt_task_selection to select task index interactively." --> prompt_task_selection_
main_ -- "Calls run_evaluation to execute chosen preset evaluation." --> run_evaluation_
run_evaluation_ -- "Now prints bash command preview when present." --> print_step_
prompt_preset_selection_ -- "Calls list_all_presets to show all presets list." --> list_all_presets_
classDef added stroke:#15AA7A
classDef removed stroke:#CD5270
classDef modified stroke:#EDAC4C
linkStyle default stroke:#CBD5E1,font-size:13px
Loading

Introduces an interactive example script that demonstrates how to discover and evaluate environment presets within the ARES framework. Enhances the visibility of available datasets, including the new VMax dataset, and improves the formatting of environment information.

TopicDetails
UI Improvements Updates EnvironmentInfo string formatting in registry.py and enhances print_step in utils.py to better display bash commands from LLM actions.
Modified files (2)
  • examples/utils.py
  • src/ares/registry.py
Latest Contributors(2)
UserCommitDate
joshua.greaves@gmail.comMassively-simplify-the...January 29, 2026
sarvanithinAdd-register_env-decor...January 22, 2026
Task Discovery Adds 06_discover_and_evaluate.py to provide an interactive CLI for listing presets, selecting tasks, and running evaluations using ChatCompletionCompatibleLLMClient.
Modified files (2)
  • examples/06_discover_and_evaluate.py
  • examples/README.md
Latest Contributors(1)
UserCommitDate
joshua.greaves@gmail.comMassively-simplify-the...January 29, 2026
This pull request is reviewed by Baz. Review like a pro on (Baz).

Comment on lines +31 to +35
import ares
from ares import config
from ares.llms import chat_completions_compatible

from . import utils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly importing chat_completions_compatible from ares.llms violates CLAUDE.md guidance for external consumers to use module-level imports (e.g. from ares import llms then llms.chat_completions_compatible), so the example shows a non-compliant import style; can we switch to the module-based import to keep provenance explicit?

Prompt for AI Agents:

In examples/06_discover_and_evaluate.py around lines 31 to 35, the code does a direct
import: `from ares.llms import chat_completions_compatible`, which violates the
CLAUDE.md guideline that external consumers should import the module (e.g., `from ares
import llms`). Change the import to `from ares import llms` and update usages
accordingly (e.g., where `chat_completions_compatible.ChatCompletionCompatibleLLMClient`
is instantiated in run_evaluation, replace it with
`llms.chat_completions_compatible.ChatCompletionCompatibleLLMClient`). Ensure no other
references rely on the old direct import.

Fix in Cursor


Finding type: AI Coding Guidelines

Comment on lines +45 to +53
# Group presets by base dataset (remove agent suffix)
datasets = {}
for preset in all_presets:
# Extract base dataset name (e.g., "vmax-tasks-mswea" -> "vmax-tasks")
base_name = preset.name.rsplit("-", 1)[0]
if base_name not in datasets:
datasets[base_name] = []
datasets[base_name].append(preset)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discover_presets iterates through all presets and builds datasets[base_name].append(preset) only to later call len(datasets)—the lists are never used, so every preset is stored twice for no reason; can we just track the unique base names (e.g. unique_bases = {preset.name.rsplit("-", 1)[0] for preset in all_presets}) and print len(unique_bases) instead so the discovery stays concise and lightweight?

Suggested change
# Group presets by base dataset (remove agent suffix)
datasets = {}
for preset in all_presets:
# Extract base dataset name (e.g., "vmax-tasks-mswea" -> "vmax-tasks")
base_name = preset.name.rsplit("-", 1)[0]
if base_name not in datasets:
datasets[base_name] = []
datasets[base_name].append(preset)
# Track unique base dataset names (remove agent suffix)
datasets = set()
for preset in all_presets:
# Extract base dataset name (e.g., "vmax-tasks-mswea" -> "vmax-tasks")
base_name = preset.name.rsplit("-", 1)[0]
datasets.add(base_name)

Finding type: Conciseness

Comment on lines +74 to +78
for i, name in enumerate(featured_names, 1):
info = ares.info(name)
featured_presets.append(name)
marker = "⭐" if name == "vmax-tasks-mswea" else " "
print(f"{marker} {i}. {info.name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show_featured_presets always calls ares.info(name) for each hardcoded featured name, so if a preset like tbench-mswea isn't registered it raises KeyError and aborts the example before users can interact; can we rescue missing presets by skipping with a warning so the example still lists available presets?

Prompt for AI Agents:

In examples/06_discover_and_evaluate.py around lines 74 to 78, the show_featured_presets
function currently calls ares.info(name) for each hardcoded featured name and will raise
KeyError if a preset is missing. Refactor by wrapping the ares.info(name) call in a
try/except KeyError block: on success, append the name to featured_presets and print the
existing info/marker; on KeyError, print a warning message like "⚠️ Featured preset
'{name}' not found, skipping" and continue. Ensure featured_presets only contains names
for which ares.info succeeded so the interactive selection remains valid.

Fix in Cursor


Finding type: Logical Bugs

Comment on lines +245 to +249
if args.quick:
preset_name = args.preset or "vmax-tasks-mswea"
task_idx = args.task_idx if args.task_idx is not None else 0
print(f"🚀 Quick mode: Running {preset_name}:{task_idx}")
await run_evaluation(preset_name, task_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick/CLI mode reuses args.preset/args.task_idx and immediately calls run_evaluation, so an invalid preset raises the ares.make KeyError and an out-of-range index raises an IndexError later instead of giving the user a friendly validation error; can we validate these before invoking run_evaluation?

Prompt for AI Agents:

In examples/06_discover_and_evaluate.py around lines 245 to 249, the quick-mode logic
directly reuses args.preset/args.task_idx and calls await run_evaluation, which allows
invalid preset names (raising KeyError inside ares.make) or out-of-range task indices
(IndexError later) to surface. Before calling run_evaluation, validate the preset by
calling ares.info(preset_name) inside a try/except and return a friendly error (raise
ValueError or print and sys.exit) if not found; then check task_idx against
info.num_tasks and similarly error if out of range. Apply this validation only in the
quick-mode branch so callers get immediate, user-friendly feedback instead of internal
exceptions.

Fix in Cursor


Finding type: Logical Bugs

Comment on lines +244 to +269
# If --quick mode, run with defaults immediately
if args.quick:
preset_name = args.preset or "vmax-tasks-mswea"
task_idx = args.task_idx if args.task_idx is not None else 0
print(f"🚀 Quick mode: Running {preset_name}:{task_idx}")
await run_evaluation(preset_name, task_idx)
return

# Interactive mode
discover_presets()
featured_presets = show_featured_presets()

# Determine preset (from args or interactive prompt)
if args.preset:
preset_name = args.preset
print(f"\n✓ Using preset from command line: {preset_name}")
else:
preset_name = prompt_preset_selection(featured_presets)

print(f"\n✓ Selected preset: {preset_name}")

# Determine task index (from args or interactive prompt)
if args.task_idx is not None:
task_idx = args.task_idx
print(f"✓ Using task index from command line: {task_idx}")
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

[Reliability] When a preset/task pair is provided via CLI flags you bypass the interactive prompts that enforce valid task indices. In both the --quick branch (lines 244‑249) and the later if args.task_idx is not None branch (lines 265‑269) the code accepts any integer and immediately passes it to ares.make(f"{preset_name}:{task_idx}"). If the index is out of bounds (e.g. --preset sbv-mswea --task-idx 9999 for a 500-task dataset), IndexSelector raises an IndexError while building the environment and the script crashes with a stack trace instead of a helpful message. Please query info = ares.info(preset_name) when consuming CLI arguments and ensure 0 <= task_idx < info.num_tasks before calling run_evaluation, raising a friendly ValueError (or prompting again) when the index is invalid.

Context for Agents
When a preset/task pair is provided via CLI flags you bypass the interactive prompts that enforce valid task indices. In both the `--quick` branch (lines 244‑249) and the later `if args.task_idx is not None` branch (lines 265‑269) the code accepts any integer and immediately passes it to `ares.make(f"{preset_name}:{task_idx}")`. If the index is out of bounds (e.g. `--preset sbv-mswea --task-idx 9999` for a 500-task dataset), `IndexSelector` raises an `IndexError` while building the environment and the script crashes with a stack trace instead of a helpful message. Please query `info = ares.info(preset_name)` when consuming CLI arguments and ensure `0 <= task_idx < info.num_tasks` before calling `run_evaluation`, raising a friendly `ValueError` (or prompting again) when the index is invalid.

File: examples/06_discover_and_evaluate.py
Line: 269

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant