Adding example on preset discoverability#78

Open

rsmith49 wants to merge 1 commit intomainfrom

examples/add-environment-discovery

Contributor

rsmith49 commented Jan 29, 2026 •

edited

Loading

User description

Adds a script that shows how to list env presets, shows all the presets available, and gets at the idea of indexing within those presets for specific tasks. Also highlights the new VMax dataset 🚀

Generated description

Below is a concise technical summary of the changes proposed in this PR:

graph LR
main_("main"):::added
discover_presets_("discover_presets"):::added
show_featured_presets_("show_featured_presets"):::added
prompt_preset_selection_("prompt_preset_selection"):::added
prompt_task_selection_("prompt_task_selection"):::added
run_evaluation_("run_evaluation"):::added
print_step_("print_step"):::modified
list_all_presets_("list_all_presets"):::added
main_ -- "Calls discover_presets to list presets using ares.info." --> discover_presets_
main_ -- "Calls show_featured_presets to display curated preset list." --> show_featured_presets_
main_ -- "Calls prompt_preset_selection to interactively choose preset." --> prompt_preset_selection_
main_ -- "Calls prompt_task_selection to select task index interactively." --> prompt_task_selection_
main_ -- "Calls run_evaluation to execute chosen preset evaluation." --> run_evaluation_
run_evaluation_ -- "Now prints bash command preview when present." --> print_step_
prompt_preset_selection_ -- "Calls list_all_presets to show all presets list." --> list_all_presets_
classDef added stroke:#15AA7A
classDef removed stroke:#CD5270
classDef modified stroke:#EDAC4C
linkStyle default stroke:#CBD5E1,font-size:13px

Introduces an interactive example script that demonstrates how to discover and evaluate environment presets within the ARES framework. Enhances the visibility of available datasets, including the new VMax dataset, and improves the formatting of environment information.

Topic Details

UI Improvements

Updates EnvironmentInfo string formatting in registry.py and enhances print_step in utils.py to better display bash commands from LLM actions.

Modified files (2)

examples/utils.py
src/ares/registry.py

Latest Contributors(2)

User	Commit	Date
joshua.greaves@gmail.com	Massively-simplify-the...	January 29, 2026
sarvanithin	Add-register_env-decor...	January 22, 2026

Adds 06_discover_and_evaluate.py to provide an interactive CLI for listing presets, selecting tasks, and running evaluations using ChatCompletionCompatibleLLMClient.

Modified files (2)

examples/06_discover_and_evaluate.py
examples/README.md

Latest Contributors(1)

User	Commit	Date
joshua.greaves@gmail.com	Massively-simplify-the...	January 29, 2026

This pull request is reviewed by Baz. Review like a pro on (Baz).


          Adding example on task discoverability

8cead5f

baz-reviewer bot reviewed

View reviewed changes

examples/06_discover_and_evaluate.py

Comment on lines +31 to +35

+              import ares
+              from ares import config
+              from ares.llms import chat_completions_compatible
+              from . import utils

Contributor

baz-reviewer bot Jan 29, 2026

Directly importing chat_completions_compatible from ares.llms violates CLAUDE.md guidance for external consumers to use module-level imports (e.g. from ares import llms then llms.chat_completions_compatible), so the example shows a non-compliant import style; can we switch to the module-based import to keep provenance explicit?

Prompt for AI Agents:

In examples/06_discover_and_evaluate.py around lines 31 to 35, the code does a direct
import: `from ares.llms import chat_completions_compatible`, which violates the
CLAUDE.md guideline that external consumers should import the module (e.g., `from ares
import llms`). Change the import to `from ares import llms` and update usages
accordingly (e.g., where `chat_completions_compatible.ChatCompletionCompatibleLLMClient`
is instantiated in run_evaluation, replace it with
`llms.chat_completions_compatible.ChatCompletionCompatibleLLMClient`). Ensure no other
references rely on the old direct import.

Finding type: AI Coding Guidelines

examples/06_discover_and_evaluate.py

Comment on lines +45 to +53

+                  # Group presets by base dataset (remove agent suffix)
+                  datasets = {}
+                  for preset in all_presets:
+                      # Extract base dataset name (e.g., "vmax-tasks-mswea" -> "vmax-tasks")
+                      base_name = preset.name.rsplit("-", 1)[0]
+                      if base_name not in datasets:
+                          datasets[base_name] = []
+                      datasets[base_name].append(preset)

Contributor

baz-reviewer bot Jan 29, 2026

discover_presets iterates through all presets and builds datasets[base_name].append(preset) only to later call len(datasets)—the lists are never used, so every preset is stored twice for no reason; can we just track the unique base names (e.g. unique_bases = {preset.name.rsplit("-", 1)[0] for preset in all_presets}) and print len(unique_bases) instead so the discovery stays concise and lightweight?

Suggested change

      
                # Group presets by base dataset (remove agent suffix)
          
                datasets = {}
          
                for preset in all_presets:
          
                    # Extract base dataset name (e.g., "vmax-tasks-mswea" -> "vmax-tasks")
          
                    base_name = preset.name.rsplit("-", 1)[0]
          
                    if base_name not in datasets:
          
                        datasets[base_name] = []
          
                    datasets[base_name].append(preset)
          
                # Track unique base dataset names (remove agent suffix)
          
                datasets = set()
          
                for preset in all_presets:
          
                    # Extract base dataset name (e.g., "vmax-tasks-mswea" -> "vmax-tasks")
          
                    base_name = preset.name.rsplit("-", 1)[0]
          
                    datasets.add(base_name)

Finding type: Conciseness

examples/06_discover_and_evaluate.py

Comment on lines +74 to +78

+                  for i, name in enumerate(featured_names, 1):
+                      info = ares.info(name)
+                      featured_presets.append(name)
+                      marker = "⭐" if name == "vmax-tasks-mswea" else "  "
+                      print(f"{marker} {i}. {info.name}")

Contributor

baz-reviewer bot Jan 29, 2026

show_featured_presets always calls ares.info(name) for each hardcoded featured name, so if a preset like tbench-mswea isn't registered it raises KeyError and aborts the example before users can interact; can we rescue missing presets by skipping with a warning so the example still lists available presets?

Prompt for AI Agents:

In examples/06_discover_and_evaluate.py around lines 74 to 78, the show_featured_presets
function currently calls ares.info(name) for each hardcoded featured name and will raise
KeyError if a preset is missing. Refactor by wrapping the ares.info(name) call in a
try/except KeyError block: on success, append the name to featured_presets and print the
existing info/marker; on KeyError, print a warning message like "⚠️ Featured preset
'{name}' not found, skipping" and continue. Ensure featured_presets only contains names
for which ares.info succeeded so the interactive selection remains valid.

Finding type: Logical Bugs

examples/06_discover_and_evaluate.py

Comment on lines +245 to +249

+                  if args.quick:
+                      preset_name = args.preset or "vmax-tasks-mswea"
+                      task_idx = args.task_idx if args.task_idx is not None else 0
+                      print(f"🚀 Quick mode: Running {preset_name}:{task_idx}")
+                      await run_evaluation(preset_name, task_idx)

Contributor

baz-reviewer bot Jan 29, 2026

Quick/CLI mode reuses args.preset/args.task_idx and immediately calls run_evaluation, so an invalid preset raises the ares.make KeyError and an out-of-range index raises an IndexError later instead of giving the user a friendly validation error; can we validate these before invoking run_evaluation?

Prompt for AI Agents:

In examples/06_discover_and_evaluate.py around lines 245 to 249, the quick-mode logic
directly reuses args.preset/args.task_idx and calls await run_evaluation, which allows
invalid preset names (raising KeyError inside ares.make) or out-of-range task indices
(IndexError later) to surface. Before calling run_evaluation, validate the preset by
calling ares.info(preset_name) inside a try/except and return a friendly error (raise
ValueError or print and sys.exit) if not found; then check task_idx against
info.num_tasks and similarly error if out of range. Apply this validation only in the
quick-mode branch so callers get immediate, user-friendly feedback instead of internal
exceptions.

Finding type: Logical Bugs

propel-code-bot bot reviewed

View reviewed changes

examples/06_discover_and_evaluate.py

Comment on lines +244 to +269

+                  # If --quick mode, run with defaults immediately
+                  if args.quick:
+                      preset_name = args.preset or "vmax-tasks-mswea"
+                      task_idx = args.task_idx if args.task_idx is not None else 0
+                      print(f"🚀 Quick mode: Running {preset_name}:{task_idx}")
+                      await run_evaluation(preset_name, task_idx)
+                      return
+                  # Interactive mode
+                  discover_presets()
+                  featured_presets = show_featured_presets()
+                  # Determine preset (from args or interactive prompt)
+                  if args.preset:
+                      preset_name = args.preset
+                      print(f"\n✓ Using preset from command line: {preset_name}")
+                  else:
+                      preset_name = prompt_preset_selection(featured_presets)
+                  print(f"\n✓ Selected preset: {preset_name}")
+                  # Determine task index (from args or interactive prompt)
+                  if args.task_idx is not None:
+                      task_idx = args.task_idx
+                      print(f"✓ Using task index from command line: {task_idx}")
+                  else:

propel-code-bot bot Jan 29, 2026

[Reliability] When a preset/task pair is provided via CLI flags you bypass the interactive prompts that enforce valid task indices. In both the --quick branch (lines 244‑249) and the later if args.task_idx is not None branch (lines 265‑269) the code accepts any integer and immediately passes it to ares.make(f"{preset_name}:{task_idx}"). If the index is out of bounds (e.g. --preset sbv-mswea --task-idx 9999 for a 500-task dataset), IndexSelector raises an IndexError while building the environment and the script crashes with a stack trace instead of a helpful message. Please query info = ares.info(preset_name) when consuming CLI arguments and ensure 0 <= task_idx < info.num_tasks before calling run_evaluation, raising a friendly ValueError (or prompting again) when the index is invalid.

Context for Agents

When a preset/task pair is provided via CLI flags you bypass the interactive prompts that enforce valid task indices. In both the `--quick` branch (lines 244‑249) and the later `if args.task_idx is not None` branch (lines 265‑269) the code accepts any integer and immediately passes it to `ares.make(f"{preset_name}:{task_idx}")`. If the index is out of bounds (e.g. `--preset sbv-mswea --task-idx 9999` for a 500-task dataset), `IndexSelector` raises an `IndexError` while building the environment and the script crashes with a stack trace instead of a helpful message. Please query `info = ares.info(preset_name)` when consuming CLI arguments and ensure `0 <= task_idx < info.num_tasks` before calling `run_evaluation`, raising a friendly `ValueError` (or prompting again) when the index is invalid.

File: examples/06_discover_and_evaluate.py
Line: 269

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet