Conversation
| import ares | ||
| from ares import config | ||
| from ares.llms import chat_completions_compatible | ||
|
|
||
| from . import utils |
There was a problem hiding this comment.
Directly importing chat_completions_compatible from ares.llms violates CLAUDE.md guidance for external consumers to use module-level imports (e.g. from ares import llms then llms.chat_completions_compatible), so the example shows a non-compliant import style; can we switch to the module-based import to keep provenance explicit?
Prompt for AI Agents:
In examples/06_discover_and_evaluate.py around lines 31 to 35, the code does a direct
import: `from ares.llms import chat_completions_compatible`, which violates the
CLAUDE.md guideline that external consumers should import the module (e.g., `from ares
import llms`). Change the import to `from ares import llms` and update usages
accordingly (e.g., where `chat_completions_compatible.ChatCompletionCompatibleLLMClient`
is instantiated in run_evaluation, replace it with
`llms.chat_completions_compatible.ChatCompletionCompatibleLLMClient`). Ensure no other
references rely on the old direct import.
Finding type: AI Coding Guidelines
| # Group presets by base dataset (remove agent suffix) | ||
| datasets = {} | ||
| for preset in all_presets: | ||
| # Extract base dataset name (e.g., "vmax-tasks-mswea" -> "vmax-tasks") | ||
| base_name = preset.name.rsplit("-", 1)[0] | ||
| if base_name not in datasets: | ||
| datasets[base_name] = [] | ||
| datasets[base_name].append(preset) | ||
|
|
There was a problem hiding this comment.
discover_presets iterates through all presets and builds datasets[base_name].append(preset) only to later call len(datasets)—the lists are never used, so every preset is stored twice for no reason; can we just track the unique base names (e.g. unique_bases = {preset.name.rsplit("-", 1)[0] for preset in all_presets}) and print len(unique_bases) instead so the discovery stays concise and lightweight?
| # Group presets by base dataset (remove agent suffix) | |
| datasets = {} | |
| for preset in all_presets: | |
| # Extract base dataset name (e.g., "vmax-tasks-mswea" -> "vmax-tasks") | |
| base_name = preset.name.rsplit("-", 1)[0] | |
| if base_name not in datasets: | |
| datasets[base_name] = [] | |
| datasets[base_name].append(preset) | |
| # Track unique base dataset names (remove agent suffix) | |
| datasets = set() | |
| for preset in all_presets: | |
| # Extract base dataset name (e.g., "vmax-tasks-mswea" -> "vmax-tasks") | |
| base_name = preset.name.rsplit("-", 1)[0] | |
| datasets.add(base_name) |
Finding type: Conciseness
| for i, name in enumerate(featured_names, 1): | ||
| info = ares.info(name) | ||
| featured_presets.append(name) | ||
| marker = "⭐" if name == "vmax-tasks-mswea" else " " | ||
| print(f"{marker} {i}. {info.name}") |
There was a problem hiding this comment.
show_featured_presets always calls ares.info(name) for each hardcoded featured name, so if a preset like tbench-mswea isn't registered it raises KeyError and aborts the example before users can interact; can we rescue missing presets by skipping with a warning so the example still lists available presets?
Prompt for AI Agents:
In examples/06_discover_and_evaluate.py around lines 74 to 78, the show_featured_presets
function currently calls ares.info(name) for each hardcoded featured name and will raise
KeyError if a preset is missing. Refactor by wrapping the ares.info(name) call in a
try/except KeyError block: on success, append the name to featured_presets and print the
existing info/marker; on KeyError, print a warning message like "⚠️ Featured preset
'{name}' not found, skipping" and continue. Ensure featured_presets only contains names
for which ares.info succeeded so the interactive selection remains valid.
Finding type: Logical Bugs
| if args.quick: | ||
| preset_name = args.preset or "vmax-tasks-mswea" | ||
| task_idx = args.task_idx if args.task_idx is not None else 0 | ||
| print(f"🚀 Quick mode: Running {preset_name}:{task_idx}") | ||
| await run_evaluation(preset_name, task_idx) |
There was a problem hiding this comment.
Quick/CLI mode reuses args.preset/args.task_idx and immediately calls run_evaluation, so an invalid preset raises the ares.make KeyError and an out-of-range index raises an IndexError later instead of giving the user a friendly validation error; can we validate these before invoking run_evaluation?
Prompt for AI Agents:
In examples/06_discover_and_evaluate.py around lines 245 to 249, the quick-mode logic
directly reuses args.preset/args.task_idx and calls await run_evaluation, which allows
invalid preset names (raising KeyError inside ares.make) or out-of-range task indices
(IndexError later) to surface. Before calling run_evaluation, validate the preset by
calling ares.info(preset_name) inside a try/except and return a friendly error (raise
ValueError or print and sys.exit) if not found; then check task_idx against
info.num_tasks and similarly error if out of range. Apply this validation only in the
quick-mode branch so callers get immediate, user-friendly feedback instead of internal
exceptions.
Finding type: Logical Bugs
| # If --quick mode, run with defaults immediately | ||
| if args.quick: | ||
| preset_name = args.preset or "vmax-tasks-mswea" | ||
| task_idx = args.task_idx if args.task_idx is not None else 0 | ||
| print(f"🚀 Quick mode: Running {preset_name}:{task_idx}") | ||
| await run_evaluation(preset_name, task_idx) | ||
| return | ||
|
|
||
| # Interactive mode | ||
| discover_presets() | ||
| featured_presets = show_featured_presets() | ||
|
|
||
| # Determine preset (from args or interactive prompt) | ||
| if args.preset: | ||
| preset_name = args.preset | ||
| print(f"\n✓ Using preset from command line: {preset_name}") | ||
| else: | ||
| preset_name = prompt_preset_selection(featured_presets) | ||
|
|
||
| print(f"\n✓ Selected preset: {preset_name}") | ||
|
|
||
| # Determine task index (from args or interactive prompt) | ||
| if args.task_idx is not None: | ||
| task_idx = args.task_idx | ||
| print(f"✓ Using task index from command line: {task_idx}") | ||
| else: |
There was a problem hiding this comment.
[Reliability] When a preset/task pair is provided via CLI flags you bypass the interactive prompts that enforce valid task indices. In both the --quick branch (lines 244‑249) and the later if args.task_idx is not None branch (lines 265‑269) the code accepts any integer and immediately passes it to ares.make(f"{preset_name}:{task_idx}"). If the index is out of bounds (e.g. --preset sbv-mswea --task-idx 9999 for a 500-task dataset), IndexSelector raises an IndexError while building the environment and the script crashes with a stack trace instead of a helpful message. Please query info = ares.info(preset_name) when consuming CLI arguments and ensure 0 <= task_idx < info.num_tasks before calling run_evaluation, raising a friendly ValueError (or prompting again) when the index is invalid.
Context for Agents
When a preset/task pair is provided via CLI flags you bypass the interactive prompts that enforce valid task indices. In both the `--quick` branch (lines 244‑249) and the later `if args.task_idx is not None` branch (lines 265‑269) the code accepts any integer and immediately passes it to `ares.make(f"{preset_name}:{task_idx}")`. If the index is out of bounds (e.g. `--preset sbv-mswea --task-idx 9999` for a 500-task dataset), `IndexSelector` raises an `IndexError` while building the environment and the script crashes with a stack trace instead of a helpful message. Please query `info = ares.info(preset_name)` when consuming CLI arguments and ensure `0 <= task_idx < info.num_tasks` before calling `run_evaluation`, raising a friendly `ValueError` (or prompting again) when the index is invalid.
File: examples/06_discover_and_evaluate.py
Line: 269
User description
Adds a script that shows how to list env presets, shows all the presets available, and gets at the idea of indexing within those presets for specific tasks. Also highlights the new VMax dataset 🚀
Generated description
Below is a concise technical summary of the changes proposed in this PR:
graph LR main_("main"):::added discover_presets_("discover_presets"):::added show_featured_presets_("show_featured_presets"):::added prompt_preset_selection_("prompt_preset_selection"):::added prompt_task_selection_("prompt_task_selection"):::added run_evaluation_("run_evaluation"):::added print_step_("print_step"):::modified list_all_presets_("list_all_presets"):::added main_ -- "Calls discover_presets to list presets using ares.info." --> discover_presets_ main_ -- "Calls show_featured_presets to display curated preset list." --> show_featured_presets_ main_ -- "Calls prompt_preset_selection to interactively choose preset." --> prompt_preset_selection_ main_ -- "Calls prompt_task_selection to select task index interactively." --> prompt_task_selection_ main_ -- "Calls run_evaluation to execute chosen preset evaluation." --> run_evaluation_ run_evaluation_ -- "Now prints bash command preview when present." --> print_step_ prompt_preset_selection_ -- "Calls list_all_presets to show all presets list." --> list_all_presets_ classDef added stroke:#15AA7A classDef removed stroke:#CD5270 classDef modified stroke:#EDAC4C linkStyle default stroke:#CBD5E1,font-size:13pxIntroduces an interactive example script that demonstrates how to discover and evaluate environment presets within the ARES framework. Enhances the visibility of available datasets, including the new VMax dataset, and improves the formatting of environment information.
EnvironmentInfostring formatting inregistry.pyand enhancesprint_stepinutils.pyto better display bash commands from LLM actions.Modified files (2)
Latest Contributors(2)
06_discover_and_evaluate.pyto provide an interactive CLI for listing presets, selecting tasks, and running evaluations usingChatCompletionCompatibleLLMClient.Modified files (2)
Latest Contributors(1)