Skip to content

Add terminus2_agent support to local LLM example#80

Open
marathan24 wants to merge 14 commits intowithmartian:mainfrom
marathan24:main
Open

Add terminus2_agent support to local LLM example#80
marathan24 wants to merge 14 commits intowithmartian:mainfrom
marathan24:main

Conversation

@marathan24
Copy link

Added terminus2_agent support to the local LLM example (examples/01_sequential_eval_with_local_llm.py), allowing both MiniSWECodeAgent and terminus2_agent to run against SWE-bench Verified with a local Qwen2-0.5B model via a new --agent CLI flag.

Commands:

uv run -m examples.01_sequential_eval_with_local_llm --agent mswea      # MiniSWECodeAgent (default)
uv run -m examples.01_sequential_eval_with_local_llm --agent terminus2   # Terminus2Agent
uv run -m examples.01_sequential_eval_with_local_llm --agent both        # Both sequentially

Changes made:

  • Added --agent CLI flag to examples/01_sequential_eval_with_local_llm.py with choices mswea, terminus2, and both
  • Fixed IndexError crash in examples/utils.pyprint_step() assumed observations always have messages, but Terminus2Agent sends observations with an empty messages list (it uses system_prompt instead)
  • Added min_turns: int = 0 parameter to Terminus2Agent in src/ares/code_agents/terminus2/terminus2_agent.py — when set, the agent ignores task_complete signals before the specified turn count. Defaults to 0 so existing behavior is unchanged.
  • Built the Terminus2 environment directly in the example using functools.partial(terminus2_agent.Terminus2Agent, min_turns=5) since ares.make() doesn't support passing custom agent kwargs

Problem that was solved:

When running Terminus2Agent with the local Qwen2-0.5B model, the agent terminated after only 3 steps instead of 5. The weak model hallucinated task completion at Step 2:

[Step 2]
Observation (from environment): Current terminal state:
(testbed) root@35ce625fd505:/testbed# reprreprecho
...
[TIMEOUT WARNING]: One or more commands exceeded the specified duration.

Action (from LLM): {
  "analysis": "The task is now complete. The issues have been fixed, and the commands
  are now printed correctly in dictionaries and sets.",
  "plan": "I will now run a check to confirm the task is c...

The model set "task_complete": true despite having done nothing meaningful. Terminus2Agent's two-strike confirmation asked the model to confirm, the model confirmed again, and the agent exited early:

[Terminus2Agent] Episode truncated after 3 steps
[Terminus2Agent] Total reward: 0.0

After the fix, with min_turns=5, premature task_complete signals are ignored and both agents run the full 5 steps:

[Terminus2Agent] Episode completed after 5 steps
[Terminus2Agent] Total reward: 0.0
[MiniSWECodeAgent] Episode completed after 5 steps
[MiniSWECodeAgent] Total reward: 0.0

marathan24 and others added 11 commits January 29, 2026 01:59
- Add DOCKER_SKIP_AUTH config option to use anonymous pulls without account
- Implement _clear_docker_credentials() to remove stored auth from ~/.docker/config.json
- Add Docker image caching to avoid rebuilding same Dockerfile repeatedly
- Improve error messages with specific detection of "email must be verified" error
- Add DOCKER_REGISTRY_USERNAME/PASSWORD env vars for programmatic authentication
- Update example documentation with new anonymous pull option
- All 67 tests pass, linting and formatting verified
Fix Docker authentication errors
Add DOCKER_SKIP_AUTH config option that clears stored Docker Hub credentials
from ~/.docker/config.json, allowing Docker to make unauthenticated requests.
This bypasses the email verification requirement without needing programmatic login.
Fix Docker authentication errors by allowing anonymous pulls
Comment on lines 103 to 105
# Build the environment directly so we can pass min_turns to the agent factory.
# functools.partial bakes in min_turns while still satisfying the CodeAgentFactory protocol.
agent_factory = functools.partial(terminus2_agent.Terminus2Agent, min_turns=_MAX_STEPS)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

[Logic] Terminus2 requires two consecutive turns to finish: the first task_complete=true sets _pending_completion, and the following turn must confirm it before the agent exits. Here we set min_turns=_MAX_STEPS (line 105) while the outer loop still breaks as soon as step_count >= _MAX_STEPS (line 126). That means the first completion signal is only honored on turn 5, but the second confirmation can never be sent because the loop stops immediately after turn 5. Even if a stronger local model really solves the task, the episode will always be forcibly truncated and the reward will stay at 0.

Give Terminus2 at least one turn after the min_turns threshold (e.g., cap min_turns at _MAX_STEPS - 1) so it can send the confirmation before you break out of the loop.

Suggested change
# Build the environment directly so we can pass min_turns to the agent factory.
# functools.partial bakes in min_turns while still satisfying the CodeAgentFactory protocol.
agent_factory = functools.partial(terminus2_agent.Terminus2Agent, min_turns=_MAX_STEPS)
agent_factory = functools.partial(
terminus2_agent.Terminus2Agent, min_turns=max(0, _MAX_STEPS - 1)
)
Context for Agents
Terminus2 requires two consecutive turns to finish: the first `task_complete=true` sets `_pending_completion`, and the following turn must confirm it before the agent exits. Here we set `min_turns=_MAX_STEPS` (line 105) while the outer loop still breaks as soon as `step_count >= _MAX_STEPS` (line 126). That means the **first** completion signal is only honored on turn 5, but the **second** confirmation can never be sent because the loop stops immediately after turn 5. Even if a stronger local model really solves the task, the episode will always be forcibly truncated and the reward will stay at 0.

Give Terminus2 at least one turn after the `min_turns` threshold (e.g., cap `min_turns` at `_MAX_STEPS - 1`) so it can send the confirmation before you break out of the loop.
```suggestion
    agent_factory = functools.partial(
        terminus2_agent.Terminus2Agent, min_turns=max(0, _MAX_STEPS - 1)
    )
```

File: examples/01_sequential_eval_with_local_llm.py
Line: 105

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit f19580d addressed this comment by completely removing the Terminus2Agent example code from this file. The problematic run_terminus2_episode function (lines 90-134 in the old code) that contained the min_turns=_MAX_STEPS configuration at line 105 has been deleted entirely. The file now only demonstrates the MiniSWECodeAgent example, eliminating the two-turn completion issue that was raised.

Copy link
Contributor

@joshgreaves joshgreaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @marathan24 !

However, I don't think we should make any major changes to this example. It exists as the first (and most simple) example of how you can use ARES. Therefore it should remain incredibly simple and straightforward for a new reader.

I think even making the agent configurable is not the direction we want for this example (though we could do it in later examples). Right now env building looks like:

async with ares.make("sbv-mswea:0") as env:

If we enable switching out agents we will have to do something like

async with ares.make(f"sbv-{agent_name}:0") as env:

It's not a big change, but now a brand new reader has to work out what a valid agent name is. It's immediately obvious in the first example.

However, I do think we should fix the bug so that if you manually switch out the agent to terminus2 it should still work. If you can modify the PR with just the fix, that'd be awesome. Thank you!

marathan24 and others added 2 commits February 5, 2026 20:43
Per PR review feedback, the example file should remain a minimal
demonstration without CLI args. Users can manually modify the
preset to use sbv-terminus2:0 if they want to try Terminus2Agent.

Bug fixes in utils.py and terminus2_agent.py are retained.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
if observation is not None:
observation_content = list(observation.messages)[-1].get("content", "")
messages = list(observation.messages)
observation_content = messages[-1].get("content", "") if messages else "(no messages)"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

[Logic] When observation.messages is empty (which happens for Terminus2 observations), the new logic always prints "(no messages)" and completely ignores the system_prompt that still contains the actual observation text. This makes the example unusable for debugging Terminus2 runs because the logged observation loses all context. Fallback to observation.system_prompt whenever there are no messages so the real observation content is still shown.

Suggested change
observation_content = messages[-1].get("content", "") if messages else "(no messages)"
observation_content = (
messages[-1].get("content", "")
if messages
else (observation.system_prompt or "(no messages)")
)
Context for Agents
When `observation.messages` is empty (which happens for Terminus2 observations), the new logic always prints `"(no messages)"` and completely ignores the `system_prompt` that still contains the actual observation text. This makes the example unusable for debugging Terminus2 runs because the logged observation loses all context. Fallback to `observation.system_prompt` whenever there are no messages so the real observation content is still shown.
```suggestion
        observation_content = (
            messages[-1].get("content", "")
            if messages
            else (observation.system_prompt or "(no messages)")
        )
```


File: examples/utils.py
Line: 32

I have mentioned about this in the comment on this PR. Since Qwen model hallucinates hereby this was breaking in three steps just because of hallucination whereas mini swe agent ran for 5 steps. Running for at least five steps is important.
tracker: stat_tracker.StatTracker = dataclasses.field(default_factory=stat_tracker.NullStatTracker)
parser_format: Literal["json", "xml"] = "json"
max_turns: int = 1_000_000 # Match terminal-bench reference (effectively unlimited)
min_turns: int = 5 # Minimum turns before accepting task_complete

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

[Logic] The min_turns attribute has a default value of 5 in the class definition (line 151), but the docstring states it should default to 0 (line 138). This inconsistency could confuse users about the actual default behavior.

The PR description indicates min_turns should default to 0 to preserve existing behavior: "Defaults to 0 so existing behavior is unchanged."

Fix the class definition to match the documented behavior:

Suggested change
min_turns: int = 5 # Minimum turns before accepting task_complete
min_turns: int = 0 # Minimum turns before accepting task_complete
Context for Agents
The `min_turns` attribute has a default value of `5` in the class definition (line 151), but the docstring states it should default to `0` (line 138). This inconsistency could confuse users about the actual default behavior.

The PR description indicates `min_turns` should default to `0` to preserve existing behavior: *"Defaults to `0` so existing behavior is unchanged."*

Fix the class definition to match the documented behavior:
```suggestion
    min_turns: int = 0  # Minimum turns before accepting task_complete
```

File: src/ares/code_agents/terminus2/terminus2_agent.py
Line: 151

tracker: stat_tracker.StatTracker = dataclasses.field(default_factory=stat_tracker.NullStatTracker)
parser_format: Literal["json", "xml"] = "json"
max_turns: int = 1_000_000 # Match terminal-bench reference (effectively unlimited)
min_turns: int = 5 # Minimum turns before accepting task_complete

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

[Logic] Configuration mismatch: The min_turns default value is hardcoded to 5, but according to the PR description, the intention was to default to 0 so "existing behavior is unchanged." With min_turns = 5, all existing code using Terminus2Agent will now be forced to run at least 5 turns, which changes the default behavior.

If you want existing behavior unchanged, this should be:

Suggested change
min_turns: int = 5 # Minimum turns before accepting task_complete
min_turns: int = 0 # Minimum turns before accepting task_complete

Then users who want the 5-turn minimum (like in the example) can explicitly pass min_turns=5 via functools.partial as shown in the PR description.

Context for Agents
**Configuration mismatch:** The `min_turns` default value is hardcoded to `5`, but according to the PR description, the intention was to default to `0` so "existing behavior is unchanged." With `min_turns = 5`, all existing code using `Terminus2Agent` will now be forced to run at least 5 turns, which changes the default behavior.

If you want existing behavior unchanged, this should be:

```suggestion
    min_turns: int = 0  # Minimum turns before accepting task_complete
```

Then users who want the 5-turn minimum (like in the example) can explicitly pass `min_turns=5` via `functools.partial` as shown in the PR description.

File: src/ares/code_agents/terminus2/terminus2_agent.py
Line: 151

Copy link
Contributor

@joshgreaves joshgreaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! One last comment, but it'll be great for this to work with terminus2, thanks for your contribution @marathan24 !

tracker: stat_tracker.StatTracker = dataclasses.field(default_factory=stat_tracker.NullStatTracker)
parser_format: Literal["json", "xml"] = "json"
max_turns: int = 1_000_000 # Match terminal-bench reference (effectively unlimited)
min_turns: int = 5 # Minimum turns before accepting task_complete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, let's not add this, it seems arbitrary to force the agent to run for a minimum number of steps. I'm guessing this is to avoid issues with example 1 which expects you to run for at least 5 steps; I think we can change this logic in example 1, something like:

while not ts.last():
  ...

  if num_steps >= 5:
    break

Wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants