Add terminus2_agent support to local LLM example by marathan24 · Pull Request #80 · withmartian/ares

marathan24 · 2026-01-31T12:33:17Z

Added terminus2_agent support to the local LLM example (examples/01_sequential_eval_with_local_llm.py), allowing both MiniSWECodeAgent and terminus2_agent to run against SWE-bench Verified with a local Qwen2-0.5B model via a new --agent CLI flag.

Commands:

uv run -m examples.01_sequential_eval_with_local_llm --agent mswea      # MiniSWECodeAgent (default)
uv run -m examples.01_sequential_eval_with_local_llm --agent terminus2   # Terminus2Agent
uv run -m examples.01_sequential_eval_with_local_llm --agent both        # Both sequentially

Changes made:

Added --agent CLI flag to examples/01_sequential_eval_with_local_llm.py with choices mswea, terminus2, and both
Fixed IndexError crash in examples/utils.py — print_step() assumed observations always have messages, but Terminus2Agent sends observations with an empty messages list (it uses system_prompt instead)
Added min_turns: int = 0 parameter to Terminus2Agent in src/ares/code_agents/terminus2/terminus2_agent.py — when set, the agent ignores task_complete signals before the specified turn count. Defaults to 0 so existing behavior is unchanged.
Built the Terminus2 environment directly in the example using functools.partial(terminus2_agent.Terminus2Agent, min_turns=5) since ares.make() doesn't support passing custom agent kwargs

Problem that was solved:

When running Terminus2Agent with the local Qwen2-0.5B model, the agent terminated after only 3 steps instead of 5. The weak model hallucinated task completion at Step 2:

[Step 2]
Observation (from environment): Current terminal state:
(testbed) root@35ce625fd505:/testbed# reprreprecho
...
[TIMEOUT WARNING]: One or more commands exceeded the specified duration.

Action (from LLM): {
  "analysis": "The task is now complete. The issues have been fixed, and the commands
  are now printed correctly in dictionaries and sets.",
  "plan": "I will now run a check to confirm the task is c...

The model set "task_complete": true despite having done nothing meaningful. Terminus2Agent's two-strike confirmation asked the model to confirm, the model confirmed again, and the agent exited early:

[Terminus2Agent] Episode truncated after 3 steps
[Terminus2Agent] Total reward: 0.0

After the fix, with min_turns=5, premature task_complete signals are ignored and both agents run the full 5 steps:

[Terminus2Agent] Episode completed after 5 steps
[Terminus2Agent] Total reward: 0.0

[MiniSWECodeAgent] Episode completed after 5 steps
[MiniSWECodeAgent] Total reward: 0.0

- Add DOCKER_SKIP_AUTH config option to use anonymous pulls without account - Implement _clear_docker_credentials() to remove stored auth from ~/.docker/config.json - Add Docker image caching to avoid rebuilding same Dockerfile repeatedly - Improve error messages with specific detection of "email must be verified" error - Add DOCKER_REGISTRY_USERNAME/PASSWORD env vars for programmatic authentication - Update example documentation with new anonymous pull option - All 67 tests pass, linting and formatting verified

Fix Docker authentication errors

Add DOCKER_SKIP_AUTH config option that clears stored Docker Hub credentials from ~/.docker/config.json, allowing Docker to make unauthenticated requests. This bypasses the email verification requirement without needing programmatic login.

Fix Docker authentication errors by allowing anonymous pulls

propel-code-bot · 2026-01-31T12:37:22Z

examples/01_sequential_eval_with_local_llm.py

+    # Build the environment directly so we can pass min_turns to the agent factory.
+    # functools.partial bakes in min_turns while still satisfying the CodeAgentFactory protocol.
+    agent_factory = functools.partial(terminus2_agent.Terminus2Agent, min_turns=_MAX_STEPS)


[Logic] Terminus2 requires two consecutive turns to finish: the first task_complete=true sets _pending_completion, and the following turn must confirm it before the agent exits. Here we set min_turns=_MAX_STEPS (line 105) while the outer loop still breaks as soon as step_count >= _MAX_STEPS (line 126). That means the first completion signal is only honored on turn 5, but the second confirmation can never be sent because the loop stops immediately after turn 5. Even if a stronger local model really solves the task, the episode will always be forcibly truncated and the reward will stay at 0.

Give Terminus2 at least one turn after the min_turns threshold (e.g., cap min_turns at _MAX_STEPS - 1) so it can send the confirmation before you break out of the loop.

Suggested change

# Build the environment directly so we can pass min_turns to the agent factory.

# functools.partial bakes in min_turns while still satisfying the CodeAgentFactory protocol.

agent_factory = functools.partial(terminus2_agent.Terminus2Agent, min_turns=_MAX_STEPS)

agent_factory = functools.partial(

terminus2_agent.Terminus2Agent, min_turns=max(0, _MAX_STEPS - 1)

)

Context for Agents

Terminus2 requires two consecutive turns to finish: the first `task_complete=true` sets `_pending_completion`, and the following turn must confirm it before the agent exits. Here we set `min_turns=_MAX_STEPS` (line 105) while the outer loop still breaks as soon as `step_count >= _MAX_STEPS` (line 126). That means the **first** completion signal is only honored on turn 5, but the **second** confirmation can never be sent because the loop stops immediately after turn 5. Even if a stronger local model really solves the task, the episode will always be forcibly truncated and the reward will stay at 0. Give Terminus2 at least one turn after the `min_turns` threshold (e.g., cap `min_turns` at `_MAX_STEPS - 1`) so it can send the confirmation before you break out of the loop. ```suggestion agent_factory = functools.partial( terminus2_agent.Terminus2Agent, min_turns=max(0, _MAX_STEPS - 1) ) ``` File: examples/01_sequential_eval_with_local_llm.py Line: 105

Commit f19580d addressed this comment by completely removing the Terminus2Agent example code from this file. The problematic run_terminus2_episode function (lines 90-134 in the old code) that contained the min_turns=_MAX_STEPS configuration at line 105 has been deleted entirely. The file now only demonstrates the MiniSWECodeAgent example, eliminating the two-turn completion issue that was raised.

joshgreaves

Thanks for this PR @marathan24 !

However, I don't think we should make any major changes to this example. It exists as the first (and most simple) example of how you can use ARES. Therefore it should remain incredibly simple and straightforward for a new reader.

I think even making the agent configurable is not the direction we want for this example (though we could do it in later examples). Right now env building looks like:

async with ares.make("sbv-mswea:0") as env:

If we enable switching out agents we will have to do something like

async with ares.make(f"sbv-{agent_name}:0") as env:

It's not a big change, but now a brand new reader has to work out what a valid agent name is. It's immediately obvious in the first example.

However, I do think we should fix the bug so that if you manually switch out the agent to terminus2 it should still work. If you can modify the PR with just the fix, that'd be awesome. Thank you!

Per PR review feedback, the example file should remain a minimal demonstration without CLI args. Users can manually modify the preset to use sbv-terminus2:0 if they want to try Terminus2Agent. Bug fixes in utils.py and terminus2_agent.py are retained. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

propel-code-bot · 2026-02-05T15:52:56Z

examples/utils.py

    if observation is not None:
-        observation_content = list(observation.messages)[-1].get("content", "")
+        messages = list(observation.messages)
+        observation_content = messages[-1].get("content", "") if messages else "(no messages)"


[Logic] When observation.messages is empty (which happens for Terminus2 observations), the new logic always prints "(no messages)" and completely ignores the system_prompt that still contains the actual observation text. This makes the example unusable for debugging Terminus2 runs because the logged observation loses all context. Fallback to observation.system_prompt whenever there are no messages so the real observation content is still shown.

Suggested change

observation_content = messages[-1].get("content", "") if messages else "(no messages)"

observation_content = (

messages[-1].get("content", "")

if messages

else (observation.system_prompt or "(no messages)")

)

Context for Agents

When `observation.messages` is empty (which happens for Terminus2 observations), the new logic always prints `"(no messages)"` and completely ignores the `system_prompt` that still contains the actual observation text. This makes the example unusable for debugging Terminus2 runs because the logged observation loses all context. Fallback to `observation.system_prompt` whenever there are no messages so the real observation content is still shown. ```suggestion observation_content = ( messages[-1].get("content", "") if messages else (observation.system_prompt or "(no messages)") ) ``` File: examples/utils.py Line: 32

I have mentioned about this in the comment on this PR. Since Qwen model hallucinates hereby this was breaking in three steps just because of hallucination whereas mini swe agent ran for 5 steps. Running for at least five steps is important.

propel-code-bot · 2026-02-05T16:27:19Z

src/ares/code_agents/terminus2/terminus2_agent.py

    tracker: stat_tracker.StatTracker = dataclasses.field(default_factory=stat_tracker.NullStatTracker)
    parser_format: Literal["json", "xml"] = "json"
    max_turns: int = 1_000_000  # Match terminal-bench reference (effectively unlimited)
+    min_turns: int = 5  # Minimum turns before accepting task_complete


[Logic] The min_turns attribute has a default value of 5 in the class definition (line 151), but the docstring states it should default to 0 (line 138). This inconsistency could confuse users about the actual default behavior.

The PR description indicates min_turns should default to 0 to preserve existing behavior: "Defaults to 0 so existing behavior is unchanged."

Fix the class definition to match the documented behavior:

Suggested change

min_turns: int = 5 # Minimum turns before accepting task_complete

min_turns: int = 0 # Minimum turns before accepting task_complete

Context for Agents

The `min_turns` attribute has a default value of `5` in the class definition (line 151), but the docstring states it should default to `0` (line 138). This inconsistency could confuse users about the actual default behavior. The PR description indicates `min_turns` should default to `0` to preserve existing behavior: *"Defaults to `0` so existing behavior is unchanged."* Fix the class definition to match the documented behavior: ```suggestion min_turns: int = 0 # Minimum turns before accepting task_complete ``` File: src/ares/code_agents/terminus2/terminus2_agent.py Line: 151

propel-code-bot · 2026-02-05T16:27:20Z

src/ares/code_agents/terminus2/terminus2_agent.py

    tracker: stat_tracker.StatTracker = dataclasses.field(default_factory=stat_tracker.NullStatTracker)
    parser_format: Literal["json", "xml"] = "json"
    max_turns: int = 1_000_000  # Match terminal-bench reference (effectively unlimited)
+    min_turns: int = 5  # Minimum turns before accepting task_complete


[Logic] Configuration mismatch: The min_turns default value is hardcoded to 5, but according to the PR description, the intention was to default to 0 so "existing behavior is unchanged." With min_turns = 5, all existing code using Terminus2Agent will now be forced to run at least 5 turns, which changes the default behavior.

If you want existing behavior unchanged, this should be:

Suggested change

min_turns: int = 5 # Minimum turns before accepting task_complete

min_turns: int = 0 # Minimum turns before accepting task_complete

Then users who want the 5-turn minimum (like in the example) can explicitly pass min_turns=5 via functools.partial as shown in the PR description.

Context for Agents

**Configuration mismatch:** The `min_turns` default value is hardcoded to `5`, but according to the PR description, the intention was to default to `0` so "existing behavior is unchanged." With `min_turns = 5`, all existing code using `Terminus2Agent` will now be forced to run at least 5 turns, which changes the default behavior. If you want existing behavior unchanged, this should be: ```suggestion min_turns: int = 0 # Minimum turns before accepting task_complete ``` Then users who want the 5-turn minimum (like in the example) can explicitly pass `min_turns=5` via `functools.partial` as shown in the PR description. File: src/ares/code_agents/terminus2/terminus2_agent.py Line: 151

joshgreaves

Thanks! One last comment, but it'll be great for this to work with terminus2, thanks for your contribution @marathan24 !

joshgreaves · 2026-02-07T00:53:56Z

src/ares/code_agents/terminus2/terminus2_agent.py

    tracker: stat_tracker.StatTracker = dataclasses.field(default_factory=stat_tracker.NullStatTracker)
    parser_format: Literal["json", "xml"] = "json"
    max_turns: int = 1_000_000  # Match terminal-bench reference (effectively unlimited)
+    min_turns: int = 5  # Minimum turns before accepting task_complete


Hm, let's not add this, it seems arbitrary to force the agent to run for a minimum number of steps. I'm guessing this is to avoid issues with example 1 which expects you to run for at least 5 steps; I think we can change this logic in example 1, something like:

while not ts.last(): ... if num_steps >= 5: break

Wdyt?

marathan24 and others added 11 commits January 29, 2026 01:59

Merge pull request #1 from marathan24/docker-issue

b2d4996

Fix Docker authentication errors

Merge pull request #2 from marathan24/docker-issue

138201a

Fix Docker authentication errors by allowing anonymous pulls

Merge branch 'withmartian:main' into main

3125b58

Merge branch 'withmartian:main' into main

fa26b87

Merge branch 'withmartian:main' into main

f3a2b3a

Terminus agent can be used to test locally and run environment

fbf9e95

ruff passed

dee4fdb

Terminus failing before completing five steps

0e32c8e

past changes

6c35d9b

propel-code-bot bot reviewed Jan 31, 2026

View reviewed changes

joshgreaves requested changes Feb 3, 2026

View reviewed changes

marathan24 and others added 2 commits February 5, 2026 20:43

Changed back to pending completion

5f5a5d5

propel-code-bot bot reviewed Feb 5, 2026

View reviewed changes

Setting minimum number of turns

7fe61fe

I have mentioned about this in the comment on this PR. Since Qwen model hallucinates hereby this was breaking in three steps just because of hallucination whereas mini swe agent ran for 5 steps. Running for at least five steps is important.

propel-code-bot bot reviewed Feb 5, 2026

View reviewed changes

joshgreaves reviewed Feb 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add terminus2_agent support to local LLM example#80

Add terminus2_agent support to local LLM example#80
marathan24 wants to merge 14 commits intowithmartian:mainfrom
marathan24:main

marathan24 commented Jan 31, 2026

Uh oh!

propel-code-bot bot Jan 31, 2026

Uh oh!

baz-reviewer bot Feb 5, 2026

Uh oh!

joshgreaves left a comment

Uh oh!

propel-code-bot bot Feb 5, 2026

Uh oh!

propel-code-bot bot Feb 5, 2026

Uh oh!

propel-code-bot bot Feb 5, 2026

Uh oh!

joshgreaves left a comment

Uh oh!

joshgreaves Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	min_turns: int = 5 # Minimum turns before accepting task_complete
	min_turns: int = 0 # Minimum turns before accepting task_complete

Conversation

marathan24 commented Jan 31, 2026

Uh oh!

propel-code-bot bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

joshgreaves left a comment

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

joshgreaves left a comment

Choose a reason for hiding this comment

Uh oh!

joshgreaves Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants