feat(swe-bench): add an iterative predictor with ～70% success rate #1414

Jerryguan777 · 2026-01-15T02:26:28Z

Implemented an iterative agent that solves problems by executing bash commands step-by-step, observing results, and generating patches. Achieved 70% success rate (7/10) in initial evaluation.

Add IterativeAgent
Add config_iterative.yml
Add git tools
Add SweBenchPredictorIterativeConfig
Register iterative predictor and git tool
Update README.md

How Has This Been Tested?

export ANTHROPIC_API_KEY=sk-xxxxxx

nat eval --config_file examples/evaluation_and_profiling/swe_bench/configs/config_iterative.yml

Running 10 instances...
10 ran successfully, 0 failed: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [04:30<00:00, 27.07s/it]
All instances run.
Cleaning cached images...
Removed 0 images.
Total instances: 10
Instances submitted: 10
Instances completed: 10
Instances incomplete: 0
Instances resolved: 7
Instances unresolved: 3
Instances with empty patches: 0
Instances with errors: 0
Unstopped containers: 0
Unremoved images: 2
Report written to nv_predictor.nat_iterative_1.json
2026-01-15 16:11:45 - INFO     - nat.eval.swe_bench_evaluator.evaluate:194 - Completed swe_bench run nat_iterative_1
2026-01-15 16:11:45 - INFO     - nat.eval.swe_bench_evaluator.evaluate:200 - SWE_bench report and logs written to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/swe_bench_reports directory
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:275 - Profiler is not enabled. Skipping profiling.
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:366 - Original config file copied to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/config_original.yml
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:384 - Effective config (with overrides) saved to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/config_effective.yml
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:427 - Configuration metadata saved to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/config_metadata.json
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:449 - Workflow output written to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/workflow_output.json
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:460 - Evaluation results written to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/swe_bench_output.json

=== EVALUATION SUMMARY ===
Workflow Status: COMPLETED (workflow_output.json)
Total Runtime: 480.70s

Per evaluator results:
| Evaluator   |   Avg Score | Output File           |
|-------------|-------------|-----------------------|
| swe_bench   |         0.7 | swe_bench_output.json |

Description

Closes #1397

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

New Features
- Added an iterative LLM-driven predictor for SWE-bench with multi-step planning, configurable step limits/timeouts, per-step observability, robust command validation, and an integrated git-repo workspace tool for isolated setup, checkout, and cleanup.
Documentation
- README and example workflow updated with benchmark context, evaluation guidance, metrics framing, harder benchmarks, and concrete onboarding steps for adding new predictors.
Tests
- Extensive test suite validating iterative flows, safety checks, error recovery, timeouts, workspace isolation, and integrations.

copy-pr-bot · 2026-01-15T02:26:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-01-15T02:26:44Z

Walkthrough

Adds an Iterative SWE-bench predictor and supporting tooling: an LLM-driven IterativeAgent that proposes and executes one bash command per step in isolated git workspaces, iterates with observations until completion or limits, and returns a git-diff patch; includes config, tool registration, example YAML, and tests.

Changes

Cohort / File(s)	Summary
Documentation `examples/evaluation_and_profiling/swe_bench/README.md`	Documented new `iterative` predictor, benchmark context, evaluation guidance, and expanded instructions for adding predictors.
Config schema `examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py`	Added `SweBenchPredictorIterativeConfig` (`name="iterative"`) with `llm_name`, `step_limit`, `timeout`; extended predictor union and docstrings.
Config example `examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`	New YAML: LLM backends, iterative predictor settings (`step_limit`, `timeout`), `git_repo_tool` registration, dataset filter, evaluator/run settings, concurrency, and output paths.
Iterative predictor core `examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`	New `IterativeAgent`, `IterativeAgentConfig`, exception hierarchy, LLM-driven one-command-per-response loop, command extraction/validation, async execution with timeout/truncation, per-step observations/logging, and `SweBenchPredictor` registered as `"iterative"`.
Shell validation `.../predict_iterative/shell_validation.py`	New command-safety module with compiled dangerous/bypass regex lists and `validate_command(command) -> (bool, message)`.
Git workspace tool `examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`	Added `RepoContext` dataclass and `RepoManager` with async `setup_repository` (clone/checkout) and `cleanup`, plus async clone/checkout helpers and per-instance workspace path logic.
Tool registration `.../predict_iterative/tools/register.py`	Registered `git_repo_tool` with `GitRepoToolConfig`, exposing async JSON `setup`/`cleanup` operations backed by `RepoManager`, and yielding `FunctionInfo`.
Predictor & tool wiring `examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py`, `examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py`	Imported/exposed the iterative predictor and `git_repo_tool` into predictor and tool registries.
Package initializers `examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.py`, `.../tools/__init__.py`	Added package initializers with SPDX/Apache-2.0 license headers.
Tests `examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py`	Comprehensive new test suite covering command validation, iterative flows, recovery from malformed outputs, timeouts, workspace isolation, repo operations, and end-to-end mocked LLM interactions.

Sequence Diagram(s)

sequenceDiagram
    participant Client as SWE-Bench Client
    participant Predictor as SweBenchPredictor
    participant Repo as RepoManager
    participant Agent as IterativeAgent
    participant LLM as LLM Backend
    participant Executor as Command Executor

    Client->>Predictor: predict_fn(swebench_input)
    Predictor->>Repo: setup_repository(repo_url, base_commit, instance_id)
    Repo-->>Predictor: RepoContext
    Predictor->>Agent: instantiate + run(task_description)

    loop until COMPLETE or limits
        Agent->>LLM: query(prompt/messages)
        LLM-->>Agent: response (one bash code block)
        Agent->>Executor: execute(bash_command, timeout)
        Executor->>Repo: run command in workspace
        Repo-->>Executor: stdout/stderr/return_code
        Executor-->>Agent: observation (may be truncated)
        Agent->>Agent: add_message(assistant,response)
        Agent->>Agent: add_message(user,observation)
        Agent->>Agent: check for COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT
    end

    alt Completed
        Agent-->>Predictor: (patch, status)
    else Error/Timeout/Limits
        Agent-->>Predictor: (error_message, status)
    end

    Predictor->>Repo: cleanup()
    Predictor-->>Client: final patch or error

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and concisely summarizes the main change: adding an iterative predictor to swe-bench with a reported 70% success rate, using imperative mood and staying within the 72-character guideline.
Linked Issues check	✅ Passed	The PR implements all core objectives from issue `#1397`: iterative agent with step-by-step execution [`#1397`], test-driven validation and error recovery [`#1397`], dynamic feedback loop replacing one-shot approach [`#1397`], and integration with SweBenchPredictorBase framework [`#1397`]. All critical coding requirements are satisfied.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing the iterative predictor: new IterativeAgent, git tools, configuration, tests, and shell validation. README updates document the new predictor and evaluation context. No unrelated features or scope creep detected.
Docstring Coverage	✅ Passed	Docstring coverage is 96.55% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`:
- Around line 1-6: Add the standard SPDX Apache-2.0 license header as the very
first lines of the YAML (before the "llms" key); update the top of the file
containing the "llms" / "claude_sonnet_llm" entries to begin with the SPDX
Apache-2.0 header so the file complies with the repo policy.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 76-79: The async function clone_repository uses the synchronous
blocking call Repo.clone_from which will block the event loop; change the
implementation to run Repo.clone_from in a background thread (e.g., via
asyncio.to_thread) and await that result so the function remains async and
non-blocking. Locate the clone_repository function and replace the direct call
to Repo.clone_from(repo_url, target_path) with an awaited asyncio.to_thread call
(or equivalent executor) that invokes Repo.clone_from, and keep the logger.info
call as-is.
- Around line 82-85: The checkout_commit function performs blocking I/O by
calling the synchronous repo.git.checkout; change checkout_commit to have an
explicit return type hint (-> None) and call the blocking operation inside
asyncio.to_thread (e.g., await asyncio.to_thread(repo.git.checkout,
commit_hash)) so the checkout runs off the event loop; keep the logger.info call
and docstring unchanged and reference the function name checkout_commit and the
blocking call repo.git.checkout when making the change.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 41-53: The git_operations function lacks input validation and
error handling: catch JSONDecodeError around json.loads(args_str) and return or
raise a clear error message, validate presence of 'operation' and for operation
== "setup" ensure required keys 'repo_url' and 'base_commit' exist before
calling repo_manager.setup_repository (raise ValueError or return a descriptive
error if missing), and wrap the repo_manager.setup_repository and
repo_manager.cleanup calls to catch and log exceptions so callers receive
actionable error messages referencing git_operations and
repo_manager.setup_repository/cleanup.

🧹 Nitpick comments (11)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)
24-25: Unused import: FunctionRef

FunctionRef is imported but not used in this file. Only LLMRef is used for the llm_name field.
🧹 Remove unused import
 from nat.data_models.common import TypedBaseModel
-from nat.data_models.component_ref import FunctionRef
 from nat.data_models.component_ref import LLMRef
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (2)
37-67: Add type hints per coding guidelines.

The class is missing type hints on __init__, active_repos, and cleanup(). Per coding guidelines, all public APIs require type hints.
📝 Add type hints
 class RepoManager:
+    active_repos: dict[str, RepoContext]
 
-    def __init__(self, workspace_dir: str):
+    def __init__(self, workspace_dir: str) -> None:
         self.workspace = Path(workspace_dir)
         self.workspace.mkdir(parents=True, exist_ok=True)
-        self.active_repos = {}
+        self.active_repos: dict[str, RepoContext] = {}
 
     # ... setup_repository unchanged ...
 
-    async def cleanup(self):
+    async def cleanup(self) -> None:
         """Clean up all managed repositories."""
25-34: Misleading docstring: not a context manager.

The docstring states "Context manager for repository operations" but RepoContext is a plain dataclass without __enter__/__exit__ methods. Consider updating the docstring to reflect its actual purpose as a data container.
📝 Fix docstring
 `@dataclass`
 class RepoContext:
-    """Context manager for repository operations."""
+    """Data container holding repository state and paths."""
     repo_url: str
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
25-29: Redundant _type field.

The _type field is redundant since TypedBaseModel (parent of FunctionBaseConfig) already manages the type discriminator via name="git_repo_tool". This creates potential confusion with two type fields.
🧹 Remove redundant field
 class GitRepoToolConfig(FunctionBaseConfig, name="git_repo_tool"):
     """Configuration for git repository management tool."""
-    _type: typing.Literal["git_repo_tool"] = "git_repo_tool"
     workspace_dir: str = "./.workspace"  # Base directory for cloning repositories
     cleanup_on_exit: bool = True  # Whether to clean up repos after use
32-60: Unused builder parameter is acceptable for interface consistency.

The builder parameter is unused (as flagged by static analysis) but is likely required by the register_function decorator's expected signature. The cleanup pattern using try/finally is well implemented.

Consider adding a return type hint for the async generator:
📝 Add return type hint
+from collections.abc import AsyncGenerator
+
 `@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder) -> AsyncGenerator[FunctionInfo, None]:
     """Git repository management tool for SWE Bench."""
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (6)
76-81: Consider adding docstrings for the configuration fields.

The dataclass lacks documentation for its fields. While the class docstring exists, individual field descriptions would improve clarity.
📝 Suggested improvement
 `@dataclass`
 class IterativeAgentConfig:
     """Configuration for the iterative agent."""
-    step_limit: int = 250
-    timeout: int = 60
-    max_output_length: int = 10000
+    step_limit: int = 250  # Maximum number of agent steps before termination
+    timeout: int = 60  # Command execution timeout in seconds
+    max_output_length: int = 10000  # Maximum characters before output truncation
105-110: Add type hint for llm parameter.

The llm parameter lacks a type annotation. Per coding guidelines, all public APIs require type hints on parameters.
📝 Suggested fix
-    def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig):
+    def __init__(self, llm: typing.Any, repo_path: Path, config: IterativeAgentConfig):
         self.llm = llm
         self.repo_path = repo_path
         self.config = config
-        self.messages: list = []
+        self.messages: list[SystemMessage | HumanMessage | AIMessage] = []
         self.n_steps = 0
Note: Add import typing at the top if not already present. Ideally, use the actual LLM interface type if available from the framework.
360-363: Chain exception and use explicit conversion.

Per coding guidelines, use raise ... from err to preserve the exception chain and use explicit conversion flag instead of str(e).
🔧 Proposed fix
         except Exception as e:
             logger.error("LLM invocation failed: %s", e, exc_info=True)
-            # recoverable error, let the agent continue
-            raise NonTerminatingException(f"LLM call failed: {str(e)}")
+            # recoverable error, let the agent continue
+            raise NonTerminatingException(f"LLM call failed: {e!s}") from e
414-427: Chain exceptions and narrow the exception type.

Multiple issues flagged by static analysis:

Missing exception chaining at lines 425 and 427

Catching broad Exception at line 426 masks specific errors
🔧 Proposed fix
         except (TimeoutError, subprocess.TimeoutExpired) as e:
             # Extract output from exception if available (only subprocess.TimeoutExpired has output attribute)
             if isinstance(e, subprocess.TimeoutExpired) and hasattr(e, "output") and e.output:
                 output = e.output.decode("utf-8", errors="replace")
             else:
                 output = ""
             # Format timeout message using template
             timeout_message = self._TIMEOUT_TEMPLATE.format(
                 action=command,
                 output=output
             )
-            raise ExecutionTimeoutError(timeout_message)
-        except Exception as e:
-            raise NonTerminatingException(f"Error executing command: {str(e)}")
+            raise ExecutionTimeoutError(timeout_message) from e
+        except OSError as e:
+            raise NonTerminatingException(f"Error executing command: {e!s}") from e
Using OSError (or subprocess.SubprocessError) is more appropriate than catching all exceptions, as it covers typical subprocess failures without masking unexpected errors.
462-464: Remove redundant exception object from logger.exception.

When using logger.exception(), the exception info is automatically included. Including e as an argument is redundant (TRY401).
🔧 Proposed fix
         except Exception as e:
-            logger.exception("Failed to setup repository: %s", e)
-            return f"Error: Failed to setup repository - {str(e)}"
+            logger.exception("Failed to setup repository")
+            return f"Error: Failed to setup repository - {e!s}"
493-495: Remove redundant exception object and use explicit conversion.

Same pattern as above - logger.exception() automatically includes exception info.
🔧 Proposed fix
         except Exception as e:
-            logger.exception(f"Error processing {swebench_input.instance_id}: {e}")
-            return f"Error: {str(e)}"
+            logger.exception("Error processing %s", swebench_input.instance_id)
+            return f"Error: {e!s}"

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e15df42 and f5771ff.

📒 Files selected for processing (10)

examples/evaluation_and_profiling/swe_bench/README.md
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/__init__.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{md,mdx}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{md,mdx}: Use 'NVIDIA NeMo Agent toolkit' for full name (first use), 'NeMo Agent toolkit' or 'the toolkit' for subsequent references, and 'Toolkit' (capital T) in titles/headings, 'toolkit' (lowercase t) in body text
Never use deprecated names: 'Agent Intelligence toolkit', 'aiqtoolkit', 'AgentIQ', 'AIQ', or 'aiq' in documentation; update any occurrences unless intentionally referring to deprecated versions or implementing compatibility layers

Files:

examples/evaluation_and_profiling/swe_bench/README.md

**/*.{md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{md,mdx,rst}: Documentation must be clear, comprehensive, and free of TODOs, FIXMEs, placeholder text, offensive or outdated terms, and spelling mistakes
Do not use words listed in 'ci/vale/styles/config/vocabularies/nat/reject.txt' in documentation
Words listed in 'ci/vale/styles/config/vocabularies/nat/accept.txt' are acceptable even if they appear to be spelling mistakes

Files:

examples/evaluation_and_profiling/swe_bench/README.md

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

examples/evaluation_and_profiling/swe_bench/README.md
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

examples/evaluation_and_profiling/swe_bench/README.md
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
in that situation no return type hint is needed).
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
(ex: "the system's performance" should be "the performance of the system").

Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

examples/evaluation_and_profiling/swe_bench/README.md
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

If an example contains Python code, it should be placed in a subdirectory named src/ and should
contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.

If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
be checked into git-lfs.

Files:

examples/evaluation_and_profiling/swe_bench/README.md
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

🧠 Learnings (1)

📚 Learning: 2025-12-12T20:49:44.305Z

Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

🧬 Code graph analysis (5)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (2)

src/nat/data_models/common.py (3)

TypedBaseModel (96-171)

static_type (157-158)

discriminator (165-171)

src/nat/data_models/component_ref.py (2)

FunctionRef (94-102)

LLMRef (116-124)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (3)

src/nat/builder/function_info.py (2)

FunctionInfo (290-625)

from_fn (552-625)

src/nat/data_models/function.py (1)

FunctionBaseConfig (26-36)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)

RepoManager (37-67)

setup_repository (44-58)

cleanup (60-67)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

src/nat/runtime/runner.py (1)

context (93-94)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py (1)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)

SweBenchPredictor (431-502)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (3)

src/nat/builder/builder.py (1)

Builder (84-811)

src/nat/builder/framework_enum.py (1)

LLMFrameworkEnum (19-27)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)

SweBenchWorkflowConfig (51-52)

🪛 Ruff (0.14.11)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)

53-53: Avoid specifying long messages outside the exception class

(TRY003)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

127-127: Avoid specifying long messages outside the exception class

(TRY003)

327-327: Avoid specifying long messages outside the exception class

(TRY003)

359-359: Consider moving this statement to an else block

(TRY300)

363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

363-363: Avoid specifying long messages outside the exception class

(TRY003)

363-363: Use explicit conversion flag

Replace with conversion flag

(RUF010)

379-379: subprocess call with shell=True identified, security issue

(S602)

412-412: Consider moving this statement to an else block

(TRY300)

425-425: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

426-426: Do not catch blind exception: Exception

(BLE001)

427-427: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

427-427: Avoid specifying long messages outside the exception class

(TRY003)

427-427: Use explicit conversion flag

Replace with conversion flag

(RUF010)

463-463: Redundant exception object included in logging.exception call

(TRY401)

464-464: Use explicit conversion flag

Replace with conversion flag

(RUF010)

494-494: Redundant exception object included in logging.exception call

(TRY401)

495-495: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🔇 Additional comments (11)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)

40-49: LGTM!

The SweBenchPredictorIterativeConfig follows the established pattern, with appropriate type hints and field descriptions. The discriminated union is correctly extended to include the new iterative variant.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

70-73: LGTM!

Simple helper with proper type hints.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py (1)

19-19: LGTM!

The import correctly triggers registration of the git_repo_tool via its decorator, following the established pattern in this file.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py (1)

20-20: The import follows the established pattern for predictor registration.

The addition of IterativePredictor on line 20 mirrors the existing GoldPredictor import on line 19, with proper flake8: noqa directives to allow unused imports (which are intentionally present to trigger registration side-effects). The file maintains proper Apache 2.0 licensing and copyright headers.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (4)

1-14: LGTM!

License header is correctly formatted with SPDX identifier and Apache 2.0 license text.

52-74: LGTM!

Well-structured exception hierarchy with clear separation between recoverable (NonTerminatingException) and terminal (TerminatingException) conditions. This enables clean control flow in the agent loop.

376-389: Acknowledged: shell=True is intentional for this agent design.

The static analysis flags shell=True as a security concern (S602). However, this is expected behavior since the agent is explicitly designed to execute arbitrary bash commands provided by the LLM. The commands are sandboxed within the repository workspace directory.

497-502: LGTM!

The _build_task_description method is clean and well-typed with appropriate type hints.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (2)

23-35: LGTM!

Workflow and functions configuration is well-structured with appropriate iterative predictor settings and git tool configuration. The 2-space indentation follows repository conventions. Based on learnings, this matches the existing YAML style in the repository.

37-70: LGTM!

Evaluation configuration is properly structured with dataset filtering for focused testing. The single-instance allowlist is appropriate for initial development/testing.

examples/evaluation_and_profiling/swe_bench/README.md (1)

159-159: LGTM!

Documentation entry for the iterative predictor follows the established pattern and provides clear, concise description with appropriate links to implementation and configuration files.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

coderabbitai · 2026-01-15T02:32:12Z

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

+async def checkout_commit(repo: Repo, commit_hash: str):
+    """Checkout a specific commit in the repository."""
+    logger.info("Checking out commit %s", commit_hash)
+    repo.git.checkout(commit_hash)


⚠️ Potential issue | 🟠 Major

Same blocking I/O issue; also missing return type hint.

repo.git.checkout() is synchronous. Wrap with asyncio.to_thread(). Per coding guidelines, add return type hint.

🔧 Proposed fix

-async def checkout_commit(repo: Repo, commit_hash: str): +async def checkout_commit(repo: Repo, commit_hash: str) -> None: """Checkout a specific commit in the repository.""" logger.info("Checking out commit %s", commit_hash) - repo.git.checkout(commit_hash) + await asyncio.to_thread(repo.git.checkout, commit_hash)

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py` around lines 82 - 85, The checkout_commit function performs blocking I/O by calling the synchronous repo.git.checkout; change checkout_commit to have an explicit return type hint (-> None) and call the blocking operation inside asyncio.to_thread (e.g., await asyncio.to_thread(repo.git.checkout, commit_hash)) so the checkout runs off the event loop; keep the logger.info call and docstring unchanged and reference the function name checkout_commit and the blocking call repo.git.checkout when making the change.

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py`:
- Around line 69-74: Docstring for SweBenchWorkflowConfig incorrectly lists a
"full" predictor option that doesn't exist; update the docstring to reflect the
actual predictor options defined in SweBenchPredictorConfig (e.g., "gold,
skeleton, or iterative") or enumerate the exact union variants from
SweBenchPredictorConfig so the docs and code are consistent; locate the class
SweBenchWorkflowConfig and its docstring and replace "full, gold, skeleton, or
iterative" with the correct set of predictor types from SweBenchPredictorConfig.
- Around line 24-25: Remove the unused import FunctionRef from the top of the
module: delete the "FunctionRef" import token in the import statement that
currently reads "from nat.data_models.component_ref import FunctionRef" so only
LLMRef remains imported (referenced symbol: FunctionRef).

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 71-74: get_repo_path currently builds a path from only the repo
name causing collisions; update get_repo_path to parse the repo URL and extract
the owner/organization component (e.g., the segment immediately preceding the
repo name for HTTPS and the part after ":" for SSH forms) and return
Path(workspace_dir) / owner / repo_name so repositories with the same name under
different orgs are distinct; ensure you handle URLs like
"https://host/org/repo.git" and "git@host:org/repo.git" and strip ".git" from
repo_name.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 32-38: The git_repo_tool function declares an unused parameter
named builder; rename it to _builder to follow the codebase convention for
intentionally unused parameters (update the function signature async def
git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder): and any
references in the decorator/register_function call if necessary) so
linters/readers know it is intentionally unused.

♻️ Duplicate comments (1)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)
16-28: Duplicate llms key causes configuration to be overwritten.

The YAML has two separate llms: keys (lines 16 and 23). In YAML, duplicate keys at the same level cause the second to overwrite the first, meaning nim_llm will be silently discarded and only claude_sonnet_llm will be available.

Additionally, nim_llm uses 1-space indentation while claude_sonnet_llm uses 2-space indentation. Per learnings, the repository uses 2-space indentation for YAML files.
🔧 Proposed fix - merge into single llms block with consistent 2-space indentation
-llms:
- nim_llm:
-   _type: nim
-   model_name: mistralai/mistral-nemotron
-   temperature: 0.6
-   max_tokens: 4096    
-
-llms:
-  claude_sonnet_llm:
-    _type: litellm
-    model_name: anthropic/claude-sonnet-4-5-20250929
-    temperature: 0.0
-    api_key: "${ANTHROPIC_API_KEY}"  # Set this environment variable before running
+llms:
+  nim_llm:
+    _type: nim
+    model_name: mistralai/mistral-nemotron
+    temperature: 0.6
+    max_tokens: 4096
+
+  claude_sonnet_llm:
+    _type: litellm
+    model_name: anthropic/claude-sonnet-4-5-20250929
+    temperature: 0.0
+    api_key: "${ANTHROPIC_API_KEY}"  # Set this environment variable before running

🧹 Nitpick comments (3)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (2)
84-88: Add return type hint for checkout_commit.

Per coding guidelines, all public APIs require type hints on return values. This function returns None implicitly.
🔧 Proposed fix
-async def checkout_commit(repo: Repo, commit_hash: str):
+async def checkout_commit(repo: Repo, commit_hash: str) -> None:
     """Checkout a specific commit in the repository."""
     logger.info("Checking out commit %s", commit_hash)
     # Use asyncio.to_thread to avoid blocking the event loop during checkout
     await asyncio.to_thread(repo.git.checkout, commit_hash)
38-43: Add type hint for active_repos dictionary.

Per coding guidelines, type hints are required. The dictionary maps repo paths to RepoContext objects.
🔧 Proposed fix
 class RepoManager:

     def __init__(self, workspace_dir: str):
         self.workspace = Path(workspace_dir)
         self.workspace.mkdir(parents=True, exist_ok=True)
-        self.active_repos = {}
+        self.active_repos: dict[str, RepoContext] = {}
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)
80-82: Remove trailing blank lines.

YAMLlint reports too many blank lines at the end of the file. Files should end with a single newline.
🔧 Proposed fix
   evaluators:
     swe_bench:
       _type: swe_bench
       run_id: nat_iterative_1
       clean: true
-

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5771ff and a39f138.

📒 Files selected for processing (4)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
in that situation no return type hint is needed).
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
(ex: "the system's performance" should be "the performance of the system").

Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

If an example contains Python code, it should be placed in a subdirectory named src/ and should
contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.

If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
be checked into git-lfs.

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

🧠 Learnings (6)

📚 Learning: 2026-01-05T15:46:49.677Z

Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Every file must start with the standard SPDX Apache-2.0 header

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2026-01-05T15:46:49.677Z

Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : All source files must include the SPDX Apache-2.0 header template

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2025-12-03T18:42:23.494Z

Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1147
File: packages/nvidia_nat_a2a/pyproject.toml:1-10
Timestamp: 2025-12-03T18:42:23.494Z
Learning: In the packages/ directory, pyproject.toml files typically do not include SPDX license headers. Out of 34 packages, only nvidia_nat_strands is an exception. This pattern differs from the requirement for SPDX headers in source code files (.py, .js, .ts, etc.).

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2025-11-05T11:45:35.119Z

Learnt from: thepatrickchin
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1152
File: examples/config_inheritance/pyproject.toml:1-25
Timestamp: 2025-11-05T11:45:35.119Z
Learning: In the examples/ directory, pyproject.toml files typically do not include SPDX license headers, with only one exception (adk_demo). This is an established pattern that differs from the general guideline requiring SPDX headers in all .toml files.

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2025-12-12T20:49:44.305Z

Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2026-01-05T15:46:49.677Z

Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.py : Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

🧬 Code graph analysis (3)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

src/nat/data_models/function.py (1)

FunctionBaseConfig (26-36)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)

RepoManager (38-68)

setup_repository (45-59)

cleanup (61-68)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (3)

src/nat/data_models/common.py (4)

BaseModelRegistryTag (91-93)

TypedBaseModel (96-171)

static_type (157-158)

discriminator (165-171)

src/nat/data_models/component_ref.py (2)

FunctionRef (94-102)

LLMRef (116-124)

src/nat/data_models/function.py (1)

FunctionBaseConfig (26-36)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

src/nat/runtime/runner.py (1)

context (93-94)

🪛 Ruff (0.14.11)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)

45-45: Avoid specifying long messages outside the exception class

(TRY003)

51-51: Avoid specifying long messages outside the exception class

(TRY003)

59-59: Avoid specifying long messages outside the exception class

(TRY003)

🪛 YAMLlint (1.37.1)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

[error] 23-23: duplication of key "llms" in mapping

(key-duplicates)

[warning] 82-82: too many blank lines (2 > 0)

(empty-lines)

🔇 Additional comments (5)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

77-81: LGTM - blocking I/O properly handled.

The use of asyncio.to_thread correctly offloads the blocking Repo.clone_from call to a thread pool, preventing event loop blocking.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

41-59: Input validation improvements look good.

The JSON parsing is now wrapped in a try-except, and required keys are validated before use. This addresses the past review feedback.

61-66: Good resource cleanup pattern with finally block.

The use of try/finally ensures repo_manager.cleanup() is called even if an exception occurs during the tool's lifecycle, preventing orphaned repository directories.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)

51-61: LGTM - SweBenchPredictorIterativeConfig is well-structured.

The configuration properly uses LLMRef for the LLM reference, and Field with descriptions for step_limit and timeout. The defaults (250 steps, 60s timeout) are reasonable for iterative problem solving.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)

1-14: LGTM - SPDX license header is present.

The required Apache-2.0 license header has been added, addressing the past review feedback.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

coderabbitai · 2026-01-15T06:12:12Z

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

+@register_function(config_type=GitRepoToolConfig)
+async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+    """Git repository management tool for SWE Bench."""
+    import json
+
+    from .git_tool import RepoManager
+    repo_manager = RepoManager(tool_config.workspace_dir)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, find and read the register.py file fd "register.py" -p "examples/evaluation_and_profiling/swe_bench" # Get the full content to understand context find . -path "*/examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py" -type f -exec wc -l {} \;

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 450

🏁 Script executed:

# Search for all `@register_function` decorated functions to check builder parameter usage rg "@register_function" -A 3 --type=py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 50381

🏁 Script executed:

# Find the definition of register_function decorator rg "def register_function" -B 2 -A 10 --type=py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 6591

🏁 Script executed:

cat -n examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 3277

Prefix unused builder parameter with underscore.

The builder parameter is required by the @register_function interface but is not used in this function. Follow the established pattern in the codebase by renaming it to _builder to indicate intentional non-use.

🔧 Suggested fix

`@register_function`(config_type=GitRepoToolConfig) -async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder): +async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder): """Git repository management tool for SWE Bench."""

🧰 Tools

🪛 Ruff (0.14.11)

33-33: Unused function argument: builder

(ARG001)

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py` around lines 32 - 38, The git_repo_tool function declares an unused parameter named builder; rename it to _builder to follow the codebase convention for intentionally unused parameters (update the function signature async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder): and any references in the decorator/register_function call if necessary) so linters/readers know it is intentionally unused.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`:
- Around line 16-28: The YAML defines two separate top-level llms mappings which
causes the first (nim_llm) to be overwritten by the second (claude_sonnet_llm);
merge both entries under a single llms key so both nim_llm and claude_sonnet_llm
are present, and fix nim_llm’s indentation to match the file’s 2-space style;
locate the nim_llm and claude_sonnet_llm blocks and combine them into one llms
mapping preserving their model_name, temperature, max_tokens and api_key fields.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 33-35: RepoContext.__post_init__ sets repo_path = base_path /
repo_name which omits the organization and disagrees with get_repo_path; update
__post_init__ so repo_path includes the org (e.g., repo_path = base_path /
self.org_name / self.repo_name) or call the existing get_repo_path logic to
compute it, ensuring RepoContext.repo_path matches the path used by
setup_repository and clone operations.

♻️ Duplicate comments (1)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
88-92: Add missing return type hint.

Per coding guidelines, all public functions require type hints. This async function returns None.
🔧 Proposed fix
-async def checkout_commit(repo: Repo, commit_hash: str):
+async def checkout_commit(repo: Repo, commit_hash: str) -> None:
     """Checkout a specific commit in the repository."""

🧹 Nitpick comments (4)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)
81-82: Remove extra trailing blank line.

YAMLlint reports too many blank lines at the end. Files should end with exactly one newline.
🔧 Proposed fix
       clean: true
-
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
1-1: Copyright year should be updated to 2025-2026.

Other files in this PR use 2025-2026 in the copyright header. This file uses only 2025.
🔧 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
32-34: Prefix unused builder parameter with underscore.

The builder parameter is required by the @register_function interface but is unused. Follow the codebase convention by renaming to _builder.
🔧 Proposed fix
 `@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):
     """Git repository management tool for SWE Bench."""
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
1-1: Copyright year should be updated to 2025-2026.

For consistency with other files in this PR.
🔧 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a39f138 and dbc2dd6.

📒 Files selected for processing (4)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

🧰 Additional context used

📓 Path-based instructions (6)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
in that situation no return type hint is needed).
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
(ex: "the system's performance" should be "the performance of the system").

Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

If an example contains Python code, it should be placed in a subdirectory named src/ and should
contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.

If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
be checked into git-lfs.

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

🧠 Learnings (6)

📚 Learning: 2026-01-05T15:46:49.677Z

Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.py : Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

📚 Learning: 2026-01-05T15:46:49.677Z

Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Every file must start with the standard SPDX Apache-2.0 header

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2026-01-05T15:46:49.677Z

Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : All source files must include the SPDX Apache-2.0 header template

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2025-12-03T18:42:23.494Z

Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1147
File: packages/nvidia_nat_a2a/pyproject.toml:1-10
Timestamp: 2025-12-03T18:42:23.494Z
Learning: In the packages/ directory, pyproject.toml files typically do not include SPDX license headers. Out of 34 packages, only nvidia_nat_strands is an exception. This pattern differs from the requirement for SPDX headers in source code files (.py, .js, .ts, etc.).

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2025-11-05T11:45:35.119Z

Learnt from: thepatrickchin
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1152
File: examples/config_inheritance/pyproject.toml:1-25
Timestamp: 2025-11-05T11:45:35.119Z
Learning: In the examples/ directory, pyproject.toml files typically do not include SPDX license headers, with only one exception (adk_demo). This is an established pattern that differs from the general guideline requiring SPDX headers in all .toml files.

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2025-12-12T20:49:44.305Z

Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

🧬 Code graph analysis (3)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (3)

src/nat/data_models/common.py (4)

BaseModelRegistryTag (91-93)

TypedBaseModel (96-171)

static_type (157-158)

discriminator (165-171)

src/nat/data_models/component_ref.py (1)

LLMRef (116-124)

src/nat/data_models/function.py (1)

FunctionBaseConfig (26-36)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

src/nat/runtime/runner.py (1)

context (93-94)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (1)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)

RepoManager (38-68)

setup_repository (45-59)

cleanup (61-68)

🪛 Ruff (0.14.11)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)

45-45: Avoid specifying long messages outside the exception class

(TRY003)

51-51: Avoid specifying long messages outside the exception class

(TRY003)

59-59: Avoid specifying long messages outside the exception class

(TRY003)

🪛 YAMLlint (1.37.1)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

[error] 23-23: duplication of key "llms" in mapping

(key-duplicates)

[warning] 82-82: too many blank lines (2 > 0)

(empty-lines)

🔇 Additional comments (6)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (3)

50-60: LGTM! Well-structured iterative predictor configuration.

The new SweBenchPredictorIterativeConfig is properly defined with appropriate type hints and sensible defaults. The LLMRef type correctly references the component reference pattern used elsewhere in the codebase.

Minor nit: Line 57 has trailing whitespace after the closing triple quotes.

62-66: Discriminated union correctly extended.

The SweBenchPredictorConfig union properly includes the new iterative config using the established Tag/Discriminator pattern.

68-74: Docstring now correctly lists available predictor types.

The docstring accurately reflects the available options: "gold, skeleton, or iterative".

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

41-59: Good input validation and error handling.

The function properly validates JSON input, checks for required keys on the setup operation, and provides clear error messages. This addresses the concerns from prior reviews.

61-66: Correct async generator pattern with guaranteed cleanup.

The try/finally ensures cleanup() runs when the tool is disposed, respecting the cleanup_on_exit config flag.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

81-85: LGTM! Blocking I/O correctly wrapped with asyncio.to_thread.

The clone operation now properly yields to the event loop, addressing the prior review concern.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`:
- Around line 1-2: Update the SPDX copyright header to include 2026 by modifying
the SPDX-FileCopyrightText line (and/or SPDX-License-Identifier line if needed)
so the year range reads "2025, 2026" or "2025-2026" as per project convention;
locate the top-of-file SPDX header in predict_iterative.py and adjust the year
accordingly.
- Around line 414-425: The except block handling subprocess.TimeoutExpired
should guard the type of e.output before decoding: in the except (TimeoutError,
subprocess.TimeoutExpired) handler (inside the predict_iterative code path),
check if isinstance(e, subprocess.TimeoutExpired) and hasattr(e, "output") and
e.output, then if isinstance(e.output, bytes) decode with .decode("utf-8",
errors="replace") else treat e.output as the string directly; assign that to the
local output variable used in self._TIMEOUT_TEMPLATE and raise
ExecutionTimeoutError(timeout_message) as before to avoid calling .decode on a
str when text=True was used.
- Around line 449-451: The parsing of swebench_input.instance_id into repo_name
using split('-')[0] breaks for hyphenated org/repo names; update the extraction
to take the suffix after the last hyphen (use rsplit('-', 1)) so repo_name
contains the full org__repo string, then continue to split into org, repo and
build repo_url as before (look for repo_name, org, repo, repo_url in
predict_iterative.py and replace the split logic accordingly).

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 1-2: Update the SPDX header to include 2026 by changing the
copyright year range in the top-of-file comments in git_tool.py (the two SPDX
comment lines); for example replace the single year "2025" with a range like
"2025-2026" (or include "2026" alongside 2025) so the SPDX-FileCopyrightText and
SPDX-License-Identifier lines reflect the 2026 change.
- Around line 76-98: The get_repo_path function incorrectly parses SSH-style
URLs like "git@github.com:org/repo.git"; update get_repo_path to normalize
repo_url first (handle git@...:org/repo.git by replacing the first colon after
host with a slash or by extracting the substring after the last ':' when an '@'
is present) so that parts = repo_url.rstrip('/').split('/') yields correct
segments; keep existing logic using repo_name, org_name, workspace_dir and
instance_id unchanged but derive org_name and repo_name from the normalized URL
to support both SSH and HTTP(S) formats.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 1-2: Update the SPDX header to include 2026 by modifying the
existing copyright SPDX lines at the top of the file (the lines containing
"SPDX-FileCopyrightText" and "SPDX-License-Identifier") so the copyright year
range includes 2026 (e.g., change "2025" to "2025, 2026" or "2025-2026")
ensuring the SPDX tags remain intact.

♻️ Duplicate comments (1)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (1)
32-33: Rename unused builder parameter to _builder.

The parameter is required by the decorator but unused. Prefixing with _ clarifies intent and avoids lint warnings.
🔧 Suggested fix
 `@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):
     """Git repository management tool for SWE Bench."""

🧹 Nitpick comments (2)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
66-73: Avoid blocking the event loop in cleanup.

shutil.rmtree is synchronous and can block. Run it in a thread (or make cleanup sync) to keep async responsiveness. Based on learnings, avoid blocking I/O in async paths.
♻️ Proposed fix
     async def cleanup(self):
         """Clean up all managed repositories."""
         import shutil
         for repo_path_str in list(self.active_repos.keys()):
             repo_path = Path(repo_path_str)
             if repo_path.exists():
-                shutil.rmtree(repo_path)
+                await asyncio.to_thread(shutil.rmtree, repo_path)
         self.active_repos.clear()
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)
105-110: Add a type hint for llm in IterativeAgent.__init__.

Public APIs should carry parameter type hints; use typing.Any (or a protocol) to make intent explicit. As per coding guidelines, add missing type annotations.
🔧 Proposed fix
-import asyncio
-import json
-import logging
-import re
-import subprocess
+import asyncio
+import json
+import logging
+import re
+import subprocess
+import typing
@@
-    def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig):
+    def __init__(self, llm: typing.Any, repo_path: Path, config: IterativeAgentConfig):
         self.llm = llm

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dbc2dd6 and 2ed8b41.

📒 Files selected for processing (4)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

🧰 Additional context used

📓 Path-based instructions (6)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
in that situation no return type hint is needed).
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
(ex: "the system's performance" should be "the performance of the system").

Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

If an example contains Python code, it should be placed in a subdirectory named src/ and should
contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.

If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
be checked into git-lfs.

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

🧠 Learnings (6)

📚 Learning: 2026-01-05T15:46:49.677Z

Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Every file must start with the standard SPDX Apache-2.0 header

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2026-01-05T15:46:49.677Z

Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : All source files must include the SPDX Apache-2.0 header template

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2025-12-03T18:42:23.494Z

Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1147
File: packages/nvidia_nat_a2a/pyproject.toml:1-10
Timestamp: 2025-12-03T18:42:23.494Z
Learning: In the packages/ directory, pyproject.toml files typically do not include SPDX license headers. Out of 34 packages, only nvidia_nat_strands is an exception. This pattern differs from the requirement for SPDX headers in source code files (.py, .js, .ts, etc.).

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2025-11-05T11:45:35.119Z

Learnt from: thepatrickchin
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1152
File: examples/config_inheritance/pyproject.toml:1-25
Timestamp: 2025-11-05T11:45:35.119Z
Learning: In the examples/ directory, pyproject.toml files typically do not include SPDX license headers, with only one exception (adk_demo). This is an established pattern that differs from the general guideline requiring SPDX headers in all .toml files.

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2025-12-12T20:49:44.305Z

Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

📚 Learning: 2026-01-05T15:46:49.677Z

Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.py : Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

🧬 Code graph analysis (2)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (3)

src/nat/builder/builder.py (1)

Builder (84-811)

src/nat/builder/framework_enum.py (1)

LLMFrameworkEnum (19-27)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)

SweBenchWorkflowConfig (68-74)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (1)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)

RepoManager (34-73)

setup_repository (41-64)

cleanup (66-73)

🪛 Ruff (0.14.11)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

127-127: Avoid specifying long messages outside the exception class

(TRY003)

327-327: Avoid specifying long messages outside the exception class

(TRY003)

359-359: Consider moving this statement to an else block

(TRY300)

363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

363-363: Avoid specifying long messages outside the exception class

(TRY003)

363-363: Use explicit conversion flag

Replace with conversion flag

(RUF010)

379-379: subprocess call with shell=True identified, security issue

(S602)

412-412: Consider moving this statement to an else block

(TRY300)

425-425: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

426-426: Do not catch blind exception: Exception

(BLE001)

427-427: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

427-427: Avoid specifying long messages outside the exception class

(TRY003)

427-427: Use explicit conversion flag

Replace with conversion flag

(RUF010)

464-464: Redundant exception object included in logging.exception call

(TRY401)

465-465: Use explicit conversion flag

Replace with conversion flag

(RUF010)

495-495: Redundant exception object included in logging.exception call

(TRY401)

496-496: Use explicit conversion flag

Replace with conversion flag

(RUF010)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)

45-45: Avoid specifying long messages outside the exception class

(TRY003)

51-51: Avoid specifying long messages outside the exception class

(TRY003)

63-63: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)

1-77: Config layout and defaults look good.

LLM backends, iterative predictor settings, tool registration, and evaluator settings are coherent and well-structured.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`:
- Around line 105-113: Update the public methods __init__ and add_message to
include explicit type hints and Google-style docstrings: annotate llm with
typing.Any (or the concrete LLM interface), add return type None for both
__init__ and add_message, and type self.messages as List[Tuple[str, str]] (or a
suitable typed container) and self.n_steps as int; update the __init__ signature
(def __init__(self, llm: Any, repo_path: Path, config: IterativeAgentConfig) ->
None) and add_message signature (def add_message(self, role: str, content: str)
-> None), and add Google-style docstrings for both methods including an Args
section for parameters and a Returns section (even if None).
- Around line 478-481: The SweBenchPredictorIterativeConfig is missing a
max_output_length field and thus the IterativeAgentConfig isn't receiving a
custom truncation limit; add max_output_length: int = Field(default=10000,
description="Maximum output length for command results") to the
SweBenchPredictorIterativeConfig in nat_swe_bench/config.py, then update the
IterativeAgentConfig construction in predict_iterative.py (the agent_config =
IterativeAgentConfig(...) block) to include
max_output_length=getattr(self.config.predictor, 'max_output_length', 10000) (or
self.config.predictor.max_output_length) so the value is passed through to
IterativeAgentConfig.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 16-18: Add a Google-style module docstring at the top of
register.py (immediately after the existing first line) describing the module as
a public registry for tools used by the full predictor, its purpose, what it
exports/registry behavior, and any important usage notes; ensure the docstring
follows Google docstring conventions (short summary line, blank line, more
detailed description and any public symbols) so the module is properly
documented for public consumption and tooling.

♻️ Duplicate comments (2)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
33-33: Rename unused builder to _builder.

This parameter is required by the decorator interface but unused; prefixing with _ matches the existing convention.
🔧 Suggested fix
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder) -> AsyncIterator[FunctionInfo]:
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder) -> AsyncIterator[FunctionInfo]:
32-34: Add return type hint to git_repo_tool.

The function is a public API (decorated with @register_function) and yields FunctionInfo, but lacks a return type hint. Per coding guidelines, all public APIs require type hints on return values.
♻️ Suggested update
-import typing
+import typing
+from collections.abc import AsyncIterator
@@
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder) -> AsyncIterator[FunctionInfo]:

🧹 Nitpick comments (1)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)
360-427: Preserve exception context when wrapping errors.

Chaining with from e keeps the original stack trace for debugging.
🔧 Suggested change
@@
-            raise NonTerminatingException(f"LLM call failed: {str(e)}")
+            raise NonTerminatingException(f"LLM call failed: {str(e)}") from e
@@
-            raise ExecutionTimeoutError(timeout_message)
+            raise ExecutionTimeoutError(timeout_message) from e
@@
-            raise NonTerminatingException(f"Error executing command: {str(e)}")
+            raise NonTerminatingException(f"Error executing command: {str(e)}") from e
As per coding guidelines, preserve exception context when wrapping errors.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ed8b41 and 1abbd6e.

📒 Files selected for processing (4)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

🚧 Files skipped from review as they are similar to previous changes (2)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

🧰 Additional context used

📓 Path-based instructions (6)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
in that situation no return type hint is needed).
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
(ex: "the system's performance" should be "the performance of the system").

Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

If an example contains Python code, it should be placed in a subdirectory named src/ and should
contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.

If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
be checked into git-lfs.

Files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

🧠 Learnings (1)

📚 Learning: 2026-01-05T15:46:49.677Z

Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Confirm that copyright years are up-to-date whenever a file is changed

Applied to files:

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

🧬 Code graph analysis (2)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (3)

src/nat/builder/builder.py (1)

Builder (84-811)

src/nat/builder/framework_enum.py (1)

LLMFrameworkEnum (19-27)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)

SweBenchWorkflowConfig (68-74)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (4)

src/nat/builder/function_info.py (2)

FunctionInfo (290-625)

from_fn (552-625)

src/nat/agent/react_agent/output_parser.py (1)

_type (103-104)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)

RepoManager (35-74)

setup_repository (42-65)

cleanup (67-74)

src/nat/runtime/runner.py (1)

context (93-94)

🪛 Ruff (0.14.11)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

127-127: Avoid specifying long messages outside the exception class

(TRY003)

327-327: Avoid specifying long messages outside the exception class

(TRY003)

359-359: Consider moving this statement to an else block

(TRY300)

363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

363-363: Avoid specifying long messages outside the exception class

(TRY003)

363-363: Use explicit conversion flag

Replace with conversion flag

(RUF010)

379-379: subprocess call with shell=True identified, security issue

(S602)

412-412: Consider moving this statement to an else block

(TRY300)

425-425: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

426-426: Do not catch blind exception: Exception

(BLE001)

427-427: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

427-427: Avoid specifying long messages outside the exception class

(TRY003)

427-427: Use explicit conversion flag

Replace with conversion flag

(RUF010)

464-464: Redundant exception object included in logging.exception call

(TRY401)

465-465: Use explicit conversion flag

Replace with conversion flag

(RUF010)

495-495: Redundant exception object included in logging.exception call

(TRY401)

496-496: Use explicit conversion flag

Replace with conversion flag

(RUF010)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)

45-45: Avoid specifying long messages outside the exception class

(TRY003)

51-51: Avoid specifying long messages outside the exception class

(TRY003)

63-63: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (3)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

25-29: Config model looks good.

Clear defaults and simple surface area for the tool configuration.

40-63: Input validation and error messages are solid.

Good defensive handling for malformed JSON and unknown operations.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)

76-81: Config defaults look sensible.

The step limit, timeout, and output limits are reasonable starting points.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-16T05:51:38Z

..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

+    def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig):
+        self.llm = llm
+        self.repo_path = repo_path
+        self.config = config
+        self.messages: list = []
+        self.n_steps = 0
+
+    def add_message(self, role: str, content: str):
+        """Add a message to the conversation and print it for debugging."""


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the file exists and read the relevant section find . -path "*/examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py" -type f

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 182

🏁 Script executed:

# Read the file to see the current state around lines 105-113 cat -n "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py" | sed -n '100,120p'

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1222

🏁 Script executed:

# Check the import statements at the top of the file head -30 "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1169

🏁 Script executed:

# Check for typing imports in the file head -50 "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py" | tail -25

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 744

🏁 Script executed:

# Find the class definition and verify this is a public class grep -n "^class " "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 456

🏁 Script executed:

# Get more context around the __init__ and add_message methods cat -n "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py" | sed -n '60,125p'

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 3254

Add type hints and Google‑style docstrings for public methods.

__init__ and add_message are public methods that require complete type hints and Google‑style docstrings per coding guidelines. Currently:

__init__ is missing type hint for llm and return type annotation

add_message is missing return type annotation and complete Google‑style docstring (needs Args section)

✏️ Proposed update

@@ -from dataclasses import dataclass +from dataclasses import dataclass +from typing import Any @@ - def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig): + def __init__(self, llm: Any, repo_path: Path, config: IterativeAgentConfig) -> None: + """Initialize the iterative agent. + + Args: + llm: LLM instance used for chat completion. + repo_path: Path to the repository workspace. + config: Agent configuration. + """ @@ - def add_message(self, role: str, content: str): - """Add a message to the conversation and print it for debugging.""" + def add_message(self, role: str, content: str) -> None: + """Add a message to the conversation and print it for debugging. + + Args: + role: Message role ("system", "user"/"human", or "assistant"/"ai"). + content: Message text. + """

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py` around lines 105 - 113, Update the public methods __init__ and add_message to include explicit type hints and Google-style docstrings: annotate llm with typing.Any (or the concrete LLM interface), add return type None for both __init__ and add_message, and type self.messages as List[Tuple[str, str]] (or a suitable typed container) and self.n_steps as int; update the __init__ signature (def __init__(self, llm: Any, repo_path: Path, config: IterativeAgentConfig) -> None) and add_message signature (def add_message(self, role: str, content: str) -> None), and add Google-style docstrings for both methods including an Args section for parameters and a Returns section (even if None).

coderabbitai · 2026-01-16T05:51:38Z

..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

+            agent_config = IterativeAgentConfig(
+                step_limit=getattr(self.config.predictor, 'step_limit', 250),
+                timeout=getattr(self.config.predictor, 'timeout', 60)
+            )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n "max_output_length" examples/evaluation_and_profiling/swe_bench/src -g "*.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 389

🏁 Script executed:

# Find the IterativeAgentConfig definition fd -t f "*.py" examples/evaluation_and_profiling/swe_bench/src | xargs rg -l "class IterativeAgentConfig"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 481

🏁 Script executed:

# Search for max_output_length more broadly in the codebase rg -n "max_output_length" --type py | head -50

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 389

🏁 Script executed:

head -100 examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py | tail -30

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1225

🏁 Script executed:

sed -n '70,90p' examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 656

🏁 Script executed:

sed -n '470,490p' examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 851

🏁 Script executed:

# Find IterativeAgentConfig definition rg -A 10 "class IterativeAgentConfig" examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench -g "*.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1580

🏁 Script executed:

# Find the predictor config model definition rg -B 5 -A 20 "class.*PredictorConfig" examples/evaluation_and_profiling/swe_bench/src -g "*.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 51

🏁 Script executed:

# Also search for config field definitions related to max_output_length rg -B 3 -A 3 "max_output_length" examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench -g "*.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2114

🏁 Script executed:

sed -n '1,100p' examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 3486

🏁 Script executed:

# Search for config class definitions in the file rg "^class.*Config" examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py -A 5

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 229

🏁 Script executed:

# Look for where self.config is defined rg -B 5 "self.config.predictor" examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py | head -30

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 601

🏁 Script executed:

# Look for config model definitions rg -B 2 -A 15 "class SweBenchWorkflowConfig" examples/evaluation_and_profiling/swe_bench/src -g "*.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1030

🏁 Script executed:

# Search for config files and their definitions fd -t f "config.py" examples/evaluation_and_profiling/swe_bench/src

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 140

🏁 Script executed:

# Check what's in the config module cat examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 3137

🏁 Script executed:

# Verify max_output_length is actually used in IterativeAgent sed -n '395,410p' examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 851

Add max_output_length to the predictor config model and pass it through to IterativeAgentConfig.

The predictor config (SweBenchPredictorIterativeConfig) is missing the max_output_length field, so the suggested getattr fix alone won't enable custom truncation limits. You need to add this field to the config model first, then pass it through here.

🔧 Required changes

In nat_swe_bench/config.py, add to SweBenchPredictorIterativeConfig:

max_output_length: int = Field(default=10000, description="Maximum output length for command results")

Then in predict_iterative.py at lines 478-481:

agent_config = IterativeAgentConfig( step_limit=getattr(self.config.predictor, 'step_limit', 250), - timeout=getattr(self.config.predictor, 'timeout', 60) + timeout=getattr(self.config.predictor, 'timeout', 60), + max_output_length=getattr(self.config.predictor, 'max_output_length', 10000), )

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py` around lines 478 - 481, The SweBenchPredictorIterativeConfig is missing a max_output_length field and thus the IterativeAgentConfig isn't receiving a custom truncation limit; add max_output_length: int = Field(default=10000, description="Maximum output length for command results") to the SweBenchPredictorIterativeConfig in nat_swe_bench/config.py, then update the IterativeAgentConfig construction in predict_iterative.py (the agent_config = IterativeAgentConfig(...) block) to include max_output_length=getattr(self.config.predictor, 'max_output_length', 10000) (or self.config.predictor.max_output_length) so the value is passed through to IterativeAgentConfig.

coderabbitai · 2026-01-16T05:51:38Z

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

+# Register all the tools needed by the full predictor without loading the dependencies.
+import typing
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check if this file exists and read its content cd examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools ls -la register.py wc -l register.py head -50 register.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2189

🏁 Script executed:

# Check the coding guidelines for examples directory cat .cursor/rules/general.mdc | grep -A 20 "examples/\*\*/\*"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 51

🏁 Script executed:

# Find Python files in examples directory and check their docstrings find examples -name "*.py" -type f | head -10

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 822

🏁 Script executed:

# Check a few example Python files to see their docstring patterns fd -e py examples | head -5 | xargs -I {} sh -c 'echo "=== {} ===" && head -20 {}'

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 51

🏁 Script executed:

# Check specific example files for module docstrings echo "=== A2A register.py ===" && head -25 examples/A2A/math_assistant_a2a/src/nat_math_assistant_a2a/register.py echo "" echo "=== HITL register.py ===" && head -25 examples/HITL/por_to_jiratickets/src/nat_por_to_jiratickets/register.py echo "" echo "=== extract_por_tool.py ===" && head -25 examples/HITL/por_to_jiratickets/src/nat_por_to_jiratickets/extract_por_tool.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2989

🏁 Script executed:

# Check the .cursor/rules/general.mdc file more thoroughly for any examples directory guidance cat .cursor/rules/general.mdc | tail -50

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2989

🏁 Script executed:

# Also check if there are any other examples modules with docstrings grep -r '"""' examples --include="*.py" | grep -A 2 "^examples" | head -20

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1960

Add a module docstring for this public registry.

Public modules require Google-style docstrings. This file currently starts with a comment only; add a docstring following the first line to describe the module's purpose.

✏️ Proposed docstring

@@ -# Register all the tools needed by the full predictor without loading the dependencies. +"""Register iterative SWE-bench tool integrations.""" + +# Register all the tools needed by the full predictor without loading the dependencies.

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py` around lines 16 - 18, Add a Google-style module docstring at the top of register.py (immediately after the existing first line) describing the module as a public registry for tools used by the full predictor, its purpose, what it exports/registry behavior, and any important usage notes; ensure the docstring follows Google docstring conventions (short summary line, blank line, more detailed description and any public symbols) so the module is properly documented for public consumption and tooling.

willkill07 · 2026-01-16T21:25:36Z

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

+  _type: swe_bench
+  predictor:
+    _type: iterative
+    llm_name: "claude_sonnet_llm"   # "nim_llm" or "claude_sonnet_llm" or "openai_llm"


Could we have the default be openai_llm or nim_llm ?

willkill07 · 2026-01-16T21:26:08Z

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

+llms:
+  nim_llm:
+    _type: nim
+    model_name: mistralai/mistral-nemotron


could we switch this to the new(er) Nemotron Nano 3 model?

Hi @willkill07,
Thanks for the feedback. I've set the default LLM to openai_llm. I also tried the latest nvidia/nemotron-3-nano-30b-a3b, but response.content is always empty. Any suggestion?
btw, I also tried nvidia/llama-3.3-nemotron-super-49b-v1.5 and it worked well. Successfully resolved one issue in 19 steps.

willkill07 · 2026-01-16T21:26:36Z

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

+    _type: litellm
+    model_name: openai/gpt-5.2


Suggested change

_type: litellm

model_name: openai/gpt-5.2

_type: openai

model_name: gpt-5.2

willkill07 · 2026-01-16T21:27:10Z

...valuation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.py

All files (even if no code) must contain the license headers

willkill07 · 2026-01-16T21:27:26Z

..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

@@ -0,0 +1,505 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


This code was introduced in 2026

Suggested change

# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

willkill07 · 2026-01-16T21:28:00Z

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

@@ -0,0 +1,70 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Suggested change

# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

willkill07 · 2026-01-16T21:28:19Z

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

@@ -0,0 +1,118 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Suggested change

# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

willkill07 · 2026-01-16T21:28:29Z

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/__init__.py

Need license header

willkill07 · 2026-01-16T21:28:44Z

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

@@ -0,0 +1,78 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Suggested change

# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`:
- Around line 353-363: The exception wrapper in _query_llm currently raises
NonTerminatingException without chaining the original exception; change the
raise to "raise NonTerminatingException(f'LLM call failed: {str(e)}') from e" so
the original traceback is preserved, and apply the same fix to the other
NonTerminatingException raise site in this module (the other error-handling
block that currently does "raise NonTerminatingException(...)" in the same
class/file).

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 35-41: Add a Google-style class docstring to the public class
RepoManager describing its responsibility (managing local git repositories
within a workspace), what it does on init (creates workspace directory), and
list key arguments/attributes: Args: workspace_dir (str) — path to workspace;
Attributes: workspace (Path) and active_repos (dict). Place the docstring
directly below the class RepoManager declaration and keep it concise (one-line
summary + short paragraph + Args and Attributes sections).
- Around line 1-16: Add a Google-style module docstring immediately below the
SPDX header in git_tool.py that briefly describes the module's purpose (public
Git tooling utilities used by the iterative predictor), lists the primary public
symbols it provides (e.g., any public functions or classes in this module), and
any usage/behavior notes; ensure it follows Google docstring conventions and
sits at the top of the file before imports (i.e., above the existing import
asyncio).
- Around line 77-104: The get_repo_path function must validate repo_url and
sanitise instance_id to prevent path traversal and index errors: ensure repo_url
yields at least two path components before accessing parts[-2] (raise ValueError
if malformed), and validate instance_id (if provided) to be a safe token—trim
whitespace and reject values containing path separators ('/', '\'), '..', or any
characters outside an allowed set such as alphanumerics, hyphen and underscore
(or use a simple regex like ^[A-Za-z0-9_-]+$); only use the sanitized
instance_id when constructing the Path and document/raise a clear error for
invalid inputs so callers cannot escape workspace directories via crafted inputs
in get_repo_path, repo_name, org_name, and instance_id.
- Around line 67-74: The cleanup method lacks a return type and performs
blocking filesystem work; update async def cleanup(self) to async def
cleanup(self) -> None and offload synchronous shutil.rmtree calls to a thread
using asyncio.to_thread (import asyncio if not already), e.g. await
asyncio.to_thread(shutil.rmtree, repo_path) while keeping the loop over
self.active_repos and clearing it afterward; ensure shutil is imported/import
available in the scope where cleanup is defined (or keep the local import) and
await the to_thread calls for each repo_path in the cleanup method.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 33-41: The docstring for git_repo_tool currently describes the
wrapper but should document the logical JSON interface presented to the LLM:
update the git_repo_tool docstring to explain the JSON payload accepted by the
inner async function git_operations (keys: "operation" with allowed operations,
"repo_url" for the git repository, "base_commit" for the commit/branch/ref, and
"instance_id" if applicable), sketch expected types/values and example payloads,
and mention that RepoManager (constructed from tool_config.workspace_dir) will
perform the requested operations; do not describe builder or wrapper internals,
only the LLM-visible JSON interface.

♻️ Duplicate comments (4)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

1-18: Add a module docstring for the tool registry.

This public module should expose a concise Google-style docstring after the SPDX header.
✏️ Proposed fix
@@
-# Register all the tools needed by the full predictor without loading the dependencies.
+"""Register iterative SWE-bench tool integrations."""
+
+# Register all the tools needed by the full predictor without loading the dependencies.
As per coding guidelines, public modules need Google-style docstrings.

32-34: Prefix unused builder parameter with underscore.

This parameter is required by the decorator but unused; underscore it to signal intent.
🔧 Proposed fix
@@
 `@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (2)

105-130: Add type hints and Google-style docstrings for public methods.

__init__ and add_message are public; they should include explicit type hints and Google-style docstrings, plus typed attributes.

✏️ Proposed fix

@@
-from dataclasses import dataclass
+from dataclasses import dataclass
+from typing import Any

@@
-    def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig):
+    def __init__(self, llm: Any, repo_path: Path, config: IterativeAgentConfig) -> None:
+        """Initialize the iterative agent.
+
+        Args:
+            llm: LLM instance used for chat completion.
+            repo_path: Path to the repository workspace.
+            config: Agent configuration.
+        """
         self.llm = llm
         self.repo_path = repo_path
         self.config = config
-        self.messages: list = []
-        self.n_steps = 0
+        self.messages: list[AIMessage | HumanMessage | SystemMessage] = []
+        self.n_steps: int = 0
@@
-    def add_message(self, role: str, content: str):
-        """Add a message to the conversation and print it for debugging."""
+    def add_message(self, role: str, content: str) -> None:
+        """Add a message to the conversation and print it for debugging.
+
+        Args:
+            role: Message role ("system", "user"/"human", or "assistant"/"ai").
+            content: Message text.
+        """

As per coding guidelines, public methods need type hints and Google-style docstrings.

477-481: Pass through max_output_length to the agent config.

Right now the predictor ignores any configured output limit; pass the value through so it can be tuned.

🔧 Proposed fix

@@
             agent_config = IterativeAgentConfig(
-                step_limit=getattr(self.config.predictor, 'step_limit', 250),
-                timeout=getattr(self.config.predictor, 'timeout', 60)
+                step_limit=getattr(self.config.predictor, "step_limit", 250),
+                timeout=getattr(self.config.predictor, "timeout", 60),
+                max_output_length=getattr(self.config.predictor, "max_output_length", 10000),
             )

🧹 Nitpick comments (1)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)

379-389: Run LLM-provided commands inside a hardened sandbox.

Because commands are executed with shell=True from model output, ensure this runs in a tightly isolated environment (drop privileges, restrict filesystem scope, and disable network) to contain potential misuse.

coderabbitai · 2026-01-17T04:28:54Z

..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

+    async def _query_llm(self) -> str:
+        """Query LLM and return response content."""
+        try:
+            response = await self.llm.ainvoke(self.messages)                    
+            content = response.content if hasattr(response, 'content') else str(response)
+            self.add_message("assistant", content)
+            return content
+        except Exception as e:
+            logger.error("LLM invocation failed: %s", e, exc_info=True)
+            # recoverable error, let the agent continue
+            raise NonTerminatingException(f"LLM call failed: {str(e)}")


⚠️ Potential issue | 🟡 Minor

Preserve exception context when wrapping errors.

Wrapping exceptions without from e loses the original traceback and makes debugging harder; chain the cause explicitly.

♻️ Proposed fix

@@ - raise NonTerminatingException(f"LLM call failed: {str(e)}") + raise NonTerminatingException(f"LLM call failed: {e}") from e @@ - raise ExecutionTimeoutError(timeout_message) + raise ExecutionTimeoutError(timeout_message) from e @@ - raise NonTerminatingException(f"Error executing command: {str(e)}") + raise NonTerminatingException(f"Error executing command: {e}") from e

As per coding guidelines, preserve stack traces when re-raising.
Also applies to: 414-427

🧰 Tools

🪛 Ruff (0.14.13)

359-359: Consider moving this statement to an else block

(TRY300)

363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

363-363: Avoid specifying long messages outside the exception class

(TRY003)

363-363: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py` around lines 353 - 363, The exception wrapper in _query_llm currently raises NonTerminatingException without chaining the original exception; change the raise to "raise NonTerminatingException(f'LLM call failed: {str(e)}') from e" so the original traceback is preserved, and apply the same fix to the other NonTerminatingException raise site in this module (the other error-handling block that currently does "raise NonTerminatingException(...)" in the same class/file).

coderabbitai · 2026-01-17T04:28:54Z

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

+# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import asyncio


⚠️ Potential issue | 🟡 Minor

Add a module docstring for public git tooling.

This public module lacks a Google-style docstring; add one just below the SPDX header.

✏️ Proposed fix

@@ # limitations under the License. - +"""Git repository utilities for the iterative SWE-bench predictor.""" +

As per coding guidelines, public modules need Google-style docstrings.

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py` around lines 1 - 16, Add a Google-style module docstring immediately below the SPDX header in git_tool.py that briefly describes the module's purpose (public Git tooling utilities used by the iterative predictor), lists the primary public symbols it provides (e.g., any public functions or classes in this module), and any usage/behavior notes; ensure it follows Google docstring conventions and sits at the top of the file before imports (i.e., above the existing import asyncio).

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

AnuradhaKaruppiah · 2026-01-21T00:29:51Z

Achieved 70% success rate (7/10) in initial evaluation.

70% is very impressive. If you have the compute resources do you think you would be able to get a run on all 500?

AnuradhaKaruppiah · 2026-01-21T00:46:15Z

/ok to test 4a1f4af

Jerryguan777 · 2026-01-22T19:20:58Z

Hi @AnuradhaKaruppiah ,
I run the evaluation with GPT 5.2 on hf://datasets/princeton-nlp/SWE-bench_Lite/data/test-00000-of-00001.parquet, here is a summary:

Success rate: 53.7% (158/294)
Note: Some failures were due to eval environment issues, the actually success rate may be higher.

Patch generation stage:

Totally 300 instances, 6 skipped, 294 finished

Evaluation Stage

    "total_instances": 294,
    "submitted_instances": 294,
    "completed_instances": 274,
    "resolved_instances": 158,
    "unresolved_instances": 116,
    "empty_patch_instances": 0,
    "error_instances": 20,

Achieved 70% success rate (7/10) in initial evaluation.

70% is very impressive. If you have the compute resources do you think you would be able to get a run on all 500?

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 109-115: The pre-clone cleanup in clone_repository currently calls
blocking shutil.rmtree inside an async function; move that call into a
background thread using asyncio.to_thread so the event loop isn't blocked. Wrap
the existing os.path.exists(target_path) check and shutil.rmtree(target_path)
invocation in await asyncio.to_thread(...) (or call asyncio.to_thread only for
rmtree if you prefer to keep the exists check synchronous), then proceed to
await asyncio.to_thread(Repo.clone_from, repo_url, target_path) as already done
to perform the clone.

♻️ Duplicate comments (4)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (4)

1-15: Add a Google-style module docstring.
Public modules should declare their purpose and key symbols; insert a short docstring under the SPDX header. As per coding guidelines, add a Google-style module docstring.

✏️ Proposed fix

 # limitations under the License.
 
+"""Git repository utilities for the iterative SWE-bench predictor.
+
+Provides `RepoContext`, `RepoManager`, `get_repo_path`, `clone_repository`, and
+`checkout_commit`.
+"""
+
 import asyncio

37-42: Document RepoManager as a public class.
Add a concise Google-style class docstring with key responsibilities and attributes. As per coding guidelines.

✏️ Proposed fix

 class RepoManager:
+    """Manage cloned repositories within a workspace.
+
+    Args:
+        workspace_dir: Base workspace directory for clones.
+
+    Attributes:
+        workspace: Workspace directory as a `Path`.
+        active_repos: Mapping of repo paths to `RepoContext`.
+    """
 
     def __init__(self, workspace_dir: str):

69-76: Avoid blocking shutil.rmtree in async cleanup.
shutil.rmtree is blocking; offload to a thread to keep the event loop responsive. Based on learnings.

♻️ Proposed fix

     async def cleanup(self):
         """Clean up all managed repositories."""
-        import shutil
         for repo_path_str in list(self.active_repos.keys()):
             repo_path = Path(repo_path_str)
             if repo_path.exists():
-                shutil.rmtree(repo_path)
+                await asyncio.to_thread(shutil.rmtree, repo_path)
         self.active_repos.clear()

92-103: Sanitise instance_id and validate repo_url parts.
Prevent path traversal and malformed URLs from escaping the workspace. As per coding guidelines.

🔒 Proposed fix

-    parts = path.strip("/").split("/")
-    repo_name = parts[-1].replace('.git', '')
-    org_name = parts[-2]  # Organization name
+    parts = [p for p in path.strip("/").split("/") if p]
+    if len(parts) < 2:
+        raise ValueError(f"Invalid repo URL: {repo_url}")
+    repo_name = parts[-1].removesuffix(".git")
+    org_name = parts[-2]  # Organization name
@@
-    if instance_id:
-        return Path(workspace_dir) / instance_id / org_name / repo_name
+    if instance_id:
+        safe_instance_id = Path(instance_id).name
+        if safe_instance_id != instance_id or safe_instance_id in {".", ".."}:
+            raise ValueError("instance_id must be a single safe path segment")
+        return Path(workspace_dir) / safe_instance_id / org_name / repo_name

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

mnajafian-nv

Great work. Implementation is clean and demonstrates NAT integration patterns. Left inline comments on technical improvements.

Critical (Must Fix)

Security: Command Injection Risk

subprocess.run(shell=True) with LLM-generated commands creates vulnerability
See inline comment with validation approach using command allowlist
Risk: Data loss, unauthorized access, system compromise

Test Coverage

Zero test coverage for iterative predictor, git tool, or agent loop
Complex async code with external dependencies (git, subprocess), multiple error paths, state management
Required:

test_iterative_agent_basic_flow() : End-to-end execution
test_format_error_recovery() : LLM output validation
test_timeout_handling() : Command timeout scenarios
test_workspace_isolation() : Concurrent instance isolation
test_repo_setup_and_checkout() : Git operations
test_cleanup() : Resource cleanup
Minimum: One integration test with mocked LLM on SWE-bench Lite example
NAT requires ≥80% coverage per general.mdc

Reliability: Git Operations

Clone and checkout operations lack timeout and error handling
See inline comments for implementation

Important (Should Fix)
4. Documentation

Ensure all public methods have complete docstrings with Args/Returns/Raises sections
Check: validate_command(), git_operations(), helper functions

Optional (P2)
5. Metrics Logging

Track efficiency metrics (steps, time, exit status) to demonstrate NAT value beyond correctness
Helps differentiate from simpler alternatives on saturating benchmarks
See inline suggestion in IterativeAgent.run()

Benchmark Context

70% result reflects foundation model capabilities (Sonnet 4.5/GPT-5.2) on saturating benchmark
Consider adding context about SWE-bench Lite saturation in README

mnajafian-nv · 2026-01-26T17:33:57Z

..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

+    step_limit: int = 250
+    timeout: int = 60
+    max_output_length: int = 10000
+


Suggested change

@dataclass

class IterativeAgentConfig:

"""Configuration for the iterative agent."""

step_limit: int = 250

timeout: int = 60

max_output_length: int = 10000

# ========== ADD THIS NEW CODE HERE (after line 265) ==========

# Allowed command prefixes for security

ALLOWED_COMMANDS = {

'ls', 'cat', 'grep', 'find', 'head', 'tail', 'sed', 'awk',

'echo', 'cd', 'pwd', 'mkdir', 'touch', 'cp', 'mv',

'git', 'python', 'python3', 'pytest', 'pip',

'nl', 'wc', 'diff', 'sort', 'uniq', 'cut',

}

# Dangerous patterns that should be blocked

DANGEROUS_PATTERNS = [

'rm -rf /',

'rm -rf ~',

'rm -rf *',

'> /dev/',

'dd if=',

'mkfs',

'fdisk',

':(){:|:&};:', # Fork bomb

'chmod 777',

'wget http', # Prevent downloading arbitrary code

'curl http',

]

def validate_command(command: str) -> tuple[bool, str]:

"""Validate that a command is safe to execute.

Args:

command: The bash command to validate

Returns:

Tuple of (is_valid, error_message). error_message is empty string if valid.

"""

command_lower = command.lower().strip()

# Extract base command (first word, handling pipes and chains)

# This handles: "cd dir && ls", "cat file | grep pattern", etc.

base_cmd = command.strip().split()[0] if command.strip() else ""

# Remove common prefixes

for prefix in ['sudo', 'time', 'env']:

if base_cmd == prefix and len(command.split()) > 1:

base_cmd = command.split()[1]

break

# Check if base command is in allowlist

if base_cmd and base_cmd not in ALLOWED_COMMANDS:

return False, f"Command '{base_cmd}' is not in the allowed list. Allowed: {sorted(ALLOWED_COMMANDS)}"

# Check for dangerous patterns

for pattern in DANGEROUS_PATTERNS:

if pattern in command_lower:

return False, f"Command contains dangerous pattern: '{pattern}'"

# Block sudo

if 'sudo' in command_lower.split():

return False, "sudo is not allowed"

# Block rm -rf with very short paths (likely dangerous)

if 'rm -rf' in command_lower or 'rm -fr' in command_lower:

# Extract what comes after rm -rf

parts = command_lower.split('rm -rf')[-1].split('rm -fr')[-1].strip().split()[0] if 'rm -rf' in command_lower or 'rm -fr' in command_lower else ''

if parts in ['/', '~', '*', '.', '..'] or len(parts) < 3:

return False, f"Dangerous rm -rf pattern detected: 'rm -rf {parts}'"

return True, ""

To address command injection risk introduced by line 381-382

# Current code return subprocess.run( command, shell=True, # Dangerous cwd=str(self.repo_path), ... )

Using subprocess.run(shell=True) with LLM-generated commands creates potential command injection vulnerability. While the command comes from LLM (not direct user input), malicious prompts or prompt injection attacks could cause the LLM to generate destructive commands (rm -rf /, data exfiltration, etc.) which could lead to data loss, unauthorized access, or system compromise

Hi @mnajafian-nv , thanks for your detailed security review. Here are my thoughts on this:

The block list is absolutely necessary. I‘v added all the patterns you listed, plus a few edge cases

It's better to not use the allow list. It would limit the capability and flexibility of LLM. As stated in the prompt, LLM has "a full linux shell environment" and can even install new tools if needed. The LLMs might need commands we can't anticipate upfront. So I went with a comprehensive block list, instead of a restrictive allow list.

It's better to keep shell=True. We still need this because the LLM generated commands usually use shell features like pipes (|), commands chains (&&), redirect (>), heredocs, etc. Without shell=True, these wouldn't work and the agent would be severely limited.

A long term solution is running the shell commands in a sandbox or container, which would provide both security and flexibility. But it's out of the scope of this PR ...

Here is what I implemented:
Add a comprehensive block list including commands like below.

Destructive commands: rm -rf /, rm -rf ~, rm -rf *, mkfs, fdisk, dd to devices

Privilege escalation: sudo, su, doas, pkexec, chmod 777, chown root

Sensitive file access: /etc/shadow, ~/.ssh/, ~/.aws/

Network exfiltration: wget http(s)://, curl http(s)://, netcat reverse shells

Fork bomb: :(){:|:&};:

The validation runs before the command execution. It blocks dangerous commands and raise a FormatErorr so the agent can try a different approach.

mnajafian-nv · 2026-01-26T17:38:56Z

..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

+                shell=True,
+                cwd=str(self.repo_path),


Command Injection Risk: Using subprocess.run(shell=True) with LLM-generated commands creates potential command injection vulnerability. While the command comes from LLM (not direct user input), malicious prompts or prompt injection attacks could cause the LLM to generate destructive commands (rm -rf /, data exfiltration, etc.).

Mitigation Options:

Option A (Recommended): Add command allowlist/validation I suggested in line 82

Option B: Use shell=False with shlex.split (breaks compatibility with complex bash commands)

Option C: Run in isolated container/sandbox (best for production)

Totally agree that shell=True with LLM-generated commands introduce real security concerns.
For this PR, I went with Option A with a comprehensive block list ...
For option C, I agree it's the best solution for long term or for production. Actually several days ago, I created a sandbox agent example for this repo which run shell / python in a sandbox. If you are interested in it, please take a look here: #1485
I didn't integrate the sandbox directly into this swe agent in this PR to keep the scope focused. But I am happy to follow up and add it if we decide a strong isolation is required for this swe agent.

mnajafian-nv · 2026-01-26T17:41:47Z

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

+async def clone_repository(repo_url: str, target_path: Path) -> Repo:
+    """Clone a repository to the specified path."""
+    logger.info("Cloning repository %s to %s", repo_url, target_path)
+    if os.path.exists(target_path):
+        await asyncio.to_thread(shutil.rmtree, target_path)
+    # Use asyncio.to_thread to avoid blocking the event loop during clone operation
+    return await asyncio.to_thread(Repo.clone_from, repo_url, target_path)


Suggested change

async def clone_repository(repo_url: str, target_path: Path) -> Repo:

"""Clone a repository to the specified path."""

logger.info("Cloning repository %s to %s", repo_url, target_path)

if os.path.exists(target_path):

await asyncio.to_thread(shutil.rmtree, target_path)

# Use asyncio.to_thread to avoid blocking the event loop during clone operation

return await asyncio.to_thread(Repo.clone_from, repo_url, target_path)

async def clone_repository(repo_url: str, target_path: Path, timeout: int = 300) -> Repo:

"""Clone a repository with timeout and validation."""

logger.info("Cloning repository %s to %s", repo_url, target_path)

# Validate URL

if not (repo_url.startswith('https://') or repo_url.startswith('git@')):

raise ValueError(f"Invalid repository URL: {repo_url}")

# Clean existing path

if target_path.exists():

await asyncio.to_thread(shutil.rmtree, target_path)

try:

# Clone with timeout

repo = await asyncio.wait_for(

asyncio.to_thread(Repo.clone_from, repo_url, target_path),

timeout=timeout

)

logger.info("Successfully cloned %s", repo_url)

return repo

except asyncio.TimeoutError:

logger.error("Clone timeout for %s after %ds", repo_url, timeout)

if target_path.exists():

await asyncio.to_thread(shutil.rmtree, target_path)

raise

except Exception as e:

logger.error("Clone failed for %s: %s", repo_url, e)

if target_path.exists():

await asyncio.to_thread(shutil.rmtree, target_path)

raise

Git clone and checkout operations have no timeout and minimal error handling. Clone operations can: either Hang indefinitely on network issues or Fail on large repos or Succeed partially leaving corrupted state and causes workflow hangs, wasted compute, unpredictable behavior

Thanks for catching this.
I have updated the clone_repository() as you suggested.
Also the checkout_commit() has the same issue, I also add timeout and error handling for it.

mnajafian-nv · 2026-01-26T17:47:32Z

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py

+
+    async def cleanup(self):
+        """Clean up all managed repositories."""
+        import shutil


Suggested change

import shutil

Already imported at line 21

fixed. Thanks

mnajafian-nv · 2026-01-26T18:02:22Z

..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

+        except Exception as e:
+            logger.exception("Failed to setup repository: %s", e)
+            return f"Error: Failed to setup repository - {str(e)}"


Suggested change

except Exception as e:

logger.exception("Failed to setup repository: %s", e)

return f"Error: Failed to setup repository - {str(e)}"

except GitCommandError as e:

logger.error("Git operation failed: %s", e, exc_info=True)

return f"Error: Git operation failed - {e.stderr}"

except OSError as e:

logger.error("Filesystem error: %s", e, exc_info=True)

return f"Error: Workspace setup failed - {str(e)}"

except Exception as e:

logger.exception("Unexpected error during repo setup: %s", e)

return f"Error: Setup failed - {str(e)}"

Generic exception catches everything (network errors, git errors, path errors) without distinction.

updated as suggested. Thanks

mnajafian-nv · 2026-01-26T18:06:55Z

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

+    try:
+        yield FunctionInfo.from_fn(git_operations,
+                                   description="Git repository management tool that accepts JSON string arguments")
+    finally:
+        if tool_config.cleanup_on_exit:
+            await repo_manager.cleanup()


Suggested change

try:

yield FunctionInfo.from_fn(git_operations,

description="Git repository management tool that accepts JSON string arguments")

finally:

if tool_config.cleanup_on_exit:

await repo_manager.cleanup()

try:

yield FunctionInfo.from_fn(git_operations,

description="Git repository management tool that accepts JSON string arguments")

finally:

if tool_config.cleanup_on_exit:

try:

await repo_manager.cleanup()

logger.info("Workspace cleanup completed successfully")

except Exception as e:

logger.error("Workspace cleanup failed: %s", e, exc_info=True)

# Don't raise - allow graceful degradation

If cleanup fails, exception is swallowed, hence there is resource leak risk here. Also, cleanup only happens on normal exit, not on crashes.

updated as suggested. Thanks

mnajafian-nv · 2026-01-26T18:09:41Z

examples/evaluation_and_profiling/swe_bench/README.md

 - `gold` - Uses the patch from the `SWEBenchInput` instance, bypassing problem-solving logic. See [predict_gold_stub.py](src/nat_swe_bench/predictors/predict_gold/predict_gold_stub.py) and configuration file `examples/evaluation_and_profiling/swe_bench/configs/config_gold.yml`.
 - `skeleton` - Skeleton code for creating a problem-solving workflow. This code can be copied to create a net-new predictor. See [predict_skeleton.py](src/nat_swe_bench/predictors/predict_skeleton/predict_skeleton.py) and configuration file `examples/evaluation_and_profiling/swe_bench/configs/config_skeleton.yml`.
+- `iterative` - Iterative agent that solves problems by executing bash commands step-by-step, observing results, and generating patches. See [predict_iterative.py](src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py) and configuration file `examples/evaluation_and_profiling/swe_bench/configs/config_iterative.yml`.



Suggested change

### Benchmark Context (January 2026)

The iterative predictor achieves 70% success rate on SWE-bench Lite, which primarily reflects the capabilities of modern foundation models (Claude Sonnet 4.5, GPT-5.2) rather than framework-specific innovations. SWE-bench Lite is approaching saturation at 70-80% with simple agent architectures.

**For evaluating framework improvements beyond task correctness, consider tracking:**

- **Efficiency metrics:** Tokens consumed, steps taken, cost per solution

- **Reliability metrics:** Success rate variance over multiple runs

- **Harder benchmarks:** SWE-bench Verified (currently ~35% SOTA, not saturated) or full SWE-bench dataset (2,294 problems)

This positions the iterative predictor as a reference implementation demonstrating NeMo Agent toolkit's builder pattern and tool integration capabilities.

The PR description mentions 70% success rate but doesn't provide context about what this means for benchmark saturation. I recommend adding the baove to the Readme to provide context

updated as suggested. Thanks

mnajafian-nv · 2026-01-26T18:15:28Z

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

+    step_limit: 100
+    timeout: 60


The config sets step_limit to 100, but the dataclass default is 250. Document which takes precedence.

Added comments in config_iterative.yml:
step_limit: 100 # Overrides default (250)

mnajafian-nv · 2026-01-26T18:19:45Z

..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

+
+<format_example>
+THOUGHT: Your reasoning and analysis here
+


The system prompt is solid but could be enhanced with prompt engineering:

Few-shot examples of successful task completions

Explicit guidance on when to submit (after verification)

Tips for handling common errors

Thanks for the suggestions. I considered these but decided to keep the prompt as-is now.

few shot examples. The 'Recommended workflow' section in the prompt provides a high-level guidance similar to few shot examples but with low token cost and more flexibility.

guidance to submit. The prompt already include verification step in the 'Recommended workflow' section and a guidance of when to submit. Adding more might be redundant.

Error handling tips. Modern models lie GPT 5.2 and Sonnet 4.5 can already reason about error by returncode and stderr output. Specifying more might reduce the agent's problem-solving flexibility.

Thanks for the suggestions.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In
`@examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py`:
- Around line 41-69: Rename the four pytest fixtures to use the fixture_ prefix
on the function name and supply the original fixture name via the decorator's
name argument; e.g., change the mock_llm fixture to a function named
fixture_mock_llm decorated with `@pytest.fixture`(name="mock_llm"), do the same
for agent_config -> fixture_agent_config (name="agent_config"), temp_repo_path
-> fixture_temp_repo_path (name="temp_repo_path"), and agent -> fixture_agent
(name="agent"); keep the same return values and dependencies (AsyncMock,
IterativeAgentConfig, tmp_path usage, and IterativeAgent construction) so
callers using the original fixture names continue to work.
- Around line 498-513: Replace the bare try/except in
test_register_cleanup_error_handling with an explicit assertion that the mocked
cleanup raises the expected exception: use pytest.raises(Exception) as excinfo
around await manager.cleanup() to assert the exception is raised, and add an
assertion on the exception message (e.g., "Cleanup failed") to make the intent
explicit; reference the RepoManager.cleanup mock and the async
test_register_cleanup_error_handling coroutine when locating where to change the
test.

🧹 Nitpick comments (2)

examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py (2)
72-82: Add return type hint to helper function.

The create_llm_response helper function is missing a return type hint. While this is test code, adding type hints improves code clarity and aligns with best practices.
🔧 Proposed addition of return type hint
-def create_llm_response(content: str, input_tokens: int = 100, output_tokens: int = 50):
+def create_llm_response(content: str, input_tokens: int = 100, output_tokens: int = 50) -> MagicMock:
     """Helper to create a mock LLM response with token usage."""
     response = MagicMock()
695-695: Use underscore for intentionally unused variable.

The patch_result variable is unpacked but never used. Use an underscore to indicate it's intentionally unused.
🧹 Proposed fix for unused variable
-        exit_status, patch_result = await agent.run("Fix the bug in main.py")
+        exit_status, _ = await agent.run("Fix the bug in main.py")

examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py

Jerryguan777 · 2026-01-28T22:49:28Z

@mnajafian-nv Thank you so much for the detailed review feedback. It's really helpful.
I checked all the items you listed and here is a summary:

Command Injection Risk - I added a comprehensive block list as suggested but decided to keep shell=True and didn't add the allow list to keep the flexibility of agent and to allow agent to use shell features like pipe, redirect, heredocs, etc.
Test Coverage - I added a comprehensive unit test for the iterative predictor and git tools, achieving coverage 82% on git_tools and 73% on iterative predictor. ( iterative predictor's core functions are covered)
Reliability of Git operation - updated as suggested.
Documentation - added docstrings for all public methods
Metrics logging - added logger info key metrics (steps, time, total input tokens, total output tokens) when patch generation is finished
Bench Saturation - updated README as suggested

For more details please check my inline comments.

Great work. Implementation is clean and demonstrates NAT integration patterns. Left inline comments on technical improvements.

Critical (Must Fix)

Security: Command Injection Risk

subprocess.run(shell=True) with LLM-generated commands creates vulnerability

See inline comment with validation approach using command allowlist

Risk: Data loss, unauthorized access, system compromise

Test Coverage

Zero test coverage for iterative predictor, git tool, or agent loop

Complex async code with external dependencies (git, subprocess), multiple error paths, state management

Required:
test_iterative_agent_basic_flow() : End-to-end execution
test_format_error_recovery() : LLM output validation
test_timeout_handling() : Command timeout scenarios
test_workspace_isolation() : Concurrent instance isolation
test_repo_setup_and_checkout() : Git operations
test_cleanup() : Resource cleanup
Minimum: One integration test with mocked LLM on SWE-bench Lite example
NAT requires ≥80% coverage per general.mdc
Reliability: Git Operations

Clone and checkout operations lack timeout and error handling

See inline comments for implementation

Important (Should Fix) 4. Documentation

Ensure all public methods have complete docstrings with Args/Returns/Raises sections

Check: validate_command(), git_operations(), helper functions

Optional (P2) 5. Metrics Logging

Track efficiency metrics (steps, time, exit status) to demonstrate NAT value beyond correctness

Helps differentiate from simpler alternatives on saturating benchmarks

See inline suggestion in IterativeAgent.run()

Benchmark Context

70% result reflects foundation model capabilities (Sonnet 4.5/GPT-5.2) on saturating benchmark

Consider adding context about SWE-bench Lite saturation in README

review-notebook-app · 2026-01-28T23:01:04Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In
`@examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py`:
- Line 695: The tuple assignment from agent.run currently binds an unused
variable patch_result; to satisfy ruff RUF059, rename the unused second binding
to a prefixed underscore (e.g., _patch_result or _) in the assignment where
exit_status, patch_result = await agent.run("Fix the bug in main.py") occurs so
that agent.run remains called and exit_status is preserved while ruff no longer
flags the unused variable.
- Around line 596-602: The test creates an unused binding `MockRepo` when
patching 'nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo',
causing a ruff F841 lint error; remove the unused name by changing the context
manager from "with
patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo') as
MockRepo:" to simply "with
patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo'):" (or
"as _" if you prefer), leaving the rest of the test (the `mock_repo` MagicMock,
the patched asyncio.wait_for, and the call to clone_repository) unchanged.
- Around line 498-505: The test test_register_cleanup_error_handling uses a
hardcoded path when instantiating RepoManager (manager =
RepoManager("/tmp/test")), which triggers ruff S108; update the test signature
to accept the tmp_path pytest fixture and instantiate the manager with
RepoManager(str(tmp_path)) instead of the hardcoded string so it follows the
same pattern as other tests and avoids using /tmp directly.
- Around line 329-379: Replace hardcoded "/tmp/workspace" usage in the tests for
get_repo_path with pytest's tmp_path fixture to avoid predictable temp paths;
update each test to accept tmp_path and set workspace = tmp_path / "workspace"
(convert to str when calling get_repo_path if it expects a string), and update
the expected assertions (e.g., for test_no_instance_id_uses_default_path use
str(tmp_path / "workspace" / "myorg" / "myrepo") rather than "/tmp/..."). Ensure
all tests referencing get_repo_path (function name) use the tmp_path fixture
consistently and adapt any string comparisons accordingly.

examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py

coderabbitai · 2026-01-28T23:04:15Z

examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py

+        with patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo') as MockRepo:
+            mock_repo = MagicMock()
+
+            with patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.asyncio.wait_for') as mock_wait:
+                mock_wait.return_value = mock_repo
+
+                result = await clone_repository("https://github.com/org/repo", target_path)


⚠️ Potential issue | 🟡 Minor

Drop unused MockRepo binding.
The MockRepo variable is unused; ruff F841 will fail linting.

🛠️ Proposed fix

- with patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo') as MockRepo: + with patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo'): mock_repo = MagicMock()

🧰 Tools

🪛 Ruff (0.14.14)

596-596: Local variable MockRepo is assigned to but never used

Remove assignment to unused variable MockRepo

(F841)

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py` around lines 596 - 602, The test creates an unused binding `MockRepo` when patching 'nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo', causing a ruff F841 lint error; remove the unused name by changing the context manager from "with patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo') as MockRepo:" to simply "with patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo'):" (or "as _" if you prefer), leaving the rest of the test (the `mock_repo` MagicMock, the patched asyncio.wait_for, and the call to clone_repository) unchanged.

coderabbitai · 2026-01-28T23:04:15Z

examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py

+                MagicMock(returncode=0, stdout="COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT\ndiff --git a/main.py\n-old\n+new"),
+            ]
+
+            exit_status, patch_result = await agent.run("Fix the bug in main.py")


⚠️ Potential issue | 🟡 Minor

Prefix unused patch_result to satisfy ruff.
patch_result is unused; ruff RUF059 suggests prefixing with _.

🛠️ Proposed fix

- exit_status, patch_result = await agent.run("Fix the bug in main.py") + exit_status, _patch_result = await agent.run("Fix the bug in main.py")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

exit_status, patch_result = await agent.run("Fix the bug in main.py")

exit_status, _patch_result = await agent.run("Fix the bug in main.py")

🧰 Tools

🪛 Ruff (0.14.14)

695-695: Unpacked variable patch_result is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py` at line 695, The tuple assignment from agent.run currently binds an unused variable patch_result; to satisfy ruff RUF059, rename the unused second binding to a prefixed underscore (e.g., _patch_result or _) in the assignment where exit_status, patch_result = await agent.run("Fix the bug in main.py") occurs so that agent.run remains called and exit_status is preserved while ruff no longer flags the unused variable.

- Add IterativeAgent - Add config_iterative.yml - Add git tools - Add SweBenchPredictorIterativeConfig - Register iterative predictor and git tool - Update README.md Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Add instance_id to workspace path to prevent git conflicts between parallel instances. Each instance now uses .workspace/{instance_id}/org/repo instead of sharing .workspace/org/repo. Performance: 10 instances in ~8 min (vs ~30 min sequential) Results: 7/10 resolved (70%) with Claude Sonnet Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

…mmands Add security validation for LLM-generated bash commands before execution. Uses a blocklist approach to block dangerous patterns while maintaining full shell functionality for the agent. Blocked patterns include: - Destructive commands: rm -rf /, mkfs, fdisk, dd to devices - Privilege escalation: sudo, su, doas, pkexec, chmod 777 - Sensitive file access: /etc/shadow, ~/.ssh/, ~/.aws/ - Network exfiltration: wget/curl downloads, netcat reverse shells - Fork bombs Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

- Add timeout parameter to clone_repository (default 600s) and checkout_commit (default 120s) to prevent indefinite hangs - Add URL validation for repository URLs - Clean up partial clones on timeout or error - Improve logging for success and failure cases Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

- Handle GitCommandError with stderr details - Handle OSError for filesystem issues - Keep generic Exception as fallback Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

- Wrap cleanup in try-except to prevent masking original exceptions - Log success/failure for debugging - Don't raise on cleanup failure to allow graceful degradation Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Explain that 70% success rate reflects model capabilities rather than framework innovations, and provide guidance on metrics and harder benchmarks for evaluating framework improvements. Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Add Args/Returns/Raises documentation to: - predict_iterative.py: add_message, run, _query_llm, _execute_action, predict_fn, _build_task_description, _build_prompts - git_tool.py: RepoManager.__init__, setup_repository, cleanup - register.py: git_repo_tool, git_operations Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Track steps, tokens, and time in IterativeAgent: - Add total_input_tokens and total_output_tokens counters - Extract token usage from LLM response metadata (OpenAI/Anthropic) - Log summary at agent completion with steps, tokens, time, status Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Add 51 unit tests covering: - Command validation security checks (safe/dangerous commands) - Basic agent flow (submission, token accumulation, step limits) - Format error recovery (no bash block, multiple blocks, dangerous commands) - Timeout handling (command timeout, timeout message content) - Workspace isolation (instance_id path separation) - Git operations (clone, checkout, URL validation, timeout) - Resource cleanup (repo cleanup, missing directory handling) - Integration test with mocked LLM Coverage: git_tool.py 82%, predict_iterative.py 73% (SweBenchPredictor.predict_fn requires full NAT environment) Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Rename pytest fixtures to use fixture_ prefix and name argument: - mock_llm -> fixture_mock_llm with name="mock_llm" - agent_config -> fixture_agent_config with name="agent_config" - temp_repo_path -> fixture_temp_repo_path with name="temp_repo_path" - agent -> fixture_agent with name="agent" Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Replace try-except-pass pattern with pytest.raises for clearer test intent. Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

- Replace hardcoded /tmp paths with tmp_path fixture (S108) - Remove unused MockRepo variable (F841) - Prefix unused _patch_result with underscore (RUF059) Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

coderabbitai

Actionable comments posted: 8

🤖 Fix all issues with AI agents

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`:
- Around line 22-31: Update the api_key environment variable names in the YAML
entries for claude_sonnet_llm and openai_llm so they use the NAT_ prefix;
replace "${ANTHROPIC_API_KEY}" with "${NAT_ANTHROPIC_API_KEY}" in the
claude_sonnet_llm block and "${OPENAI_API_KEY}" with "${NAT_OPENAI_API_KEY}" in
the openai_llm block to follow the NAT_ environment variable naming convention.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.py`:
- Around line 1-14: The package initializer (__init__.py) for the
predict_iterative module is missing a public module docstring; add a short
Google-style module docstring at the top of __init__.py that briefly describes
the package purpose, the main functionality it exposes (e.g., iterative
predictors for NAT SWE benchmarking), and any important public symbols or
modules contained within the package; ensure the docstring follows Google-style
(one-line summary, blank line, short elaboration) and is placed before any code
or headers.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`:
- Around line 16-26: Update the module docstring in predict_iterative.py to
replace the term “NAT” with the repo-preferred names: use “NeMo Agent Toolkit”
(or `nat`) where referring to the toolkit/API/CLI, and follow naming conventions
elsewhere (package name: `nvidia-nat`, env var prefix: `NAT_`, and
URLs/directories: `NeMo-Agent-Toolkit`) so the top-level docstring and any
mentions inside the file reflect these conventions.
- Around line 649-657: The three except blocks in the repo setup currently log
without full tracebacks for GitCommandError and OSError; update the
GitCommandError and OSError handlers to use logger.exception(...) (similar to
the generic Exception handler) so full stack traces are captured before
returning the error strings. Specifically, in the except GitCommandError as e
and except OSError as e blocks replace logger.error(...) with
logger.exception(...) and keep the same return values (referencing
GitCommandError, OSError, and the logger used in predict_iterative.py).
- Around line 86-172: The DANGEROUS_PATTERNS list in predict_iterative.py is
missing a rule to block shell metacharacters which allows command
chaining/substitution when using shell=True; add a compiled regex tuple to
DANGEROUS_PATTERNS (near other patterns) that detects pipes, semicolons,
backticks, dollar-parenthesis command substitution, and standalone redirection
chars while permitting the safe tokens && and || (e.g. a regex that flags any
occurrence of [|;`$()<>] but uses negative lookahead to allow && and ||), and
include an appropriate error message like "Shell metacharacters and command
chaining are not allowed"; ensure the new tuple uses re.compile and follows the
(re.Pattern, str) shape used by other entries.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 95-99: The except block that logs cleanup failures should use
logger.exception to record the full stack trace when not re-raising; update the
except handling around repo_manager.cleanup() to call
logger.exception("Workspace cleanup failed") (or similar) instead of
logger.error(..., exc_info=True) so the stack trace is captured automatically by
the logger.

In
`@examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py`:
- Around line 41-83: Update the fixtures and helper to include the requested
type hints: change fixture_mock_llm() to return AsyncMock,
fixture_agent_config() to return IterativeAgentConfig,
fixture_temp_repo_path(tmp_path: Path) to return Path, fixture_agent(mock_llm:
AsyncMock, temp_repo_path: Path, agent_config: IterativeAgentConfig) to return
IterativeAgent, and create_llm_response(content: str, input_tokens: int = 100,
output_tokens: int = 50) -> MagicMock; keep the existing AsyncMock,
IterativeAgentConfig, Path, IterativeAgent, and MagicMock imports and only add
the annotations on the functions named fixture_mock_llm, fixture_agent_config,
fixture_temp_repo_path, fixture_agent, and create_llm_response.
- Around line 140-142: Remove the unnecessary `@pytest.mark.asyncio` decorator
from all async test functions (e.g., test_basic_flow_to_submission and the other
async test coroutines in this file) because pytest-asyncio is configured with
asyncio_mode = auto and will auto-detect async tests; locate each async def
test_* coroutine and delete the `@pytest.mark.asyncio` line above it so the tests
remain async def functions without the decorator.

🧹 Nitpick comments (1)

examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py (1)
617-635: Make the truncation test resilient to config changes.

The test hardcodes lengths and the expected elided count; it will break if max_output_length changes. Compute the long output length and expected elided count dynamically.
♻️ Suggested improvement
-        long_output = "x" * 2000  # Config has max_output_length=1000
+        long_output = "x" * (agent.config.max_output_length + 1000)

@@
-        assert "1000 characters elided" in truncated
+        assert f"{len(long_output) - max_length} characters elided" in truncated

coderabbitai · 2026-02-01T19:39:01Z

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

+  claude_sonnet_llm:
+    _type: litellm
+    model_name: anthropic/claude-sonnet-4-5-20250929
+    temperature: 0.0
+    api_key: "${ANTHROPIC_API_KEY}"
+  openai_llm:
+    _type: openai
+    model_name: gpt-5.2
+    temperature: 0.0
+    api_key: "${OPENAI_API_KEY}"


⚠️ Potential issue | 🟡 Minor

Use NAT_-prefixed environment variables for API keys.

✏️ Suggested update

- api_key: "${ANTHROPIC_API_KEY}" + api_key: "${NAT_ANTHROPIC_API_KEY}" @@ - api_key: "${OPENAI_API_KEY}" + api_key: "${NAT_OPENAI_API_KEY}"

As per coding guidelines, use 'nat' for the API namespace and CLI tool, 'nvidia-nat' for the package name, 'NAT_' prefix for environment variables, and 'NeMo-Agent-Toolkit' for URLs and directory names.

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml` around lines 22 - 31, Update the api_key environment variable names in the YAML entries for claude_sonnet_llm and openai_llm so they use the NAT_ prefix; replace "${ANTHROPIC_API_KEY}" with "${NAT_ANTHROPIC_API_KEY}" in the claude_sonnet_llm block and "${OPENAI_API_KEY}" with "${NAT_OPENAI_API_KEY}" in the openai_llm block to follow the NAT_ environment variable naming convention.

I'll keep the standard environment variable names (OPENAI_API_KEY, ANTHROPIC_API_KEY) for consistency with:

Other examples in this repository (e.g., nat_autogen_demo, agno_personal_finance)

Industry-standard conventions that users likely already have configured

The NAT_ prefix pattern in this project is used for NAT-specific configs (OAuth, user IDs), not third-party API keys

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for API key environment variable usage patterns in examples directory echo "=== Searching for ANTHROPIC_API_KEY usage ===" rg 'ANTHROPIC_API_KEY' examples/ -n -C2 --type yaml echo -e "\n=== Searching for OPENAI_API_KEY usage ===" rg 'OPENAI_API_KEY' examples/ -n -C2 --type yaml echo -e "\n=== Searching for NAT_ prefixed API keys ===" rg 'NAT_.*API_KEY' examples/ -n -C2 --type yaml echo -e "\n=== Checking specific examples mentioned by user ===" fd -t f . examples/advanced_agents/nat_autogen_demo examples/advanced_agents/agno_personal_finance -e yml -e yaml -x cat {}

Length of output: 3286

@Jerryguan777, thank you for the clarification! You're absolutely right—I verified that other examples in the repository (nat_autogen_demo, agno_personal_finance) consistently use the standard OPENAI_API_KEY naming convention, and there are no NAT_-prefixed third-party API keys in the examples directory.

Your reasoning makes sense: using industry-standard environment variable names improves developer experience since users likely already have them configured. I appreciate you distinguishing between NAT-specific configuration variables and third-party service API keys.

🧠 Learnings used

Learnt from: CR Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0 File: .cursor/rules/general.mdc:0-0 Timestamp: 2026-01-31T03:35:56.118Z Learning: Applies to **/*.{py,md,toml,yaml,yml} : Use 'nat' for the API namespace and CLI tool, 'nvidia-nat' for the package name, 'NAT_' prefix for environment variables, and 'NeMo-Agent-Toolkit' for URLs and directory names.

Learnt from: CR Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0 File: .cursor/rules/general.mdc:0-0 Timestamp: 2026-01-31T03:35:56.118Z Learning: Applies to docs/source/**/*.md : Never use deprecated names 'Agent Intelligence toolkit', 'aiqtoolkit', 'AgentIQ', 'AIQ', or 'aiq' when referring to NeMo Agent Toolkit; update any occurrences found in documentation unless intentionally referencing deprecated names or compatibility layers.

Learnt from: willkill07 Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1512 File: docs/source/run-workflows/launching-ui.md:20-20 Timestamp: 2026-01-30T15:03:44.777Z Learning: In the NVIDIA/NeMo-Agent-Toolkit repository documentation, "NeMo Agent Toolkit" should always be spelled with a capital "T" in "Toolkit". The documentation guideline states: "Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit."

Learnt from: AnuradhaKaruppiah Repo: NVIDIA/NeMo-Agent-Toolkit PR: 701 File: examples/evaluation_and_profiling/simple_calculator_eval/src/nat_simple_calculator_eval/configs/config-with-custom-post-process.yml:16-18 Timestamp: 2025-08-22T20:29:13.433Z Learning: In the NeMo-Agent-Toolkit project, configuration consistency across example files is prioritized over removing unused config options, even when those options aren't currently implemented in the runtime.

Learnt from: Jerryguan777 Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1485 File: examples/advanced_agents/sandbox_agent/docker-compose.yaml:31-33 Timestamp: 2026-01-26T01:05:01.390Z Learning: In the Sandbox Agent (examples/advanced_agents/sandbox_agent), host-side tools like web_search (Tavily) and youtube_transcript run on the host machine, not inside the Docker sandbox container. Therefore, their API keys (e.g., TAVILY_API_KEY) do not need to be passed as environment variables in docker-compose.yaml.

Learnt from: CR Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0 File: .cursor/rules/general.mdc:0-0 Timestamp: 2026-01-31T03:35:56.118Z Learning: Never commit API keys, credentials or personal data; use environment variables or '.env' files excluded from Git.

Learnt from: zterek Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243 File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98 Timestamp: 2025-12-12T20:49:51.370Z Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, most YAML configuration files use 2-space indentation rather than 4-space indentation. When reviewing YAML files, prefer consistency with the existing codebase style (2-space indentation) over the stated coding guideline (4-space indentation) until a repository-wide standardization is performed.

coderabbitai · 2026-02-01T19:39:01Z

...valuation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.py

+# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.


⚠️ Potential issue | 🟡 Minor

Add a module docstring for the package initializer.
This public module currently has only the SPDX header; add a short Google-style module docstring describing the package.

✏️ Suggested docstring

@@ # limitations under the License. +"""Iterative SWE-bench predictor package.""" +

As per coding guidelines, Provide Google-style docstrings for every public module, class, function and CLI command.

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.py` around lines 1 - 14, The package initializer (__init__.py) for the predict_iterative module is missing a public module docstring; add a short Google-style module docstring at the top of __init__.py that briefly describes the package purpose, the main functionality it exposes (e.g., iterative predictors for NAT SWE benchmarking), and any important public symbols or modules contained within the package; ensure the docstring follows Google-style (one-line summary, blank line, short elaboration) and is placed before any code or headers.

Acknowledged. Keeping as-is for consistency with other __init__.py files in this project.

@Jerryguan777, understood. I'll skip this suggestion.

🧠 Learnings used

Learnt from: CR Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0 File: .cursor/rules/general.mdc:0-0 Timestamp: 2026-01-31T03:35:56.118Z Learning: Applies to {src,packages}/**/*.py : Provide Google-style docstrings for every public module, class, function and CLI command.

..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

coderabbitai · 2026-02-01T19:39:01Z

..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

+        except GitCommandError as e:
+            logger.error("Git operation failed: %s", e, exc_info=True)
+            return f"Error: Git operation failed - {e.stderr}"
+        except OSError as e:
+            logger.error("Filesystem error: %s", e, exc_info=True)
+            return f"Error: Workspace setup failed - {str(e)}"
+        except Exception as e:
+            logger.exception("Unexpected error during repo setup: %s", e)
+            return f"Error: Setup failed - {str(e)}"


⚠️ Potential issue | 🟡 Minor

Use logger.exception when returning after errors.
You swallow these exceptions, so prefer logger.exception to capture full tracebacks.

🔧 Suggested change

- except GitCommandError as e: - logger.error("Git operation failed: %s", e, exc_info=True) + except GitCommandError as e: + logger.exception("Git operation failed") return f"Error: Git operation failed - {e.stderr}" - except OSError as e: - logger.error("Filesystem error: %s", e, exc_info=True) + except OSError as e: + logger.exception("Filesystem error") return f"Error: Workspace setup failed - {str(e)}"

As per coding guidelines, When catching and logging without re-raising, use logger.exception() to capture full stack trace information.

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 654-654: Use explicit conversion flag

Replace with conversion flag

(RUF010)

[warning] 656-656: Redundant exception object included in logging.exception call

(TRY401)

[warning] 657-657: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py` around lines 649 - 657, The three except blocks in the repo setup currently log without full tracebacks for GitCommandError and OSError; update the GitCommandError and OSError handlers to use logger.exception(...) (similar to the generic Exception handler) so full stack traces are captured before returning the error strings. Specifically, in the except GitCommandError as e and except OSError as e blocks replace logger.error(...) with logger.exception(...) and keep the same return values (referencing GitCommandError, OSError, and the logger used in predict_iterative.py).

Current code already captures full stack traces via exc_info=True.
logger.exception() is just syntactic sugar for logger.error(..., exc_info=True).
Keeping as-is for clarity - the explicit exc_info=True makes the intent more obvious.

@Jerryguan777, you're absolutely right that logger.error(..., exc_info=True) and logger.exception() are functionally equivalent—both capture the full stack trace.

That said, the project's coding guidelines do specify using logger.exception() when catching and logging without re-raising:

When catching and logging exceptions without re-raising: always use logger.exception() to capture the full stack trace information.

The guideline appears to prioritize consistency across the codebase over individual preference. But if you feel the explicit form better serves readability for your use case, that's your call as the author.

🧠 Learnings used

Learnt from: CR Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0 File: .cursor/rules/general.mdc:0-0 Timestamp: 2026-01-31T03:35:56.118Z Learning: Applies to **/*.py : Preserve stack traces and prevent duplicate logging when handling exceptions. Use bare 'raise' statements when re-raising to maintain original stack trace, and use 'logger.error()' (not 'logger.exception()') to avoid duplicate logging. When catching and logging without re-raising, use 'logger.exception()' to capture full stack trace.

coderabbitai · 2026-02-01T19:39:01Z

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

+            try:
+                await repo_manager.cleanup()
+                logger.info("Workspace cleanup completed successfully")
+            except Exception as e:
+                logger.error("Workspace cleanup failed: %s", e, exc_info=True)


⚠️ Potential issue | 🟡 Minor

Use logger.exception when swallowing cleanup errors.
You’re not re-raising here, so prefer logger.exception to capture stack traces.

🔧 Suggested change

- except Exception as e: - logger.error("Workspace cleanup failed: %s", e, exc_info=True) + except Exception: + logger.exception("Workspace cleanup failed")

As per coding guidelines, When catching and logging without re-raising, use logger.exception() to capture full stack trace information.

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py` around lines 95 - 99, The except block that logs cleanup failures should use logger.exception to record the full stack trace when not re-raising; update the except handling around repo_manager.cleanup() to call logger.exception("Workspace cleanup failed") (or similar) instead of logger.error(..., exc_info=True) so the stack trace is captured automatically by the logger.

Duplicate of previous comment. exc_info=True already captures full stack traces.

@Jerryguan777, you're absolutely right that both approaches capture the full stack trace. The coding guideline preference for logger.exception() is primarily about consistency across the codebase, but functionally they're equivalent. I'll defer to your choice here.

🐰💭

/\_/\ ( o.o ) Noted! > ^ <

🧠 Learnings used

Learnt from: CR Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0 File: .cursor/rules/general.mdc:0-0 Timestamp: 2026-01-31T03:35:56.118Z Learning: Applies to **/*.py : Preserve stack traces and prevent duplicate logging when handling exceptions. Use bare 'raise' statements when re-raising to maintain original stack trace, and use 'logger.error()' (not 'logger.exception()') to avoid duplicate logging. When catching and logging without re-raising, use 'logger.exception()' to capture full stack trace.

coderabbitai · 2026-02-01T19:39:01Z

examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py

+@pytest.fixture(name="mock_llm")
+def fixture_mock_llm():
+    """Create a mock LLM that returns configurable responses."""
+    llm = AsyncMock()
+    return llm
+
+
+@pytest.fixture(name="agent_config")
+def fixture_agent_config():
+    """Create a default agent configuration for testing."""
+    return IterativeAgentConfig(
+        step_limit=10,
+        timeout=5,
+        max_output_length=1000
+    )
+
+
+@pytest.fixture(name="temp_repo_path")
+def fixture_temp_repo_path(tmp_path):
+    """Create a temporary directory to simulate a repository."""
+    repo_path = tmp_path / "test_repo"
+    repo_path.mkdir()
+    return repo_path
+
+
+@pytest.fixture(name="agent")
+def fixture_agent(mock_llm, temp_repo_path, agent_config):
+    """Create an IterativeAgent instance with mocked dependencies."""
+    return IterativeAgent(mock_llm, temp_repo_path, agent_config)
+
+
+def create_llm_response(content: str, input_tokens: int = 100, output_tokens: int = 50):
+    """Helper to create a mock LLM response with token usage."""
+    response = MagicMock()
+    response.content = content
+    response.response_metadata = {
+        'token_usage': {
+            'prompt_tokens': input_tokens,
+            'completion_tokens': output_tokens
+        }
+    }
+    return response
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py | head -100

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 3850

Add type hints to fixture and helper function signatures.

All fixtures and the helper function lack proper type annotations. Add return type hints for all fixtures and the helper, and add parameter type hints for fixture parameters (particularly tmp_path: Path).

Required additions:

fixture_mock_llm() -> AsyncMock

fixture_agent_config() -> IterativeAgentConfig

fixture_temp_repo_path(tmp_path: Path) -> Path

fixture_agent(mock_llm: AsyncMock, temp_repo_path: Path, agent_config: IterativeAgentConfig) -> IterativeAgent

create_llm_response(...) -> MagicMock

All required types are already imported.

🤖 Prompt for AI Agents

In `@examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py` around lines 41 - 83, Update the fixtures and helper to include the requested type hints: change fixture_mock_llm() to return AsyncMock, fixture_agent_config() to return IterativeAgentConfig, fixture_temp_repo_path(tmp_path: Path) to return Path, fixture_agent(mock_llm: AsyncMock, temp_repo_path: Path, agent_config: IterativeAgentConfig) to return IterativeAgent, and create_llm_response(content: str, input_tokens: int = 100, output_tokens: int = 50) -> MagicMock; keep the existing AsyncMock, IterativeAgentConfig, Path, IterativeAgent, and MagicMock imports and only add the annotations on the functions named fixture_mock_llm, fixture_agent_config, fixture_temp_repo_path, fixture_agent, and create_llm_response.

Won't fix - consistent with other test files in this project which don't use type hints for fixtures.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check other test files in the swe_bench tests directory for fixture type hint patterns echo "=== Checking fixture definitions in other test files ===" fd -e py -p 'examples/evaluation_and_profiling/swe_bench/tests/' --exec grep -l '@pytest.fixture' {} \; echo -e "\n=== Sample fixture definitions from other test files ===" # Look for fixture definitions with type hints fd -e py -p 'examples/evaluation_and_profiling/swe_bench/tests/' -x rg -A 3 '@pytest\.fixture' {} \; | head -100

Length of output: 911

🏁 Script executed:

#!/bin/bash # Check for other test files in the broader examples directory echo "=== All test files in examples directory ===" fd -e py -p 'examples/' -t f | rg 'test.*\.py$' echo -e "\n=== Sample fixture definitions from broader examples tests ===" fd -e py -p 'examples/' -t f | rg 'test.*\.py$' | xargs rg -A 3 '@pytest\.fixture' | head -150

Length of output: 19305

@Jerryguan777, I appreciate your response. However, the evidence shows that most test files in this project actually do use type hints for fixtures. For example:

examples/observability/simple_calculator_observability/tests/test_simple_calc_observability.py has extensive type hints: def question_fixture() -> str:, def config_dir_fixture(examples_dir: Path) -> Path:, etc.

examples/documentation_guides/tests/test_custom_workflow.py uses type hints consistently

examples/front_ends/simple_calculator_custom_routes/tests/test_simple_calculator_custom_routes.py also uses type hints

The project's coding guidelines explicitly state: "Python methods should use type hints for all parameters and return values."

That said, I understand your decision. I won't pursue this further.

examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py

Add BYPASS_PATTERNS to detect shell metacharacter tricks that could evade command-based filtering: - Command substitution ($() and backticks) - Piping to shell interpreters (bash, sh, python, etc.) - Base64 encoded command execution - eval/exec execution - Here-string to interpreter - Process substitution - Hex escape execution - Environment variable injection - Xargs with shell execution - Source/dot command execution Refactor validation code into separate shell_validation.py module for better maintainability. Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

- Rename "NAT framework" to "NeMo Agent Toolkit" for consistency - Remove redundant @pytest.mark.asyncio decorators (asyncio_mode=auto) - Regenerate uv.lock Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

- Validate instance_id to prevent path traversal attacks (.., /, \) - Validate repo_url to prevent IndexError from malformed URLs - Add tests for both validations Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Jerryguan777 requested a review from a team as a code owner January 15, 2026 02:26

coderabbitai bot reviewed Jan 15, 2026

View reviewed changes

Jerryguan777 force-pushed the feat/iterative-predictor branch from a39f138 to dbc2dd6 Compare January 15, 2026 06:29

coderabbitai bot reviewed Jan 15, 2026

View reviewed changes

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml Outdated Show resolved Hide resolved

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py Outdated Show resolved Hide resolved

Jerryguan777 mentioned this pull request Jan 15, 2026

Add Iterative Predictor for Improved SWE-bench Issue Resolution #1397

Open

2 tasks

Jerryguan777 changed the title ~~feat(swe-bench): implement iterative predictor for SWE-bench~~ feat(swe-bench): add an iterative predictor with ～70% success rate Jan 16, 2026

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

Jerryguan777 force-pushed the feat/iterative-predictor branch from 2ed8b41 to 1abbd6e Compare January 16, 2026 05:40

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

willkill07 reviewed Jan 17, 2026

View reviewed changes

coderabbitai bot reviewed Jan 17, 2026

View reviewed changes

AnuradhaKaruppiah assigned Jerryguan777 Jan 21, 2026

AnuradhaKaruppiah added non-breaking Non-breaking change feature request New feature or request labels Jan 21, 2026

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py Outdated Show resolved Hide resolved

mnajafian-nv reviewed Jan 26, 2026

View reviewed changes

Jerryguan777 force-pushed the feat/iterative-predictor branch from 012d855 to af6e2f5 Compare January 28, 2026 22:28

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py Outdated Show resolved Hide resolved

examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py Outdated Show resolved Hide resolved

Jerryguan777 force-pushed the feat/iterative-predictor branch from cc296ff to 28e6f47 Compare January 28, 2026 23:00

Jerryguan777 requested a review from a team as a code owner January 28, 2026 23:00

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

feat(swe-bench): implement iterative predictor for SWE-bench

658a8f9

- Add IterativeAgent - Add config_iterative.yml - Add git tools - Add SweBenchPredictorIterativeConfig - Register iterative predictor and git tool - Update README.md Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Jerryguan777 added 18 commits February 1, 2026 11:26

fix CI warnings when PR

2395ac4

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

change nim_llm to nemotron-3-nano, add and fix license headers

6a69861

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

clean up existed repo folder when downloading

f80efce

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

fix CI warning: shutil.rmtree in an async function blocks the event loop

f8f430a

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

fix(swe-bench): remove redundant shutil import in cleanup method

61b02b2

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

fix(swe-bench): add specific exception handling for repo setup

26fda75

- Handle GitCommandError with stderr details - Handle OSError for filesystem issues - Keep generic Exception as fallback Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

fix(swe-bench): add error handling for workspace cleanup

03e0b46

- Wrap cleanup in try-except to prevent masking original exceptions - Log success/failure for debugging - Don't raise on cleanup failure to allow graceful degradation Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

docs(swe-bench): clarify step_limit override in config

3f62472

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

refactor(swe-bench): use pytest.raises for explicit exception assertion

4569449

Replace try-except-pass pattern with pytest.raises for clearer test intent. Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

fix(swe-bench): resolve ruff linting warnings in tests

a140745

- Replace hardcoded /tmp paths with tmp_path fixture (S108) - Remove unused MockRepo variable (F841) - Prefix unused _patch_result with underscore (RUF059) Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

Jerryguan777 force-pushed the feat/iterative-predictor branch from e89ebed to a140745 Compare February 1, 2026 19:26

coderabbitai bot added the DO NOT MERGE PR should not be merged; see PR for details label Feb 1, 2026

coderabbitai bot reviewed Feb 1, 2026

View reviewed changes

Jerryguan777 added 3 commits February 1, 2026 21:06

chore(swe-bench): code review fixes

e8e7e82

- Rename "NAT framework" to "NeMo Agent Toolkit" for consistency - Remove redundant @pytest.mark.asyncio decorators (asyncio_mode=auto) - Regenerate uv.lock Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

fix(swe-bench): add input sanitization for get_repo_path

329c712

- Validate instance_id to prevent path traversal attacks (.., /, \) - Validate repo_url to prevent IndexError from malformed URLs - Add tests for both validations Signed-off-by: Jerry Guan <jerryguan777@gmail.com>

		# Register all the tools needed by the full predictor without loading the dependencies.
		import typing

		@@ -0,0 +1,505 @@
		# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -0,0 +1,70 @@
		# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -0,0 +1,118 @@
		# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -0,0 +1,78 @@
		# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

+@dataclass
+class IterativeAgentConfig:
+    """Configuration for the iterative agent."""
+    step_limit: int = 250
+    timeout: int = 60
+    max_output_length: int = 10000
+# ========== ADD THIS NEW CODE HERE (after line 265) ==========
+# Allowed command prefixes for security
+ALLOWED_COMMANDS = {
+    'ls', 'cat', 'grep', 'find', 'head', 'tail', 'sed', 'awk',
+    'echo', 'cd', 'pwd', 'mkdir', 'touch', 'cp', 'mv',
+    'git', 'python', 'python3', 'pytest', 'pip',
+    'nl', 'wc', 'diff', 'sort', 'uniq', 'cut',
+}
+# Dangerous patterns that should be blocked
+DANGEROUS_PATTERNS = [
+    'rm -rf /',
+    'rm -rf ~',
+    'rm -rf *',
+    '> /dev/',
+    'dd if=',
+    'mkfs',
+    'fdisk',
+    ':(){:|:&};:',  # Fork bomb
+    'chmod 777',
+    'wget http',  # Prevent downloading arbitrary code
+    'curl http',
+]
+def validate_command(command: str) -> tuple[bool, str]:
+    """Validate that a command is safe to execute.
+    Args:
+        command: The bash command to validate
+    Returns:
+        Tuple of (is_valid, error_message). error_message is empty string if valid.
+    """
+    command_lower = command.lower().strip()
+    # Extract base command (first word, handling pipes and chains)
+    # This handles: "cd dir && ls", "cat file | grep pattern", etc.
+    base_cmd = command.strip().split()[0] if command.strip() else ""
+    # Remove common prefixes
+    for prefix in ['sudo', 'time', 'env']:
+        if base_cmd == prefix and len(command.split()) > 1:
+            base_cmd = command.split()[1]
+            break
+    # Check if base command is in allowlist
+    if base_cmd and base_cmd not in ALLOWED_COMMANDS:
+        return False, f"Command '{base_cmd}' is not in the allowed list. Allowed: {sorted(ALLOWED_COMMANDS)}"
+    # Check for dangerous patterns
+    for pattern in DANGEROUS_PATTERNS:
+        if pattern in command_lower:
+            return False, f"Command contains dangerous pattern: '{pattern}'"
+    # Block sudo
+    if 'sudo' in command_lower.split():
+        return False, "sudo is not allowed"
+    # Block rm -rf with very short paths (likely dangerous)
+    if 'rm -rf' in command_lower or 'rm -fr' in command_lower:
+        # Extract what comes after rm -rf
+        parts = command_lower.split('rm -rf')[-1].split('rm -fr')[-1].strip().split()[0] if 'rm -rf' in command_lower or 'rm -fr' in command_lower else ''
+        if parts in ['/', '~', '*', '.', '..'] or len(parts) < 3:
+            return False, f"Dangerous rm -rf pattern detected: 'rm -rf {parts}'"
+    return True, ""

-async def clone_repository(repo_url: str, target_path: Path) -> Repo:
-    """Clone a repository to the specified path."""
-    logger.info("Cloning repository %s to %s", repo_url, target_path)
-    if os.path.exists(target_path):
-        await asyncio.to_thread(shutil.rmtree, target_path)
-    # Use asyncio.to_thread to avoid blocking the event loop during clone operation
-    return await asyncio.to_thread(Repo.clone_from, repo_url, target_path)
+async def clone_repository(repo_url: str, target_path: Path, timeout: int = 300) -> Repo:
+  """Clone a repository with timeout and validation."""
+  logger.info("Cloning repository %s to %s", repo_url, target_path)
+  # Validate URL
+  if not (repo_url.startswith('https://') or repo_url.startswith('git@')):
+      raise ValueError(f"Invalid repository URL: {repo_url}")
+  # Clean existing path
+  if target_path.exists():
+      await asyncio.to_thread(shutil.rmtree, target_path)
+  try:
+      # Clone with timeout
+      repo = await asyncio.wait_for(
+          asyncio.to_thread(Repo.clone_from, repo_url, target_path),
+          timeout=timeout
+      )
+      logger.info("Successfully cloned %s", repo_url)
+      return repo
+  except asyncio.TimeoutError:
+      logger.error("Clone timeout for %s after %ds", repo_url, timeout)
+      if target_path.exists():
+          await asyncio.to_thread(shutil.rmtree, target_path)
+      raise
+  except Exception as e:
+      logger.error("Clone failed for %s: %s", repo_url, e)
+      if target_path.exists():
+          await asyncio.to_thread(shutil.rmtree, target_path)
+      raise

+### Benchmark Context (January 2026)
+The iterative predictor achieves 70% success rate on SWE-bench Lite, which primarily reflects the capabilities of modern foundation models (Claude Sonnet 4.5, GPT-5.2) rather than framework-specific innovations. SWE-bench Lite is approaching saturation at 70-80% with simple agent architectures.
+**For evaluating framework improvements beyond task correctness, consider tracking:**
+- **Efficiency metrics:** Tokens consumed, steps taken, cost per solution
+- **Reliability metrics:** Success rate variance over multiple runs
+- **Harder benchmarks:** SWE-bench Verified (currently ~35% SOTA, not saturated) or full SWE-bench dataset (2,294 problems)
+This positions the iterative predictor as a reference implementation demonstrating NeMo Agent toolkit's builder pattern and tool integration capabilities.

	exit_status, patch_result = await agent.run("Fix the bug in main.py")
	exit_status, _patch_result = await agent.run("Fix the bug in main.py")

feat(swe-bench): add an iterative predictor with ～70% success rate #1414

Are you sure you want to change the base?

feat(swe-bench): add an iterative predictor with ～70% success rate #1414

Uh oh!

Conversation

Jerryguan777 commented Jan 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How Has This Been Tested?

Description

By Submitting this PR I confirm:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 15, 2026

Uh oh!

coderabbitai bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jerryguan777 commented Jan 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 15, 2026 •

edited

Loading

AnuradhaKaruppiah commented Jan 21, 2026 •

edited

Loading