-
Notifications
You must be signed in to change notification settings - Fork 498
feat(swe-bench): add an iterative predictor with ~70% success rate #1414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
feat(swe-bench): add an iterative predictor with ~70% success rate #1414
Conversation
WalkthroughAdds an Iterative SWE-bench predictor and supporting tooling: an LLM-driven IterativeAgent that proposes and executes one bash command per step in isolated git workspaces, iterates with observations until completion or limits, and returns a git-diff patch; includes config, tool registration, example YAML, and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as SWE-Bench Client
participant Predictor as SweBenchPredictor
participant Repo as RepoManager
participant Agent as IterativeAgent
participant LLM as LLM Backend
participant Executor as Command Executor
Client->>Predictor: predict_fn(swebench_input)
Predictor->>Repo: setup_repository(repo_url, base_commit, instance_id)
Repo-->>Predictor: RepoContext
Predictor->>Agent: instantiate + run(task_description)
loop until COMPLETE or limits
Agent->>LLM: query(prompt/messages)
LLM-->>Agent: response (one bash code block)
Agent->>Executor: execute(bash_command, timeout)
Executor->>Repo: run command in workspace
Repo-->>Executor: stdout/stderr/return_code
Executor-->>Agent: observation (may be truncated)
Agent->>Agent: add_message(assistant,response)
Agent->>Agent: add_message(user,observation)
Agent->>Agent: check for COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT
end
alt Completed
Agent-->>Predictor: (patch, status)
else Error/Timeout/Limits
Agent-->>Predictor: (error_message, status)
end
Predictor->>Repo: cleanup()
Predictor-->>Client: final patch or error
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`:
- Around line 1-6: Add the standard SPDX Apache-2.0 license header as the very
first lines of the YAML (before the "llms" key); update the top of the file
containing the "llms" / "claude_sonnet_llm" entries to begin with the SPDX
Apache-2.0 header so the file complies with the repo policy.
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 76-79: The async function clone_repository uses the synchronous
blocking call Repo.clone_from which will block the event loop; change the
implementation to run Repo.clone_from in a background thread (e.g., via
asyncio.to_thread) and await that result so the function remains async and
non-blocking. Locate the clone_repository function and replace the direct call
to Repo.clone_from(repo_url, target_path) with an awaited asyncio.to_thread call
(or equivalent executor) that invokes Repo.clone_from, and keep the logger.info
call as-is.
- Around line 82-85: The checkout_commit function performs blocking I/O by
calling the synchronous repo.git.checkout; change checkout_commit to have an
explicit return type hint (-> None) and call the blocking operation inside
asyncio.to_thread (e.g., await asyncio.to_thread(repo.git.checkout,
commit_hash)) so the checkout runs off the event loop; keep the logger.info call
and docstring unchanged and reference the function name checkout_commit and the
blocking call repo.git.checkout when making the change.
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 41-53: The git_operations function lacks input validation and
error handling: catch JSONDecodeError around json.loads(args_str) and return or
raise a clear error message, validate presence of 'operation' and for operation
== "setup" ensure required keys 'repo_url' and 'base_commit' exist before
calling repo_manager.setup_repository (raise ValueError or return a descriptive
error if missing), and wrap the repo_manager.setup_repository and
repo_manager.cleanup calls to catch and log exceptions so callers receive
actionable error messages referencing git_operations and
repo_manager.setup_repository/cleanup.
🧹 Nitpick comments (11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)
24-25: Unused import:FunctionRef
FunctionRefis imported but not used in this file. OnlyLLMRefis used for thellm_namefield.🧹 Remove unused import
from nat.data_models.common import TypedBaseModel -from nat.data_models.component_ref import FunctionRef from nat.data_models.component_ref import LLMRefexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (2)
37-67: Add type hints per coding guidelines.The class is missing type hints on
__init__,active_repos, andcleanup(). Per coding guidelines, all public APIs require type hints.📝 Add type hints
class RepoManager: + active_repos: dict[str, RepoContext] - def __init__(self, workspace_dir: str): + def __init__(self, workspace_dir: str) -> None: self.workspace = Path(workspace_dir) self.workspace.mkdir(parents=True, exist_ok=True) - self.active_repos = {} + self.active_repos: dict[str, RepoContext] = {} # ... setup_repository unchanged ... - async def cleanup(self): + async def cleanup(self) -> None: """Clean up all managed repositories."""
25-34: Misleading docstring: not a context manager.The docstring states "Context manager for repository operations" but
RepoContextis a plain dataclass without__enter__/__exit__methods. Consider updating the docstring to reflect its actual purpose as a data container.📝 Fix docstring
`@dataclass` class RepoContext: - """Context manager for repository operations.""" + """Data container holding repository state and paths.""" repo_url: strexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
25-29: Redundant_typefield.The
_typefield is redundant sinceTypedBaseModel(parent ofFunctionBaseConfig) already manages the type discriminator vianame="git_repo_tool". This creates potential confusion with two type fields.🧹 Remove redundant field
class GitRepoToolConfig(FunctionBaseConfig, name="git_repo_tool"): """Configuration for git repository management tool.""" - _type: typing.Literal["git_repo_tool"] = "git_repo_tool" workspace_dir: str = "./.workspace" # Base directory for cloning repositories cleanup_on_exit: bool = True # Whether to clean up repos after use
32-60: Unusedbuilderparameter is acceptable for interface consistency.The
builderparameter is unused (as flagged by static analysis) but is likely required by theregister_functiondecorator's expected signature. The cleanup pattern usingtry/finallyis well implemented.Consider adding a return type hint for the async generator:
📝 Add return type hint
+from collections.abc import AsyncGenerator + `@register_function`(config_type=GitRepoToolConfig) -async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder): +async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder) -> AsyncGenerator[FunctionInfo, None]: """Git repository management tool for SWE Bench."""examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (6)
76-81: Consider adding docstrings for the configuration fields.The dataclass lacks documentation for its fields. While the class docstring exists, individual field descriptions would improve clarity.
📝 Suggested improvement
`@dataclass` class IterativeAgentConfig: """Configuration for the iterative agent.""" - step_limit: int = 250 - timeout: int = 60 - max_output_length: int = 10000 + step_limit: int = 250 # Maximum number of agent steps before termination + timeout: int = 60 # Command execution timeout in seconds + max_output_length: int = 10000 # Maximum characters before output truncation
105-110: Add type hint forllmparameter.The
llmparameter lacks a type annotation. Per coding guidelines, all public APIs require type hints on parameters.📝 Suggested fix
- def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig): + def __init__(self, llm: typing.Any, repo_path: Path, config: IterativeAgentConfig): self.llm = llm self.repo_path = repo_path self.config = config - self.messages: list = [] + self.messages: list[SystemMessage | HumanMessage | AIMessage] = [] self.n_steps = 0Note: Add
import typingat the top if not already present. Ideally, use the actual LLM interface type if available from the framework.
360-363: Chain exception and use explicit conversion.Per coding guidelines, use
raise ... from errto preserve the exception chain and use explicit conversion flag instead ofstr(e).🔧 Proposed fix
except Exception as e: logger.error("LLM invocation failed: %s", e, exc_info=True) - # recoverable error, let the agent continue - raise NonTerminatingException(f"LLM call failed: {str(e)}") + # recoverable error, let the agent continue + raise NonTerminatingException(f"LLM call failed: {e!s}") from e
414-427: Chain exceptions and narrow the exception type.Multiple issues flagged by static analysis:
- Missing exception chaining at lines 425 and 427
- Catching broad
Exceptionat line 426 masks specific errors🔧 Proposed fix
except (TimeoutError, subprocess.TimeoutExpired) as e: # Extract output from exception if available (only subprocess.TimeoutExpired has output attribute) if isinstance(e, subprocess.TimeoutExpired) and hasattr(e, "output") and e.output: output = e.output.decode("utf-8", errors="replace") else: output = "" # Format timeout message using template timeout_message = self._TIMEOUT_TEMPLATE.format( action=command, output=output ) - raise ExecutionTimeoutError(timeout_message) - except Exception as e: - raise NonTerminatingException(f"Error executing command: {str(e)}") + raise ExecutionTimeoutError(timeout_message) from e + except OSError as e: + raise NonTerminatingException(f"Error executing command: {e!s}") from eUsing
OSError(orsubprocess.SubprocessError) is more appropriate than catching all exceptions, as it covers typical subprocess failures without masking unexpected errors.
462-464: Remove redundant exception object fromlogger.exception.When using
logger.exception(), the exception info is automatically included. Includingeas an argument is redundant (TRY401).🔧 Proposed fix
except Exception as e: - logger.exception("Failed to setup repository: %s", e) - return f"Error: Failed to setup repository - {str(e)}" + logger.exception("Failed to setup repository") + return f"Error: Failed to setup repository - {e!s}"
493-495: Remove redundant exception object and use explicit conversion.Same pattern as above -
logger.exception()automatically includes exception info.🔧 Proposed fix
except Exception as e: - logger.exception(f"Error processing {swebench_input.instance_id}: {e}") - return f"Error: {str(e)}" + logger.exception("Error processing %s", swebench_input.instance_id) + return f"Error: {e!s}"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
examples/evaluation_and_profiling/swe_bench/README.mdexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/__init__.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{md,mdx}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.{md,mdx}: Use 'NVIDIA NeMo Agent toolkit' for full name (first use), 'NeMo Agent toolkit' or 'the toolkit' for subsequent references, and 'Toolkit' (capital T) in titles/headings, 'toolkit' (lowercase t) in body text
Never use deprecated names: 'Agent Intelligence toolkit', 'aiqtoolkit', 'AgentIQ', 'AIQ', or 'aiq' in documentation; update any occurrences unless intentionally referring to deprecated versions or implementing compatibility layers
Files:
examples/evaluation_and_profiling/swe_bench/README.md
**/*.{md,mdx,rst}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.{md,mdx,rst}: Documentation must be clear, comprehensive, and free of TODOs, FIXMEs, placeholder text, offensive or outdated terms, and spelling mistakes
Do not use words listed in 'ci/vale/styles/config/vocabularies/nat/reject.txt' in documentation
Words listed in 'ci/vale/styles/config/vocabularies/nat/accept.txt' are acceptable even if they appear to be spelling mistakes
Files:
examples/evaluation_and_profiling/swe_bench/README.md
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template
Files:
examples/evaluation_and_profiling/swe_bench/README.mdexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
**/*.{py,md,mdx,rst}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs
Files:
examples/evaluation_and_profiling/swe_bench/README.mdexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of
None,
in that situation no return type hint is needed).
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.
- Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
(ex: "the system's performance" should be "the performance of the system").- Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
The exception to this rule is when referring to package names or code identifiers that contain "nat", th...
Files:
examples/evaluation_and_profiling/swe_bench/README.mdexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/**/*
⚙️ CodeRabbit configuration file
examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.
- If an example contains Python code, it should be placed in a subdirectory named
src/and should
contain apyproject.tomlfile. Optionally, it might also contain scripts in ascripts/directory.- If an example contains YAML files, they should be placed in a subdirectory named
configs/. - If an example contains sample data files, they should be placed in a subdirectory nameddata/, and should
be checked into git-lfs.
Files:
examples/evaluation_and_profiling/swe_bench/README.mdexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
**/*.{py,yaml,yml,json,toml}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Indent with 4 spaces (never tabs) and ensure every file ends with a single newline
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
🧠 Learnings (1)
📚 Learning: 2025-12-12T20:49:44.305Z
Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
🧬 Code graph analysis (5)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (2)
src/nat/data_models/common.py (3)
TypedBaseModel(96-171)static_type(157-158)discriminator(165-171)src/nat/data_models/component_ref.py (2)
FunctionRef(94-102)LLMRef(116-124)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (3)
src/nat/builder/function_info.py (2)
FunctionInfo(290-625)from_fn(552-625)src/nat/data_models/function.py (1)
FunctionBaseConfig(26-36)examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
RepoManager(37-67)setup_repository(44-58)cleanup(60-67)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
src/nat/runtime/runner.py (1)
context(93-94)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)
SweBenchPredictor(431-502)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (3)
src/nat/builder/builder.py (1)
Builder(84-811)src/nat/builder/framework_enum.py (1)
LLMFrameworkEnum(19-27)examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)
SweBenchWorkflowConfig(51-52)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
33-33: Unused function argument: builder
(ARG001)
53-53: Avoid specifying long messages outside the exception class
(TRY003)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
127-127: Avoid specifying long messages outside the exception class
(TRY003)
327-327: Avoid specifying long messages outside the exception class
(TRY003)
359-359: Consider moving this statement to an else block
(TRY300)
363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
363-363: Avoid specifying long messages outside the exception class
(TRY003)
363-363: Use explicit conversion flag
Replace with conversion flag
(RUF010)
379-379: subprocess call with shell=True identified, security issue
(S602)
412-412: Consider moving this statement to an else block
(TRY300)
425-425: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
426-426: Do not catch blind exception: Exception
(BLE001)
427-427: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
427-427: Avoid specifying long messages outside the exception class
(TRY003)
427-427: Use explicit conversion flag
Replace with conversion flag
(RUF010)
463-463: Redundant exception object included in logging.exception call
(TRY401)
464-464: Use explicit conversion flag
Replace with conversion flag
(RUF010)
494-494: Redundant exception object included in logging.exception call
(TRY401)
495-495: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🔇 Additional comments (11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)
40-49: LGTM!The
SweBenchPredictorIterativeConfigfollows the established pattern, with appropriate type hints and field descriptions. The discriminated union is correctly extended to include the new iterative variant.examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
70-73: LGTM!Simple helper with proper type hints.
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py (1)
19-19: LGTM!The import correctly triggers registration of the
git_repo_toolvia its decorator, following the established pattern in this file.examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py (1)
20-20: The import follows the established pattern for predictor registration.The addition of
IterativePredictoron line 20 mirrors the existingGoldPredictorimport on line 19, with properflake8: noqadirectives to allow unused imports (which are intentionally present to trigger registration side-effects). The file maintains proper Apache 2.0 licensing and copyright headers.examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (4)
1-14: LGTM!License header is correctly formatted with SPDX identifier and Apache 2.0 license text.
52-74: LGTM!Well-structured exception hierarchy with clear separation between recoverable (
NonTerminatingException) and terminal (TerminatingException) conditions. This enables clean control flow in the agent loop.
376-389: Acknowledged:shell=Trueis intentional for this agent design.The static analysis flags
shell=Trueas a security concern (S602). However, this is expected behavior since the agent is explicitly designed to execute arbitrary bash commands provided by the LLM. The commands are sandboxed within the repository workspace directory.
497-502: LGTM!The
_build_task_descriptionmethod is clean and well-typed with appropriate type hints.examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (2)
23-35: LGTM!Workflow and functions configuration is well-structured with appropriate iterative predictor settings and git tool configuration. The 2-space indentation follows repository conventions. Based on learnings, this matches the existing YAML style in the repository.
37-70: LGTM!Evaluation configuration is properly structured with dataset filtering for focused testing. The single-instance allowlist is appropriate for initial development/testing.
examples/evaluation_and_profiling/swe_bench/README.md (1)
159-159: LGTM!Documentation entry for the iterative predictor follows the established pattern and provides clear, concise description with appropriate links to implementation and configuration files.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
Outdated
Show resolved
Hide resolved
...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
Outdated
Show resolved
Hide resolved
| async def checkout_commit(repo: Repo, commit_hash: str): | ||
| """Checkout a specific commit in the repository.""" | ||
| logger.info("Checking out commit %s", commit_hash) | ||
| repo.git.checkout(commit_hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same blocking I/O issue; also missing return type hint.
repo.git.checkout() is synchronous. Wrap with asyncio.to_thread(). Per coding guidelines, add return type hint.
🔧 Proposed fix
-async def checkout_commit(repo: Repo, commit_hash: str):
+async def checkout_commit(repo: Repo, commit_hash: str) -> None:
"""Checkout a specific commit in the repository."""
logger.info("Checking out commit %s", commit_hash)
- repo.git.checkout(commit_hash)
+ await asyncio.to_thread(repo.git.checkout, commit_hash)🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`
around lines 82 - 85, The checkout_commit function performs blocking I/O by
calling the synchronous repo.git.checkout; change checkout_commit to have an
explicit return type hint (-> None) and call the blocking operation inside
asyncio.to_thread (e.g., await asyncio.to_thread(repo.git.checkout,
commit_hash)) so the checkout runs off the event loop; keep the logger.info call
and docstring unchanged and reference the function name checkout_commit and the
blocking call repo.git.checkout when making the change.
...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py`:
- Around line 69-74: Docstring for SweBenchWorkflowConfig incorrectly lists a
"full" predictor option that doesn't exist; update the docstring to reflect the
actual predictor options defined in SweBenchPredictorConfig (e.g., "gold,
skeleton, or iterative") or enumerate the exact union variants from
SweBenchPredictorConfig so the docs and code are consistent; locate the class
SweBenchWorkflowConfig and its docstring and replace "full, gold, skeleton, or
iterative" with the correct set of predictor types from SweBenchPredictorConfig.
- Around line 24-25: Remove the unused import FunctionRef from the top of the
module: delete the "FunctionRef" import token in the import statement that
currently reads "from nat.data_models.component_ref import FunctionRef" so only
LLMRef remains imported (referenced symbol: FunctionRef).
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 71-74: get_repo_path currently builds a path from only the repo
name causing collisions; update get_repo_path to parse the repo URL and extract
the owner/organization component (e.g., the segment immediately preceding the
repo name for HTTPS and the part after ":" for SSH forms) and return
Path(workspace_dir) / owner / repo_name so repositories with the same name under
different orgs are distinct; ensure you handle URLs like
"https://host/org/repo.git" and "git@host:org/repo.git" and strip ".git" from
repo_name.
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 32-38: The git_repo_tool function declares an unused parameter
named builder; rename it to _builder to follow the codebase convention for
intentionally unused parameters (update the function signature async def
git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder): and any
references in the decorator/register_function call if necessary) so
linters/readers know it is intentionally unused.
♻️ Duplicate comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)
16-28: Duplicatellmskey causes configuration to be overwritten.The YAML has two separate
llms:keys (lines 16 and 23). In YAML, duplicate keys at the same level cause the second to overwrite the first, meaningnim_llmwill be silently discarded and onlyclaude_sonnet_llmwill be available.Additionally,
nim_llmuses 1-space indentation whileclaude_sonnet_llmuses 2-space indentation. Per learnings, the repository uses 2-space indentation for YAML files.🔧 Proposed fix - merge into single llms block with consistent 2-space indentation
-llms: - nim_llm: - _type: nim - model_name: mistralai/mistral-nemotron - temperature: 0.6 - max_tokens: 4096 - -llms: - claude_sonnet_llm: - _type: litellm - model_name: anthropic/claude-sonnet-4-5-20250929 - temperature: 0.0 - api_key: "${ANTHROPIC_API_KEY}" # Set this environment variable before running +llms: + nim_llm: + _type: nim + model_name: mistralai/mistral-nemotron + temperature: 0.6 + max_tokens: 4096 + + claude_sonnet_llm: + _type: litellm + model_name: anthropic/claude-sonnet-4-5-20250929 + temperature: 0.0 + api_key: "${ANTHROPIC_API_KEY}" # Set this environment variable before running
🧹 Nitpick comments (3)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (2)
84-88: Add return type hint forcheckout_commit.Per coding guidelines, all public APIs require type hints on return values. This function returns
Noneimplicitly.🔧 Proposed fix
-async def checkout_commit(repo: Repo, commit_hash: str): +async def checkout_commit(repo: Repo, commit_hash: str) -> None: """Checkout a specific commit in the repository.""" logger.info("Checking out commit %s", commit_hash) # Use asyncio.to_thread to avoid blocking the event loop during checkout await asyncio.to_thread(repo.git.checkout, commit_hash)
38-43: Add type hint foractive_reposdictionary.Per coding guidelines, type hints are required. The dictionary maps repo paths to
RepoContextobjects.🔧 Proposed fix
class RepoManager: def __init__(self, workspace_dir: str): self.workspace = Path(workspace_dir) self.workspace.mkdir(parents=True, exist_ok=True) - self.active_repos = {} + self.active_repos: dict[str, RepoContext] = {}examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)
80-82: Remove trailing blank lines.YAMLlint reports too many blank lines at the end of the file. Files should end with a single newline.
🔧 Proposed fix
evaluators: swe_bench: _type: swe_bench run_id: nat_iterative_1 clean: true -
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{py,yaml,yml,json,toml}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Indent with 4 spaces (never tabs) and ensure every file ends with a single newline
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of
None,
in that situation no return type hint is needed).
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.
- Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
(ex: "the system's performance" should be "the performance of the system").- Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
The exception to this rule is when referring to package names or code identifiers that contain "nat", th...
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/**/*
⚙️ CodeRabbit configuration file
examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.
- If an example contains Python code, it should be placed in a subdirectory named
src/and should
contain apyproject.tomlfile. Optionally, it might also contain scripts in ascripts/directory.- If an example contains YAML files, they should be placed in a subdirectory named
configs/. - If an example contains sample data files, they should be placed in a subdirectory nameddata/, and should
be checked into git-lfs.
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
**/*.{py,md,mdx,rst}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
🧠 Learnings (6)
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Every file must start with the standard SPDX Apache-2.0 header
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : All source files must include the SPDX Apache-2.0 header template
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-03T18:42:23.494Z
Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1147
File: packages/nvidia_nat_a2a/pyproject.toml:1-10
Timestamp: 2025-12-03T18:42:23.494Z
Learning: In the packages/ directory, pyproject.toml files typically do not include SPDX license headers. Out of 34 packages, only nvidia_nat_strands is an exception. This pattern differs from the requirement for SPDX headers in source code files (.py, .js, .ts, etc.).
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-11-05T11:45:35.119Z
Learnt from: thepatrickchin
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1152
File: examples/config_inheritance/pyproject.toml:1-25
Timestamp: 2025-11-05T11:45:35.119Z
Learning: In the examples/ directory, pyproject.toml files typically do not include SPDX license headers, with only one exception (adk_demo). This is an established pattern that differs from the general guideline requiring SPDX headers in all .toml files.
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-12T20:49:44.305Z
Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.py : Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
🧬 Code graph analysis (3)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
src/nat/data_models/function.py (1)
FunctionBaseConfig(26-36)examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
RepoManager(38-68)setup_repository(45-59)cleanup(61-68)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (3)
src/nat/data_models/common.py (4)
BaseModelRegistryTag(91-93)TypedBaseModel(96-171)static_type(157-158)discriminator(165-171)src/nat/data_models/component_ref.py (2)
FunctionRef(94-102)LLMRef(116-124)src/nat/data_models/function.py (1)
FunctionBaseConfig(26-36)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
src/nat/runtime/runner.py (1)
context(93-94)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
33-33: Unused function argument: builder
(ARG001)
45-45: Avoid specifying long messages outside the exception class
(TRY003)
51-51: Avoid specifying long messages outside the exception class
(TRY003)
59-59: Avoid specifying long messages outside the exception class
(TRY003)
🪛 YAMLlint (1.37.1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
[error] 23-23: duplication of key "llms" in mapping
(key-duplicates)
[warning] 82-82: too many blank lines (2 > 0)
(empty-lines)
🔇 Additional comments (5)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
77-81: LGTM - blocking I/O properly handled.The use of
asyncio.to_threadcorrectly offloads the blockingRepo.clone_fromcall to a thread pool, preventing event loop blocking.examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
41-59: Input validation improvements look good.The JSON parsing is now wrapped in a try-except, and required keys are validated before use. This addresses the past review feedback.
61-66: Good resource cleanup pattern withfinallyblock.The use of
try/finallyensuresrepo_manager.cleanup()is called even if an exception occurs during the tool's lifecycle, preventing orphaned repository directories.examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)
51-61: LGTM -SweBenchPredictorIterativeConfigis well-structured.The configuration properly uses
LLMReffor the LLM reference, andFieldwith descriptions forstep_limitandtimeout. The defaults (250 steps, 60s timeout) are reasonable for iterative problem solving.examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)
1-14: LGTM - SPDX license header is present.The required Apache-2.0 license header has been added, addressing the past review feedback.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
Outdated
Show resolved
Hide resolved
...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
Outdated
Show resolved
Hide resolved
| @register_function(config_type=GitRepoToolConfig) | ||
| async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder): | ||
| """Git repository management tool for SWE Bench.""" | ||
| import json | ||
|
|
||
| from .git_tool import RepoManager | ||
| repo_manager = RepoManager(tool_config.workspace_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, find and read the register.py file
fd "register.py" -p "examples/evaluation_and_profiling/swe_bench"
# Get the full content to understand context
find . -path "*/examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py" -type f -exec wc -l {} \;Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 450
🏁 Script executed:
# Search for all `@register_function` decorated functions to check builder parameter usage
rg "@register_function" -A 3 --type=pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 50381
🏁 Script executed:
# Find the definition of register_function decorator
rg "def register_function" -B 2 -A 10 --type=pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 6591
🏁 Script executed:
cat -n examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 3277
Prefix unused builder parameter with underscore.
The builder parameter is required by the @register_function interface but is not used in this function. Follow the established pattern in the codebase by renaming it to _builder to indicate intentional non-use.
🔧 Suggested fix
`@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):
"""Git repository management tool for SWE Bench."""🧰 Tools
🪛 Ruff (0.14.11)
33-33: Unused function argument: builder
(ARG001)
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`
around lines 32 - 38, The git_repo_tool function declares an unused parameter
named builder; rename it to _builder to follow the codebase convention for
intentionally unused parameters (update the function signature async def
git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder): and any
references in the decorator/register_function call if necessary) so
linters/readers know it is intentionally unused.
a39f138 to
dbc2dd6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`:
- Around line 16-28: The YAML defines two separate top-level llms mappings which
causes the first (nim_llm) to be overwritten by the second (claude_sonnet_llm);
merge both entries under a single llms key so both nim_llm and claude_sonnet_llm
are present, and fix nim_llm’s indentation to match the file’s 2-space style;
locate the nim_llm and claude_sonnet_llm blocks and combine them into one llms
mapping preserving their model_name, temperature, max_tokens and api_key fields.
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 33-35: RepoContext.__post_init__ sets repo_path = base_path /
repo_name which omits the organization and disagrees with get_repo_path; update
__post_init__ so repo_path includes the org (e.g., repo_path = base_path /
self.org_name / self.repo_name) or call the existing get_repo_path logic to
compute it, ensuring RepoContext.repo_path matches the path used by
setup_repository and clone operations.
♻️ Duplicate comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
88-92: Add missing return type hint.Per coding guidelines, all public functions require type hints. This async function returns
None.🔧 Proposed fix
-async def checkout_commit(repo: Repo, commit_hash: str): +async def checkout_commit(repo: Repo, commit_hash: str) -> None: """Checkout a specific commit in the repository."""
🧹 Nitpick comments (4)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)
81-82: Remove extra trailing blank line.YAMLlint reports too many blank lines at the end. Files should end with exactly one newline.
🔧 Proposed fix
clean: true -examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
1-1: Copyright year should be updated to 2025-2026.Other files in this PR use
2025-2026in the copyright header. This file uses only2025.🔧 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
32-34: Prefix unusedbuilderparameter with underscore.The
builderparameter is required by the@register_functioninterface but is unused. Follow the codebase convention by renaming to_builder.🔧 Proposed fix
`@register_function`(config_type=GitRepoToolConfig) -async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder): +async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder): """Git repository management tool for SWE Bench."""examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
1-1: Copyright year should be updated to 2025-2026.For consistency with other files in this PR.
🔧 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,yaml,yml,json,toml}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Indent with 4 spaces (never tabs) and ensure every file ends with a single newline
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,md,mdx,rst}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of
None,
in that situation no return type hint is needed).
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.
- Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
(ex: "the system's performance" should be "the performance of the system").- Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
The exception to this rule is when referring to package names or code identifiers that contain "nat", th...
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/**/*
⚙️ CodeRabbit configuration file
examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.
- If an example contains Python code, it should be placed in a subdirectory named
src/and should
contain apyproject.tomlfile. Optionally, it might also contain scripts in ascripts/directory.- If an example contains YAML files, they should be placed in a subdirectory named
configs/. - If an example contains sample data files, they should be placed in a subdirectory nameddata/, and should
be checked into git-lfs.
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧠 Learnings (6)
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.py : Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Every file must start with the standard SPDX Apache-2.0 header
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : All source files must include the SPDX Apache-2.0 header template
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-03T18:42:23.494Z
Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1147
File: packages/nvidia_nat_a2a/pyproject.toml:1-10
Timestamp: 2025-12-03T18:42:23.494Z
Learning: In the packages/ directory, pyproject.toml files typically do not include SPDX license headers. Out of 34 packages, only nvidia_nat_strands is an exception. This pattern differs from the requirement for SPDX headers in source code files (.py, .js, .ts, etc.).
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-11-05T11:45:35.119Z
Learnt from: thepatrickchin
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1152
File: examples/config_inheritance/pyproject.toml:1-25
Timestamp: 2025-11-05T11:45:35.119Z
Learning: In the examples/ directory, pyproject.toml files typically do not include SPDX license headers, with only one exception (adk_demo). This is an established pattern that differs from the general guideline requiring SPDX headers in all .toml files.
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-12T20:49:44.305Z
Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
🧬 Code graph analysis (3)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (3)
src/nat/data_models/common.py (4)
BaseModelRegistryTag(91-93)TypedBaseModel(96-171)static_type(157-158)discriminator(165-171)src/nat/data_models/component_ref.py (1)
LLMRef(116-124)src/nat/data_models/function.py (1)
FunctionBaseConfig(26-36)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
src/nat/runtime/runner.py (1)
context(93-94)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
RepoManager(38-68)setup_repository(45-59)cleanup(61-68)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
33-33: Unused function argument: builder
(ARG001)
45-45: Avoid specifying long messages outside the exception class
(TRY003)
51-51: Avoid specifying long messages outside the exception class
(TRY003)
59-59: Avoid specifying long messages outside the exception class
(TRY003)
🪛 YAMLlint (1.37.1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
[error] 23-23: duplication of key "llms" in mapping
(key-duplicates)
[warning] 82-82: too many blank lines (2 > 0)
(empty-lines)
🔇 Additional comments (6)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (3)
50-60: LGTM! Well-structured iterative predictor configuration.The new
SweBenchPredictorIterativeConfigis properly defined with appropriate type hints and sensible defaults. TheLLMReftype correctly references the component reference pattern used elsewhere in the codebase.Minor nit: Line 57 has trailing whitespace after the closing triple quotes.
62-66: Discriminated union correctly extended.The
SweBenchPredictorConfigunion properly includes the new iterative config using the establishedTag/Discriminatorpattern.
68-74: Docstring now correctly lists available predictor types.The docstring accurately reflects the available options: "gold, skeleton, or iterative".
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
41-59: Good input validation and error handling.The function properly validates JSON input, checks for required keys on the
setupoperation, and provides clear error messages. This addresses the concerns from prior reviews.
61-66: Correct async generator pattern with guaranteed cleanup.The
try/finallyensurescleanup()runs when the tool is disposed, respecting thecleanup_on_exitconfig flag.examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
81-85: LGTM! Blocking I/O correctly wrapped withasyncio.to_thread.The clone operation now properly yields to the event loop, addressing the prior review concern.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
Outdated
Show resolved
Hide resolved
...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`:
- Around line 1-2: Update the SPDX copyright header to include 2026 by modifying
the SPDX-FileCopyrightText line (and/or SPDX-License-Identifier line if needed)
so the year range reads "2025, 2026" or "2025-2026" as per project convention;
locate the top-of-file SPDX header in predict_iterative.py and adjust the year
accordingly.
- Around line 414-425: The except block handling subprocess.TimeoutExpired
should guard the type of e.output before decoding: in the except (TimeoutError,
subprocess.TimeoutExpired) handler (inside the predict_iterative code path),
check if isinstance(e, subprocess.TimeoutExpired) and hasattr(e, "output") and
e.output, then if isinstance(e.output, bytes) decode with .decode("utf-8",
errors="replace") else treat e.output as the string directly; assign that to the
local output variable used in self._TIMEOUT_TEMPLATE and raise
ExecutionTimeoutError(timeout_message) as before to avoid calling .decode on a
str when text=True was used.
- Around line 449-451: The parsing of swebench_input.instance_id into repo_name
using split('-')[0] breaks for hyphenated org/repo names; update the extraction
to take the suffix after the last hyphen (use rsplit('-', 1)) so repo_name
contains the full org__repo string, then continue to split into org, repo and
build repo_url as before (look for repo_name, org, repo, repo_url in
predict_iterative.py and replace the split logic accordingly).
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 1-2: Update the SPDX header to include 2026 by changing the
copyright year range in the top-of-file comments in git_tool.py (the two SPDX
comment lines); for example replace the single year "2025" with a range like
"2025-2026" (or include "2026" alongside 2025) so the SPDX-FileCopyrightText and
SPDX-License-Identifier lines reflect the 2026 change.
- Around line 76-98: The get_repo_path function incorrectly parses SSH-style
URLs like "git@github.com:org/repo.git"; update get_repo_path to normalize
repo_url first (handle git@...:org/repo.git by replacing the first colon after
host with a slash or by extracting the substring after the last ':' when an '@'
is present) so that parts = repo_url.rstrip('/').split('/') yields correct
segments; keep existing logic using repo_name, org_name, workspace_dir and
instance_id unchanged but derive org_name and repo_name from the normalized URL
to support both SSH and HTTP(S) formats.
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 1-2: Update the SPDX header to include 2026 by modifying the
existing copyright SPDX lines at the top of the file (the lines containing
"SPDX-FileCopyrightText" and "SPDX-License-Identifier") so the copyright year
range includes 2026 (e.g., change "2025" to "2025, 2026" or "2025-2026")
ensuring the SPDX tags remain intact.
♻️ Duplicate comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (1)
32-33: Rename unusedbuilderparameter to_builder.The parameter is required by the decorator but unused. Prefixing with
_clarifies intent and avoids lint warnings.🔧 Suggested fix
`@register_function`(config_type=GitRepoToolConfig) -async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder): +async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder): """Git repository management tool for SWE Bench."""
🧹 Nitpick comments (2)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
66-73: Avoid blocking the event loop incleanup.
shutil.rmtreeis synchronous and can block. Run it in a thread (or makecleanupsync) to keep async responsiveness. Based on learnings, avoid blocking I/O in async paths.♻️ Proposed fix
async def cleanup(self): """Clean up all managed repositories.""" import shutil for repo_path_str in list(self.active_repos.keys()): repo_path = Path(repo_path_str) if repo_path.exists(): - shutil.rmtree(repo_path) + await asyncio.to_thread(shutil.rmtree, repo_path) self.active_repos.clear()examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)
105-110: Add a type hint forllminIterativeAgent.__init__.Public APIs should carry parameter type hints; use
typing.Any(or a protocol) to make intent explicit. As per coding guidelines, add missing type annotations.🔧 Proposed fix
-import asyncio -import json -import logging -import re -import subprocess +import asyncio +import json +import logging +import re +import subprocess +import typing @@ - def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig): + def __init__(self, llm: typing.Any, repo_path: Path, config: IterativeAgentConfig): self.llm = llm
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,yaml,yml,json,toml}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Indent with 4 spaces (never tabs) and ensure every file ends with a single newline
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,md,mdx,rst}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of
None,
in that situation no return type hint is needed).
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.
- Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
(ex: "the system's performance" should be "the performance of the system").- Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
The exception to this rule is when referring to package names or code identifiers that contain "nat", th...
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/**/*
⚙️ CodeRabbit configuration file
examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.
- If an example contains Python code, it should be placed in a subdirectory named
src/and should
contain apyproject.tomlfile. Optionally, it might also contain scripts in ascripts/directory.- If an example contains YAML files, they should be placed in a subdirectory named
configs/. - If an example contains sample data files, they should be placed in a subdirectory nameddata/, and should
be checked into git-lfs.
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧠 Learnings (6)
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Every file must start with the standard SPDX Apache-2.0 header
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : All source files must include the SPDX Apache-2.0 header template
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-03T18:42:23.494Z
Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1147
File: packages/nvidia_nat_a2a/pyproject.toml:1-10
Timestamp: 2025-12-03T18:42:23.494Z
Learning: In the packages/ directory, pyproject.toml files typically do not include SPDX license headers. Out of 34 packages, only nvidia_nat_strands is an exception. This pattern differs from the requirement for SPDX headers in source code files (.py, .js, .ts, etc.).
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-11-05T11:45:35.119Z
Learnt from: thepatrickchin
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1152
File: examples/config_inheritance/pyproject.toml:1-25
Timestamp: 2025-11-05T11:45:35.119Z
Learning: In the examples/ directory, pyproject.toml files typically do not include SPDX license headers, with only one exception (adk_demo). This is an established pattern that differs from the general guideline requiring SPDX headers in all .toml files.
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-12T20:49:44.305Z
Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.py : Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
🧬 Code graph analysis (2)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (3)
src/nat/builder/builder.py (1)
Builder(84-811)src/nat/builder/framework_enum.py (1)
LLMFrameworkEnum(19-27)examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)
SweBenchWorkflowConfig(68-74)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
RepoManager(34-73)setup_repository(41-64)cleanup(66-73)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
127-127: Avoid specifying long messages outside the exception class
(TRY003)
327-327: Avoid specifying long messages outside the exception class
(TRY003)
359-359: Consider moving this statement to an else block
(TRY300)
363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
363-363: Avoid specifying long messages outside the exception class
(TRY003)
363-363: Use explicit conversion flag
Replace with conversion flag
(RUF010)
379-379: subprocess call with shell=True identified, security issue
(S602)
412-412: Consider moving this statement to an else block
(TRY300)
425-425: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
426-426: Do not catch blind exception: Exception
(BLE001)
427-427: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
427-427: Avoid specifying long messages outside the exception class
(TRY003)
427-427: Use explicit conversion flag
Replace with conversion flag
(RUF010)
464-464: Redundant exception object included in logging.exception call
(TRY401)
465-465: Use explicit conversion flag
Replace with conversion flag
(RUF010)
495-495: Redundant exception object included in logging.exception call
(TRY401)
496-496: Use explicit conversion flag
Replace with conversion flag
(RUF010)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
33-33: Unused function argument: builder
(ARG001)
45-45: Avoid specifying long messages outside the exception class
(TRY003)
51-51: Avoid specifying long messages outside the exception class
(TRY003)
63-63: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)
1-77: Config layout and defaults look good.LLM backends, iterative predictor settings, tool registration, and evaluator settings are coherent and well-structured.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
Outdated
Show resolved
Hide resolved
..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
Show resolved
Hide resolved
..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
Outdated
Show resolved
Hide resolved
...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
Outdated
Show resolved
Hide resolved
...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
Outdated
Show resolved
Hide resolved
2ed8b41 to
1abbd6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`:
- Around line 105-113: Update the public methods __init__ and add_message to
include explicit type hints and Google-style docstrings: annotate llm with
typing.Any (or the concrete LLM interface), add return type None for both
__init__ and add_message, and type self.messages as List[Tuple[str, str]] (or a
suitable typed container) and self.n_steps as int; update the __init__ signature
(def __init__(self, llm: Any, repo_path: Path, config: IterativeAgentConfig) ->
None) and add_message signature (def add_message(self, role: str, content: str)
-> None), and add Google-style docstrings for both methods including an Args
section for parameters and a Returns section (even if None).
- Around line 478-481: The SweBenchPredictorIterativeConfig is missing a
max_output_length field and thus the IterativeAgentConfig isn't receiving a
custom truncation limit; add max_output_length: int = Field(default=10000,
description="Maximum output length for command results") to the
SweBenchPredictorIterativeConfig in nat_swe_bench/config.py, then update the
IterativeAgentConfig construction in predict_iterative.py (the agent_config =
IterativeAgentConfig(...) block) to include
max_output_length=getattr(self.config.predictor, 'max_output_length', 10000) (or
self.config.predictor.max_output_length) so the value is passed through to
IterativeAgentConfig.
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 16-18: Add a Google-style module docstring at the top of
register.py (immediately after the existing first line) describing the module as
a public registry for tools used by the full predictor, its purpose, what it
exports/registry behavior, and any important usage notes; ensure the docstring
follows Google docstring conventions (short summary line, blank line, more
detailed description and any public symbols) so the module is properly
documented for public consumption and tooling.
♻️ Duplicate comments (2)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
33-33: Rename unusedbuilderto_builder.This parameter is required by the decorator interface but unused; prefixing with
_matches the existing convention.🔧 Suggested fix
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder) -> AsyncIterator[FunctionInfo]: +async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder) -> AsyncIterator[FunctionInfo]:
32-34: Add return type hint togit_repo_tool.The function is a public API (decorated with
@register_function) and yieldsFunctionInfo, but lacks a return type hint. Per coding guidelines, all public APIs require type hints on return values.♻️ Suggested update
-import typing +import typing +from collections.abc import AsyncIterator @@ -async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder): +async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder) -> AsyncIterator[FunctionInfo]:
🧹 Nitpick comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)
360-427: Preserve exception context when wrapping errors.Chaining with
from ekeeps the original stack trace for debugging.🔧 Suggested change
@@ - raise NonTerminatingException(f"LLM call failed: {str(e)}") + raise NonTerminatingException(f"LLM call failed: {str(e)}") from e @@ - raise ExecutionTimeoutError(timeout_message) + raise ExecutionTimeoutError(timeout_message) from e @@ - raise NonTerminatingException(f"Error executing command: {str(e)}") + raise NonTerminatingException(f"Error executing command: {str(e)}") from eAs per coding guidelines, preserve exception context when wrapping errors.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.ymlexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
- examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,yaml,yml,json,toml}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Indent with 4 spaces (never tabs) and ensure every file ends with a single newline
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,md,mdx,rst}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of
None,
in that situation no return type hint is needed).
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.
- Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
(ex: "the system's performance" should be "the performance of the system").- Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
The exception to this rule is when referring to package names or code identifiers that contain "nat", th...
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/**/*
⚙️ CodeRabbit configuration file
examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.
- If an example contains Python code, it should be placed in a subdirectory named
src/and should
contain apyproject.tomlfile. Optionally, it might also contain scripts in ascripts/directory.- If an example contains YAML files, they should be placed in a subdirectory named
configs/. - If an example contains sample data files, they should be placed in a subdirectory nameddata/, and should
be checked into git-lfs.
Files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧠 Learnings (1)
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Confirm that copyright years are up-to-date whenever a file is changed
Applied to files:
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyexamples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧬 Code graph analysis (2)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (3)
src/nat/builder/builder.py (1)
Builder(84-811)src/nat/builder/framework_enum.py (1)
LLMFrameworkEnum(19-27)examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)
SweBenchWorkflowConfig(68-74)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (4)
src/nat/builder/function_info.py (2)
FunctionInfo(290-625)from_fn(552-625)src/nat/agent/react_agent/output_parser.py (1)
_type(103-104)examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
RepoManager(35-74)setup_repository(42-65)cleanup(67-74)src/nat/runtime/runner.py (1)
context(93-94)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
127-127: Avoid specifying long messages outside the exception class
(TRY003)
327-327: Avoid specifying long messages outside the exception class
(TRY003)
359-359: Consider moving this statement to an else block
(TRY300)
363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
363-363: Avoid specifying long messages outside the exception class
(TRY003)
363-363: Use explicit conversion flag
Replace with conversion flag
(RUF010)
379-379: subprocess call with shell=True identified, security issue
(S602)
412-412: Consider moving this statement to an else block
(TRY300)
425-425: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
426-426: Do not catch blind exception: Exception
(BLE001)
427-427: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
427-427: Avoid specifying long messages outside the exception class
(TRY003)
427-427: Use explicit conversion flag
Replace with conversion flag
(RUF010)
464-464: Redundant exception object included in logging.exception call
(TRY401)
465-465: Use explicit conversion flag
Replace with conversion flag
(RUF010)
495-495: Redundant exception object included in logging.exception call
(TRY401)
496-496: Use explicit conversion flag
Replace with conversion flag
(RUF010)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
33-33: Unused function argument: builder
(ARG001)
45-45: Avoid specifying long messages outside the exception class
(TRY003)
51-51: Avoid specifying long messages outside the exception class
(TRY003)
63-63: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (3)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
25-29: Config model looks good.Clear defaults and simple surface area for the tool configuration.
40-63: Input validation and error messages are solid.Good defensive handling for malformed JSON and unknown operations.
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)
76-81: Config defaults look sensible.The step limit, timeout, and output limits are reasonable starting points.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig): | ||
| self.llm = llm | ||
| self.repo_path = repo_path | ||
| self.config = config | ||
| self.messages: list = [] | ||
| self.n_steps = 0 | ||
|
|
||
| def add_message(self, role: str, content: str): | ||
| """Add a message to the conversation and print it for debugging.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's check if the file exists and read the relevant section
find . -path "*/examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py" -type fRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 182
🏁 Script executed:
# Read the file to see the current state around lines 105-113
cat -n "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py" | sed -n '100,120p'Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1222
🏁 Script executed:
# Check the import statements at the top of the file
head -30 "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1169
🏁 Script executed:
# Check for typing imports in the file
head -50 "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py" | tail -25Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 744
🏁 Script executed:
# Find the class definition and verify this is a public class
grep -n "^class " "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 456
🏁 Script executed:
# Get more context around the __init__ and add_message methods
cat -n "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py" | sed -n '60,125p'Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 3254
Add type hints and Google‑style docstrings for public methods.
__init__ and add_message are public methods that require complete type hints and Google‑style docstrings per coding guidelines. Currently:
__init__is missing type hint forllmand return type annotationadd_messageis missing return type annotation and complete Google‑style docstring (needs Args section)
✏️ Proposed update
@@
-from dataclasses import dataclass
+from dataclasses import dataclass
+from typing import Any
@@
- def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig):
+ def __init__(self, llm: Any, repo_path: Path, config: IterativeAgentConfig) -> None:
+ """Initialize the iterative agent.
+
+ Args:
+ llm: LLM instance used for chat completion.
+ repo_path: Path to the repository workspace.
+ config: Agent configuration.
+ """
@@
- def add_message(self, role: str, content: str):
- """Add a message to the conversation and print it for debugging."""
+ def add_message(self, role: str, content: str) -> None:
+ """Add a message to the conversation and print it for debugging.
+
+ Args:
+ role: Message role ("system", "user"/"human", or "assistant"/"ai").
+ content: Message text.
+ """🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`
around lines 105 - 113, Update the public methods __init__ and add_message to
include explicit type hints and Google-style docstrings: annotate llm with
typing.Any (or the concrete LLM interface), add return type None for both
__init__ and add_message, and type self.messages as List[Tuple[str, str]] (or a
suitable typed container) and self.n_steps as int; update the __init__ signature
(def __init__(self, llm: Any, repo_path: Path, config: IterativeAgentConfig) ->
None) and add_message signature (def add_message(self, role: str, content: str)
-> None), and add Google-style docstrings for both methods including an Args
section for parameters and a Returns section (even if None).
| agent_config = IterativeAgentConfig( | ||
| step_limit=getattr(self.config.predictor, 'step_limit', 250), | ||
| timeout=getattr(self.config.predictor, 'timeout', 60) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
rg -n "max_output_length" examples/evaluation_and_profiling/swe_bench/src -g "*.py"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 389
🏁 Script executed:
# Find the IterativeAgentConfig definition
fd -t f "*.py" examples/evaluation_and_profiling/swe_bench/src | xargs rg -l "class IterativeAgentConfig"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 481
🏁 Script executed:
# Search for max_output_length more broadly in the codebase
rg -n "max_output_length" --type py | head -50Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 389
🏁 Script executed:
head -100 examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py | tail -30Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1225
🏁 Script executed:
sed -n '70,90p' examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 656
🏁 Script executed:
sed -n '470,490p' examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 851
🏁 Script executed:
# Find IterativeAgentConfig definition
rg -A 10 "class IterativeAgentConfig" examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench -g "*.py"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1580
🏁 Script executed:
# Find the predictor config model definition
rg -B 5 -A 20 "class.*PredictorConfig" examples/evaluation_and_profiling/swe_bench/src -g "*.py"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 51
🏁 Script executed:
# Also search for config field definitions related to max_output_length
rg -B 3 -A 3 "max_output_length" examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench -g "*.py"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 2114
🏁 Script executed:
sed -n '1,100p' examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 3486
🏁 Script executed:
# Search for config class definitions in the file
rg "^class.*Config" examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py -A 5Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 229
🏁 Script executed:
# Look for where self.config is defined
rg -B 5 "self.config.predictor" examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py | head -30Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 601
🏁 Script executed:
# Look for config model definitions
rg -B 2 -A 15 "class SweBenchWorkflowConfig" examples/evaluation_and_profiling/swe_bench/src -g "*.py"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1030
🏁 Script executed:
# Search for config files and their definitions
fd -t f "config.py" examples/evaluation_and_profiling/swe_bench/srcRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 140
🏁 Script executed:
# Check what's in the config module
cat examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 3137
🏁 Script executed:
# Verify max_output_length is actually used in IterativeAgent
sed -n '395,410p' examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 851
Add max_output_length to the predictor config model and pass it through to IterativeAgentConfig.
The predictor config (SweBenchPredictorIterativeConfig) is missing the max_output_length field, so the suggested getattr fix alone won't enable custom truncation limits. You need to add this field to the config model first, then pass it through here.
🔧 Required changes
In nat_swe_bench/config.py, add to SweBenchPredictorIterativeConfig:
max_output_length: int = Field(default=10000, description="Maximum output length for command results")Then in predict_iterative.py at lines 478-481:
agent_config = IterativeAgentConfig(
step_limit=getattr(self.config.predictor, 'step_limit', 250),
- timeout=getattr(self.config.predictor, 'timeout', 60)
+ timeout=getattr(self.config.predictor, 'timeout', 60),
+ max_output_length=getattr(self.config.predictor, 'max_output_length', 10000),
)🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`
around lines 478 - 481, The SweBenchPredictorIterativeConfig is missing a
max_output_length field and thus the IterativeAgentConfig isn't receiving a
custom truncation limit; add max_output_length: int = Field(default=10000,
description="Maximum output length for command results") to the
SweBenchPredictorIterativeConfig in nat_swe_bench/config.py, then update the
IterativeAgentConfig construction in predict_iterative.py (the agent_config =
IterativeAgentConfig(...) block) to include
max_output_length=getattr(self.config.predictor, 'max_output_length', 10000) (or
self.config.predictor.max_output_length) so the value is passed through to
IterativeAgentConfig.
| # Register all the tools needed by the full predictor without loading the dependencies. | ||
| import typing | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's check if this file exists and read its content
cd examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools
ls -la register.py
wc -l register.py
head -50 register.pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 2189
🏁 Script executed:
# Check the coding guidelines for examples directory
cat .cursor/rules/general.mdc | grep -A 20 "examples/\*\*/\*"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 51
🏁 Script executed:
# Find Python files in examples directory and check their docstrings
find examples -name "*.py" -type f | head -10Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 822
🏁 Script executed:
# Check a few example Python files to see their docstring patterns
fd -e py examples | head -5 | xargs -I {} sh -c 'echo "=== {} ===" && head -20 {}'Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 51
🏁 Script executed:
# Check specific example files for module docstrings
echo "=== A2A register.py ===" && head -25 examples/A2A/math_assistant_a2a/src/nat_math_assistant_a2a/register.py
echo ""
echo "=== HITL register.py ===" && head -25 examples/HITL/por_to_jiratickets/src/nat_por_to_jiratickets/register.py
echo ""
echo "=== extract_por_tool.py ===" && head -25 examples/HITL/por_to_jiratickets/src/nat_por_to_jiratickets/extract_por_tool.pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 2989
🏁 Script executed:
# Check the .cursor/rules/general.mdc file more thoroughly for any examples directory guidance
cat .cursor/rules/general.mdc | tail -50Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 2989
🏁 Script executed:
# Also check if there are any other examples modules with docstrings
grep -r '"""' examples --include="*.py" | grep -A 2 "^examples" | head -20Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1960
Add a module docstring for this public registry.
Public modules require Google-style docstrings. This file currently starts with a comment only; add a docstring following the first line to describe the module's purpose.
✏️ Proposed docstring
@@
-# Register all the tools needed by the full predictor without loading the dependencies.
+"""Register iterative SWE-bench tool integrations."""
+
+# Register all the tools needed by the full predictor without loading the dependencies.🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`
around lines 16 - 18, Add a Google-style module docstring at the top of
register.py (immediately after the existing first line) describing the module as
a public registry for tools used by the full predictor, its purpose, what it
exports/registry behavior, and any important usage notes; ensure the docstring
follows Google docstring conventions (short summary line, blank line, more
detailed description and any public symbols) so the module is properly
documented for public consumption and tooling.
| _type: swe_bench | ||
| predictor: | ||
| _type: iterative | ||
| llm_name: "claude_sonnet_llm" # "nim_llm" or "claude_sonnet_llm" or "openai_llm" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have the default be openai_llm or nim_llm ?
| llms: | ||
| nim_llm: | ||
| _type: nim | ||
| model_name: mistralai/mistral-nemotron |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we switch this to the new(er) Nemotron Nano 3 model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @willkill07,
Thanks for the feedback. I've set the default LLM to openai_llm. I also tried the latest nvidia/nemotron-3-nano-30b-a3b, but response.content is always empty. Any suggestion?
btw, I also tried nvidia/llama-3.3-nemotron-super-49b-v1.5 and it worked well. Successfully resolved one issue in 19 steps.
| _type: litellm | ||
| model_name: openai/gpt-5.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| _type: litellm | |
| model_name: openai/gpt-5.2 | |
| _type: openai | |
| model_name: gpt-5.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All files (even if no code) must contain the license headers
| @@ -0,0 +1,505 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was introduced in 2026
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| @@ -0,0 +1,70 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| @@ -0,0 +1,118 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need license header
| @@ -0,0 +1,78 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`:
- Around line 353-363: The exception wrapper in _query_llm currently raises
NonTerminatingException without chaining the original exception; change the
raise to "raise NonTerminatingException(f'LLM call failed: {str(e)}') from e" so
the original traceback is preserved, and apply the same fix to the other
NonTerminatingException raise site in this module (the other error-handling
block that currently does "raise NonTerminatingException(...)" in the same
class/file).
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 35-41: Add a Google-style class docstring to the public class
RepoManager describing its responsibility (managing local git repositories
within a workspace), what it does on init (creates workspace directory), and
list key arguments/attributes: Args: workspace_dir (str) — path to workspace;
Attributes: workspace (Path) and active_repos (dict). Place the docstring
directly below the class RepoManager declaration and keep it concise (one-line
summary + short paragraph + Args and Attributes sections).
- Around line 1-16: Add a Google-style module docstring immediately below the
SPDX header in git_tool.py that briefly describes the module's purpose (public
Git tooling utilities used by the iterative predictor), lists the primary public
symbols it provides (e.g., any public functions or classes in this module), and
any usage/behavior notes; ensure it follows Google docstring conventions and
sits at the top of the file before imports (i.e., above the existing import
asyncio).
- Around line 77-104: The get_repo_path function must validate repo_url and
sanitise instance_id to prevent path traversal and index errors: ensure repo_url
yields at least two path components before accessing parts[-2] (raise ValueError
if malformed), and validate instance_id (if provided) to be a safe token—trim
whitespace and reject values containing path separators ('/', '\'), '..', or any
characters outside an allowed set such as alphanumerics, hyphen and underscore
(or use a simple regex like ^[A-Za-z0-9_-]+$); only use the sanitized
instance_id when constructing the Path and document/raise a clear error for
invalid inputs so callers cannot escape workspace directories via crafted inputs
in get_repo_path, repo_name, org_name, and instance_id.
- Around line 67-74: The cleanup method lacks a return type and performs
blocking filesystem work; update async def cleanup(self) to async def
cleanup(self) -> None and offload synchronous shutil.rmtree calls to a thread
using asyncio.to_thread (import asyncio if not already), e.g. await
asyncio.to_thread(shutil.rmtree, repo_path) while keeping the loop over
self.active_repos and clearing it afterward; ensure shutil is imported/import
available in the scope where cleanup is defined (or keep the local import) and
await the to_thread calls for each repo_path in the cleanup method.
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 33-41: The docstring for git_repo_tool currently describes the
wrapper but should document the logical JSON interface presented to the LLM:
update the git_repo_tool docstring to explain the JSON payload accepted by the
inner async function git_operations (keys: "operation" with allowed operations,
"repo_url" for the git repository, "base_commit" for the commit/branch/ref, and
"instance_id" if applicable), sketch expected types/values and example payloads,
and mention that RepoManager (constructed from tool_config.workspace_dir) will
perform the requested operations; do not describe builder or wrapper internals,
only the LLM-visible JSON interface.
♻️ Duplicate comments (4)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
1-18: Add a module docstring for the tool registry.This public module should expose a concise Google-style docstring after the SPDX header.
As per coding guidelines, public modules need Google-style docstrings.✏️ Proposed fix
@@ -# Register all the tools needed by the full predictor without loading the dependencies. +"""Register iterative SWE-bench tool integrations.""" + +# Register all the tools needed by the full predictor without loading the dependencies.
32-34: Prefix unusedbuilderparameter with underscore.This parameter is required by the decorator but unused; underscore it to signal intent.
🔧 Proposed fix
@@ `@register_function`(config_type=GitRepoToolConfig) -async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder): +async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (2)
105-130: Add type hints and Google-style docstrings for public methods.
__init__andadd_messageare public; they should include explicit type hints and Google-style docstrings, plus typed attributes.As per coding guidelines, public methods need type hints and Google-style docstrings.✏️ Proposed fix
@@ -from dataclasses import dataclass +from dataclasses import dataclass +from typing import Any@@ - def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig): + def __init__(self, llm: Any, repo_path: Path, config: IterativeAgentConfig) -> None: + """Initialize the iterative agent. + + Args: + llm: LLM instance used for chat completion. + repo_path: Path to the repository workspace. + config: Agent configuration. + """ self.llm = llm self.repo_path = repo_path self.config = config - self.messages: list = [] - self.n_steps = 0 + self.messages: list[AIMessage | HumanMessage | SystemMessage] = [] + self.n_steps: int = 0 @@ - def add_message(self, role: str, content: str): - """Add a message to the conversation and print it for debugging.""" + def add_message(self, role: str, content: str) -> None: + """Add a message to the conversation and print it for debugging. + + Args: + role: Message role ("system", "user"/"human", or "assistant"/"ai"). + content: Message text. + """
477-481: Pass throughmax_output_lengthto the agent config.Right now the predictor ignores any configured output limit; pass the value through so it can be tuned.
🔧 Proposed fix
@@ agent_config = IterativeAgentConfig( - step_limit=getattr(self.config.predictor, 'step_limit', 250), - timeout=getattr(self.config.predictor, 'timeout', 60) + step_limit=getattr(self.config.predictor, "step_limit", 250), + timeout=getattr(self.config.predictor, "timeout", 60), + max_output_length=getattr(self.config.predictor, "max_output_length", 10000), )
🧹 Nitpick comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)
379-389: Run LLM-provided commands inside a hardened sandbox.Because commands are executed with
shell=Truefrom model output, ensure this runs in a tightly isolated environment (drop privileges, restrict filesystem scope, and disable network) to contain potential misuse.
| async def _query_llm(self) -> str: | ||
| """Query LLM and return response content.""" | ||
| try: | ||
| response = await self.llm.ainvoke(self.messages) | ||
| content = response.content if hasattr(response, 'content') else str(response) | ||
| self.add_message("assistant", content) | ||
| return content | ||
| except Exception as e: | ||
| logger.error("LLM invocation failed: %s", e, exc_info=True) | ||
| # recoverable error, let the agent continue | ||
| raise NonTerminatingException(f"LLM call failed: {str(e)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve exception context when wrapping errors.
Wrapping exceptions without from e loses the original traceback and makes debugging harder; chain the cause explicitly.
♻️ Proposed fix
@@
- raise NonTerminatingException(f"LLM call failed: {str(e)}")
+ raise NonTerminatingException(f"LLM call failed: {e}") from e
@@
- raise ExecutionTimeoutError(timeout_message)
+ raise ExecutionTimeoutError(timeout_message) from e
@@
- raise NonTerminatingException(f"Error executing command: {str(e)}")
+ raise NonTerminatingException(f"Error executing command: {e}") from eAlso applies to: 414-427
🧰 Tools
🪛 Ruff (0.14.13)
359-359: Consider moving this statement to an else block
(TRY300)
363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
363-363: Avoid specifying long messages outside the exception class
(TRY003)
363-363: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`
around lines 353 - 363, The exception wrapper in _query_llm currently raises
NonTerminatingException without chaining the original exception; change the
raise to "raise NonTerminatingException(f'LLM call failed: {str(e)}') from e" so
the original traceback is preserved, and apply the same fix to the other
NonTerminatingException raise site in this module (the other error-handling
block that currently does "raise NonTerminatingException(...)" in the same
class/file).
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import asyncio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a module docstring for public git tooling.
This public module lacks a Google-style docstring; add one just below the SPDX header.
✏️ Proposed fix
@@
# limitations under the License.
-
+"""Git repository utilities for the iterative SWE-bench predictor."""
+🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`
around lines 1 - 16, Add a Google-style module docstring immediately below the
SPDX header in git_tool.py that briefly describes the module's purpose (public
Git tooling utilities used by the iterative predictor), lists the primary public
symbols it provides (e.g., any public functions or classes in this module), and
any usage/behavior notes; ensure it follows Google docstring conventions and
sits at the top of the file before imports (i.e., above the existing import
asyncio).
...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
Show resolved
Hide resolved
...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
Outdated
Show resolved
Hide resolved
...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
Show resolved
Hide resolved
...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
Show resolved
Hide resolved
70% is very impressive. If you have the compute resources do you think you would be able to get a run on all 500? |
|
/ok to test 4a1f4af |
|
Hi @AnuradhaKaruppiah , Success rate: 53.7% (158/294) Patch generation stage:Evaluation Stage
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 109-115: The pre-clone cleanup in clone_repository currently calls
blocking shutil.rmtree inside an async function; move that call into a
background thread using asyncio.to_thread so the event loop isn't blocked. Wrap
the existing os.path.exists(target_path) check and shutil.rmtree(target_path)
invocation in await asyncio.to_thread(...) (or call asyncio.to_thread only for
rmtree if you prefer to keep the exists check synchronous), then proceed to
await asyncio.to_thread(Repo.clone_from, repo_url, target_path) as already done
to perform the clone.
♻️ Duplicate comments (4)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (4)
1-15: Add a Google-style module docstring.
Public modules should declare their purpose and key symbols; insert a short docstring under the SPDX header. As per coding guidelines, add a Google-style module docstring.✏️ Proposed fix
# limitations under the License. +"""Git repository utilities for the iterative SWE-bench predictor. + +Provides `RepoContext`, `RepoManager`, `get_repo_path`, `clone_repository`, and +`checkout_commit`. +""" + import asyncio
37-42: DocumentRepoManageras a public class.
Add a concise Google-style class docstring with key responsibilities and attributes. As per coding guidelines.✏️ Proposed fix
class RepoManager: + """Manage cloned repositories within a workspace. + + Args: + workspace_dir: Base workspace directory for clones. + + Attributes: + workspace: Workspace directory as a `Path`. + active_repos: Mapping of repo paths to `RepoContext`. + """ def __init__(self, workspace_dir: str):
69-76: Avoid blockingshutil.rmtreein async cleanup.
shutil.rmtreeis blocking; offload to a thread to keep the event loop responsive. Based on learnings.♻️ Proposed fix
async def cleanup(self): """Clean up all managed repositories.""" - import shutil for repo_path_str in list(self.active_repos.keys()): repo_path = Path(repo_path_str) if repo_path.exists(): - shutil.rmtree(repo_path) + await asyncio.to_thread(shutil.rmtree, repo_path) self.active_repos.clear()
92-103: Sanitiseinstance_idand validaterepo_urlparts.
Prevent path traversal and malformed URLs from escaping the workspace. As per coding guidelines.🔒 Proposed fix
- parts = path.strip("/").split("/") - repo_name = parts[-1].replace('.git', '') - org_name = parts[-2] # Organization name + parts = [p for p in path.strip("/").split("/") if p] + if len(parts) < 2: + raise ValueError(f"Invalid repo URL: {repo_url}") + repo_name = parts[-1].removesuffix(".git") + org_name = parts[-2] # Organization name @@ - if instance_id: - return Path(workspace_dir) / instance_id / org_name / repo_name + if instance_id: + safe_instance_id = Path(instance_id).name + if safe_instance_id != instance_id or safe_instance_id in {".", ".."}: + raise ValueError("instance_id must be a single safe path segment") + return Path(workspace_dir) / safe_instance_id / org_name / repo_name
...ion_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
Outdated
Show resolved
Hide resolved
mnajafian-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. Implementation is clean and demonstrates NAT integration patterns. Left inline comments on technical improvements.
Critical (Must Fix)
- Security: Command Injection Risk
- subprocess.run(shell=True) with LLM-generated commands creates vulnerability
- See inline comment with validation approach using command allowlist
- Risk: Data loss, unauthorized access, system compromise
- Test Coverage
- Zero test coverage for iterative predictor, git tool, or agent loop
- Complex async code with external dependencies (git, subprocess), multiple error paths, state management
- Required:
test_iterative_agent_basic_flow() : End-to-end execution
test_format_error_recovery() : LLM output validation
test_timeout_handling() : Command timeout scenarios
test_workspace_isolation() : Concurrent instance isolation
test_repo_setup_and_checkout() : Git operations
test_cleanup() : Resource cleanup
Minimum: One integration test with mocked LLM on SWE-bench Lite example
NAT requires ≥80% coverage per general.mdc
- Reliability: Git Operations
- Clone and checkout operations lack timeout and error handling
- See inline comments for implementation
Important (Should Fix)
4. Documentation
- Ensure all public methods have complete docstrings with Args/Returns/Raises sections
- Check: validate_command(), git_operations(), helper functions
Optional (P2)
5. Metrics Logging
- Track efficiency metrics (steps, time, exit status) to demonstrate NAT value beyond correctness
- Helps differentiate from simpler alternatives on saturating benchmarks
- See inline suggestion in IterativeAgent.run()
- Benchmark Context
- 70% result reflects foundation model capabilities (Sonnet 4.5/GPT-5.2) on saturating benchmark
- Consider adding context about SWE-bench Lite saturation in README
| step_limit: int = 250 | ||
| timeout: int = 60 | ||
| max_output_length: int = 10000 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @dataclass | |
| class IterativeAgentConfig: | |
| """Configuration for the iterative agent.""" | |
| step_limit: int = 250 | |
| timeout: int = 60 | |
| max_output_length: int = 10000 | |
| # ========== ADD THIS NEW CODE HERE (after line 265) ========== | |
| # Allowed command prefixes for security | |
| ALLOWED_COMMANDS = { | |
| 'ls', 'cat', 'grep', 'find', 'head', 'tail', 'sed', 'awk', | |
| 'echo', 'cd', 'pwd', 'mkdir', 'touch', 'cp', 'mv', | |
| 'git', 'python', 'python3', 'pytest', 'pip', | |
| 'nl', 'wc', 'diff', 'sort', 'uniq', 'cut', | |
| } | |
| # Dangerous patterns that should be blocked | |
| DANGEROUS_PATTERNS = [ | |
| 'rm -rf /', | |
| 'rm -rf ~', | |
| 'rm -rf *', | |
| '> /dev/', | |
| 'dd if=', | |
| 'mkfs', | |
| 'fdisk', | |
| ':(){:|:&};:', # Fork bomb | |
| 'chmod 777', | |
| 'wget http', # Prevent downloading arbitrary code | |
| 'curl http', | |
| ] | |
| def validate_command(command: str) -> tuple[bool, str]: | |
| """Validate that a command is safe to execute. | |
| Args: | |
| command: The bash command to validate | |
| Returns: | |
| Tuple of (is_valid, error_message). error_message is empty string if valid. | |
| """ | |
| command_lower = command.lower().strip() | |
| # Extract base command (first word, handling pipes and chains) | |
| # This handles: "cd dir && ls", "cat file | grep pattern", etc. | |
| base_cmd = command.strip().split()[0] if command.strip() else "" | |
| # Remove common prefixes | |
| for prefix in ['sudo', 'time', 'env']: | |
| if base_cmd == prefix and len(command.split()) > 1: | |
| base_cmd = command.split()[1] | |
| break | |
| # Check if base command is in allowlist | |
| if base_cmd and base_cmd not in ALLOWED_COMMANDS: | |
| return False, f"Command '{base_cmd}' is not in the allowed list. Allowed: {sorted(ALLOWED_COMMANDS)}" | |
| # Check for dangerous patterns | |
| for pattern in DANGEROUS_PATTERNS: | |
| if pattern in command_lower: | |
| return False, f"Command contains dangerous pattern: '{pattern}'" | |
| # Block sudo | |
| if 'sudo' in command_lower.split(): | |
| return False, "sudo is not allowed" | |
| # Block rm -rf with very short paths (likely dangerous) | |
| if 'rm -rf' in command_lower or 'rm -fr' in command_lower: | |
| # Extract what comes after rm -rf | |
| parts = command_lower.split('rm -rf')[-1].split('rm -fr')[-1].strip().split()[0] if 'rm -rf' in command_lower or 'rm -fr' in command_lower else '' | |
| if parts in ['/', '~', '*', '.', '..'] or len(parts) < 3: | |
| return False, f"Dangerous rm -rf pattern detected: 'rm -rf {parts}'" | |
| return True, "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address command injection risk introduced by line 381-382
# Current code
return subprocess.run(
command,
shell=True, # Dangerous
cwd=str(self.repo_path),
...
)
Using subprocess.run(shell=True) with LLM-generated commands creates potential command injection vulnerability. While the command comes from LLM (not direct user input), malicious prompts or prompt injection attacks could cause the LLM to generate destructive commands (rm -rf /, data exfiltration, etc.) which could lead to data loss, unauthorized access, or system compromise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mnajafian-nv , thanks for your detailed security review. Here are my thoughts on this:
- The block list is absolutely necessary. I‘v added all the patterns you listed, plus a few edge cases
- It's better to not use the allow list. It would limit the capability and flexibility of LLM. As stated in the prompt, LLM has "a full linux shell environment" and can even install new tools if needed. The LLMs might need commands we can't anticipate upfront. So I went with a comprehensive block list, instead of a restrictive allow list.
- It's better to keep shell=True. We still need this because the LLM generated commands usually use shell features like pipes (|), commands chains (&&), redirect (>), heredocs, etc. Without shell=True, these wouldn't work and the agent would be severely limited.
- A long term solution is running the shell commands in a sandbox or container, which would provide both security and flexibility. But it's out of the scope of this PR ...
Here is what I implemented:
Add a comprehensive block list including commands like below.
- Destructive commands: rm -rf /, rm -rf ~, rm -rf *, mkfs, fdisk, dd to devices
- Privilege escalation: sudo, su, doas, pkexec, chmod 777, chown root
- Sensitive file access: /etc/shadow, ~/.ssh/, ~/.aws/
- Network exfiltration: wget http(s)://, curl http(s)://, netcat reverse shells
- Fork bomb: :(){:|:&};:
The validation runs before the command execution. It blocks dangerous commands and raise a FormatErorr so the agent can try a different approach.
| shell=True, | ||
| cwd=str(self.repo_path), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Command Injection Risk: Using subprocess.run(shell=True) with LLM-generated commands creates potential command injection vulnerability. While the command comes from LLM (not direct user input), malicious prompts or prompt injection attacks could cause the LLM to generate destructive commands (rm -rf /, data exfiltration, etc.).
Mitigation Options:
-
Option A (Recommended): Add command allowlist/validation I suggested in line 82
-
Option B: Use shell=False with shlex.split (breaks compatibility with complex bash commands)
-
Option C: Run in isolated container/sandbox (best for production)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree that shell=True with LLM-generated commands introduce real security concerns.
For this PR, I went with Option A with a comprehensive block list ...
For option C, I agree it's the best solution for long term or for production. Actually several days ago, I created a sandbox agent example for this repo which run shell / python in a sandbox. If you are interested in it, please take a look here: #1485
I didn't integrate the sandbox directly into this swe agent in this PR to keep the scope focused. But I am happy to follow up and add it if we decide a strong isolation is required for this swe agent.
| async def clone_repository(repo_url: str, target_path: Path) -> Repo: | ||
| """Clone a repository to the specified path.""" | ||
| logger.info("Cloning repository %s to %s", repo_url, target_path) | ||
| if os.path.exists(target_path): | ||
| await asyncio.to_thread(shutil.rmtree, target_path) | ||
| # Use asyncio.to_thread to avoid blocking the event loop during clone operation | ||
| return await asyncio.to_thread(Repo.clone_from, repo_url, target_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| async def clone_repository(repo_url: str, target_path: Path) -> Repo: | |
| """Clone a repository to the specified path.""" | |
| logger.info("Cloning repository %s to %s", repo_url, target_path) | |
| if os.path.exists(target_path): | |
| await asyncio.to_thread(shutil.rmtree, target_path) | |
| # Use asyncio.to_thread to avoid blocking the event loop during clone operation | |
| return await asyncio.to_thread(Repo.clone_from, repo_url, target_path) | |
| async def clone_repository(repo_url: str, target_path: Path, timeout: int = 300) -> Repo: | |
| """Clone a repository with timeout and validation.""" | |
| logger.info("Cloning repository %s to %s", repo_url, target_path) | |
| # Validate URL | |
| if not (repo_url.startswith('https://') or repo_url.startswith('git@')): | |
| raise ValueError(f"Invalid repository URL: {repo_url}") | |
| # Clean existing path | |
| if target_path.exists(): | |
| await asyncio.to_thread(shutil.rmtree, target_path) | |
| try: | |
| # Clone with timeout | |
| repo = await asyncio.wait_for( | |
| asyncio.to_thread(Repo.clone_from, repo_url, target_path), | |
| timeout=timeout | |
| ) | |
| logger.info("Successfully cloned %s", repo_url) | |
| return repo | |
| except asyncio.TimeoutError: | |
| logger.error("Clone timeout for %s after %ds", repo_url, timeout) | |
| if target_path.exists(): | |
| await asyncio.to_thread(shutil.rmtree, target_path) | |
| raise | |
| except Exception as e: | |
| logger.error("Clone failed for %s: %s", repo_url, e) | |
| if target_path.exists(): | |
| await asyncio.to_thread(shutil.rmtree, target_path) | |
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Git clone and checkout operations have no timeout and minimal error handling. Clone operations can: either Hang indefinitely on network issues or Fail on large repos or Succeed partially leaving corrupted state and causes workflow hangs, wasted compute, unpredictable behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this.
I have updated the clone_repository() as you suggested.
Also the checkout_commit() has the same issue, I also add timeout and error handling for it.
|
|
||
| async def cleanup(self): | ||
| """Clean up all managed repositories.""" | ||
| import shutil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import shutil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already imported at line 21
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. Thanks
| except Exception as e: | ||
| logger.exception("Failed to setup repository: %s", e) | ||
| return f"Error: Failed to setup repository - {str(e)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| except Exception as e: | |
| logger.exception("Failed to setup repository: %s", e) | |
| return f"Error: Failed to setup repository - {str(e)}" | |
| except GitCommandError as e: | |
| logger.error("Git operation failed: %s", e, exc_info=True) | |
| return f"Error: Git operation failed - {e.stderr}" | |
| except OSError as e: | |
| logger.error("Filesystem error: %s", e, exc_info=True) | |
| return f"Error: Workspace setup failed - {str(e)}" | |
| except Exception as e: | |
| logger.exception("Unexpected error during repo setup: %s", e) | |
| return f"Error: Setup failed - {str(e)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic exception catches everything (network errors, git errors, path errors) without distinction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated as suggested. Thanks
| try: | ||
| yield FunctionInfo.from_fn(git_operations, | ||
| description="Git repository management tool that accepts JSON string arguments") | ||
| finally: | ||
| if tool_config.cleanup_on_exit: | ||
| await repo_manager.cleanup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| try: | |
| yield FunctionInfo.from_fn(git_operations, | |
| description="Git repository management tool that accepts JSON string arguments") | |
| finally: | |
| if tool_config.cleanup_on_exit: | |
| await repo_manager.cleanup() | |
| try: | |
| yield FunctionInfo.from_fn(git_operations, | |
| description="Git repository management tool that accepts JSON string arguments") | |
| finally: | |
| if tool_config.cleanup_on_exit: | |
| try: | |
| await repo_manager.cleanup() | |
| logger.info("Workspace cleanup completed successfully") | |
| except Exception as e: | |
| logger.error("Workspace cleanup failed: %s", e, exc_info=True) | |
| # Don't raise - allow graceful degradation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cleanup fails, exception is swallowed, hence there is resource leak risk here. Also, cleanup only happens on normal exit, not on crashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated as suggested. Thanks
| - `gold` - Uses the patch from the `SWEBenchInput` instance, bypassing problem-solving logic. See [predict_gold_stub.py](src/nat_swe_bench/predictors/predict_gold/predict_gold_stub.py) and configuration file `examples/evaluation_and_profiling/swe_bench/configs/config_gold.yml`. | ||
| - `skeleton` - Skeleton code for creating a problem-solving workflow. This code can be copied to create a net-new predictor. See [predict_skeleton.py](src/nat_swe_bench/predictors/predict_skeleton/predict_skeleton.py) and configuration file `examples/evaluation_and_profiling/swe_bench/configs/config_skeleton.yml`. | ||
| - `iterative` - Iterative agent that solves problems by executing bash commands step-by-step, observing results, and generating patches. See [predict_iterative.py](src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py) and configuration file `examples/evaluation_and_profiling/swe_bench/configs/config_iterative.yml`. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ### Benchmark Context (January 2026) | |
| The iterative predictor achieves 70% success rate on SWE-bench Lite, which primarily reflects the capabilities of modern foundation models (Claude Sonnet 4.5, GPT-5.2) rather than framework-specific innovations. SWE-bench Lite is approaching saturation at 70-80% with simple agent architectures. | |
| **For evaluating framework improvements beyond task correctness, consider tracking:** | |
| - **Efficiency metrics:** Tokens consumed, steps taken, cost per solution | |
| - **Reliability metrics:** Success rate variance over multiple runs | |
| - **Harder benchmarks:** SWE-bench Verified (currently ~35% SOTA, not saturated) or full SWE-bench dataset (2,294 problems) | |
| This positions the iterative predictor as a reference implementation demonstrating NeMo Agent toolkit's builder pattern and tool integration capabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description mentions 70% success rate but doesn't provide context about what this means for benchmark saturation. I recommend adding the baove to the Readme to provide context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated as suggested. Thanks
| step_limit: 100 | ||
| timeout: 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config sets step_limit to 100, but the dataclass default is 250. Document which takes precedence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments in config_iterative.yml:
step_limit: 100 # Overrides default (250)
|
|
||
| <format_example> | ||
| THOUGHT: Your reasoning and analysis here | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The system prompt is solid but could be enhanced with prompt engineering:
- Few-shot examples of successful task completions
- Explicit guidance on when to submit (after verification)
- Tips for handling common errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions. I considered these but decided to keep the prompt as-is now.
- few shot examples. The 'Recommended workflow' section in the prompt provides a high-level guidance similar to few shot examples but with low token cost and more flexibility.
- guidance to submit. The prompt already include verification step in the 'Recommended workflow' section and a guidance of when to submit. Adding more might be redundant.
- Error handling tips. Modern models lie GPT 5.2 and Sonnet 4.5 can already reason about error by returncode and stderr output. Specifying more might reduce the agent's problem-solving flexibility.
Thanks for the suggestions.
012d855 to
af6e2f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py`:
- Around line 41-69: Rename the four pytest fixtures to use the fixture_ prefix
on the function name and supply the original fixture name via the decorator's
name argument; e.g., change the mock_llm fixture to a function named
fixture_mock_llm decorated with `@pytest.fixture`(name="mock_llm"), do the same
for agent_config -> fixture_agent_config (name="agent_config"), temp_repo_path
-> fixture_temp_repo_path (name="temp_repo_path"), and agent -> fixture_agent
(name="agent"); keep the same return values and dependencies (AsyncMock,
IterativeAgentConfig, tmp_path usage, and IterativeAgent construction) so
callers using the original fixture names continue to work.
- Around line 498-513: Replace the bare try/except in
test_register_cleanup_error_handling with an explicit assertion that the mocked
cleanup raises the expected exception: use pytest.raises(Exception) as excinfo
around await manager.cleanup() to assert the exception is raised, and add an
assertion on the exception message (e.g., "Cleanup failed") to make the intent
explicit; reference the RepoManager.cleanup mock and the async
test_register_cleanup_error_handling coroutine when locating where to change the
test.
🧹 Nitpick comments (2)
examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py (2)
72-82: Add return type hint to helper function.The
create_llm_responsehelper function is missing a return type hint. While this is test code, adding type hints improves code clarity and aligns with best practices.🔧 Proposed addition of return type hint
-def create_llm_response(content: str, input_tokens: int = 100, output_tokens: int = 50): +def create_llm_response(content: str, input_tokens: int = 100, output_tokens: int = 50) -> MagicMock: """Helper to create a mock LLM response with token usage.""" response = MagicMock()
695-695: Use underscore for intentionally unused variable.The
patch_resultvariable is unpacked but never used. Use an underscore to indicate it's intentionally unused.🧹 Proposed fix for unused variable
- exit_status, patch_result = await agent.run("Fix the bug in main.py") + exit_status, _ = await agent.run("Fix the bug in main.py")
examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py
Outdated
Show resolved
Hide resolved
examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py
Outdated
Show resolved
Hide resolved
|
@mnajafian-nv Thank you so much for the detailed review feedback. It's really helpful.
For more details please check my inline comments.
|
cc296ff to
28e6f47
Compare
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py`:
- Line 695: The tuple assignment from agent.run currently binds an unused
variable patch_result; to satisfy ruff RUF059, rename the unused second binding
to a prefixed underscore (e.g., _patch_result or _) in the assignment where
exit_status, patch_result = await agent.run("Fix the bug in main.py") occurs so
that agent.run remains called and exit_status is preserved while ruff no longer
flags the unused variable.
- Around line 596-602: The test creates an unused binding `MockRepo` when
patching 'nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo',
causing a ruff F841 lint error; remove the unused name by changing the context
manager from "with
patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo') as
MockRepo:" to simply "with
patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo'):" (or
"as _" if you prefer), leaving the rest of the test (the `mock_repo` MagicMock,
the patched asyncio.wait_for, and the call to clone_repository) unchanged.
- Around line 498-505: The test test_register_cleanup_error_handling uses a
hardcoded path when instantiating RepoManager (manager =
RepoManager("/tmp/test")), which triggers ruff S108; update the test signature
to accept the tmp_path pytest fixture and instantiate the manager with
RepoManager(str(tmp_path)) instead of the hardcoded string so it follows the
same pattern as other tests and avoids using /tmp directly.
- Around line 329-379: Replace hardcoded "/tmp/workspace" usage in the tests for
get_repo_path with pytest's tmp_path fixture to avoid predictable temp paths;
update each test to accept tmp_path and set workspace = tmp_path / "workspace"
(convert to str when calling get_repo_path if it expects a string), and update
the expected assertions (e.g., for test_no_instance_id_uses_default_path use
str(tmp_path / "workspace" / "myorg" / "myrepo") rather than "/tmp/..."). Ensure
all tests referencing get_repo_path (function name) use the tmp_path fixture
consistently and adapt any string comparisons accordingly.
examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py
Outdated
Show resolved
Hide resolved
examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py
Outdated
Show resolved
Hide resolved
| with patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo') as MockRepo: | ||
| mock_repo = MagicMock() | ||
|
|
||
| with patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.asyncio.wait_for') as mock_wait: | ||
| mock_wait.return_value = mock_repo | ||
|
|
||
| result = await clone_repository("https://github.com/org/repo", target_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop unused MockRepo binding.
The MockRepo variable is unused; ruff F841 will fail linting.
🛠️ Proposed fix
- with patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo') as MockRepo:
+ with patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo'):
mock_repo = MagicMock()🧰 Tools
🪛 Ruff (0.14.14)
596-596: Local variable MockRepo is assigned to but never used
Remove assignment to unused variable MockRepo
(F841)
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py`
around lines 596 - 602, The test creates an unused binding `MockRepo` when
patching 'nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo',
causing a ruff F841 lint error; remove the unused name by changing the context
manager from "with
patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo') as
MockRepo:" to simply "with
patch('nat_swe_bench.predictors.predict_iterative.tools.git_tool.Repo'):" (or
"as _" if you prefer), leaving the rest of the test (the `mock_repo` MagicMock,
the patched asyncio.wait_for, and the call to clone_repository) unchanged.
| MagicMock(returncode=0, stdout="COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT\ndiff --git a/main.py\n-old\n+new"), | ||
| ] | ||
|
|
||
| exit_status, patch_result = await agent.run("Fix the bug in main.py") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefix unused patch_result to satisfy ruff.
patch_result is unused; ruff RUF059 suggests prefixing with _.
🛠️ Proposed fix
- exit_status, patch_result = await agent.run("Fix the bug in main.py")
+ exit_status, _patch_result = await agent.run("Fix the bug in main.py")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| exit_status, patch_result = await agent.run("Fix the bug in main.py") | |
| exit_status, _patch_result = await agent.run("Fix the bug in main.py") |
🧰 Tools
🪛 Ruff (0.14.14)
695-695: Unpacked variable patch_result is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py`
at line 695, The tuple assignment from agent.run currently binds an unused
variable patch_result; to satisfy ruff RUF059, rename the unused second binding
to a prefixed underscore (e.g., _patch_result or _) in the assignment where
exit_status, patch_result = await agent.run("Fix the bug in main.py") occurs so
that agent.run remains called and exit_status is preserved while ruff no longer
flags the unused variable.
- Add IterativeAgent - Add config_iterative.yml - Add git tools - Add SweBenchPredictorIterativeConfig - Register iterative predictor and git tool - Update README.md Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Add instance_id to workspace path to prevent git conflicts between
parallel instances. Each instance now uses .workspace/{instance_id}/org/repo
instead of sharing .workspace/org/repo.
Performance: 10 instances in ~8 min (vs ~30 min sequential)
Results: 7/10 resolved (70%) with Claude Sonnet
Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
…mmands Add security validation for LLM-generated bash commands before execution. Uses a blocklist approach to block dangerous patterns while maintaining full shell functionality for the agent. Blocked patterns include: - Destructive commands: rm -rf /, mkfs, fdisk, dd to devices - Privilege escalation: sudo, su, doas, pkexec, chmod 777 - Sensitive file access: /etc/shadow, ~/.ssh/, ~/.aws/ - Network exfiltration: wget/curl downloads, netcat reverse shells - Fork bombs Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
- Add timeout parameter to clone_repository (default 600s) and checkout_commit (default 120s) to prevent indefinite hangs - Add URL validation for repository URLs - Clean up partial clones on timeout or error - Improve logging for success and failure cases Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
- Handle GitCommandError with stderr details - Handle OSError for filesystem issues - Keep generic Exception as fallback Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
- Wrap cleanup in try-except to prevent masking original exceptions - Log success/failure for debugging - Don't raise on cleanup failure to allow graceful degradation Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Explain that 70% success rate reflects model capabilities rather than framework innovations, and provide guidance on metrics and harder benchmarks for evaluating framework improvements. Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Add Args/Returns/Raises documentation to: - predict_iterative.py: add_message, run, _query_llm, _execute_action, predict_fn, _build_task_description, _build_prompts - git_tool.py: RepoManager.__init__, setup_repository, cleanup - register.py: git_repo_tool, git_operations Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Track steps, tokens, and time in IterativeAgent: - Add total_input_tokens and total_output_tokens counters - Extract token usage from LLM response metadata (OpenAI/Anthropic) - Log summary at agent completion with steps, tokens, time, status Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Add 51 unit tests covering: - Command validation security checks (safe/dangerous commands) - Basic agent flow (submission, token accumulation, step limits) - Format error recovery (no bash block, multiple blocks, dangerous commands) - Timeout handling (command timeout, timeout message content) - Workspace isolation (instance_id path separation) - Git operations (clone, checkout, URL validation, timeout) - Resource cleanup (repo cleanup, missing directory handling) - Integration test with mocked LLM Coverage: git_tool.py 82%, predict_iterative.py 73% (SweBenchPredictor.predict_fn requires full NAT environment) Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Rename pytest fixtures to use fixture_ prefix and name argument: - mock_llm -> fixture_mock_llm with name="mock_llm" - agent_config -> fixture_agent_config with name="agent_config" - temp_repo_path -> fixture_temp_repo_path with name="temp_repo_path" - agent -> fixture_agent with name="agent" Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Replace try-except-pass pattern with pytest.raises for clearer test intent. Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
- Replace hardcoded /tmp paths with tmp_path fixture (S108) - Remove unused MockRepo variable (F841) - Prefix unused _patch_result with underscore (RUF059) Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
e89ebed to
a140745
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`:
- Around line 22-31: Update the api_key environment variable names in the YAML
entries for claude_sonnet_llm and openai_llm so they use the NAT_ prefix;
replace "${ANTHROPIC_API_KEY}" with "${NAT_ANTHROPIC_API_KEY}" in the
claude_sonnet_llm block and "${OPENAI_API_KEY}" with "${NAT_OPENAI_API_KEY}" in
the openai_llm block to follow the NAT_ environment variable naming convention.
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.py`:
- Around line 1-14: The package initializer (__init__.py) for the
predict_iterative module is missing a public module docstring; add a short
Google-style module docstring at the top of __init__.py that briefly describes
the package purpose, the main functionality it exposes (e.g., iterative
predictors for NAT SWE benchmarking), and any important public symbols or
modules contained within the package; ensure the docstring follows Google-style
(one-line summary, blank line, short elaboration) and is placed before any code
or headers.
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`:
- Around line 16-26: Update the module docstring in predict_iterative.py to
replace the term “NAT” with the repo-preferred names: use “NeMo Agent Toolkit”
(or `nat`) where referring to the toolkit/API/CLI, and follow naming conventions
elsewhere (package name: `nvidia-nat`, env var prefix: `NAT_`, and
URLs/directories: `NeMo-Agent-Toolkit`) so the top-level docstring and any
mentions inside the file reflect these conventions.
- Around line 649-657: The three except blocks in the repo setup currently log
without full tracebacks for GitCommandError and OSError; update the
GitCommandError and OSError handlers to use logger.exception(...) (similar to
the generic Exception handler) so full stack traces are captured before
returning the error strings. Specifically, in the except GitCommandError as e
and except OSError as e blocks replace logger.error(...) with
logger.exception(...) and keep the same return values (referencing
GitCommandError, OSError, and the logger used in predict_iterative.py).
- Around line 86-172: The DANGEROUS_PATTERNS list in predict_iterative.py is
missing a rule to block shell metacharacters which allows command
chaining/substitution when using shell=True; add a compiled regex tuple to
DANGEROUS_PATTERNS (near other patterns) that detects pipes, semicolons,
backticks, dollar-parenthesis command substitution, and standalone redirection
chars while permitting the safe tokens && and || (e.g. a regex that flags any
occurrence of [|;`$()<>] but uses negative lookahead to allow && and ||), and
include an appropriate error message like "Shell metacharacters and command
chaining are not allowed"; ensure the new tuple uses re.compile and follows the
(re.Pattern, str) shape used by other entries.
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 95-99: The except block that logs cleanup failures should use
logger.exception to record the full stack trace when not re-raising; update the
except handling around repo_manager.cleanup() to call
logger.exception("Workspace cleanup failed") (or similar) instead of
logger.error(..., exc_info=True) so the stack trace is captured automatically by
the logger.
In
`@examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py`:
- Around line 41-83: Update the fixtures and helper to include the requested
type hints: change fixture_mock_llm() to return AsyncMock,
fixture_agent_config() to return IterativeAgentConfig,
fixture_temp_repo_path(tmp_path: Path) to return Path, fixture_agent(mock_llm:
AsyncMock, temp_repo_path: Path, agent_config: IterativeAgentConfig) to return
IterativeAgent, and create_llm_response(content: str, input_tokens: int = 100,
output_tokens: int = 50) -> MagicMock; keep the existing AsyncMock,
IterativeAgentConfig, Path, IterativeAgent, and MagicMock imports and only add
the annotations on the functions named fixture_mock_llm, fixture_agent_config,
fixture_temp_repo_path, fixture_agent, and create_llm_response.
- Around line 140-142: Remove the unnecessary `@pytest.mark.asyncio` decorator
from all async test functions (e.g., test_basic_flow_to_submission and the other
async test coroutines in this file) because pytest-asyncio is configured with
asyncio_mode = auto and will auto-detect async tests; locate each async def
test_* coroutine and delete the `@pytest.mark.asyncio` line above it so the tests
remain async def functions without the decorator.
🧹 Nitpick comments (1)
examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py (1)
617-635: Make the truncation test resilient to config changes.The test hardcodes lengths and the expected elided count; it will break if
max_output_lengthchanges. Compute the long output length and expected elided count dynamically.♻️ Suggested improvement
- long_output = "x" * 2000 # Config has max_output_length=1000 + long_output = "x" * (agent.config.max_output_length + 1000) @@ - assert "1000 characters elided" in truncated + assert f"{len(long_output) - max_length} characters elided" in truncated
| claude_sonnet_llm: | ||
| _type: litellm | ||
| model_name: anthropic/claude-sonnet-4-5-20250929 | ||
| temperature: 0.0 | ||
| api_key: "${ANTHROPIC_API_KEY}" | ||
| openai_llm: | ||
| _type: openai | ||
| model_name: gpt-5.2 | ||
| temperature: 0.0 | ||
| api_key: "${OPENAI_API_KEY}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use NAT_-prefixed environment variables for API keys.
✏️ Suggested update
- api_key: "${ANTHROPIC_API_KEY}"
+ api_key: "${NAT_ANTHROPIC_API_KEY}"
@@
- api_key: "${OPENAI_API_KEY}"
+ api_key: "${NAT_OPENAI_API_KEY}"As per coding guidelines, use 'nat' for the API namespace and CLI tool, 'nvidia-nat' for the package name, 'NAT_' prefix for environment variables, and 'NeMo-Agent-Toolkit' for URLs and directory names.
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`
around lines 22 - 31, Update the api_key environment variable names in the YAML
entries for claude_sonnet_llm and openai_llm so they use the NAT_ prefix;
replace "${ANTHROPIC_API_KEY}" with "${NAT_ANTHROPIC_API_KEY}" in the
claude_sonnet_llm block and "${OPENAI_API_KEY}" with "${NAT_OPENAI_API_KEY}" in
the openai_llm block to follow the NAT_ environment variable naming convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll keep the standard environment variable names (OPENAI_API_KEY, ANTHROPIC_API_KEY) for consistency with:
- Other examples in this repository (e.g.,
nat_autogen_demo,agno_personal_finance) - Industry-standard conventions that users likely already have configured
- The
NAT_prefix pattern in this project is used for NAT-specific configs (OAuth, user IDs), not third-party API keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for API key environment variable usage patterns in examples directory
echo "=== Searching for ANTHROPIC_API_KEY usage ==="
rg 'ANTHROPIC_API_KEY' examples/ -n -C2 --type yaml
echo -e "\n=== Searching for OPENAI_API_KEY usage ==="
rg 'OPENAI_API_KEY' examples/ -n -C2 --type yaml
echo -e "\n=== Searching for NAT_ prefixed API keys ==="
rg 'NAT_.*API_KEY' examples/ -n -C2 --type yaml
echo -e "\n=== Checking specific examples mentioned by user ==="
fd -t f . examples/advanced_agents/nat_autogen_demo examples/advanced_agents/agno_personal_finance -e yml -e yaml -x cat {}
Length of output: 3286
@Jerryguan777, thank you for the clarification! You're absolutely right—I verified that other examples in the repository (nat_autogen_demo, agno_personal_finance) consistently use the standard OPENAI_API_KEY naming convention, and there are no NAT_-prefixed third-party API keys in the examples directory.
Your reasoning makes sense: using industry-standard environment variable names improves developer experience since users likely already have them configured. I appreciate you distinguishing between NAT-specific configuration variables and third-party service API keys.
🧠 Learnings used
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-31T03:35:56.118Z
Learning: Applies to **/*.{py,md,toml,yaml,yml} : Use 'nat' for the API namespace and CLI tool, 'nvidia-nat' for the package name, 'NAT_' prefix for environment variables, and 'NeMo-Agent-Toolkit' for URLs and directory names.
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-31T03:35:56.118Z
Learning: Applies to docs/source/**/*.md : Never use deprecated names 'Agent Intelligence toolkit', 'aiqtoolkit', 'AgentIQ', 'AIQ', or 'aiq' when referring to NeMo Agent Toolkit; update any occurrences found in documentation unless intentionally referencing deprecated names or compatibility layers.
Learnt from: willkill07
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1512
File: docs/source/run-workflows/launching-ui.md:20-20
Timestamp: 2026-01-30T15:03:44.777Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository documentation, "NeMo Agent Toolkit" should always be spelled with a capital "T" in "Toolkit". The documentation guideline states: "Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit."
Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 701
File: examples/evaluation_and_profiling/simple_calculator_eval/src/nat_simple_calculator_eval/configs/config-with-custom-post-process.yml:16-18
Timestamp: 2025-08-22T20:29:13.433Z
Learning: In the NeMo-Agent-Toolkit project, configuration consistency across example files is prioritized over removing unused config options, even when those options aren't currently implemented in the runtime.
Learnt from: Jerryguan777
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1485
File: examples/advanced_agents/sandbox_agent/docker-compose.yaml:31-33
Timestamp: 2026-01-26T01:05:01.390Z
Learning: In the Sandbox Agent (examples/advanced_agents/sandbox_agent), host-side tools like web_search (Tavily) and youtube_transcript run on the host machine, not inside the Docker sandbox container. Therefore, their API keys (e.g., TAVILY_API_KEY) do not need to be passed as environment variables in docker-compose.yaml.
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-31T03:35:56.118Z
Learning: Never commit API keys, credentials or personal data; use environment variables or '.env' files excluded from Git.
Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:51.370Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, most YAML configuration files use 2-space indentation rather than 4-space indentation. When reviewing YAML files, prefer consistency with the existing codebase style (2-space indentation) over the stated coding guideline (4-space indentation) until a repository-wide standardization is performed.
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a module docstring for the package initializer.
This public module currently has only the SPDX header; add a short Google-style module docstring describing the package.
✏️ Suggested docstring
@@
# limitations under the License.
+"""Iterative SWE-bench predictor package."""
+As per coding guidelines, Provide Google-style docstrings for every public module, class, function and CLI command.
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.py`
around lines 1 - 14, The package initializer (__init__.py) for the
predict_iterative module is missing a public module docstring; add a short
Google-style module docstring at the top of __init__.py that briefly describes
the package purpose, the main functionality it exposes (e.g., iterative
predictors for NAT SWE benchmarking), and any important public symbols or
modules contained within the package; ensure the docstring follows Google-style
(one-line summary, blank line, short elaboration) and is placed before any code
or headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged. Keeping as-is for consistency with other __init__.py files in this project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jerryguan777, understood. I'll skip this suggestion.
🧠 Learnings used
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-31T03:35:56.118Z
Learning: Applies to {src,packages}/**/*.py : Provide Google-style docstrings for every public module, class, function and CLI command.
..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
Show resolved
Hide resolved
..._and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
Outdated
Show resolved
Hide resolved
| except GitCommandError as e: | ||
| logger.error("Git operation failed: %s", e, exc_info=True) | ||
| return f"Error: Git operation failed - {e.stderr}" | ||
| except OSError as e: | ||
| logger.error("Filesystem error: %s", e, exc_info=True) | ||
| return f"Error: Workspace setup failed - {str(e)}" | ||
| except Exception as e: | ||
| logger.exception("Unexpected error during repo setup: %s", e) | ||
| return f"Error: Setup failed - {str(e)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logger.exception when returning after errors.
You swallow these exceptions, so prefer logger.exception to capture full tracebacks.
🔧 Suggested change
- except GitCommandError as e:
- logger.error("Git operation failed: %s", e, exc_info=True)
+ except GitCommandError as e:
+ logger.exception("Git operation failed")
return f"Error: Git operation failed - {e.stderr}"
- except OSError as e:
- logger.error("Filesystem error: %s", e, exc_info=True)
+ except OSError as e:
+ logger.exception("Filesystem error")
return f"Error: Workspace setup failed - {str(e)}"As per coding guidelines, When catching and logging without re-raising, use logger.exception() to capture full stack trace information.
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 654-654: Use explicit conversion flag
Replace with conversion flag
(RUF010)
[warning] 656-656: Redundant exception object included in logging.exception call
(TRY401)
[warning] 657-657: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`
around lines 649 - 657, The three except blocks in the repo setup currently log
without full tracebacks for GitCommandError and OSError; update the
GitCommandError and OSError handlers to use logger.exception(...) (similar to
the generic Exception handler) so full stack traces are captured before
returning the error strings. Specifically, in the except GitCommandError as e
and except OSError as e blocks replace logger.error(...) with
logger.exception(...) and keep the same return values (referencing
GitCommandError, OSError, and the logger used in predict_iterative.py).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current code already captures full stack traces via exc_info=True.
logger.exception() is just syntactic sugar for logger.error(..., exc_info=True).
Keeping as-is for clarity - the explicit exc_info=True makes the intent more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jerryguan777, you're absolutely right that logger.error(..., exc_info=True) and logger.exception() are functionally equivalent—both capture the full stack trace.
That said, the project's coding guidelines do specify using logger.exception() when catching and logging without re-raising:
When catching and logging exceptions without re-raising: always use
logger.exception()to capture the full stack trace information.
The guideline appears to prioritize consistency across the codebase over individual preference. But if you feel the explicit form better serves readability for your use case, that's your call as the author.
🧠 Learnings used
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-31T03:35:56.118Z
Learning: Applies to **/*.py : Preserve stack traces and prevent duplicate logging when handling exceptions. Use bare 'raise' statements when re-raising to maintain original stack trace, and use 'logger.error()' (not 'logger.exception()') to avoid duplicate logging. When catching and logging without re-raising, use 'logger.exception()' to capture full stack trace.
| try: | ||
| await repo_manager.cleanup() | ||
| logger.info("Workspace cleanup completed successfully") | ||
| except Exception as e: | ||
| logger.error("Workspace cleanup failed: %s", e, exc_info=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logger.exception when swallowing cleanup errors.
You’re not re-raising here, so prefer logger.exception to capture stack traces.
🔧 Suggested change
- except Exception as e:
- logger.error("Workspace cleanup failed: %s", e, exc_info=True)
+ except Exception:
+ logger.exception("Workspace cleanup failed")As per coding guidelines, When catching and logging without re-raising, use logger.exception() to capture full stack trace information.
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`
around lines 95 - 99, The except block that logs cleanup failures should use
logger.exception to record the full stack trace when not re-raising; update the
except handling around repo_manager.cleanup() to call
logger.exception("Workspace cleanup failed") (or similar) instead of
logger.error(..., exc_info=True) so the stack trace is captured automatically by
the logger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate of previous comment. exc_info=True already captures full stack traces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jerryguan777, you're absolutely right that both approaches capture the full stack trace. The coding guideline preference for logger.exception() is primarily about consistency across the codebase, but functionally they're equivalent. I'll defer to your choice here.
🐰💭
/\_/\
( o.o ) Noted!
> ^ <
🧠 Learnings used
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-31T03:35:56.118Z
Learning: Applies to **/*.py : Preserve stack traces and prevent duplicate logging when handling exceptions. Use bare 'raise' statements when re-raising to maintain original stack trace, and use 'logger.error()' (not 'logger.exception()') to avoid duplicate logging. When catching and logging without re-raising, use 'logger.exception()' to capture full stack trace.
| @pytest.fixture(name="mock_llm") | ||
| def fixture_mock_llm(): | ||
| """Create a mock LLM that returns configurable responses.""" | ||
| llm = AsyncMock() | ||
| return llm | ||
|
|
||
|
|
||
| @pytest.fixture(name="agent_config") | ||
| def fixture_agent_config(): | ||
| """Create a default agent configuration for testing.""" | ||
| return IterativeAgentConfig( | ||
| step_limit=10, | ||
| timeout=5, | ||
| max_output_length=1000 | ||
| ) | ||
|
|
||
|
|
||
| @pytest.fixture(name="temp_repo_path") | ||
| def fixture_temp_repo_path(tmp_path): | ||
| """Create a temporary directory to simulate a repository.""" | ||
| repo_path = tmp_path / "test_repo" | ||
| repo_path.mkdir() | ||
| return repo_path | ||
|
|
||
|
|
||
| @pytest.fixture(name="agent") | ||
| def fixture_agent(mock_llm, temp_repo_path, agent_config): | ||
| """Create an IterativeAgent instance with mocked dependencies.""" | ||
| return IterativeAgent(mock_llm, temp_repo_path, agent_config) | ||
|
|
||
|
|
||
| def create_llm_response(content: str, input_tokens: int = 100, output_tokens: int = 50): | ||
| """Helper to create a mock LLM response with token usage.""" | ||
| response = MagicMock() | ||
| response.content = content | ||
| response.response_metadata = { | ||
| 'token_usage': { | ||
| 'prompt_tokens': input_tokens, | ||
| 'completion_tokens': output_tokens | ||
| } | ||
| } | ||
| return response | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cat -n examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py | head -100Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 3850
Add type hints to fixture and helper function signatures.
All fixtures and the helper function lack proper type annotations. Add return type hints for all fixtures and the helper, and add parameter type hints for fixture parameters (particularly tmp_path: Path).
Required additions:
fixture_mock_llm() -> AsyncMockfixture_agent_config() -> IterativeAgentConfigfixture_temp_repo_path(tmp_path: Path) -> Pathfixture_agent(mock_llm: AsyncMock, temp_repo_path: Path, agent_config: IterativeAgentConfig) -> IterativeAgentcreate_llm_response(...) -> MagicMock
All required types are already imported.
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py`
around lines 41 - 83, Update the fixtures and helper to include the requested
type hints: change fixture_mock_llm() to return AsyncMock,
fixture_agent_config() to return IterativeAgentConfig,
fixture_temp_repo_path(tmp_path: Path) to return Path, fixture_agent(mock_llm:
AsyncMock, temp_repo_path: Path, agent_config: IterativeAgentConfig) to return
IterativeAgent, and create_llm_response(content: str, input_tokens: int = 100,
output_tokens: int = 50) -> MagicMock; keep the existing AsyncMock,
IterativeAgentConfig, Path, IterativeAgent, and MagicMock imports and only add
the annotations on the functions named fixture_mock_llm, fixture_agent_config,
fixture_temp_repo_path, fixture_agent, and create_llm_response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't fix - consistent with other test files in this project which don't use type hints for fixtures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check other test files in the swe_bench tests directory for fixture type hint patterns
echo "=== Checking fixture definitions in other test files ==="
fd -e py -p 'examples/evaluation_and_profiling/swe_bench/tests/' --exec grep -l '@pytest.fixture' {} \;
echo -e "\n=== Sample fixture definitions from other test files ==="
# Look for fixture definitions with type hints
fd -e py -p 'examples/evaluation_and_profiling/swe_bench/tests/' -x rg -A 3 '@pytest\.fixture' {} \; | head -100
Length of output: 911
🏁 Script executed:
#!/bin/bash
# Check for other test files in the broader examples directory
echo "=== All test files in examples directory ==="
fd -e py -p 'examples/' -t f | rg 'test.*\.py$'
echo -e "\n=== Sample fixture definitions from broader examples tests ==="
fd -e py -p 'examples/' -t f | rg 'test.*\.py$' | xargs rg -A 3 '@pytest\.fixture' | head -150
Length of output: 19305
@Jerryguan777, I appreciate your response. However, the evidence shows that most test files in this project actually do use type hints for fixtures. For example:
examples/observability/simple_calculator_observability/tests/test_simple_calc_observability.pyhas extensive type hints:def question_fixture() -> str:,def config_dir_fixture(examples_dir: Path) -> Path:, etc.examples/documentation_guides/tests/test_custom_workflow.pyuses type hints consistentlyexamples/front_ends/simple_calculator_custom_routes/tests/test_simple_calculator_custom_routes.pyalso uses type hints
The project's coding guidelines explicitly state: "Python methods should use type hints for all parameters and return values."
That said, I understand your decision. I won't pursue this further.
examples/evaluation_and_profiling/swe_bench/tests/test_iterative_predictor.py
Outdated
Show resolved
Hide resolved
Add BYPASS_PATTERNS to detect shell metacharacter tricks that could evade command-based filtering: - Command substitution ($() and backticks) - Piping to shell interpreters (bash, sh, python, etc.) - Base64 encoded command execution - eval/exec execution - Here-string to interpreter - Process substitution - Hex escape execution - Environment variable injection - Xargs with shell execution - Source/dot command execution Refactor validation code into separate shell_validation.py module for better maintainability. Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
- Rename "NAT framework" to "NeMo Agent Toolkit" for consistency - Remove redundant @pytest.mark.asyncio decorators (asyncio_mode=auto) - Regenerate uv.lock Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
- Validate instance_id to prevent path traversal attacks (.., /, \) - Validate repo_url to prevent IndexError from malformed URLs - Add tests for both validations Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Implemented an iterative agent that solves problems by executing bash commands step-by-step, observing results, and generating patches. Achieved 70% success rate (7/10) in initial evaluation.
How Has This Been Tested?
Description
Closes #1397
By Submitting this PR I confirm:
Summary by CodeRabbit
New Features
Documentation
Tests