Skip to content

Conversation

@srivatsankrishnan
Copy link
Contributor

Summary

Added MegatronBridge as a native CloudAI SlurmSystem workload with a SlurmCommandGenStrategy that runs Megatron-Bridge’s scripts/performance/setup_experiment.py using CloudAI-managed installs (Git clone + dedicated venv). Implemented Slurm job-id retrieval for Megatron-Bridge by generating a readable wrapper script (megatron_bridge_submit_and_parse_jobid.sh) that redirects launcher output to megatron_bridge_launcher.log
extracts Job id: from that log and CloudAI can track the Slurm job.

Known Issues: There is some issues with M-bridge overriding logic from passing the default values to it. Once the issue is root caused (on M-bridge side), we should follow it up with another PR.

Test Plan

  • CI/CD
  • Internal Cluster
 cloudai run --system-config ../cloudaix/conf/common/system/lyris.toml --tests-dir conf/experimental/megatron_bridge/test --test-scenario conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml 
[INFO] System Name: lyris
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: megatron_bridge_qwen_30b
[INFO] Checking if workloads components are installed.
[INFO] Test Scenario: megatron_bridge_qwen_30b

Section Name: megatron_bridge_qwen_30b
  Test Name: megatron_bridge_qwen_30b
  Description: Megatron-Bridge run via CloudAI SlurmSystem for Qwen3 30B A3B
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Scenario results will be stored at: results/megatron_bridge_qwen_30b_2025-12-22_15-26-23
[INFO] Starting test: megatron_bridge_qwen_30b (results at: results/megatron_bridge_qwen_30b_2025-12-22_15-26-23/megatron_bridge_qwen_30b/0)
[INFO] Running test: megatron_bridge_qwen_30b
[INFO] Submitted slurm job: 586542
[INFO] Job completed: megatron_bridge_qwen_30b (iteration 1 of 1)
[INFO] Generated scenario report at results/megatron_bridge_qwen_30b_2025-12-22_15-26-23/megatron_bridge_qwen_30b.html
[INFO] Scenario results                                                                                                       
┌──────────────────────────_────────_─────────────────────────────────────────────────────────────────────────────────┐
│ Case                     │ Status │ Details                                                                         │
_──────────────────────────┼────────┼─────────────────────────────────────────────────────────────────────────────────_
│ megatron_bridge_qwen_30b │ PASSED │ results/megatron_bridge_qwen_30b_2025-12-22_15-26-23/megatron_bridge_qwen_30b/0 │
└──────────────────────────_────────_─────────────────────────────────────────────────────────────────────────────────┘

[INFO] All jobs are complete.

Additional Notes

Include any other notes or comments about the pull request here. This can include challenges faced, future considerations, or context that reviewers might find helpful.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 22, 2025

📝 Walkthrough

Walkthrough

Adds a Megatron-Bridge workload: new TOML configs and test scenario, a workload package (CmdArgs/TestDefinition), Slurm command generator, report generation strategy, registration updates, and unit tests for reporting and Slurm command generation.

Changes

Cohort / File(s) Summary
Configuration
conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml, conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml
New test configuration and test scenario for a Qwen3 30B A3B Megatron-Bridge run (metadata, cmd_args, 2-node scenario).
Workload package exports
src/cloudai/workloads/megatron_bridge/__init__.py
New package initializer re-exporting MegatronBridge public types and strategies (CmdArgs, TestDefinition, report & slurm strategies).
Core workload implementation
src/cloudai/workloads/megatron_bridge/megatron_bridge.py
Adds MegatronBridgeCmdArgs and MegatronBridgeTestDefinition, including docker/python/git properties, installable resolution, ref inference/mapping, and extensive multi-rule constraint validation.
Reporting
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py
Adds MegatronBridgeReportGenerationStrategy to discover launcher logs, extract step time and TFLOP/s per GPU samples, compute statistics, and emit report.txt and get_metric API.
Slurm command generation
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
Adds MegatronBridgeSlurmCommandGenStrategy to locate repo/venv, build and wrap launcher command, normalize flags (recompute/cuda-graph), enforce hf_token presence, write generated_command.sh and dump metadata.
Registration
src/cloudai/registration.py
Registers MegatronBridge test definition, Slurm command-gen strategy, and report generation strategy with existing registries and SlurmSystem.
Tests — new
tests/report_generation_strategy/test_megatron_bridge_report_generation_strategy.py, tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
New unit tests for report parsing/generation and Slurm command generation (hf_token validation, wrapper behavior, normalization, detach handling, generated command contents).
Tests — updated
tests/test_init.py, tests/test_test_scenario.py, tests/test_cloudaigym.py, tests/test_test_definitions.py
Tests updated to include MegatronBridge registrations and expectations (counts, reporter list), minor assertion cleanup, and a conditional skip for MegatronBridge tests lacking hf_token.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • DeepEP benchmark #723 — overlapping changes to registration and workload export surfaces; likely touches same registries and exports.

Suggested reviewers

  • srinivas212
  • TaekyungHeo
  • amaslenn

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title "Megatron Bridge in CloudAI" directly and clearly describes the main change: adding Megatron Bridge support as a native CloudAI workload.
Description check ✅ Passed The description is well-detailed and directly related to the changeset, covering the implementation of MegatronBridge as a SlurmSystem workload, the wrapper script functionality, test plans, and known issues.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 120a3c5 and cfa33b9.

📒 Files selected for processing (4)
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
  • tests/test_test_definitions.py
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 764
File: src/cloudai/workloads/megatron_bridge/megatron_bridge.py:148-155
Timestamp: 2025-12-23T00:28:50.788Z
Learning: In src/cloudai/workloads/megatron_bridge/megatron_bridge.py, when constraint_check is called, CloudAIGym has already resolved list-valued fields (like tp, pp, cp, etc.) to static scalar values. The _as_int helper using cast() is safe because it will only receive int or None at runtime, never List[int].
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 764
File: src/cloudai/workloads/megatron_bridge/megatron_bridge.py:98-101
Timestamp: 2025-12-23T00:23:11.471Z
Learning: In src/cloudai/workloads/megatron_bridge/megatron_bridge.py, the nemo_run_repo GitRepo uses commit="main" intentionally. Nemo Run is a Slurm executor (not a framework) used by Megatron Bridge to launch recipes, and tracking the main branch is acceptable for this dependency.
📚 Learning: 2025-12-17T22:24:51.805Z
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 760
File: tests/standalone_command_gen_strategy/test_aiconfigurator_standalone_command_gen_strategy.py:33-122
Timestamp: 2025-12-17T22:24:51.805Z
Learning: In the NVIDIA/cloudai repository, avoid suggesting overly nitpick refactor comments such as test parametrization when there are only two test cases with different modes (e.g., agg vs disagg). Such refactoring suggestions are not needed unless explicitly requested.

Applied to files:

  • tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
📚 Learning: 2025-12-16T19:47:41.994Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.

Applied to files:

  • tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
  • tests/test_test_definitions.py
📚 Learning: 2025-12-23T00:23:11.471Z
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 764
File: src/cloudai/workloads/megatron_bridge/megatron_bridge.py:98-101
Timestamp: 2025-12-23T00:23:11.471Z
Learning: In src/cloudai/workloads/megatron_bridge/megatron_bridge.py, the nemo_run_repo GitRepo uses commit="main" intentionally. Nemo Run is a Slurm executor (not a framework) used by Megatron Bridge to launch recipes, and tracking the main branch is acceptable for this dependency.

Applied to files:

  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
📚 Learning: 2025-12-17T22:02:45.215Z
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 756
File: src/cloudai/workloads/aiconfig/standalone_command_gen_strategy.py:65-85
Timestamp: 2025-12-17T22:02:45.215Z
Learning: In CloudAI's DSE flow for the Aiconfigurator workload (src/cloudai/workloads/aiconfig/standalone_command_gen_strategy.py), list-valued parameters in AiconfiguratorCmdArgs (such as batch_size, ctx_tokens, tp, pp, dp, etc. in Agg and Disagg models) are scalarized by apply_params_set before gen_exec_command is called, so these fields are guaranteed to be scalar integers at command generation time.

Applied to files:

  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
📚 Learning: 2025-12-23T00:28:50.788Z
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 764
File: src/cloudai/workloads/megatron_bridge/megatron_bridge.py:148-155
Timestamp: 2025-12-23T00:28:50.788Z
Learning: In src/cloudai/workloads/megatron_bridge/megatron_bridge.py, when constraint_check is called, CloudAIGym has already resolved list-valued fields (like tp, pp, cp, etc.) to static scalar values. The _as_int helper using cast() is safe because it will only receive int or None at runtime, never List[int].

Applied to files:

  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
📚 Learning: 2025-12-05T13:59:40.479Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 739
File: src/cloudai/workloads/ai_dynamo/report_generation_strategy.py:123-138
Timestamp: 2025-12-05T13:59:40.479Z
Learning: In the AI Dynamo workload for CloudAI, num_nodes fields in WorkerBaseArgs can be typed as `int | list[int]`, but lists are unrolled at the cmd_gen/json_gen level. By the time report generation runs, only scalar integer values are present in num_nodes fields. The Slurm command generation strategy enforces this with explicit assertions.

Applied to files:

  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
📚 Learning: 2025-12-05T13:58:27.113Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 739
File: src/cloudai/workloads/ai_dynamo/ai_dynamo.py:34-63
Timestamp: 2025-12-05T13:58:27.113Z
Learning: In the AI Dynamo workload (src/cloudai/workloads/ai_dynamo/), list-valued fields in WorkerBaseArgs (e.g., num_nodes: int | list[int], data_parallel_size: int | list[int] | None) are unrolled at the cmd_gen/json_gen level, so downstream code only receives scalar values.

Applied to files:

  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
🧬 Code graph analysis (3)
tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py (3)
src/cloudai/systems/slurm/slurm_system.py (1)
  • SlurmSystem (96-743)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (5)
  • MegatronBridgeCmdArgs (26-91)
  • MegatronBridgeTestDefinition (94-433)
  • python_executable (116-119)
  • megatron_bridge_repo (122-129)
  • docker_image (110-113)
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (1)
  • MegatronBridgeSlurmCommandGenStrategy (32-286)
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (3)
src/cloudai/models/scenario.py (2)
  • TestRunDetails (201-253)
  • from_test_run (241-253)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)
  • SlurmCommandGenStrategy (32-472)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (5)
  • MegatronBridgeCmdArgs (26-91)
  • MegatronBridgeTestDefinition (94-433)
  • megatron_bridge_repo (122-129)
  • python_executable (116-119)
  • docker_image (110-113)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (2)
src/cloudai/_core/installables.py (4)
  • DockerImage (36-84)
  • GitRepo (87-115)
  • Installable (25-32)
  • PythonExecutable (119-145)
src/cloudai/models/workload.py (2)
  • CmdArgs (26-29)
  • TestDefinition (89-141)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Greptile Review
  • GitHub Check: Run pytest (3.10)
🔇 Additional comments (4)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (1)

26-433: LGTM! Comprehensive Megatron-Bridge workload implementation.

The implementation is well-structured with:

  • Proper field validation (hf_token requirement)
  • Lazy initialization of installables (docker_image, python_executable, megatron_bridge_repo)
  • Comprehensive constraint validation (17 constraints) with detailed error logging
  • Safe handling of potentially list-typed fields (confirmed by learnings that CloudAIGym resolves lists before constraint_check)

The constraint_check method is appropriately thorough given the complexity of distributed training configurations, with proper division-by-zero guards and clear error messages.

Based on learnings, the use of cast() in _as_int and bool() in _as_bool is safe because CloudAIGym resolves list-valued fields to static scalar values before constraint_check is called.

tests/test_test_definitions.py (1)

91-94: LGTM! Appropriate test skip for credential-required config.

The skip prevents test failures when the MegatronBridge example config has an empty hf_token placeholder. This is a reasonable approach for configs requiring user-provided credentials, with a clear message directing users to set the token.

tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py (1)

31-191: LGTM! Comprehensive test coverage for command generation.

The test suite thoroughly validates:

  • Schema validation (hf_token requirement)
  • Default value emission behavior (ensuring defaults aren't forced)
  • Container image path handling (local vs. installed paths)
  • Argument normalization (CUDA graph scope)
  • Flag handling (detach/no-detach/omit variations)
  • Command file generation

The detach flag test properly reconstructs cmd_args (lines 160-165) to ensure fields_set is correctly populated, addressing the past review concern about Pydantic's model_fields_set behavior.

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (1)

32-286: LGTM! Robust Slurm command generation strategy.

The implementation demonstrates strong engineering practices:

  • Error handling: Clear RuntimeError messages for missing required fields and installation issues (lines 168-175, 237-240)
  • Troubleshooting support: Warnings for missing installs with actionable guidance (lines 51-54, 59-62)
  • Job tracking: Well-structured wrapper script that captures Slurm job ID for CloudAI integration (lines 104-150)
  • Path resolution: Proper fallback logic for local vs. installed container images (lines 177-182)
  • Normalization: Consistent handling of list/string arguments for recompute_modules and CUDA graph scopes

The command construction respects Pydantic's fields_set to avoid emitting default values, and the detach flag handling correctly uses flag presence rather than boolean values.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@srivatsankrishnan srivatsankrishnan marked this pull request as ready for review December 23, 2025 00:09
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 23, 2025

Greptile Summary

  • Adds MegatronBridge as a native CloudAI workload for Slurm systems, enabling distributed AI training benchmarks through CloudAI's standardized test framework with automatic Git repository management and version mapping
  • Implements sophisticated Slurm job ID tracking by generating wrapper scripts that capture Megatron-Bridge launcher output and parse job IDs for CloudAI's job monitoring system
  • Introduces comprehensive constraint validation with 17 different checks for complex distributed training configurations including tensor/pipeline/context parallelism compatibility and FSDP requirements

Important Files Changed

Filename Overview
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py New file implementing complex wrapper script generation and job ID parsing logic with hardcoded log format dependencies
src/cloudai/workloads/megatron_bridge/megatron_bridge.py New file with extensive constraint validation system for distributed training configs and CloudAI-managed installation logic
conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml New config file containing hardcoded placeholder HF token that requires manual replacement before use

Confidence score: 3/5

  • This PR requires careful review due to complex integration logic and hardcoded dependencies that could cause runtime failures
  • Score reflects sophisticated but brittle job ID parsing mechanism, extensive constraint validation system with potential edge cases, and hardcoded log filename dependencies between components
  • Pay close attention to slurm_command_gen_strategy.py for wrapper script logic and megatron_bridge.py for constraint validation edge cases

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI_CLI as "CloudAI CLI"
    participant TestDefinition as "MegatronBridgeTestDefinition"
    participant CommandGenStrategy as "MegatronBridgeSlurmCommandGenStrategy"
    participant Installer as "SlurmInstaller"
    participant SlurmRunner as "SlurmRunner"
    participant MegatronBridge as "Megatron-Bridge Launcher"
    participant SlurmCluster as "Slurm Cluster"
    
    User->>CloudAI_CLI: "cloudai run --system-config system.toml --test-scenario scenario.toml"
    CloudAI_CLI->>TestDefinition: "Parse test configuration"
    TestDefinition->>TestDefinition: "Validate cmd_args (hf_token, constraints)"
    TestDefinition->>TestDefinition: "Setup installables (docker_image, nemo_run_repo, megatron_bridge_repo)"
    
    CloudAI_CLI->>Installer: "Install dependencies"
    Installer->>Installer: "Clone Megatron-Bridge repo"
    Installer->>Installer: "Create Python venv with NeMo-Run"
    Installer->>Installer: "Cache container image"
    
    CloudAI_CLI->>CommandGenStrategy: "Generate execution command"
    CommandGenStrategy->>CommandGenStrategy: "Build launcher command parts"
    CommandGenStrategy->>CommandGenStrategy: "Create wrapper script (megatron_bridge_submit_and_parse_jobid.sh)"
    CommandGenStrategy->>SlurmRunner: "Return wrapped command"
    
    SlurmRunner->>MegatronBridge: "Execute wrapper script"
    MegatronBridge->>MegatronBridge: "Run setup_experiment.py"
    MegatronBridge->>SlurmCluster: "Submit training job via sbatch"
    SlurmCluster-->>MegatronBridge: "Job ID"
    MegatronBridge->>MegatronBridge: "Log output to megatron_bridge_launcher.log"
    MegatronBridge->>SlurmRunner: "Echo 'Submitted batch job <ID>'"
    
    SlurmRunner->>SlurmRunner: "Parse job ID and track job"
    SlurmRunner-->>User: "Job completion status"
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. src/cloudai/workloads/megatron_bridge/megatron_bridge.py, line 151-152 (link)

    logic: _as_int accepts List[int] but only casts without checking type, so tp=[1,2,4] sweeps will pass through incorrectly as list objects

  2. src/cloudai/workloads/megatron_bridge/megatron_bridge.py, line 154-155 (link)

    logic: _as_bool has same issue - use_megatron_fsdp=[true, false] sweeps will incorrectly evaluate as truthy list

12 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fbf9891 and 78b919e.

📒 Files selected for processing (12)
  • conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml
  • src/cloudai/registration.py
  • src/cloudai/workloads/megatron_bridge/__init__.py
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
  • src/cloudai/workloads/megatron_bridge/report_generation_strategy.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/report_generation_strategy/test_megatron_bridge_report_generation_strategy.py
  • tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
  • tests/test_cloudaigym.py
  • tests/test_init.py
  • tests/test_test_scenario.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-16T19:47:41.994Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.

Applied to files:

  • tests/test_cloudaigym.py
  • tests/report_generation_strategy/test_megatron_bridge_report_generation_strategy.py
  • tests/test_init.py
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
  • src/cloudai/workloads/megatron_bridge/report_generation_strategy.py
  • tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/test_test_scenario.py
  • src/cloudai/workloads/megatron_bridge/__init__.py
  • src/cloudai/registration.py
📚 Learning: 2025-12-17T22:02:45.215Z
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 756
File: src/cloudai/workloads/aiconfig/standalone_command_gen_strategy.py:65-85
Timestamp: 2025-12-17T22:02:45.215Z
Learning: In CloudAI's DSE flow for the Aiconfigurator workload (src/cloudai/workloads/aiconfig/standalone_command_gen_strategy.py), list-valued parameters in AiconfiguratorCmdArgs (such as batch_size, ctx_tokens, tp, pp, dp, etc. in Agg and Disagg models) are scalarized by apply_params_set before gen_exec_command is called, so these fields are guaranteed to be scalar integers at command generation time.

Applied to files:

  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
🧬 Code graph analysis (5)
tests/report_generation_strategy/test_megatron_bridge_report_generation_strategy.py (2)
src/cloudai/_core/test_scenario.py (1)
  • TestRun (58-174)
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py (3)
  • MegatronBridgeReportGenerationStrategy (28-163)
  • can_handle_directory (47-48)
  • generate_report (73-136)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (2)
src/cloudai/_core/installables.py (4)
  • DockerImage (36-84)
  • GitRepo (87-115)
  • Installable (25-32)
  • PythonExecutable (119-145)
src/cloudai/models/workload.py (2)
  • CmdArgs (26-29)
  • TestDefinition (89-141)
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py (1)
src/cloudai/_core/report_generation_strategy.py (1)
  • ReportGenerationStrategy (24-40)
src/cloudai/workloads/megatron_bridge/__init__.py (3)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (2)
  • MegatronBridgeCmdArgs (26-89)
  • MegatronBridgeTestDefinition (92-431)
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py (1)
  • MegatronBridgeReportGenerationStrategy (28-163)
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (1)
  • MegatronBridgeSlurmCommandGenStrategy (32-281)
src/cloudai/registration.py (4)
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py (1)
  • MegatronBridgeReportGenerationStrategy (28-163)
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (1)
  • MegatronBridgeSlurmCommandGenStrategy (32-281)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (1)
  • MegatronBridgeTestDefinition (92-431)
src/cloudai/_core/registry.py (2)
  • add_command_gen_strategy (251-259)
  • add_report (207-210)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🔇 Additional comments (21)
conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml (1)

17-22: LGTM!

The test scenario configuration is minimal and correctly references the test definition. The structure aligns with other scenario files in the codebase.

src/cloudai/registration.py (1)

94-98: LGTM!

The MegatronBridge registrations follow the established patterns in the codebase:

  • Import grouped with other workload imports
  • Command generation strategy registered for SlurmSystem
  • Test definition and report strategy properly registered

Also applies to: 193-195, 245-245, 262-262

conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml (2)

17-35: LGTM for experimental configuration.

The test configuration is well-structured for a Qwen 30B model run. The parameters align with the 2-node scenario (8 GPUs / 4 GPUs per node = 2 nodes).


35-35: No action needed — hf_token validation is already implemented.

The placeholder hf_token = "REPLACE_ME_WITH_HF_TOKEN" is already caught at runtime. The _build_launcher_parts method in MegatronBridgeSlurmCommandGenStrategy explicitly validates this value at line 155 and raises a RuntimeError with a clear message: "HuggingFace token is required. Please set cmd_args.hf_token to a real token string (not 'REPLACE_ME_WITH_HF_TOKEN') in your local test TOML."

Likely an incorrect or invalid review comment.

tests/test_test_scenario.py (1)

52-52: LGTM!

The test updates correctly integrate MegatronBridge:

  • Import added in proper alphabetical position
  • Reporter count incremented from 15 to 16
  • Parametrized test case added for the new test definition and report strategy mapping

Also applies to: 475-475, 485-485

tests/test_init.py (1)

52-55: LGTM! MegatronBridge registration follows the established pattern.

The import, command generation strategy mapping, test definition count update, and test definition entry are all consistent with the existing workload registrations.

Also applies to: 136-136, 220-220, 235-235

tests/report_generation_strategy/test_megatron_bridge_report_generation_strategy.py (2)

27-42: LGTM! The fixture provides a realistic test setup.

The log content accurately reflects the Megatron-Bridge output format with Step Time and GPU utilization metrics, which the report generation strategy parses.


45-57: LGTM! Tests cover the essential report generation behavior.

The tests verify both can_handle_directory() and the content of the generated report, aligning with the behavioral documentation approach mentioned in the learnings.

src/cloudai/workloads/megatron_bridge/__init__.py (1)

17-26: LGTM! Clean package initialization.

The __all__ exports are alphabetically ordered and include all necessary public entities for the Megatron Bridge workload.

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (4)

104-150: LGTM! Well-designed wrapper script for job ID extraction.

The script properly:

  • Sets strict mode (set -euo pipefail)
  • Redirects launcher output to a log file
  • Parses the job ID with graceful fallback (|| true)
  • Emits CloudAI-compatible output or fails with diagnostic info

152-159: Good validation: Rejecting placeholder HF tokens prevents common misconfiguration.

Clear error message guides users to set a real token in their TOML configuration.


191-204: LGTM! Clean helper functions for flag handling.

The add() and add_field() helpers properly handle None, booleans (converted to "true"/"false"), and the fields_set logic to avoid emitting defaults.


271-275: LGTM! Detach flag handling is correct.

The logic properly emits --detach for True, --no-detach for False, and nothing when the field is not explicitly set, avoiding Megatron-Bridge's default override issues mentioned in the PR notes.

tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py (3)

88-126: LGTM! Good coverage of the fields_set logic.

By constructing MegatronBridgeCmdArgs with only the required fields, this test correctly verifies that optional fields not in fields_set are not emitted in the command.


127-145: LGTM! Container path and CUDA graph scope tests verify key normalization behavior.

Both tests validate that user-provided values are correctly passed through or normalized in the generated wrapper script.


173-183: LGTM! Verifies the generated command file is written correctly.

The test confirms that generated_command.sh is created and contains the expected wrapper script invocation.

src/cloudai/workloads/megatron_bridge/megatron_bridge.py (5)

1-23: LGTM!

License header and imports are appropriate for this module.


26-90: LGTM!

The MegatronBridgeCmdArgs class is well-structured with appropriate field definitions supporting both scalar values and lists for sweep configurations. The hf_token validator correctly sanitizes input.


413-431: LGTM!

The constraint aggregation logic is correct. All 17 constraints are properly combined, and the approach of logging each failure individually before returning provides excellent debugging visibility.


232-235: Edge case: cuda_graph_scope with only empty/whitespace values.

The cuda_graphs determination checks len(scopes) > 0, but if _normalize_str_list filters out all empty segments, an original non-empty input like cuda_graph_scope=" " would result in scopes = []. This is likely the intended behavior, but worth confirming the edge case is handled as expected.


246-251: Consider simplifying the nested ternary for readability.

The constraint logic on line 247 is difficult to parse at a glance due to nested ternaries:

constraint6 = pp == 1 and cp == 1 and (vp == 1 if vp is not None else True) if fsdp else True
🔎 Proposed refactor for clarity
-        constraint6 = pp == 1 and cp == 1 and (vp == 1 if vp is not None else True) if fsdp else True
+        if fsdp:
+            constraint6 = pp == 1 and cp == 1 and (vp is None or vp == 1)
+        else:
+            constraint6 = True
⛔ Skipped due to learnings
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 760
File: tests/standalone_command_gen_strategy/test_aiconfigurator_standalone_command_gen_strategy.py:33-122
Timestamp: 2025-12-17T22:24:51.805Z
Learning: In the NVIDIA/cloudai repository, avoid suggesting overly nitpick refactor comments such as test parametrization when there are only two test cases with different modes (e.g., agg vs disagg). Such refactoring suggestions are not needed unless explicitly requested.

@srivatsankrishnan
Copy link
Contributor Author

Additional Comments (2)

  1. src/cloudai/workloads/megatron_bridge/megatron_bridge.py, line 151-152 (link)
    logic: _as_int accepts List[int] but only casts without checking type, so tp=[1,2,4] sweeps will pass through incorrectly as list objects
  2. src/cloudai/workloads/megatron_bridge/megatron_bridge.py, line 154-155 (link)
    logic: _as_bool has same issue - use_megatron_fsdp=[true, false] sweeps will incorrectly evaluate as truthy list

12 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

This is fine. The list resolution happens in CloudAIGym before calling constraint checks. So at this point there is no question of values coming in as sweeps (lists). You can ignore this.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 871f460 and a054f4a.

📒 Files selected for processing (1)
  • tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 764
File: src/cloudai/workloads/megatron_bridge/megatron_bridge.py:98-101
Timestamp: 2025-12-23T00:23:11.471Z
Learning: In src/cloudai/workloads/megatron_bridge/megatron_bridge.py, the nemo_run_repo GitRepo uses commit="main" intentionally. Nemo Run is a Slurm executor (not a framework) used by Megatron Bridge to launch recipes, and tracking the main branch is acceptable for this dependency.
📚 Learning: 2025-12-16T19:47:41.994Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.

Applied to files:

  • tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🔇 Additional comments (7)
tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py (7)

32-77: LGTM: Clean fixture setup.

The fixtures properly create a synthetic test environment with fake installation paths, avoiding dependencies on real installs. The approach is well-documented with inline comments.


79-86: LGTM: Proper validation test.

Correctly verifies that placeholder tokens are rejected at runtime with an appropriate error.


88-125: LGTM: Comprehensive default handling test.

Correctly verifies that optional fields not specified in TOML are omitted from the generated command. The intentional duplication of setup logic (rather than using fixtures) is appropriate for testing this minimal configuration.


127-139: LGTM: Container path handling validated.

Properly verifies that local container image paths are preserved verbatim and not overridden by cached paths.


141-145: LGTM: Normalization verified.

Correctly validates that cuda_graph_scope bracket notation is normalized in the generated wrapper.


165-170: LGTM: model_fields_set issue resolved.

The reconstruction approach using model_dump and model_validate correctly addresses the previous review concern. This ensures model_fields_set is properly populated based on which fields are present in the data dict, allowing the strategy code's "detach" in fields_set check to work correctly.


183-193: LGTM: Command file generation verified.

Properly validates that the generated command file is created with the expected content and format.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 219 (link)

    style: check if self.system.account exists before using, as account may be optional

  2. src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 237-243 (link)

    style: consider moving required field validation to Pydantic model instead of runtime checks

    If model_name and model_size are truly required, define them without defaults in MegatronBridgeCmdArgs so Pydantic validates them at construction time.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

12 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (3)

  1. src/cloudai/workloads/megatron_bridge/megatron_bridge.py, line 201-205 (link)

    style: Constraint 2 bypassed (assumes num_layers not available, set to True). VP validation incomplete. Constraint 3 (dp != 0) is redundant with Constraint 1 since if num_gpus % (tp*pp*cp) == 0 and num_gpus > 0, then dp will always be non-zero.

  2. src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 153-157 (link)

    style: Consider moving HF token validation (checking for empty/placeholder values) to a @field_validator in MegatronBridgeCmdArgs for earlier error detection rather than failing at command generation time.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  3. src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 126-132 (link)

    style: Job ID extraction relies on grepping for "Job id: <num>" pattern. Verify this pattern matches Megatron-Bridge's output format across different versions (currently mapped to r0.2.0 for container 25.11).

12 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (3)

  1. tests/test_test_definitions.py, line 93 (link)

    style: Check should also handle None values to be more robust

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  2. src/cloudai/workloads/megatron_bridge/report_generation_strategy.py, line 62-64 (link)

    style: Limiting to last 10 data points is hardcoded and may not be suitable for all scenarios - consider making this configurable or removing the limit. Is there a specific reason for limiting to 10 data points, or should this be configurable based on the test requirements?

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  3. src/cloudai/workloads/megatron_bridge/report_generation_strategy.py, line 98 (link)

    style: Using population standard deviation (pstdev) instead of sample standard deviation - verify this is the intended statistical measure. Should this be sample standard deviation (stdev) instead of population standard deviation (pstdev)?

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

13 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@srivatsankrishnan
Copy link
Contributor Author

Additional Comments (3)

  1. src/cloudai/workloads/megatron_bridge/megatron_bridge.py, line 201-205 (link)
    style: Constraint 2 bypassed (assumes num_layers not available, set to True). VP validation incomplete. Constraint 3 (dp != 0) is redundant with Constraint 1 since if num_gpus % (tp*pp*cp) == 0 and num_gpus > 0, then dp will always be non-zero.
  2. src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 153-157 (link)
    style: Consider moving HF token validation (checking for empty/placeholder values) to a @field_validator in MegatronBridgeCmdArgs for earlier error detection rather than failing at command generation time.
    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
  3. src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 126-132 (link)
    style: Job ID extraction relies on grepping for "Job id: <num>" pattern. Verify this pattern matches Megatron-Bridge's output format across different versions (currently mapped to r0.2.0 for container 25.11).

12 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Constraint checks are independent to each other. you can ignore them. HF_token to field validator is addresssed in this commit.
Job ID extraction pattern will remain the same since its using Nemo Run and remains the same with different version we are using. Safe to ignore. @greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (5)

  1. conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml, line 22 (link)

    syntax: num_nodes value should be an integer, not a string

  2. tests/report_generation_strategy/test_megatron_bridge_report_generation_strategy.py, line 32 (link)

    style: The log content appears to have a formatting issue - 'ain_fp8_mx/0' looks like a truncated prefix that should probably be a complete experiment name or identifier. Is this truncated log line intentional for testing, or should it include the full experiment identifier?

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  3. src/cloudai/workloads/megatron_bridge/report_generation_strategy.py, line 62-64 (link)

    style: only last 10 samples used for statistics without explanation. Is this limitation intentional for performance reasons or to focus on steady-state metrics?

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  4. src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 133 (link)

    style: Job ID parsing relies on hardcoded regex pattern 'Job id[: ]+[0-9]+' which could break if Megatron-Bridge changes its output format. Is there a more robust way to get the job ID from Megatron-Bridge, such as through its API or a structured output format?

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  5. src/cloudai/workloads/megatron_bridge/megatron_bridge.py, line 137-141 (link)

    style: Version mapping only includes 25.11->r0.2.0, defaulting unknown versions to 'main' branch could cause compatibility issues

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

13 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

@srivatsankrishnan
Copy link
Contributor Author

Additional Comments (3)

  1. tests/test_test_definitions.py, line 93 (link)
    style: Check should also handle None values to be more robust
    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
  2. src/cloudai/workloads/megatron_bridge/report_generation_strategy.py, line 62-64 (link)
    style: Limiting to last 10 data points is hardcoded and may not be suitable for all scenarios - consider making this configurable or removing the limit. Is there a specific reason for limiting to 10 data points, or should this be configurable based on the test requirements?
    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
  3. src/cloudai/workloads/megatron_bridge/report_generation_strategy.py, line 98 (link)
    style: Using population standard deviation (pstdev) instead of sample standard deviation - verify this is the intended statistical measure. Should this be sample standard deviation (stdev) instead of population standard deviation (pstdev)?
    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

13 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@greptile.
Safe to ignore these comments. Last 10 data points is a requirement for stability in training step. using normal std deviation is file.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I've noted these preferences for ML training workloads:

  1. Last 10 data points limit: This is intentional for training stability metrics and shouldn't be flagged as hardcoded
  2. Population std deviation (pstdev): This is the correct statistical measure for training metrics, not sample std deviation

Here's a link to the new rule that will be applied in future reviews.

@srivatsankrishnan
Copy link
Contributor Author

Additional Comments (5)

  1. conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml, line 22 (link)
    syntax: num_nodes value should be an integer, not a string
  2. tests/report_generation_strategy/test_megatron_bridge_report_generation_strategy.py, line 32 (link)
    style: The log content appears to have a formatting issue - 'ain_fp8_mx/0' looks like a truncated prefix that should probably be a complete experiment name or identifier. Is this truncated log line intentional for testing, or should it include the full experiment identifier?
    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
  3. src/cloudai/workloads/megatron_bridge/report_generation_strategy.py, line 62-64 (link)
    style: only last 10 samples used for statistics without explanation. Is this limitation intentional for performance reasons or to focus on steady-state metrics?
    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
  4. src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 133 (link)
    style: Job ID parsing relies on hardcoded regex pattern 'Job id[: ]+[0-9]+' which could break if Megatron-Bridge changes its output format. Is there a more robust way to get the job ID from Megatron-Bridge, such as through its API or a structured output format?
    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
  5. src/cloudai/workloads/megatron_bridge/megatron_bridge.py, line 137-141 (link)
    style: Version mapping only includes 25.11->r0.2.0, defaulting unknown versions to 'main' branch could cause compatibility issues
    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

13 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

@greptile
Repeating the comments. Already addressed in this PR. Safe to ignore these preferences.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! I've created a rule to remember your team's preferences for MegatronBridge workload implementations. I won't flag these patterns as issues in future reviews. Thanks for the feedback!

Copy link

@alexmanle alexmanle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@srivatsankrishnan srivatsankrishnan merged commit 99f9158 into NVIDIA:main Dec 23, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants