-
Notifications
You must be signed in to change notification settings - Fork 41
Megatron Bridge in CloudAI #764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds a Megatron-Bridge workload: new TOML configs and test scenario, a workload package (CmdArgs/TestDefinition), Slurm command generator, report generation strategy, registration updates, and unit tests for reporting and Slurm command generation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (4)
🧰 Additional context used🧠 Learnings (8)📓 Common learnings📚 Learning: 2025-12-17T22:24:51.805ZApplied to files:
📚 Learning: 2025-12-16T19:47:41.994ZApplied to files:
📚 Learning: 2025-12-23T00:23:11.471ZApplied to files:
📚 Learning: 2025-12-17T22:02:45.215ZApplied to files:
📚 Learning: 2025-12-23T00:28:50.788ZApplied to files:
📚 Learning: 2025-12-05T13:59:40.479ZApplied to files:
📚 Learning: 2025-12-05T13:58:27.113ZApplied to files:
🧬 Code graph analysis (3)tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py (3)
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (3)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile Summary
Important Files Changed
Confidence score: 3/5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
src/cloudai/workloads/megatron_bridge/megatron_bridge.py, line 151-152 (link)logic:
_as_intacceptsList[int]but only casts without checking type, sotp=[1,2,4]sweeps will pass through incorrectly as list objects -
src/cloudai/workloads/megatron_bridge/megatron_bridge.py, line 154-155 (link)logic:
_as_boolhas same issue -use_megatron_fsdp=[true, false]sweeps will incorrectly evaluate as truthy list
12 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (12)
conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.tomlconf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.tomlsrc/cloudai/registration.pysrc/cloudai/workloads/megatron_bridge/__init__.pysrc/cloudai/workloads/megatron_bridge/megatron_bridge.pysrc/cloudai/workloads/megatron_bridge/report_generation_strategy.pysrc/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.pytests/report_generation_strategy/test_megatron_bridge_report_generation_strategy.pytests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.pytests/test_cloudaigym.pytests/test_init.pytests/test_test_scenario.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-16T19:47:41.994Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.
Applied to files:
tests/test_cloudaigym.pytests/report_generation_strategy/test_megatron_bridge_report_generation_strategy.pytests/test_init.pysrc/cloudai/workloads/megatron_bridge/megatron_bridge.pysrc/cloudai/workloads/megatron_bridge/report_generation_strategy.pytests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.pysrc/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.pytests/test_test_scenario.pysrc/cloudai/workloads/megatron_bridge/__init__.pysrc/cloudai/registration.py
📚 Learning: 2025-12-17T22:02:45.215Z
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 756
File: src/cloudai/workloads/aiconfig/standalone_command_gen_strategy.py:65-85
Timestamp: 2025-12-17T22:02:45.215Z
Learning: In CloudAI's DSE flow for the Aiconfigurator workload (src/cloudai/workloads/aiconfig/standalone_command_gen_strategy.py), list-valued parameters in AiconfiguratorCmdArgs (such as batch_size, ctx_tokens, tp, pp, dp, etc. in Agg and Disagg models) are scalarized by apply_params_set before gen_exec_command is called, so these fields are guaranteed to be scalar integers at command generation time.
Applied to files:
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
🧬 Code graph analysis (5)
tests/report_generation_strategy/test_megatron_bridge_report_generation_strategy.py (2)
src/cloudai/_core/test_scenario.py (1)
TestRun(58-174)src/cloudai/workloads/megatron_bridge/report_generation_strategy.py (3)
MegatronBridgeReportGenerationStrategy(28-163)can_handle_directory(47-48)generate_report(73-136)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (2)
src/cloudai/_core/installables.py (4)
DockerImage(36-84)GitRepo(87-115)Installable(25-32)PythonExecutable(119-145)src/cloudai/models/workload.py (2)
CmdArgs(26-29)TestDefinition(89-141)
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py (1)
src/cloudai/_core/report_generation_strategy.py (1)
ReportGenerationStrategy(24-40)
src/cloudai/workloads/megatron_bridge/__init__.py (3)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (2)
MegatronBridgeCmdArgs(26-89)MegatronBridgeTestDefinition(92-431)src/cloudai/workloads/megatron_bridge/report_generation_strategy.py (1)
MegatronBridgeReportGenerationStrategy(28-163)src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (1)
MegatronBridgeSlurmCommandGenStrategy(32-281)
src/cloudai/registration.py (4)
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py (1)
MegatronBridgeReportGenerationStrategy(28-163)src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (1)
MegatronBridgeSlurmCommandGenStrategy(32-281)src/cloudai/workloads/megatron_bridge/megatron_bridge.py (1)
MegatronBridgeTestDefinition(92-431)src/cloudai/_core/registry.py (2)
add_command_gen_strategy(251-259)add_report(207-210)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🔇 Additional comments (21)
conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml (1)
17-22: LGTM!The test scenario configuration is minimal and correctly references the test definition. The structure aligns with other scenario files in the codebase.
src/cloudai/registration.py (1)
94-98: LGTM!The MegatronBridge registrations follow the established patterns in the codebase:
- Import grouped with other workload imports
- Command generation strategy registered for SlurmSystem
- Test definition and report strategy properly registered
Also applies to: 193-195, 245-245, 262-262
conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml (2)
17-35: LGTM for experimental configuration.The test configuration is well-structured for a Qwen 30B model run. The parameters align with the 2-node scenario (8 GPUs / 4 GPUs per node = 2 nodes).
35-35: No action needed — hf_token validation is already implemented.The placeholder
hf_token = "REPLACE_ME_WITH_HF_TOKEN"is already caught at runtime. The_build_launcher_partsmethod inMegatronBridgeSlurmCommandGenStrategyexplicitly validates this value at line 155 and raises aRuntimeErrorwith a clear message: "HuggingFace token is required. Please set cmd_args.hf_token to a real token string (not 'REPLACE_ME_WITH_HF_TOKEN') in your local test TOML."Likely an incorrect or invalid review comment.
tests/test_test_scenario.py (1)
52-52: LGTM!The test updates correctly integrate MegatronBridge:
- Import added in proper alphabetical position
- Reporter count incremented from 15 to 16
- Parametrized test case added for the new test definition and report strategy mapping
Also applies to: 475-475, 485-485
tests/test_init.py (1)
52-55: LGTM! MegatronBridge registration follows the established pattern.The import, command generation strategy mapping, test definition count update, and test definition entry are all consistent with the existing workload registrations.
Also applies to: 136-136, 220-220, 235-235
tests/report_generation_strategy/test_megatron_bridge_report_generation_strategy.py (2)
27-42: LGTM! The fixture provides a realistic test setup.The log content accurately reflects the Megatron-Bridge output format with Step Time and GPU utilization metrics, which the report generation strategy parses.
45-57: LGTM! Tests cover the essential report generation behavior.The tests verify both
can_handle_directory()and the content of the generated report, aligning with the behavioral documentation approach mentioned in the learnings.src/cloudai/workloads/megatron_bridge/__init__.py (1)
17-26: LGTM! Clean package initialization.The
__all__exports are alphabetically ordered and include all necessary public entities for the Megatron Bridge workload.src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (4)
104-150: LGTM! Well-designed wrapper script for job ID extraction.The script properly:
- Sets strict mode (
set -euo pipefail)- Redirects launcher output to a log file
- Parses the job ID with graceful fallback (
|| true)- Emits CloudAI-compatible output or fails with diagnostic info
152-159: Good validation: Rejecting placeholder HF tokens prevents common misconfiguration.Clear error message guides users to set a real token in their TOML configuration.
191-204: LGTM! Clean helper functions for flag handling.The
add()andadd_field()helpers properly handleNone, booleans (converted to"true"/"false"), and thefields_setlogic to avoid emitting defaults.
271-275: LGTM! Detach flag handling is correct.The logic properly emits
--detachfor True,--no-detachfor False, and nothing when the field is not explicitly set, avoiding Megatron-Bridge's default override issues mentioned in the PR notes.tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py (3)
88-126: LGTM! Good coverage of thefields_setlogic.By constructing
MegatronBridgeCmdArgswith only the required fields, this test correctly verifies that optional fields not infields_setare not emitted in the command.
127-145: LGTM! Container path and CUDA graph scope tests verify key normalization behavior.Both tests validate that user-provided values are correctly passed through or normalized in the generated wrapper script.
173-183: LGTM! Verifies the generated command file is written correctly.The test confirms that
generated_command.shis created and contains the expected wrapper script invocation.src/cloudai/workloads/megatron_bridge/megatron_bridge.py (5)
1-23: LGTM!License header and imports are appropriate for this module.
26-90: LGTM!The
MegatronBridgeCmdArgsclass is well-structured with appropriate field definitions supporting both scalar values and lists for sweep configurations. Thehf_tokenvalidator correctly sanitizes input.
413-431: LGTM!The constraint aggregation logic is correct. All 17 constraints are properly combined, and the approach of logging each failure individually before returning provides excellent debugging visibility.
232-235: Edge case:cuda_graph_scopewith only empty/whitespace values.The
cuda_graphsdetermination checkslen(scopes) > 0, but if_normalize_str_listfilters out all empty segments, an original non-empty input likecuda_graph_scope=" "would result inscopes = []. This is likely the intended behavior, but worth confirming the edge case is handled as expected.
246-251: Consider simplifying the nested ternary for readability.The constraint logic on line 247 is difficult to parse at a glance due to nested ternaries:
constraint6 = pp == 1 and cp == 1 and (vp == 1 if vp is not None else True) if fsdp else True🔎 Proposed refactor for clarity
- constraint6 = pp == 1 and cp == 1 and (vp == 1 if vp is not None else True) if fsdp else True + if fsdp: + constraint6 = pp == 1 and cp == 1 and (vp is None or vp == 1) + else: + constraint6 = True⛔ Skipped due to learnings
Learnt from: srivatsankrishnan Repo: NVIDIA/cloudai PR: 760 File: tests/standalone_command_gen_strategy/test_aiconfigurator_standalone_command_gen_strategy.py:33-122 Timestamp: 2025-12-17T22:24:51.805Z Learning: In the NVIDIA/cloudai repository, avoid suggesting overly nitpick refactor comments such as test parametrization when there are only two test cases with different modes (e.g., agg vs disagg). Such refactoring suggestions are not needed unless explicitly requested.
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py
Outdated
Show resolved
Hide resolved
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py
Outdated
Show resolved
Hide resolved
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py
Outdated
Show resolved
Hide resolved
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
Outdated
Show resolved
Hide resolved
tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
Show resolved
Hide resolved
This is fine. The list resolution happens in CloudAIGym before calling constraint checks. So at this point there is no question of values coming in as sweeps (lists). You can ignore this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 764
File: src/cloudai/workloads/megatron_bridge/megatron_bridge.py:98-101
Timestamp: 2025-12-23T00:23:11.471Z
Learning: In src/cloudai/workloads/megatron_bridge/megatron_bridge.py, the nemo_run_repo GitRepo uses commit="main" intentionally. Nemo Run is a Slurm executor (not a framework) used by Megatron Bridge to launch recipes, and tracking the main branch is acceptable for this dependency.
📚 Learning: 2025-12-16T19:47:41.994Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.
Applied to files:
tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🔇 Additional comments (7)
tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py (7)
32-77: LGTM: Clean fixture setup.The fixtures properly create a synthetic test environment with fake installation paths, avoiding dependencies on real installs. The approach is well-documented with inline comments.
79-86: LGTM: Proper validation test.Correctly verifies that placeholder tokens are rejected at runtime with an appropriate error.
88-125: LGTM: Comprehensive default handling test.Correctly verifies that optional fields not specified in TOML are omitted from the generated command. The intentional duplication of setup logic (rather than using fixtures) is appropriate for testing this minimal configuration.
127-139: LGTM: Container path handling validated.Properly verifies that local container image paths are preserved verbatim and not overridden by cached paths.
141-145: LGTM: Normalization verified.Correctly validates that cuda_graph_scope bracket notation is normalized in the generated wrapper.
165-170: LGTM: model_fields_set issue resolved.The reconstruction approach using
model_dumpandmodel_validatecorrectly addresses the previous review concern. This ensuresmodel_fields_setis properly populated based on which fields are present in the data dict, allowing the strategy code's"detach" in fields_setcheck to work correctly.
183-193: LGTM: Command file generation verified.Properly validates that the generated command file is created with the expected content and format.
tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 219 (link)style: check if
self.system.accountexists before using, as account may be optional -
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 237-243 (link)style: consider moving required field validation to Pydantic model instead of runtime checks
If
model_nameandmodel_sizeare truly required, define them without defaults inMegatronBridgeCmdArgsso Pydantic validates them at construction time.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
12 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (3)
-
src/cloudai/workloads/megatron_bridge/megatron_bridge.py, line 201-205 (link)style: Constraint 2 bypassed (assumes
num_layersnot available, set to True). VP validation incomplete. Constraint 3 (dp != 0) is redundant with Constraint 1 since ifnum_gpus % (tp*pp*cp) == 0andnum_gpus > 0, thendpwill always be non-zero. -
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 153-157 (link)style: Consider moving HF token validation (checking for empty/placeholder values) to a
@field_validatorinMegatronBridgeCmdArgsfor earlier error detection rather than failing at command generation time.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 126-132 (link)style: Job ID extraction relies on grepping for
"Job id: <num>"pattern. Verify this pattern matches Megatron-Bridge's output format across different versions (currently mapped to r0.2.0 for container 25.11).
12 files reviewed, 3 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (3)
-
tests/test_test_definitions.py, line 93 (link)style: Check should also handle None values to be more robust
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py, line 62-64 (link)style: Limiting to last 10 data points is hardcoded and may not be suitable for all scenarios - consider making this configurable or removing the limit. Is there a specific reason for limiting to 10 data points, or should this be configurable based on the test requirements?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py, line 98 (link)style: Using population standard deviation (pstdev) instead of sample standard deviation - verify this is the intended statistical measure. Should this be sample standard deviation (stdev) instead of population standard deviation (pstdev)?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
13 files reviewed, 3 comments
Constraint checks are independent to each other. you can ignore them. HF_token to field validator is addresssed in this commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (5)
-
conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml, line 22 (link)syntax: num_nodes value should be an integer, not a string
-
tests/report_generation_strategy/test_megatron_bridge_report_generation_strategy.py, line 32 (link)style: The log content appears to have a formatting issue - 'ain_fp8_mx/0' looks like a truncated prefix that should probably be a complete experiment name or identifier. Is this truncated log line intentional for testing, or should it include the full experiment identifier?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
src/cloudai/workloads/megatron_bridge/report_generation_strategy.py, line 62-64 (link)style: only last 10 samples used for statistics without explanation. Is this limitation intentional for performance reasons or to focus on steady-state metrics?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, line 133 (link)style: Job ID parsing relies on hardcoded regex pattern 'Job id[: ]+[0-9]+' which could break if Megatron-Bridge changes its output format. Is there a more robust way to get the job ID from Megatron-Bridge, such as through its API or a structured output format?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
src/cloudai/workloads/megatron_bridge/megatron_bridge.py, line 137-141 (link)style: Version mapping only includes 25.11->r0.2.0, defaulting unknown versions to 'main' branch could cause compatibility issues
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
13 files reviewed, 5 comments
@greptile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I've noted these preferences for ML training workloads:
- Last 10 data points limit: This is intentional for training stability metrics and shouldn't be flagged as hardcoded
- Population std deviation (pstdev): This is the correct statistical measure for training metrics, not sample std deviation
Here's a link to the new rule that will be applied in future reviews.
@greptile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! I've created a rule to remember your team's preferences for MegatronBridge workload implementations. I won't flag these patterns as issues in future reviews. Thanks for the feedback!
alexmanle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Summary
Added MegatronBridge as a native CloudAI SlurmSystem workload with a SlurmCommandGenStrategy that runs Megatron-Bridge’s scripts/performance/setup_experiment.py using CloudAI-managed installs (Git clone + dedicated venv). Implemented Slurm job-id retrieval for Megatron-Bridge by generating a readable wrapper script (megatron_bridge_submit_and_parse_jobid.sh) that redirects launcher output to megatron_bridge_launcher.log
extracts Job id: from that log and CloudAI can track the Slurm job.
Known Issues: There is some issues with M-bridge overriding logic from passing the default values to it. Once the issue is root caused (on M-bridge side), we should follow it up with another PR.
Test Plan
Additional Notes
Include any other notes or comments about the pull request here. This can include challenges faced, future considerations, or context that reviewers might find helpful.