Release v1.4.0 · NVIDIA/cloudai

Highlights

1. GB300 Support for Common Configs

CloudAI and example configurations now support GB300 systems, expanding hardware compatibility.

2. AI Dynamo with Kubernetes Support (Alpha)

Added Kubernetes SPCx support for AI Dynamo workload
Enhanced container orchestration capabilities

3. New Model and Workload Support

Qwen Recipe Support: Added comprehensive Qwen model recipe support
SGLang Backend: Added SGLang backend support for DeepSeekR1 model
NIXL KVBench: New NIXL KVBench workload with full Slurm integration

4. Advanced AI Dynamo Capabilities

Multi-worker GPU Slicing: Multi-worker-per-node GPU slicing with dynamic allocation
Explicit Node Assignment: Dedicated node assignment for prefill and decode workers
Shell Script Entry Point: Shell script-based entry point (replacing Python implementation)
Environment Validation: Environment validation during startup sequence
Error Handling: Error detection and retry mechanism for worker failures

5. Agent System Enhancements

Plugin System: Load agents from entrypoints for extensibility
Custom Reward Functions: Custom reward functions with latency & throughput metrics handling
Configurable Rewards: Configurable agent reward functions

6. Enhanced Documentation

Sphinx Framework: Sphinx-based documentation framework with GitHub Pages deployment
Comprehensive Workload Docs: Complete documentation for Bash, NCCL, UCC, and other workloads
Interactive Features: Copy-to-clipboard support for code snippets

7. CLI Modernization

Click Framework: Complete CLI re-implementation using Click framework (replacing argparse)
Improved Usability: Made --tests-dir optional for better user experience
Simplified Commands: Removed unnecessary install/uninstall command options

8. NIXL Workload Improvements

Per-rank Environment Variables: Enhanced per-rank environment variable support for NIXL Perftest
Multi-backend Support: Support for non-UCX backends with multiple backend options
Enhanced Output Parsing: Improved output parsing for noisy/multi-format output
Container Management: etcd now managed from NIXL container

9. Reporting & Metrics

NCCL Comparison Reports: NCCL comparison reports with latency metrics
Reusable Framework: Reusable comparison report framework for NIXL
Multi-section CSV: Multi-section CSV format handling for AI Dynamo
Configurable Reports: Configurable reports via scenario configuration
DSE Trajectory Support: DSE trajectory support for single-sbatch mode

10. Development & Maintenance

Environment Management: Added uv.lock for persistent environment management
Code Cleanup: Removed deprecated NemoLauncher-based configurations
Refactoring: NeMo recipes refactoring for better maintainability

11. Reliability & Error Handling

Output Directory Handling: Improved output directory error handling (permissions, read-only filesystem)
Graceful Error Handling: Graceful handling of missing tests with MissingTestError
Environment Preservation: Better environment variable preservation (order maintained from system schema)
Debug Logging: Enhanced debug logging for system config parsing errors

What's Changed (details)

Support custom matgen args and set valid ppn by @amaslenn in #612
Fix gres related directives for single sbatch mode by @amaslenn in #613
Preserve the order of environment variables specified in the system schema by @TaekyungHeo in #616
Update docker_image_url separator from colon to hash by @TaekyungHeo in #621
Bump default version to v1.4 by @amaslenn in #622
Update USER_GUIDE.md by @TaekyungHeo in #623
Update doc/ai_dynamo.md by @TaekyungHeo in #624
Update conf/common/test/nemo_run_llama3_8b.toml by @TaekyungHeo in #625
Replace PyTorch image tag from 24.02-py3 to 25.06-py3 in all conf TOML files by @TaekyungHeo in #627
Update doc/ai_dynamo.md by @TaekyungHeo in #628
Replace Nemo image tag from 24.12.rc3 to 25.04.rc2 in all conf TOML files by @TaekyungHeo in #626
Improve prepare_output_dir error handling for permissions and read-only fs by @TaekyungHeo in #629
Improve prepare_output_dir error handling for permissions and read-only fs (continued) by @TaekyungHeo in #631
Add GitRepo support to KubernetesInstaller with install/uninstall logic by @TaekyungHeo in #634
Use a shell script as the entry point for AI Dynamo by @TaekyungHeo in #615
Handle multi-section CSV format in AI Dynamo report generation by @TaekyungHeo in #620
Add multi-worker-per-node GPU slicing support with dynamic allocation by @TaekyungHeo in #636
Log mapping between AI Dynamo nodes and roles by @TaekyungHeo in #617
Updates for SlurmContainer workload by @amaslenn in #638
Handle missing tests gracefully by adding MissingTestError to avoid backtrace by @TaekyungHeo in #640
Clean up src/cloudai/workloads/ai_dynamo/ai_dynamo.sh by @TaekyungHeo in #639
Preserve installables' state during apply_params_set() by @amaslenn in #643
Control which env vars dumped for per-rand evaluation by @amaslenn in #642
Align extra_env_vars definition in test and scenario by @amaslenn in #644
Update USER_GUIDE.md by @TaekyungHeo in #646
Add latency metric reporting for NCCL by @amaslenn in #645
Support for DeepSeekR1 model with SGLang / AI Dynamo by @TaekyungHeo in #641
Support mounting any JSON files for --dynamo-deepep-config by @TaekyungHeo in #650
Set tp-size and dp-size from args if provided, else use total_gpus by @TaekyungHeo in #649
Add environment validation to startup sequence by @TaekyungHeo in #651
Follow-up for PR641 (Support for DeepSeekR1 model with SGLang / AI Dynamo) by @TaekyungHeo in #653
Reorder the functions in ai_dynamo.sh for improved maintainability by @TaekyungHeo in #654
Refactor GPU count to use _gpus_per_node in vllm and env validation by @TaekyungHeo in #657
Mount huggingface_home_container_path unconditionally by @TaekyungHeo in #655
Refactor nodelist validation to check DYNAMO_NODELIST only if both args empty by @TaekyungHeo in #658
Comparison report for NCCL workloads by @amaslenn in #656
Support explicit node assignment for prefill and decode workers by @TaekyungHeo in #647
Configure reports via scenario config by @amaslenn in #661
Handle CancelledError gracefully during job cleanup by @TaekyungHeo in #662
Small housekeeping updates by @amaslenn in #663
nemo recipes refactor by @malay-nagda in #633
Re-use comparison report for NIXL by @amaslenn in #664
Handle single-sbatch metadata layout in report by @amaslenn in #666
Follow-up for PR647 (Support explicit node assignment for prefill and decode workers) by @TaekyungHeo in #665
Support two NIXL bench output formats by @amaslenn in #668
Add error detection and retry mechanism for worker failures by @TaekyungHeo in #659
Use single source of data for reporting and NIXL pass/fail by @amaslenn in #670
Write trajectory file for DSE jobs in single-sbatch mode by @amaslenn in #671
Update NIXL bench command generation logic by @amaslenn in #673
Make agent_reward_function configurable by @TaekyungHeo in #675
Add custom reward functions with latency & throughput metrics handling by @TaekyungHeo in #674
Auto install missing components for workloads in run mode by @amaslenn in #676
Fix step idx in single-sbatch trajectory by @amaslenn in #678
Get rid of strict validation for configs by @amaslenn in #680
Update Nemo image reference to nvcr.io#nvidia/nemo:25.07 in all toml files by @TaekyungHeo in #681
Update NIXL perftest workload by @amaslenn in #679
Increase time limit to 60m in conf/common/test_scenario/dse_nemo_run_llama3_8b.toml by @TaekyungHeo in #682
Add support for NIXL kvbench by @amaslenn in #683
Add public documentation via GitHub pages by @amaslenn in #684
Update NIXL workloads by @amaslenn in #685
Add documentation for workloads by @amaslenn in #686
Explicitly forbid dependencies for scenarios with DSE jobs by @amaslenn in #687
Remove NemoLauncher-based configs by @amaslenn in #689
Fix NIXL bench output parsing in case of noisy output by @amaslenn in #690
Add more documentation by @amaslenn in #691
Re-work CLI implementation to use Click by @amaslenn in #677
Allow sweeps without reporter by @amaslenn in #696
Fix CLI arg for kvbench by @amaslenn in #695
Add copyright headers for files in doc/ by @amaslenn in #694
Make --tests-dir optional by @amaslenn in #693
Small formatting improvements by @amaslenn in #697
Remove useless options for install/uninstall commands by @amaslenn in #699
Removing VBOOST by @srivatsankrishnan in #698
Fix install/uninstall commands examples in doc by @amaslenn in #700
Add uv.lock for persistent env by @amaslenn in #701
Upgrade lock file version to make self version dynamic by @amaslenn in #702
Allow agents to be registered via entry points by @amaslenn in #703
Add Qwen Recipe (VER) by @srivatsankrishnan in #704
Add debug logging for system config parsing errors by @TaekyungHeo in #705
Enable AI Dynamo w/ K8S SPCx (Alpha) - Rel 2B by @TaekyungHeo in #667
Add llama4 Recipe by @aahouzi in #708
Update NemoRun configs by @amaslenn in #706
Updated Python executable path in NIXL KVBench workloads by @Bohatchuk in #709
Update docs by @amaslenn in #710
ucc_perftest_add_gen_and_a2av by @yaeliyac in #692

New Contributors

@malay-nagda made their first contribution in #633
@Bohatchuk made their first contribution in #709
@yaeliyac made their first contribution in #692

Full Changelog: v1.3.0...v1.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.4.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

1. GB300 Support for Common Configs

2. AI Dynamo with Kubernetes Support (Alpha)

3. New Model and Workload Support

4. Advanced AI Dynamo Capabilities

5. Agent System Enhancements

6. Enhanced Documentation

7. CLI Modernization

8. NIXL Workload Improvements

9. Reporting & Metrics

10. Development & Maintenance

11. Reliability & Error Handling

What's Changed (details)

New Contributors

Contributors

Uh oh!