Highlights
1. GB300 Support for Common Configs
CloudAI and example configurations now support GB300 systems, expanding hardware compatibility.
2. AI Dynamo with Kubernetes Support (Alpha)
- Added Kubernetes SPCx support for AI Dynamo workload
- Enhanced container orchestration capabilities
3. New Model and Workload Support
- Qwen Recipe Support: Added comprehensive Qwen model recipe support
- SGLang Backend: Added SGLang backend support for DeepSeekR1 model
- NIXL KVBench: New NIXL KVBench workload with full Slurm integration
4. Advanced AI Dynamo Capabilities
- Multi-worker GPU Slicing: Multi-worker-per-node GPU slicing with dynamic allocation
- Explicit Node Assignment: Dedicated node assignment for prefill and decode workers
- Shell Script Entry Point: Shell script-based entry point (replacing Python implementation)
- Environment Validation: Environment validation during startup sequence
- Error Handling: Error detection and retry mechanism for worker failures
5. Agent System Enhancements
- Plugin System: Load agents from entrypoints for extensibility
- Custom Reward Functions: Custom reward functions with latency & throughput metrics handling
- Configurable Rewards: Configurable agent reward functions
6. Enhanced Documentation
- Sphinx Framework: Sphinx-based documentation framework with GitHub Pages deployment
- Comprehensive Workload Docs: Complete documentation for Bash, NCCL, UCC, and other workloads
- Interactive Features: Copy-to-clipboard support for code snippets
7. CLI Modernization
- Click Framework: Complete CLI re-implementation using Click framework (replacing argparse)
- Improved Usability: Made
--tests-diroptional for better user experience - Simplified Commands: Removed unnecessary install/uninstall command options
8. NIXL Workload Improvements
- Per-rank Environment Variables: Enhanced per-rank environment variable support for NIXL Perftest
- Multi-backend Support: Support for non-UCX backends with multiple backend options
- Enhanced Output Parsing: Improved output parsing for noisy/multi-format output
- Container Management: etcd now managed from NIXL container
9. Reporting & Metrics
- NCCL Comparison Reports: NCCL comparison reports with latency metrics
- Reusable Framework: Reusable comparison report framework for NIXL
- Multi-section CSV: Multi-section CSV format handling for AI Dynamo
- Configurable Reports: Configurable reports via scenario configuration
- DSE Trajectory Support: DSE trajectory support for single-sbatch mode
10. Development & Maintenance
- Environment Management: Added
uv.lockfor persistent environment management - Code Cleanup: Removed deprecated NemoLauncher-based configurations
- Refactoring: NeMo recipes refactoring for better maintainability
11. Reliability & Error Handling
- Output Directory Handling: Improved output directory error handling (permissions, read-only filesystem)
- Graceful Error Handling: Graceful handling of missing tests with MissingTestError
- Environment Preservation: Better environment variable preservation (order maintained from system schema)
- Debug Logging: Enhanced debug logging for system config parsing errors
What's Changed (details)
- Support custom matgen args and set valid ppn by @amaslenn in #612
- Fix gres related directives for single sbatch mode by @amaslenn in #613
- Preserve the order of environment variables specified in the system schema by @TaekyungHeo in #616
- Update docker_image_url separator from colon to hash by @TaekyungHeo in #621
- Bump default version to v1.4 by @amaslenn in #622
- Update USER_GUIDE.md by @TaekyungHeo in #623
- Update doc/ai_dynamo.md by @TaekyungHeo in #624
- Update conf/common/test/nemo_run_llama3_8b.toml by @TaekyungHeo in #625
- Replace PyTorch image tag from 24.02-py3 to 25.06-py3 in all conf TOML files by @TaekyungHeo in #627
- Update doc/ai_dynamo.md by @TaekyungHeo in #628
- Replace Nemo image tag from 24.12.rc3 to 25.04.rc2 in all conf TOML files by @TaekyungHeo in #626
- Improve prepare_output_dir error handling for permissions and read-only fs by @TaekyungHeo in #629
- Improve prepare_output_dir error handling for permissions and read-only fs (continued) by @TaekyungHeo in #631
- Add GitRepo support to KubernetesInstaller with install/uninstall logic by @TaekyungHeo in #634
- Use a shell script as the entry point for AI Dynamo by @TaekyungHeo in #615
- Handle multi-section CSV format in AI Dynamo report generation by @TaekyungHeo in #620
- Add multi-worker-per-node GPU slicing support with dynamic allocation by @TaekyungHeo in #636
- Log mapping between AI Dynamo nodes and roles by @TaekyungHeo in #617
- Updates for SlurmContainer workload by @amaslenn in #638
- Handle missing tests gracefully by adding MissingTestError to avoid backtrace by @TaekyungHeo in #640
- Clean up src/cloudai/workloads/ai_dynamo/ai_dynamo.sh by @TaekyungHeo in #639
- Preserve installables' state during apply_params_set() by @amaslenn in #643
- Control which env vars dumped for per-rand evaluation by @amaslenn in #642
- Align extra_env_vars definition in test and scenario by @amaslenn in #644
- Update USER_GUIDE.md by @TaekyungHeo in #646
- Add latency metric reporting for NCCL by @amaslenn in #645
- Support for DeepSeekR1 model with SGLang / AI Dynamo by @TaekyungHeo in #641
- Support mounting any JSON files for --dynamo-deepep-config by @TaekyungHeo in #650
- Set tp-size and dp-size from args if provided, else use total_gpus by @TaekyungHeo in #649
- Add environment validation to startup sequence by @TaekyungHeo in #651
- Follow-up for PR641 (Support for DeepSeekR1 model with SGLang / AI Dynamo) by @TaekyungHeo in #653
- Reorder the functions in ai_dynamo.sh for improved maintainability by @TaekyungHeo in #654
- Refactor GPU count to use _gpus_per_node in vllm and env validation by @TaekyungHeo in #657
- Mount huggingface_home_container_path unconditionally by @TaekyungHeo in #655
- Refactor nodelist validation to check DYNAMO_NODELIST only if both args empty by @TaekyungHeo in #658
- Comparison report for NCCL workloads by @amaslenn in #656
- Support explicit node assignment for prefill and decode workers by @TaekyungHeo in #647
- Configure reports via scenario config by @amaslenn in #661
- Handle CancelledError gracefully during job cleanup by @TaekyungHeo in #662
- Small housekeeping updates by @amaslenn in #663
- nemo recipes refactor by @malay-nagda in #633
- Re-use comparison report for NIXL by @amaslenn in #664
- Handle single-sbatch metadata layout in report by @amaslenn in #666
- Follow-up for PR647 (Support explicit node assignment for prefill and decode workers) by @TaekyungHeo in #665
- Support two NIXL bench output formats by @amaslenn in #668
- Add error detection and retry mechanism for worker failures by @TaekyungHeo in #659
- Use single source of data for reporting and NIXL pass/fail by @amaslenn in #670
- Write trajectory file for DSE jobs in single-sbatch mode by @amaslenn in #671
- Update NIXL bench command generation logic by @amaslenn in #673
- Make agent_reward_function configurable by @TaekyungHeo in #675
- Add custom reward functions with latency & throughput metrics handling by @TaekyungHeo in #674
- Auto install missing components for workloads in run mode by @amaslenn in #676
- Fix step idx in single-sbatch trajectory by @amaslenn in #678
- Get rid of strict validation for configs by @amaslenn in #680
- Update Nemo image reference to nvcr.io#nvidia/nemo:25.07 in all toml files by @TaekyungHeo in #681
- Update NIXL perftest workload by @amaslenn in #679
- Increase time limit to 60m in conf/common/test_scenario/dse_nemo_run_llama3_8b.toml by @TaekyungHeo in #682
- Add support for NIXL kvbench by @amaslenn in #683
- Add public documentation via GitHub pages by @amaslenn in #684
- Update NIXL workloads by @amaslenn in #685
- Add documentation for workloads by @amaslenn in #686
- Explicitly forbid dependencies for scenarios with DSE jobs by @amaslenn in #687
- Remove NemoLauncher-based configs by @amaslenn in #689
- Fix NIXL bench output parsing in case of noisy output by @amaslenn in #690
- Add more documentation by @amaslenn in #691
- Re-work CLI implementation to use Click by @amaslenn in #677
- Allow sweeps without reporter by @amaslenn in #696
- Fix CLI arg for kvbench by @amaslenn in #695
- Add copyright headers for files in doc/ by @amaslenn in #694
- Make --tests-dir optional by @amaslenn in #693
- Small formatting improvements by @amaslenn in #697
- Remove useless options for install/uninstall commands by @amaslenn in #699
- Removing VBOOST by @srivatsankrishnan in #698
- Fix install/uninstall commands examples in doc by @amaslenn in #700
- Add uv.lock for persistent env by @amaslenn in #701
- Upgrade lock file version to make self version dynamic by @amaslenn in #702
- Allow agents to be registered via entry points by @amaslenn in #703
- Add Qwen Recipe (VER) by @srivatsankrishnan in #704
- Add debug logging for system config parsing errors by @TaekyungHeo in #705
- Enable AI Dynamo w/ K8S SPCx (Alpha) - Rel 2B by @TaekyungHeo in #667
- Add llama4 Recipe by @aahouzi in #708
- Update NemoRun configs by @amaslenn in #706
- Updated Python executable path in NIXL KVBench workloads by @Bohatchuk in #709
- Update docs by @amaslenn in #710
- ucc_perftest_add_gen_and_a2av by @yaeliyac in #692
New Contributors
- @malay-nagda made their first contribution in #633
- @Bohatchuk made their first contribution in #709
- @yaeliyac made their first contribution in #692
Full Changelog: v1.3.0...v1.4.0