Skip to content

v1.4.0

Latest

Choose a tag to compare

@amaslenn amaslenn released this 28 Oct 13:38
71dfd89

Highlights

1. GB300 Support for Common Configs

CloudAI and example configurations now support GB300 systems, expanding hardware compatibility.

2. AI Dynamo with Kubernetes Support (Alpha)

  • Added Kubernetes SPCx support for AI Dynamo workload
  • Enhanced container orchestration capabilities

3. New Model and Workload Support

  • Qwen Recipe Support: Added comprehensive Qwen model recipe support
  • SGLang Backend: Added SGLang backend support for DeepSeekR1 model
  • NIXL KVBench: New NIXL KVBench workload with full Slurm integration

4. Advanced AI Dynamo Capabilities

  • Multi-worker GPU Slicing: Multi-worker-per-node GPU slicing with dynamic allocation
  • Explicit Node Assignment: Dedicated node assignment for prefill and decode workers
  • Shell Script Entry Point: Shell script-based entry point (replacing Python implementation)
  • Environment Validation: Environment validation during startup sequence
  • Error Handling: Error detection and retry mechanism for worker failures

5. Agent System Enhancements

  • Plugin System: Load agents from entrypoints for extensibility
  • Custom Reward Functions: Custom reward functions with latency & throughput metrics handling
  • Configurable Rewards: Configurable agent reward functions

6. Enhanced Documentation

  • Sphinx Framework: Sphinx-based documentation framework with GitHub Pages deployment
  • Comprehensive Workload Docs: Complete documentation for Bash, NCCL, UCC, and other workloads
  • Interactive Features: Copy-to-clipboard support for code snippets

7. CLI Modernization

  • Click Framework: Complete CLI re-implementation using Click framework (replacing argparse)
  • Improved Usability: Made --tests-dir optional for better user experience
  • Simplified Commands: Removed unnecessary install/uninstall command options

8. NIXL Workload Improvements

  • Per-rank Environment Variables: Enhanced per-rank environment variable support for NIXL Perftest
  • Multi-backend Support: Support for non-UCX backends with multiple backend options
  • Enhanced Output Parsing: Improved output parsing for noisy/multi-format output
  • Container Management: etcd now managed from NIXL container

9. Reporting & Metrics

  • NCCL Comparison Reports: NCCL comparison reports with latency metrics
  • Reusable Framework: Reusable comparison report framework for NIXL
  • Multi-section CSV: Multi-section CSV format handling for AI Dynamo
  • Configurable Reports: Configurable reports via scenario configuration
  • DSE Trajectory Support: DSE trajectory support for single-sbatch mode

10. Development & Maintenance

  • Environment Management: Added uv.lock for persistent environment management
  • Code Cleanup: Removed deprecated NemoLauncher-based configurations
  • Refactoring: NeMo recipes refactoring for better maintainability

11. Reliability & Error Handling

  • Output Directory Handling: Improved output directory error handling (permissions, read-only filesystem)
  • Graceful Error Handling: Graceful handling of missing tests with MissingTestError
  • Environment Preservation: Better environment variable preservation (order maintained from system schema)
  • Debug Logging: Enhanced debug logging for system config parsing errors

What's Changed (details)

New Contributors

Full Changelog: v1.3.0...v1.4.0