Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Empty file modified .github/CODEOWNERS
100644 → 100755
Empty file.
2 changes: 1 addition & 1 deletion .github/workflows/cla.yml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
path-to-signatures: 'cla-bot/v1/cla.json'
# branch should not be protected
branch: 'main'
allowlist: user1,bot*
allowlist: user1,claude[bot],claude,Wes Vaske,bot*
remote-organization-name: mlcommons
remote-repository-name: systems

Expand Down
74 changes: 74 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
name: Tests

on:
push:
branches: [main, master]
pull_request:
branches: [main, master]

jobs:
test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ['3.10', '3.11', '3.12']

steps:
- uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y libopenmpi-dev openmpi-common
- name: Install package and test dependencies
run: |
python -m pip install --upgrade pip
# Install the package in editable mode without DLIO
pip install -e ".[test]"
- name: Run unit tests
run: |
pytest tests/unit -v --tb=short
- name: Run unit tests with coverage
run: |
pytest tests/unit -v --cov=mlpstorage --cov-report=xml --cov-report=term-missing
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
with:
files: ./coverage.xml
fail_ci_if_error: false
verbose: true
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}

lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: Install lint dependencies
run: |
python -m pip install --upgrade pip
pip install ruff
- name: Run ruff check
run: |
ruff check mlpstorage/ --output-format=github || true
- name: Run ruff format check
run: |
ruff format --check mlpstorage/ || true
31 changes: 31 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Python cache
__pycache__/
*.py[cod]
*$py.class
*.so

# Distribution / packaging
dist/
build/
*.egg-info/

# Virtual environments
venv/
.venv/
env/

# IDE
.idea/
.vscode/
*.swp
*.swo

# Test artifacts
.pytest_cache/
.coverage
htmlcov/
*.html

# OS files
.DS_Store
Thumbs.db
100 changes: 100 additions & 0 deletions .planning/PROJECT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# MLPerf Storage Benchmark Suite v3.0

## What This Is

A benchmark orchestration framework for the MLCommons MLPerf Storage working group. The suite runs storage benchmarks aligned with MLPerf rules and reports results with verification of rules compliance.

## Core Value

**The ONE thing that must work:** Orchestrate multiple benchmark types (training, checkpointing, kv-cache, vectordb) across distributed systems and produce verified, rules-compliant results.

## Context

### Current State
- v2.0 release with Claude Code enhancements
- Training and checkpointing benchmarks use DLIO as underlying engine
- KV cache benchmark exists in separate directory (`kv_cache_benchmark/`)
- VectorDB benchmark code exists in external branch
- MPI-based execution and host collection for DLIO benchmarks
- Existing error handling and validation pipeline

### Target State (v3.0)
- Fully integrated KV cache and VectorDB benchmarks as Benchmark subclasses
- New training models (dlrm, retinanet, flux)
- Package version management with lockfiles
- SSH-based host collection for non-MPI benchmarks
- Time-series /proc/ data collection during benchmark execution
- Improved error messaging and user guidance

### Timeline
- **Feature freeze:** 6 weeks
- **Bugfix period:** 6 weeks
- **Code freeze:** 12 weeks total

## Requirements

### Validated (Existing)

- ✓ Training benchmark orchestration via DLIO — existing
- ✓ Checkpointing benchmark orchestration via DLIO — existing
- ✓ MPI-based distributed execution — existing
- ✓ Rules validation pipeline — existing
- ✓ Report generation — existing
- ✓ CLI with nested subcommands — existing
- ✓ Benchmark registry pattern — existing

### Active

- [ ] Package version lockfile management
- [ ] Remove GPU package dependencies (not used)
- [ ] KV cache Benchmark class (wraps kv-cache.py)
- [ ] KV cache MPI execution across hosts
- [ ] VectorDB Benchmark class (wraps load_vdb.py, compact_and_watch.py, simple_bench.py)
- [ ] SSH-based host collection for non-MPI benchmarks
- [ ] New training models: dlrm, retinanet, flux
- [ ] Improved error messaging for missing commands/packages
- [ ] Clear user guidance for resolving dependency issues
- [ ] Time-series /proc/ collection (diskstats, vmstat, cpuinfo, etc.)
- [ ] Parallel collection process (10 sec intervals) without impacting benchmark

### Out of Scope

- GPU support — deliberately not supporting GPU execution
- Rewriting KV/VDB as native benchmarks — v3.0 wraps existing scripts
- Real-time monitoring UI — collection only, no visualization
- Cloud provider integrations — on-premise/bare-metal focus

## Key Decisions

| Decision | Rationale | Outcome |
|----------|-----------|---------|
| Lockfile for package versions | Reproducibility across systems, MPI version issues | Pending |
| Benchmark subclasses for KV/VDB | Minimal integration, reuse CLI and reporting infrastructure | Pending |
| SSH for non-MPI host collection | KV cache and VectorDB don't require MPI execution | Pending |
| Parallel process for time-series | Must not impact benchmark performance | Pending |

## Constraints

- **No GPU dependencies** — storage benchmark, not compute
- **MPI compatibility** — must work with various MPI implementations
- **Cross-platform** — Linux primarily, various distributions
- **Minimal dependencies** — reduce version conflict surface area

## External Code References

| Component | Location | Notes |
|-----------|----------|-------|
| KV cache benchmark | `kv_cache_benchmark/` (local) | Also: `mlcommons/storage/TF_KVCache` branch |
| VectorDB benchmark | `mlcommons/storage/TF_VDBBench` branch | Scripts: load_vdb.py, compact_and_watch.py, simple_bench.py |
| DLIO benchmark | External package | Upstream dependency for training/checkpointing |

## Success Metrics

- All 4 benchmark types (training, checkpointing, kv-cache, vectordb) runnable from unified CLI
- Package lockfile prevents version conflicts in CI
- Error messages guide users to resolution for common issues
- Host data collected for all benchmark types (MPI or SSH)
- Time-series collection runs without measurable benchmark impact

---
*Last updated: 2026-01-23 after initialization*
92 changes: 92 additions & 0 deletions .planning/REQUIREMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# MLPerf Storage v3.0 Requirements

## v1 Requirements

### Package Management

- [x] **PKG-01**: Lockfile for Python dependencies with pinned versions
- [x] **PKG-02**: Remove GPU package dependencies from default install
- [x] **PKG-03**: Validate package versions match lockfile before benchmark execution

### Benchmark Integration

- [x] **BENCH-01**: KVCacheBenchmark class extending Benchmark base (wraps kv-cache.py)
- [x] **BENCH-02**: KV cache MPI execution across multiple hosts
- [x] **BENCH-03**: VectorDBBenchmark class extending Benchmark base (wraps VDB scripts)
- [x] **BENCH-04**: VectorDB CLI commands (run, datagen operations)
- [x] **BENCH-05**: Integration with existing validation/reporting pipeline

### Training Updates

- [x] **TRAIN-01**: Add dlrm model configuration
- [x] **TRAIN-02**: Add retinanet model configuration
- [x] **TRAIN-03**: Add flux model configuration
- [x] **TRAIN-04**: Update DLIO to support parquet for data loaders, readers, data generation
- [x] **TRAIN-05**: Production-ready parquet reader with memory-efficient I/O
- [x] **TRAIN-06**: Update pyproject.toml to reference DLIO fork

### Host Collection

- [x] **HOST-01**: SSH-based host collection for non-MPI benchmarks
- [x] **HOST-02**: Collect /proc/ data (diskstats, vmstat, cpuinfo, filesystems, cgroups)
- [x] **HOST-03**: Collection at benchmark start and end
- [x] **HOST-04**: Time-series collection (10 sec intervals) during execution
- [x] **HOST-05**: Parallel collection process without benchmark performance impact

### Error Handling & UX

- [x] **UX-01**: Detect missing commands/packages with actionable error messages
- [x] **UX-02**: Suggest installation steps for missing dependencies
- [x] **UX-03**: Validate environment before benchmark execution (fail-fast)
- [x] **UX-04**: Clear progress indication during long operations

---

## v2 Requirements (Deferred)

- [ ] Deeper KV cache integration (native implementation vs wrapper)
- [ ] Deeper VectorDB integration (native implementation vs wrapper)
- [ ] Real-time monitoring dashboard for time-series data
- [ ] Cloud provider integrations (AWS, GCP, Azure)

---

## Out of Scope

- **GPU support** — Storage benchmark, deliberately not supporting GPU execution
- **Rewriting KV/VDB as native benchmarks** — v3.0 wraps existing scripts
- **Real-time visualization** — Collection only, no visualization in v3.0
- **Windows support** — Linux-only target

---

## Traceability

| Requirement | Phase | Status |
|-------------|-------|--------|
| PKG-01 | Phase 1 | Complete |
| PKG-02 | Phase 1 | Complete |
| PKG-03 | Phase 1 | Complete |
| UX-01 | Phase 2 | Complete |
| UX-02 | Phase 2 | Complete |
| UX-03 | Phase 2 | Complete |
| BENCH-01 | Phase 3 | Complete |
| BENCH-02 | Phase 3 | Complete |
| BENCH-03 | Phase 4 | Complete |
| BENCH-04 | Phase 4 | Complete |
| BENCH-05 | Phase 5 | Complete |
| HOST-01 | Phase 6 | Complete |
| HOST-02 | Phase 6 | Complete |
| HOST-03 | Phase 6 | Complete |
| HOST-04 | Phase 7 | Complete |
| HOST-05 | Phase 7 | Complete |
| TRAIN-01 | Phase 8 | Complete |
| TRAIN-02 | Phase 8 | Complete |
| TRAIN-03 | Phase 8 | Complete |
| TRAIN-04 | Phase 9 | Complete |
| UX-04 | Phase 10 | Complete |
| TRAIN-05 | Phase 11 | Complete |
| TRAIN-06 | Phase 11 | Complete |

---
*Last updated: 2026-01-25*
Loading