Attempt to resolve the conflicts for Hazemawadalla modular refactor#242
Closed
FileSystemGuy wants to merge 16 commits intomainfrom
Closed
Attempt to resolve the conflicts for Hazemawadalla modular refactor#242FileSystemGuy wants to merge 16 commits intomainfrom
FileSystemGuy wants to merge 16 commits intomainfrom
Conversation
- Add ConfigLoader class with YAML config file support and schema validation - Add cfg() helper function for config-driven parameter access - Add validate_args() with safety limits for protected system paths - Rename all nvme_* metrics to storage_* for MLPerf terminology compliance - Add extended QoS percentiles: P99.9 and P99.99 latency tracking - Add per-tier bandwidth metrics (read/write GB/s per tier) - Add per-tier KV bytes tracking for detailed storage analysis - Fix GPU metadata desync bug via on_eviction_callback pattern - Change eviction from single-shot to iterative loop until space freed - Replace print statements with Python logging module - Add waterfall LRU eviction with configurable high/low watermarks - Add storage_health section with PASS/FAIL criteria - Add storage_throughput_tokens_per_sec as primary MLPerf metric
- Add -c DIR option for custom config directory - Generate and pass config.yaml to Python script via --config flag - Add --xlsx-output support for Excel export - Update jq queries for new storage_* metric names - Add mlperf_submission workload with required trial parameters - Enhance system detection for thread counts and memory limits - Update metric parsing for storage_throughput primary metric
- Add 170+ tests covering all new functionality - Add ConfigLoader tests: schema validation, defaults, file loading - Add cfg() helper tests for config-driven parameters - Add validate_args() tests for path safety and input validation - Add extended QoS tests for P99.9 and P99.99 percentiles - Add GPU eviction callback tests for metadata sync - Add per-tier bandwidth and KV bytes metric tests - Add storage_* metric naming tests for MLPerf compliance - Add waterfall eviction tests with high/low watermarks - Add storage_health PASS/FAIL criteria tests
- Add Configuration section with YAML parameter reference - Add MLPerf Submission Guidelines with validated commands - Add Excel metrics reference table with all output columns - Add installation instructions including pyyaml dependency - Add CLI arguments vs config file precedence documentation - Add workload definitions and tier configuration examples - Add troubleshooting section for common issues
- Add kv-cache-test-report.html with full test execution results - All 170+ tests passing for v3.0 features - Create unit_test_results directory for test artifacts
- Add P99.9 and P99.99 latency columns - Add per-tier KV bytes columns (GPU, CPU, Storage) - Add per-tier bandwidth columns (read/write GB/s) - Add storage tier device vs host latency breakdown - Rename nvme_entries to storage_entries for MLPerf compliance - Add storage_throughput_tokens_per_sec as primary metric
- Add pyyaml>=6.0 for YAML configuration file parsing - Required for ConfigLoader and --config CLI argument
- Add user_templates section with conversation patterns - Add qos_profiles with latency thresholds per tier - Add eviction settings with waterfall LRU parameters - Add storage_health criteria for PASS/FAIL determination - Add cache_sizing defaults for GPU/CPU/Storage tiers - Provides validated defaults for all tunable parameters
Split the single ~3500-line kv-cache.py into a structured Python package (kv_cache/) with 12 modules. Added MLA attention support, NVMe capacity management, SSD preconditioning, disaggregated inference modes, and streaming BurstGPT trace replay. Updated proposal and README with corrected DeepSeek-V3 MLA calculations, capacity planning scope notes, and repo cleanup. Structural changes: - kv_cache/ package: __init__, _compat, config, models, backends, cache, conversation, prefix_cache, rag, monitoring, workload, benchmark, cli - kv-cache.py is now a thin shim importing from kv_cache - Added pyproject.toml for pip-installable package New features: - MLA attention support (DeepSeek-V3: 70,272 bytes/token vs 1.7M MHA) - 4 new models: deepseek-v3, qwen3-32b, gpt-oss-120b, gpt-oss-20b - NVMe capacity tracking with LRU eviction (prevents disk exhaustion) - SSD preconditioning (--precondition) - Disaggregated inference (--prefill-only, --decode-only) - Streaming BurstGPT trace replay (--trace-speedup, --replay-cycles) - Config-driven model definitions via config.yaml - RAG retrieval distribution (zipfian/uniform), document eviction Documentation: - Corrected DeepSeek-V3 from MHA formula to MLA in all capacity tables - Scoped capacity planning claims to storage throughput (no tier promotion) - Restructured GDS section around production GPU-origin KV cache - Added NVMe terminology note (benchmark works with any block device) - Fixed stale class names and default ranges in README Repo cleanup: - Moved kv-cache-wrapper.sh to utils/ - Added utils/run_benchmarks_256gb.sh - Removed kv-cache_sharegpt_replay.py (merged into package) - Removed discovery_results_and_analysis/, lmcache_results_*, proposal PDF
README: Corrected DeepSeek-V3 KV cache from MHA formula (1,748,992 bytes/token, 1.7 MB) to MLA formula (70,272 bytes/token, 69 KB). Updated all derived tables: per-user RAM 13.4 GB -> 0.54 GB, removed from 128 GB exclusion list, fixed model reference table. Moved validate.sh to utils/ alongside other shell scripts.
The code reads decode_batch_size from config.yaml via
cfg('decode', 'batch_size', default=32). Updated the proposal
code snippet to match the actual implementation.
The "Two Separate Eviction Mechanisms" section now explicitly distinguishes metadata-only eviction (ConversationManager removes dict entries; .npy files remain on disk) from physical file deletion (MultiTierCache calls path.unlink(), permanently removing .npy files from the filesystem). Added actual code paths from backends.py and cache.py to replace the pseudocode.
Removed optional dependencies and ShareGPT dataset loader from kv-cache.py.
Updated script for KV Cache Storage Benchmark to reflect new author attribution and modified test parameters for MLPerf submissions.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Attempt to resolve the conflicting files.