From 80fc279e27264a36ac1ea4b69f2ed34f3c7b80a7 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Tue, 4 Nov 2025 09:57:38 -0800 Subject: [PATCH 01/23] Clarify directory structure for submissions Expanded the directory structure requirements for CLOSED and OPEN submissions, detailing the necessary files and their organization within the submission structure. --- Rules.md | 901 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 901 insertions(+) create mode 100644 Rules.md diff --git a/Rules.md b/Rules.md new file mode 100644 index 00000000..993acb74 --- /dev/null +++ b/Rules.md @@ -0,0 +1,901 @@ +# MLPerf™ Storage V3.0 Benchmark Rules +—————————————————————————————————————————— + +- [MLPerf Storage Benchmark Submission Guidelines v2.0](#mlperf-storage-benchmark-submission-guidelines-v20) + - [1. Introduction](#1-introduction) + - [1.1 Timeline](#11-timeline) + - [2. Benchmark Overview](#2-benchmark-overview) + - [2.1 Training](#21-training) + - [2.2 Checkpointing](#22-checkpointing) + - [3 Definitions](#3-definitions) + - [4. Performance Metrics](#4-performance-metrics) + - [5. Benchmark Code](#5-benchmark-code) + - [6. General Rules](#6-general-rules) + - [6.1. Strive to be fair](#61-strive-to-be-fair) + - [6.2. System and framework must be available](#62-system-and-framework-must-be-available) + - [6.3 Non-determinism](#63-non-determinism) + - [6.4. Result rounding](#64-result-rounding) + - [6.5. Stable storage must be used](#65-stable-storage-must-be-used) + - [6.6. Caching](#66-caching) + - [6.7. Replicability is mandatory](#67-replicability-is-mandatory) + - [7. Dataset Generation](#7-dataset-generation) + - [8. Single-host Submissions](#8-single-host-submissions) + - [9. Distributed Training Submissions](#9-distributed-training-submissions) + - [10. CLOSED and OPEN Divisions](#10-closed-and-open-divisions) + - [10.1 CLOSED: virtually all changes are disallowed](#101-closed:-virtually-all-changes-are-disallowed) + - [10.2 OPEN: changes are allowed but must be disclosed](#102-open:-changes-are-allowed-but-must-be-disclosed) + - [11. Submission](#11-submission) + - [11.1 What to submit - CLOSED submissions](#111-what-to-submit---closed-submissions) + - [11.2 What to submit - OPEN submissions](#112-what-to-submit---open-submissions) + - [11.3 Directory Structure for CLOSED or OPEN Submissions](#113-directory-structure-for-closed-or-open-submissions) + - [11.4 System Description](#114-system-description) + - [11.4.1 System Description YAML](#1141-system-description-yaml) + - [11.4.2 System Description PDF](#1142-system-description-pdf) + - [12. Review](#12-review) + - [12.1 Visibility of results and code during review](#121-visibility-of-results-and-code-during-review) + - [12.2 Filing objections](#122-filing-objections) + - [12.3 Resolving objections](#123-resolving-objections) + - [12.4 Fixing objections](#124-fixing-objections) + - [12.5 Withdrawing results / changing division](#125-withdrawing-results-/-changing-division) + - [13. Roadmap for future MLPerf Storage releases](#13-roadmap-for-future-mlperf-storage-releases) + +## 1. Introduction + +MLPerf™ Storage is a benchmark suite to characterize the performance of storage systems that support machine learning workloads. The suite consists of 2 workload categories: + +1. Training +2. Checkpointing + +This benchmark attempts to balance two goals. First, we aim for **comparability** between benchmark submissions to enable decision making by the AI/ML Community. Second, we aim for **flexibility** to enable experimentation and to show off unique storage system features that will benefit the AI/ML Community. To that end we have defined two classes of submissions: CLOSED and OPEN. + +Published results for the 3D-Unet, ResNet-50, and Cosmoflow Training workloads are comparable across v1.0 and v2.0 of the MLPerf Storage benchmark. A [full listing of comparability is available](https://github.com/mlcommons/policies/blob/master/MLPerf_Compatibility_Table.adoc). + +The MLPerf name and logo are trademarks of the MLCommons® Association ("MLCommons"). In order to refer to a result using the MLPerf name, the result must conform to the letter and spirit of the rules specified in this document. MLCommons reserves the right to solely determine if a use of its name or logos is acceptable. + +### 1.1 Timeline + +| Date | Description | +| ---- | ----------- | +| Jun 18, 2025 | Freeze rules & benchmark code. | +| Jun 24, 2025 | Open benchmark for submissions. | +| Jul 7, 2025 | **Submissions due.** | +| Jul 7, 2025 - Aug 4, 2025 | Review period. | +| Aug 4, 2025 | **Benchmark competition results are published.** | + + +## 2. Benchmark Overview + +This version of the benchmark does not include offline or online data pre-processing. We are aware that data pre-processing is an important part of the ML data pipeline and we will include it in a future version of the benchmark. + +Each benchmark setup must be executed a number of times (5 for training and 10 for checkpointing). All logs from every run must be submitted as part of a submission package. The final metrics are the average across the runs. Runs must be consecutive with no failed runs between the submitted runs. Runs can not be cherry-picked from a range of runs excepting that all five runs are consecutive within the large sequence of runs. + +### 2.1 Training + +MLPerf Storage emulates (or "simulates", the terms are used interchangably in this document) accelerators for the training workloads with the tool DLIO developed by Argonne National Labs. DLIO uses the standard AI frameworks (PyTorch, Tensorflow, Numpy, etc) to load data from storage to memory at the same intensity as a given accelerator. + +**This emulation means that submitters do not need to use hardware accelerators (e.g., GPUs, TPUs, and other ASICs) when running MLPerf Storage - Training.** + +Instead, our benchmark tool replaces the training on the accelerator for a single batch of data with a ``sleep()`` call. The ``sleep()`` interval depends on the batch size and accelerator type and has been determined through measurement on a system running the real training workload. The rest of the data ingestion pipeline (data loading, caching, checkpointing) is unchanged and runs in the same way as when the actual training is performed. + +There are two main advantages to accelerator emulation. First, MLPerf Storage allows testing different storage systems with different types of accelerators. To change the type of accelerator that the benchmark emulates (e.g., to switch to a system with NVIDIA H100 GPUs instead of A100 GPUs), it is enough to adjust the batch size and ``sleep()`` parameter. The second advantage is that MLPerf Storage can put a high load on the storage system simply by increasing the number of emulated accelerators. This allows for testing the behavior of the storage system in large-scale scenarios without purchasing/renting the AI compute infrastructure. + +The benchmark suite provides workload [configurations](https://github.com/mlcommons/storage/tree/main/storage-conf/workload) that simulate the I/O patterns of selected workloads listed in Table 1. The I/O patterns for each MLPerf Storage benchmark correspond to the I/O patterns of the MLPerf Training and MLPerf HPC benchmarks (i.e., the I/O generated by our tool for 3D U-Net closely follows the I/O generated by actually running the 3D U-Net training workload). The benchmark suite can also generate synthetic datasets which show the same I/O load as the actual datasets listed in Table 1. + +| Area | Problem | Model | Data Loader | Dataset seed | Minimum AU% | +| ---- | ------- | ----- | ----------- | ------------ | ----------- | +| Vision | Image segmentation (medical) | 3D U-Net | PyTorch | KiTS 19 (140MB/sample) | 90% | +| Vision | Image classification | ResNet-50 | TensorFlow | ImageNet (150KB/sample) | 90% | +| Scientific | Cosmology | parameter prediction | TensorFlow | CosmoFlow N-body simulation (2MB/sample) | 70% | + +Table 1: Benchmark description + +- Benchmark start point: The dataset is in **shared persistent storage**. +- Benchmark end point: The measurement ends after a predetermined number of epochs. *Note: data transfers from storage in this test terminate with the data in host DRAM; transfering data into the accelerator memory is not included in this benchmark.* +- Configuration files for the workloads and dataset content can be found [here](https://github.com/mlcommons/storage/tree/main/storage-conf/workload). + +### 2.2 Checkpointing +#### 2.2.1 models +Benchmark results may be submitted for the following four model configurations. The associated model architectures and parallelism settings are listed below. The number of MPI processes must be set to 8, 64, 512, and 1024 for the respective models for CLOSED submission. + +For CLOSED submissions, participants are not permitted to change the total number of simulated accelerators. However, they may adjust the number of simulated accelerators per host, as long as each host uses more than 4 simulated accelerators. This allows the use of nodes with higher simulated accelerator density and fewer total nodes. Note: the aggregate simulated accelerator memory across all nodes must be sufficient to accommodate the model’s checkpoint size. + +**Table 2 LLM models** + +| Model | 8B | 70B | 405B | 1T | +|------------------------|--------|--------|---------|--------| +| Hidden dimension | 4096 | 8192 | 16384 | 25872 | +| FFN size | 14336 | 28672 | 53248 | 98304 | +| num_attention_heads | 32 | 128 | 128 | 192 | +| num_kv_heads | 8 | 8 | 8 | 32 | +| Num layers | 32 | 80 | 126 | 128 | +| Parallelism (TPxPPxDP) | 1×1×8 | 8×1x8 | 8×32×2 | 8×64×2 | +| Total Processes | 8 | 64 | 512 | 1024 | +| ZeRO | 3 | 3 | 1 | 1 | +| Checkpoint size | 105 GB | 912 GB | 5.29 TB | 18 TB | +| Subset: 8-Process Size | 105 GB | 114 GB | 94 GB | 161 GB | + + +#### 2.2.2 Benchmark Execution +**Checkpoint Modes (global storage vs local storage)** + +There are two operational modes: + +* ``default``: Used for shared storage systems. In this mode, the benchmark runs on multiple hosts to write/read the entire checkpoint dataset. The total number of processes (emulated accelerators) must match the number listed in Table 2 (TP×PP×DP = Total Processes). + +* ``subset``: Intended for node local storage systems. In this mode, checkpointing is simulated on a single host by writing/reading only a fraction (``num_gpus/TP/PP/DP``) of the checkpoint data, where ``num_gpus`` is the number of simulated accelerators on the host. The only allowed value for number of processes in a subset submission is 8 (the 8B model does not support subset mode as it is already set to 8 processes). + +**Checkpoint write and (read) recovery** + +For each submission, one must first perform the checkpoint write, then clear the cache if required, and finally perform the checkpoint read. The required command-line flags are: +*Note: Clearing caches is done to ensure that no data for the read phase comes from the filesystem cache* + +For a submission, the sequence is the following: +1. Write 10x checkpoints +2. Clear filesystem caches if necessary +3. Read 10x checkpoints + +The default options will run the read and write checkpoints in a single mlpstorage call. For example, the following command will execute a sequence of writing 10 checkpoints and reading those same 10 checkpoints. +```bash +mlpstorage checkpointing run --client-host-memory-in-gb 512 --model llama3-8b --num-processes 8 --checkpoint-folder /mnt/checkpoint_test +``` + +If caches need to be cleared use the following parameters for the WRITE and READ tests. + +* WRITE: ``--num-checkpoints-read=0`` +* READ: ``--num-checkpoints-write=0`` + + +In the above example, the write tests would be executed first with this command which will do the writes but no reads. +```bash +mlpstorage checkpointing run --client-host-memory-in-gb 512 --model llama3-8b --num-processes 8 --checkpoint-folder /mnt/checkpoint_test --num-checkpoints-read=0 +``` + +After the write tests complete, clear the caches on your hosts. A standard linux system would use a command like this: +```bash +echo 3 > /proc/sys/vm/drop_caches +``` +The end result of "clearing caches" is that 100% data for the read phase should come from the storage system under test and not from the client's filesystem cache. + +Finally, with the same example the read tests would be executed with the following command which indicates no writes during this phase: +```bash +mlpstorage checkpointing run --client-host-memory-in-gb 512 --model llama3-8b --num-processes 8 --checkpoint-folder /mnt/checkpoint_test --num-checkpoints-write=0 +``` + +Caches need to be cleared by the user outside of the mlpstorage tool. + +##### 2.2.2.1 Clearing Caches + +The checkpoints that are written are quite large. **If the checkpoint size per client node is less than 3x the client node's memory capacity, then the filesystem cache needs to be cleared between the write and read phases.** + +Examples: + +| Model (Total Size) | Num Clients & Memory | Size for ranks | Size for 1st and Last Client | Need to Clear Caches? | +|---------------------|-------------------------------------------|----------------------------|----------------------------------------------------------|------------------------------------------------------------------| +| Llama3 405b (5.2TB) | 8x (64 Ranks / Node)
1024GB per Client | 256x 11.8GB
256x 8.85GB | First: 755GB (64x 11.8GB)
Last: 566.4GB (64x 8.85GB) | No (556GB x 3 = 1,699GB which is greater than the client memory) | +| Llama3 70b (912GB) | 8x (8x Ranks / Node)
1024GB per Client | 64x 11.23GB | First: 89.8GB (8x 14.23GB)
Last: Same as First (DP=1) | Yes (89.8 x 3 = 269.5GB which is less than the client memory) | + +In the first case, after 2x checkpoints data that has been written is being flushed from the filesystem cache. This means that after 10x checkpoints a standard Linux system will not have any data in the filesystem cache that would be read for a checkpoint recovery starting back at the first written checkpoint. + +In the second case, after 10x checkpoints, 898GB of data will have been written per client with each client having 1024GB of memory. Without clearing caches this data would be read from the filesystem cache + +**fsync** + +We enforce ``fsync`` to be applied during checkpoint writes to ensure data is flushed to persistent storage. ``fsync`` is enabled by default in all workload configuration files. + +**Example Execution Commands** + +* ``default`` mode (``WORLD_SIZE = TP*PP*DP`` as listed in Table 2): + ```bash + # Perform checkpoint writes (make sure the number of hosts is WORLD_SIZE/num_processes_per_host) + mlpstorage checkpointing run --model llama3-405b \ + --hosts ip1 ip2 .... \ + --num-processes 512 \ + --num-checkpoints-read 0 \ + --checkpoint-folder ./checkpoint_data1 \ + --results-dir ./mlpstorage_results \ + --client-host-memory-in-gb 64 + + # Clear the cache (This might require admin access to the system) + ... + + # perform checkpoint reads + mlpstorage checkpointing run --model llama3-405b \ + --hosts ip1 ip2 .... \ + --num-processes 512 \ + --num-checkpoints-write 0 \ + --checkpoint-folder ./checkpoint_data1 \ + --results-dir ./mlpstorage_results \ + --client-host-memory-in-gb 64 + ``` +* ``subset`` mode (on a single host with **8 simulated accelerators**) + ```bash + # Perform checkpoint writes (data parallelism must match Table 2) + mlpstorage checkpointing run --model llama3-405b \ + --hosts ip1 \ + --num-processes 8 \ + --num-checkpoints-read 0 \ + --checkpoint-folder ./checkpoint_data1 \ + --results-dir ./mlpstorage_results \ + --client-host-memory-in-gb 64 + # Clear the cache + ... + # Perform checkpoint read (data parallelism must match Table 2) + mlpstorage checkpointing run --model llama3-405b \ + --hosts ip1 \ + --num-processes 8 \ + --num-checkpoints-write 0 \ + --checkpoint-folder ./checkpoint_data1 \ + --results-dir ./mlpstorage_results \ + --client-host-memory-in-gb 64 + ``` + +#### 2.2.3 Metrics and Results Reporting +We report the checkpoint time per write / read and I/O throughput from each rank. For each run: + + * The metric for duration is the maximum time across all processes. + * The metric for throughput is the minimum across all processes. + +A checkpoint workload submission must include 10 checkpoints written and 10 checkpoints read as well as the logs for any optional processes as outlined in section 2.2.5 (clearing caches, storage remapping, etc) + +#### 2.2.4 Requirements for Simultaneously Readable and Writable + +Checkpoint recovery is intended to mimic an environment where a failure has occurred and the data needs to be read by different hosts than wrote the data. + +For storage systems where all hosts can read and write all data simultaneously, the process described above satisfies the requirements. + +For storage systems where 1 host has write access to a volume but all hosts have read access, the above process also satisfies the requirements so long as reads can be fulfilled immediately following a write. + +For storage systems where 1 host has write access to a volume and a "remapping" process is required for other hosts to read the same data, the time to remap must be measured and included in the submission. + +When a checkpoint is taken/written, it must be written to stable storage, but that checkpoint does not need to be readable by other other hosts yet. If it is not readable by other hosts immediately after the checkpoint write is complete, if it requires some additional processing or reconfiguration before the checkpoint is readable by other hosts, the time duration between the checkpoint being completed and the earliest time that that checkpoint could be read by a different ``host node`` must be reported in the SystemDescription.yaml file. That duration between write completion and availability for reading will be added to the time to read/recover from the benchmark. + +**Any processes between the write and read phases of checkpointing that are required before data can be read by a different host than wrote the data must be measured and included in the submission. The time for these processes will be added to the recovery time and throughput calculation for submitted scores** + +The system_configuration.yaml document must list whether the solution support simultaneous reads and/or writes as such: +```yaml +System: + shared_capabilities: + multi_host_support: True # False is used for local storage + simultaneous_write_support: False # Are simultaneous writes by multiple hosts supported in the submitted configuration + simultaneous_read__support: True # Are simultaneous reads by multiple hosts supported in the submitted configuration +``` + +#### 2.2.5 OPEN vs CLOSED submissions +For CLOSED submissions, the total number of processes must be fixed according to Table 2. + +For OPEN submissions, the total number of processes may be increased in multiples of (TP×PP) to showcase the scalability of the storage solution. + +**Table 3: Configuration parameters and their mutability in CLOSED and OPEN divisions** + +| Parameter | Meaning | Default value | Changeable in CLOSED | Changeable in OPEN | +|------------------------------------|----------------------------------------------|-----------------------------------------------|----------------------|--------------------| +| --ppn **(USE HOST:SLOTS INSTEAD)** | Number of processes per node | N/A | YES (minimal 4) | YES (minimal 4) | +| --num-processes | Total number of processes | Node local: 8
Global: the value in Table 1 | NO | YES | +| --checkpoint-folder | The folder to save the checkpoint data | checkpoint/{workload} | YES | YES | +| --num-checkpoints-write | Number of write checkpoints | 10 or 0** | NO | NO | +| --num-checkpoints-read | Number of write checkpoints | 10 or 0** | NO | NO | + +**The ``--ppn`` syntax above was incorrect for the MPI package the benchmark uses, please use the syntax ``hostname:slotcount`` for the hosts listed in the ``--hosts`` argument. The ``slotcount`` value has the same meaning as the ``ppn`` value, the number of processes per node to run.** + +** By default, --num-checkpoints-read and --num-checkpoints-write are set to be 10. To perform write only, one has to turn off read by explicitly setting ``--num-checkpoints-read=0``; to perform read only, one has to turn off write by explicitly set ``--num-checkpoints-write=0`` + +For an OPEN or CLOSED submission, the process must follow: +1. Write 10 checkpoints +2. Clearing Caches or Remapping Volumes if required +3. Read 10 checkpoint + +DLIO and mlpstorage both support options to run 10 checkpoints with a single call or run 10 checkpoints as separate invokations of the tools. So long as the process is followed, checkpoints can be executed as a 10 checkpoint batch or individually. + +### 2.3 Vector Database + +## 3 Definitions +The following definitions are used throughout this document: + +- A **sample** is the unit of data on which training is run, e.g., an image, or a sentence. +- A **step** is defined to be the first batch of data loaded into the (emulated) accelerator. +- **Accelerator Utilization (AU)** is defined as the percentage of time taken by the simulated accelerators, relative to the total benchmark running time. Higher is better. +- **Design power** is defined to be the minimum measurement of electrical power that must be capable of being supplied to a single or collection of power supply units (PSUs) in order to avoid violating regulatory and safety requirements. For individual PSUs, the design power equals the nameplate rated power. For groups of redundant PSUs, the design power is equal to the sum of the nameplate rated power of the minimum number of PSUs required to be simultaneously operational. +- A **division** is a set of rules for implementing benchmarks from a suite to produce a class of comparable results. MLPerf Storage allows CLOSED and OPEN divisions, detailed in Section 6. +- **DLIO ([code link](https://github.com/argonne-lcf/dlio_benchmark), [paper link](https://ieeexplore.ieee.org/document/9499416))** is a benchmarking tool for deep learning applications. DLIO is the core of the MLPerf Storage benchmark and with specified configurations will emulate the I/O pattern for the workloads listed in Table 1. MLPerf Storage provides wrapper scripts to launch DLIO. There is no need to know the internals of DLIO to do a CLOSED submission, as the wrapper scripts provided by MLPerf Storage will suffice. However, for OPEN submissions changes to the DLIO code might be required (e.g., to add custom data loaders). +- **Dataset content** refers to the data and the total capacity of the data, not the format of how the data is stored. Specific information on dataset content can be found [here](https://github.com/mlcommons/storage/tree/main/storage-conf/workload). +- **Dataset format** refers to the format in which the training data is stored (e.g., npz, hdf5, csv, png, tfrecord, etc.), not the content or total capacity of the dataset. + + *NOTE: we plan to add support for Object storage in a future version of the benchmark, so OPEN submissions that include benchmark application changes and a description of how the original MLPerf Training benchmark dataset was mapped into Objects will be appreciated.* +- A **storage system** consists of a defined set of hardware and software resources that provide storage services to one or more ``host nodes``. Storage systems can be hardware based, software-defined, virtualized, hyperconverged, or cloud based, and must be capable of providing the minimum storage services required to run the benchmark. If the storage system requires a dedicated network, then the hardware required for that network must be included in the ``storage system``. If the storage system is hyperconverged, then it will probably share hardware (eg: CPU and/or networking) with the ``host nodes``. +- A **storage scaling unit** is defined as the minimum unit by which the performance and scale of a storage system can be increased. Examples of storage scaling units are “nodes”, “controllers”, “virtual machines” or “shelves”. Benchmark runs with different numbers of storage scaling units allow a reviewer to evaluate how well a given storage solution is able to scale as more scaling units are added. +- A **host node** is defined as the minimum unit by which the load upon the storage system under test can be increased. Every ``host node`` must run the same number of simulated accelerators. A ``host node`` can be instantiated by running the MLPerf Storage benchmark code within a Container or within a VM guest image or natively within an entire physical system. The number of Containers or VM guest images per physical system and the CPU resources per ``host node`` is up to the submitter. Note that the maximum DRAM available to any ``host node`` must be used when calculating the dataset size to be generated for the test. +- An **ML framework** is a specific version of a software library or set of related libraries for training ML models using a system. Examples include specific versions of Caffe2, MXNet, PaddlePaddle, PyTorch, or TensorFlow. +- A **benchmark** is an abstract problem that can be solved using ML by training a model based on a specific dataset or simulation environment to a target quality level. +- A **reference implementation** is a specific implementation of a benchmark provided by the MLPerf organization. +- A **benchmark implementation** is an implementation of a benchmark in a particular framework by a user under the rules of a specific division. +- A **run** is a complete execution of a benchmark implementation on a system. +- A **benchmark result** is the mean of 5 run results, executed consecutively. The dataset is generated only once for the 5 runs, prior to those runs. The 5 runs must be done on the same machine(s). +- **Nameplate rated power** is defined as the maximum power capacity that can be provided by a power supply unit (PSU), as declared to a certification authority. The nameplate rated power can typically be obtained from the PSU datasheet. +- A **Power Supply Unit (PSU)** is a component which converts an AC or DC voltage input to one or more DC voltage outputs for the purpose of powering a system or subsystem. Power supply units may be redundant and hot swappable. +- **SPEC PTDaemon® Interface (PTDaemon®)** is a software component created by the Standard Performance Evaluation Corporation (SPEC) designed to simplify the measurement of power consumption by abstracting the interface between benchmarking software and supported power analyzers. +- A **Supported power analyzer** is a test device supported by the PTDaemon® software that measures the instantaneous voltage and multiplies it by the instantaneous current, then accumulates these values over a specific time period to provide a cumulative measurement of consumed electrical power. For a listing of supported power analyzers, see https://www.spec.org/power/docs/SPECpower-Device_List.html +- A **System Under Test (SUT)** is the storage system being benchmarked. + + +- The storage system under test must be described via one of the following **storage system access types**. The overall solution might support more than one of the below types, but any given benchmark submission must be described by the access type that was actually used during that submission. An optional vendor-specified qualifier may be specified. This will be displayed in the results table after the storage system access type, for example, “NAS - RDMA”. + - **Direct-attached media** – any solution using local media on the ``host node``(s); eg: NVMe-attached storage with a local filesystem layered over it. This will be abbreviated “**Local**” in the results table. + - **Remotely-attached block device** – any solution using remote block storage; eg: a SAN using FibreChannel, iSCSI, NVMeoF, NVMeoF over RDMA, etc, with a local filesystem implementation layered over it. This will be abbreviated “**Remote Block**” in the results table. + - **Shared filesystem using a standards-defined access protocol** – any solution using a version of standard NFS or CIFS/SMB to access storage. This will be abbreviated “**NAS**” in the results table. + - **Shared filesystem using a proprietary access protocol** – any network-shared filesystem solution that requires a unique/proprietary protocol implementation to be installed on the ``host node``(s) to access storage; eg: an HPC parallel filesystem. This will be abbreviated “**Proprietary**” in the results table. + - **Object** – any solution accessed using an object protocol such as S3, RADOS, etc. This will be abbreviated “**Object**” in the results table. + - **Other** – any solution whose access is not sufficiently described by the above categories. This will be abbreviated “**Other**” in the results table. + +## 4. Performance Metrics + +The metrics reported by the benchmark are different for different types of workloads. They are broken out below. + +### 4.1. Training Workloads + +The benchmark performance metric for Training workloads (3D-Unet, ResNet-50, and Cosmflow) is **samples per second, subject to a minimum accelerator utilization (AU) defined for that workload**. Higher samples per second is better. + +To pass a benchmark run, the AU should be equal to or greater than the minimum value, and is computed as follows: +``` +AU (percentage) = (total_compute_time/total_benchmark_running_time) * 100 +``` + +All the I/O operations from the first **step** are excluded from the AU calculation in order to avoid the disturbance in the averages caused by the startup costs of the data processing pipeline, allowing the AU to more-quickly converge on the steady-state performance of the pipeline. The I/O operations that are excluded from the AU calculation **are** included in the samples/second reported by the benchmark, however. + +If all I/O operations are hidden by compute time, then the `total_compute_time` will equal the `total_benchmark_running_time` and the AU will be 100%. + +The total compute time can be derived from the batch size, total dataset size, number of simulated accelerators, and sleep time: +``` +total_compute_time = (records_per_file * total_files) / simulated_accelerators / batch_size * computation_time * epochs. +``` + +*NOTE: The sleep time has been determined by running the actual MLPerf training workloads including the compute step on real hardware and is dependent on the accelerator type. In this version of the benchmark we include sleep times for **NVIDIA A100 and H100 GPUs**. We plan on expanding the measurements to different accelerator types in future releases.* + +### 4.2. Checkpoint Workloads + +The benchmark performance metrics for Checkpoint workloads (write/take, and read/recover) are **bandwidth while writing, and bandwidth while reading**, plus an additional data point which is the amount of time required, if any, between the completion of writing a checkpoint and the first point at which that checkpoint can be read from a different ``host node``. That duration between write completeion and availability for reading will be added to the time to read/recover from the benchmark. + +**Submitters do not need to use hardware accelerators (e.g., GPUs, TPUs, and other ASICs) when running MLPerf Storage - Checkpointing.** + +## 5. Benchmark Code + +The MLPerf Storage working group provides a benchmark implementation which includes: +- Scripts to determine the minimum dataset size required for your system, for a given benchmark. +- Scripts for data generation. +- Benchmark tool, based on DLIO, with configuration files for the benchmarks. +- A script for running the benchmark on one host (additional setup is required if you are running a distributed training benchmark – see Section 5). +- A script for generating the results report (additional scripting and setup may be required if you are running a distributed training benchmark – see Section 5), and potentially additional supporting scripts. + +More details on installation and running the benchmark can be found in the [Github repo](https://github.com/mlcommons/storage) + +## 6. General Rules + +The following apply to all results submitted for this benchmark. + +### 6.1. Strive to be fair + +Benchmarking should be conducted to measure the framework and storage system performance as fairly as possible. Ethics and reputation matter. + +### 6.2. System and framework must be available + +- **Available Systems**. To be called an ``available system`` all components of the system must be publicly available. If any components of the system are not available at the time of the benchmark results submission, those components must be included in an ``available system`` submission that is submitted in the next round of MLPerf Storage benchmark submissions. Otherwise, the results for that submission may be retracted from the MLCommons results dashboard. +- **RDI Systems**. If you are measuring the performance of an experimental framework or system, you must make the system and framework you use available upon demand for replication by MLCommons. This class of systems will be called RDI (research, development, internal). + +### 6.3 Non-determinism +The data generator in DLIO uses a fixed random seed that must not be changed, to ensure that all submissions are working with the same dataset. Random number generators may be seeded from the following sources: +- Clock +- System source of randomness, e.g. /dev/random or /dev/urandom +- Another random number generator initialized with an allowed seed +Random number generators may be initialized repeatedly in multiple processes or threads. For a single run, the same seed may be shared across multiple processes or threads. + +The storage system must not be informed of the random seed or the source of randomness. This is intended to disallow submissions where the storage systen can predict the access pattern of the data samples. + +### 6.4. Result rounding +Public results should be rounded normally, to two decimal places. + +### 6.5. Stable storage must be used + +For all workloads stable storage must be used, but there are some differences in the specifics. + +#### 6.5.1. Training Workloads + +The MLPerf Storage benchmark will create the dataset on the storage system, in the desired ``dataset format``, before the start of the benchmark run. The data must reside on stable storage before the actual benchmark testing can run. + +#### 6.5.2. Checkpoint Workloads + +See section "2.2.3 Metrics and Results Reporting" for more details. + +### 6.6. Caching +Caching of training data on ``host nodes`` running MLPerf Storage is controlled via a warm up run, dataset size to memory ratios, and changing random seeds between runs. +1. All runs must use a warm-up run before the 5 test runs. +2. For Training benchmarks, the dataset size must be at least 5x larger than the sum of memory across all of the MLPerf Storage nodes +3. The random seed must change for each run as controlled by the benchmark.py script + +### 6.7. Replicability is mandatory +Results that cannot be replicated are not valid results. Replicated results should be within 5% within 5 tries. + +### 6.8 Consecutive Runs Requirement +Each of the benchmarks described in this document have a requirement for multiple runs. This is to ensure consistency of operation of the system under test as well as ensure statistical significance of the measurements. + +Unless otherwise noted, the multiple runs for a workload need to be run consecutively. To ensure this requirement is met, the time between runs (from the stop time of one run and the start time to the next run) needs to be less than the time to execute a single run. This is to discourage cherry-picking of results which is expressly forbidden and against the spirit of the rules. + +## 7. Dataset Generation + +This section only describes the dataset generation methodology and requirements for Training workloads, the equivalent topic is covered in section 2.2, Checkpointing. + +MLPerf Storage uses DLIO to generate synthetic data. Instructions on how to generate the datasets for each benchmark are available [here](https://github.com/mlcommons/storage). The datasets are generated following the sample size distribution and structure of the dataset seeds (see Table 1) for each of the benchmarks. + +**Minimum dataset size**. The MLPerf Storage benchmark script **must be used** to run the benchmarks since it calculates the minimum dataset size for each benchmark. It does so using the provided number of simulated accelerators and the size of all of the ``host node``’s memory in GB. The minimum dataset size computation is as follows: + +- Calculate required minimum samples given number of steps per epoch *(NB: num_steps_per_epoch is a minimum of 500)*: +``` + min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes +``` +- Calculate required minimum samples given host memory to eliminate client-side caching effects; *(NB: HOST_MEMORY_MULTIPLIER = 5)*: +``` + min_samples_host_memory_across_all_nodes = number_of_hosts * memory_per_host_in_GB * HOST_MEMORY_MULTIPLIER * 1024 * 1024 * 1024 / record_length +``` +- Ensure we meet both constraints: +``` + min_samples = max(min_samples_steps_per_epoch, min_samples_host_memory_across_all_nodes) +``` +- Calculate minimum files to generate +``` + min_total_files= min_samples / num_samples_per_file + min_files_size = min_samples * record_length / 1024 / 1024 / 1024 +``` + +A minimum of ``min_total_files`` files are required which will consume ``min_files_size`` GB of storage. + +**Running the benchmark on a subset of a larger dataset**. We support running the benchmark on a subset of the synthetically generated dataset. One can generate a large dataset and then run the benchmark on a subset of that dataset by setting ``num_files_train`` or ``num_files_eval`` smaller than the number of files available in the dataset folder. Note that if the dataset is stored in multiple subfolders, the subset actually used by this run will be evenly selected from all the subfolders. In this case, ``num_subfolders_train`` and ``num_subfolders_eval`` need to be equal to the actual number of subfolders inside the dataset folder in order to generate valid results. + +Please note that the log file(s) output during the generation step needs to be included in the benchmark results submission package. + +## 8. Single-host Submissions + +This section only applies to Training workloads, the equivalent topic is covered in section 2.2.2, "subset mode". + +Submitters can add load to the storage system in two orthogonal ways: (1) increase the number of simulated accelerators inside one ``host node`` (i.e., one machine), and/or (2) increase the number of ``host nodes`` connected to the storage system. + +For single-host submissions, increase the number of simulated accelerators by changing the ``--num-accelerators`` parameter to the ``benchmark.sh script``. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. + +For **single-host submissions**, CLOSED and OPEN division results must include benchmark runs for the maximum simulated accelerators that can be run on ONE HOST NODE, in ONE MLPerf Storage job, without going below the 90% accelerator utilization threshold. + +## 9. Distributed Training Submissions + +This setup simulates distributed training of a single training task, spread across multiple ``host nodes``, on a shared dataset. The current version of the benchmark only supports data parallelism, not model parallelism. + +Submitters must respect the following for multi-host node submissions: +- All the data must be accessible to all the ``host nodes``. +- The number of simulated accelerators in each ``host node`` must be identical. + +While it is recommended that all ``host nodes`` be as close as possible to identical, that is not required by these Rules. The fact that distributed training uses a pool-wide common barrier to synchronize the transition from one step to the next of all ``host nodes`` results in the overall performance of the cluster being determined by the slowest ``host node``. + +Here are a few practical suggestions on how to leverage a set of non-identical hardware, but these are not requirements of these Rules. It is possible to leverage very large physical nodes by using multiple Containers or VM guest images per node, each with dedicated affinity to given CPUs cores and where DRAM capacity and NUMA locality have been configured. Alternatively, larger physical nodes that have higher numbers of cores or additional memory than the others may have those additional cores or memory disabled. + +For **distributed training submissions**, CLOSED and OPEN division results must include benchmark runs for the maximum number of simulated accelerators across all ``host nodes`` that can be run in the distributed training setup, without going below the 90% accelerator utilization threshold. Each ``host node`` must run the same number of simulated accelerators for the submission to be valid. + +## 10. CLOSED and OPEN Divisions + +### 10.1 CLOSED: virtually all changes are disallowed +CLOSED represents a level playing field where all results are **comparable** across submissions. CLOSED explicitly forfeits flexibility in order to enable easy comparability. + +In order to accomplish that, most of the optimizations and customizations to the AI/ML algorithms and framework that might typically be applied during benchmarking or even during production use must be disallowed. Optimizations and customizations to the storage system are allowed in CLOSED. + +For CLOSED submissions of this benchmark, the MLPerf Storage codebase takes the place of the AI/ML algorithms and framework, and therefore cannot be changed. The sole exception to this rule is if the submitter decides to apply the code change identified in PR#299 of the DLIO repo in github, the resulting codebase will be considered "unchanged" for the purposes of this rule. + +A small number of parameters can be configured in CLOSED submissions; listed in the tables below. + +**Table: Training Workload Tunable Parameters for CLOSED** + +| Parameter | Description | Default | +|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|----------| +| *Dataset parameters* | | | +| dataset.num_files_train | Number of files for the training set | -- | +| dataset.num_subfolders_train | Number of subfolders that the training set is stored | 0 | +| dataset.data_folder | The path where dataset is stored | -- | +| | | | +| *Reader parameters* | | | +| reader.read_threads | Number of threads to load the data | -- | +| reader.computation_threads | Number of threads to preprocess the data (only for resnet) | -- | +| reader.transfer_size | An int64 scalar representing the number of bytes in the read buffer. (only supported for Tensorflow models -- Resnet and Cosmoflow) | | +| reader.prefetch_size | An int64 scalar representing the amount of prefetching done, with values of 0, 1, or 2. | | +| reader.odirect | Enable ODIRECT mode for Unet3D Training | False | +| | | | +| *Checkpoint parameters* | | | +| checkpoint.checkpoint_folder | The folder to save the checkpoints | -- | +| | | | +| *Storage parameters* | | | +| storage.storage_root | The storage root directory | ./ | +| storage.storage_type | The storage type | local_fs | + +**Table: Checkpoint Workload Tunable Parameters for CLOSED** + +| Parameter | Description | Default | +|----------------------------------|-------------------------------------------------------------|-----------------------| +| checkpoint.checkpoint_folder | The storage directory for writing and reading checkpoints | ./checkpoints/ | +| checkpoint.num_checkpoints_write | The number of checkpoint writes to do in a single dlio call | 10 | +| checkpoint.num_checkpoints_read | The number of checkpoint reads to do in a single dlio call | 10 | + + +CLOSED division benchmarks must be referred to using the benchmark name plus the term CLOSED, e.g. “The system was able to support *N ACME X100* accelerators running a CLOSED division 3D U-Net workload at only 8% less than optimal performance.” + +### 10.2 OPEN: changes are allowed but must be disclosed + +OPEN allows more **flexibility** to tune and change both the benchmark and the storage system configuration to show off new approaches or new features that will benefit the AI/ML Community. OPEN explicitly forfeits comparability to allow showcasing innovation. + +The essence of OPEN division results is that for a given benchmark area, they are “best case” results if optimizations and customizations are allowed. The submitter has the opportunity to show the performance of the storage system if an arbitrary, but documented, set of changes are made to the data storage environment or algorithms. + +Changes to DLIO itself are allowed in OPEN division submissions. Any changes to DLIO code or command line options must be disclosed. + +While changes to DLIO are allowed, changing the workload itself is not. Ie: how the workload is processed can be changed, but those changes cannot fundamentally change the purpose and result of the training. For example, changing the workload imposed upon storage by a ResNet-50 training task into 3D-Unet training task is not allowed. + +In addition to what can be changed in the CLOSED submission, the following parameters can be changed in the benchmark.sh script: + +| Parameter | Description | Default | +|------------------------------|--------------------------------------------|---------------------------------------------------------------------| +| framework | The machine learning framework. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | +| | | | +| *Dataset parameters* | | | +| dataset.format | Format of the dataset. | 3D U-Net: .npz
ResNet-50: .tfrecord
Cosmoflow: .tfrecord | +| dataset.num_samples_per_file | | 3D U-Net: 1
ResNet-50: 1251
Cosmoflow: 1 | +| | | | +| *Reader parameters* | | | +| reader.data_loader | Supported options: Tensorflow or PyTorch. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | + + +#### 10.2.1 OPEN: num_samples_per_file +Changing this parameter is supported only with Tensorflow, using tfrecord datasets. Currently, the benchmark code only supports num_samples_per_file = 1 for Pytorch data loader. To support other values, the data loader needs to be adjusted. + +#### 10.2.2 OPEN: data_loader +OPEN submissions can have custom data loaders. If a new data loader is added, or an existing data loader is changed, the DLIO code will need to be modified. + +#### 10.2.3 Execution of OPEN submissions +OPEN division benchmarks must be referred to using the benchmark name plus the term OPEN, e.g. “The system was able to support N ACME X100 accelerators running an OPEN division 3D U-Net workload at only 8% less than optimal performance.” + +## 11. Submission + +A successful run result consists of a directory tree structure containing the set of files produced by the benchmark as the result, plus the manually created SystemDescription files (both PDF and yaml) that describe the storage solution under test and the environment the test was run in. + +The whole package must be uploaded to MLCommons via the UI provided to submitters. + +It will be possible to upload your results many times, not just once, but each upload completely replaces the prior upload before the submission deadline. + +At least your final upload, if not all of them, should include all of the individual result submissions that you want to be included. Eg: if you want to submit results for A100 and H100, that would be two submissions but only one upload operation. + +The following is not a requirement of these rules, but a possibly valuable risk management strategy. Consider uploading whatever results you have every day or two. Each new upload replaces the last one. If some disaster happened and you were not able to continue tuning your submission, you would at least have the prior submission package available as a backup. + +### 11.1 What to submit - CLOSED submissions + +A complete submission for one workload (3D-Unet, ResNet, or Cosmoflow) contains 3 folders: +1. **results** folder, containing, for each system: + - The entire output folder generated by running MLPerf Storage. + - Final submission JSON summary files ``results.json``. The JSON file must be generated using the ``mlpstorage reportgen`` script. The ``mlpstorage reportgen`` command must be run on the rank0 machine in order to collect the correct set of files for the submission. + - The logs from the benchmark runs, but only from the rank0 systems not all of the systems. + - The logs from the dataset generation step that built the files that this benchmark run read from. +2. **systems** folder, containing: + - ``.yaml`` + - ``.pdf`` + - For system naming examples look [here](https://github.com/mlcommons/storage_results_v0.5/tree/main/closed) in the ``results/closed`` subdirectory below each submitter's directory. +3. **code** folder, containing: + - Source code of the benchmark implementation. The submission source code and logs must be made available to other submitters for auditing purposes during the review period. + +### 11.2 What to submit - OPEN submissions + +- Everything that is required for a CLOSED submission, following the same structure. +- Additionally, the source code used for the OPEN Submission benchmark implementations must be available under a license that permits MLCommon to use the implementation for benchmarking. + +### 11.3 Directory Structure for CLOSED or OPEN Submissions + +11.3.1. The submission structure must start from a single directory whose name is the name of the submitter. + +11.3.2. Within the top-level directory of the submission structure there must be a directory named "closed" and/or one named "open", and nothing more. + +11.3.3. Within the Closed or Open directories there must be a single directory whose name if the name of the submitter (the same as the top-level directory). + +11.3.4. Within the submitter directory mentioned just above, there must be exactly three directories: "code", "results", and "systems". + +11.3.5. The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. +If this is in the "open" hierarchy, any modifications made to the benchmark code must be included here, and if this is in the "closed" hierarchy, there must be no changes to the benchmark code. +Note that in both cases this must be the code that was actually run to generate those results. + +11.3.6. The "results" directory must include one or more directories that are the names of the "systems under test". Eg: a system name could be "Big_and_Fast_4000". +This name can be anything the submitter wants, it is just a name to both idenfity the set of results that were collected from a given +configuration of storage system and to link together those results with the .pdf and .yaml files that describe the system-under-test. +Note that only results from a given set of configuration parameters and hardware and software components of the system-under-test can be part of a given "system name", +any change to the configuration parameters or hardware or software will force the results that come from those runs to be held in a different "system name". + +11.3.7. The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file. Each of those files must be named with the "system name". +Eg: for a system-under-test named "Big_and_Fast_4000", there must be a "Big_and_Fast_4000.yaml" and a "Big_and_Fast_4000.pdf" file. + + +``` +root_folder (or any name you prefer) +├── Closed +│ └── +│ ├── code +│ ├── results +│ │ └──system-name-1 +│ │ ├── training +│ │ │ ├── unet3d +│ │ │ │ ├── datagen +│ │ │ │ │ └── YYYYMMDD_HHmmss +│ │ │ │ │ └── dlio_log_files +│ │ │ │ └── run +│ │ │ │ ├──results.json +│ │ │ │ ├── YYYYMMDD_HHmmss +│ │ │ │ │ └── dlio_log_files +│ │ │ │ ... (5x Runs per Emulated Accelerator Type) +│ │ │ │ └── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_log_files +│ │ │ ├── resnet50 +│ │ │ │ ├── datagen +│ │ │ │ │ └── YYYYMMDD_HHmmss +│ │ │ │ │ └── dlio_log_files +│ │ │ │ └── run +│ │ │ │ ├──results.json +│ │ │ │ ├── YYYYMMDD_HHmmss +│ │ │ │ │ └── dlio_log_files +│ │ │ │ ... (5x Runs per Emulated Accelerator Type) +│ │ │ │ └── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_log_files +│ │ │ └── cosmoflow +│ │ │ ├── datagen +│ │ │ │ └── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_log_files +│ │ │ └── run +│ │ │ ├──results.json +│ │ │ ├── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_log_files +│ │ │ ... (5x Runs per Emulated Accelerator Type) +│ │ │ └── YYYYMMDD_HHmmss +│ │ │ └── dlio_log_files +│ │ └── checkpointing +│ │ ├── llama3-8b +│ │ │ ├──results.json +│ │ │ ├── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_log_files +│ │ ... (10x Runs for Read and Write. May be combined in a single run) +│ │ │ └── YYYYMMDD_HHmmss +│ │ │ └── dlio_log_files +│ │ ├── llama3-70b +│ │ │ ├──results.json +│ │ │ ├── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_log_files +│ │ ... (10x Runs for Read and Write. May be combined in a single run) +│ │ │ └── YYYYMMDD_HHmmss +│ │ │ └── dlio_log_files +│ │ ├── llama3-405b +│ │ │ ├──results.json +│ │ │ ├── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_log_files +│ │ ... (10x Runs for Read and Write. May be combined in a single run) +│ │ │ └── YYYYMMDD_HHmmss +│ │ │ └── dlio_log_files +│ │ └── llama3-1t +│ │ ├──results.json +│ │ ├── YYYYMMDD_HHmmss +│ │ │ └── dlio_log_files +│ │ ... (10x Runs for Read and Write. May be combined in a single run) +│ │ └── YYYYMMDD_HHmmss +│ │ └── dlio_log_files +│ └── systems +│ ├──system-name-1.yaml +│ ├──system-name-1.pdf +│ ├──system-name-2.yaml +│ └──system-name-2.pdf +│ +└── Open + └── + ├── code + ├── results + │ └──system-name-1 + │ ├── training + │ │ ├── unet3d + │ │ │ ├── datagen + │ │ │ │ └── YYYYMMDD_HHmmss + │ │ │ │ └── dlio_log_files + │ │ │ └── run + │ │ | ├──results.json + │ │ │ ├── YYYYMMDD_HHmmss + │ │ │ │ └── dlio_log_files + │ │ │ ... (5x Runs per Emulated Accelerator Type) + │ │ │ └── YYYYMMDD_HHmmss + │ │ │ └── dlio_log_files + │ │ ├── resnet50 + │ │ │ ├── datagen + │ │ │ │ └── YYYYMMDD_HHmmss + │ │ │ │ └── dlio_log_files + │ │ │ └── run + │ │ | ├──results.json + │ │ │ ├── YYYYMMDD_HHmmss + │ │ │ │ └── dlio_log_files + │ │ │ ... (5x Runs per Emulated Accelerator Type) + │ │ │ └── YYYYMMDD_HHmmss + │ │ │ └── dlio_log_files + │ │ └── cosmoflow + │ │ ├── datagen + │ │ │ └── YYYYMMDD_HHmmss + │ │ │ └── dlio_log_files + │ │ └── run + │ │ ├──results.json + │ │ ├── YYYYMMDD_HHmmss + │ │ │ └── dlio_log_files + │ │ ... (5x Runs per Emulated Accelerator Type) + │ │ └── YYYYMMDD_HHmmss + │ │ └── dlio_log_files + │ └── checkpointing + │ ├── llama3-8b + │ | ├──results.json + │ │ ├── YYYYMMDD_HHmmss + │ │ │ └── dlio_log_files + │ │ ... (10x Runs for Read and Write. May be combined in a single run) + │ │ └── YYYYMMDD_HHmmss + │ │ └── dlio_log_files + │ ├── llama3-70b + │ | ├──results.json + │ │ ├── YYYYMMDD_HHmmss + │ │ │ └── dlio_log_files + │ │ ... (10x Runs for Read and Write. May be combined in a single run) + │ │ └── YYYYMMDD_HHmmss + │ │ └── dlio_log_files + │ ├── llama3-405b + │ | ├──results.json + │ │ ├── YYYYMMDD_HHmmss + │ │ │ └── dlio_log_files + │ │ ... (10x Runs for Read and Write. May be combined in a single run) + │ │ └── YYYYMMDD_HHmmss + │ │ └── dlio_log_files + │ └── llama3-1t + │ ├──results.json + │ ├── YYYYMMDD_HHmmss + │ │ └── dlio_log_files + │ ... (10x Runs for Read and Write. May be combined in a single run) + │ └── YYYYMMDD_HHmmss + │ └── dlio_log_files + └── systems + ├──system-name-1.yaml + ├──system-name-1.pdf + ├──system-name-2.yaml + └──system-name-2.pdf +``` + +#### 11.3.1 DLIO Log Files Required +The Training and Checkpointing workloads both use DLIO to execute the test. The following files are required for every run in a submission: +``` +YYYYMMDD_HHmmss +├── [training|checkpointing]_[datagen|run].stdout.log # Captured manually if running DLIO directly. mlpstorage captures this automatically +├── [training|checkpointing]_[datagen|run].stderr.log # Captured manually if running DLIO directly. mlpstorage captures this automatically +├── *[output|per_epoch_stats|summary].json # Captured manually if running DLIO directly. mlpstorage captures this automatically +├── dlio.log +└── dlio_config | .hydra_config # Running DLIO manually creates a ".hydra_config" directory. mlpstorage names this "dlio_config" + ├── config.yaml + ├── hydra.yaml + └── overrides.yaml + +``` + +### 11.4 System Description + +The purpose of the system description is to provide sufficient detail on the storage system under test, and the ``host nodes`` running the test, plus the network connecting them, to enable full reproduction of the benchmark results by a third party. + +Each submission must contain a ``.yaml`` file and a ``.pdf`` file. If you submit more than one benchmark result, each submission must have a unique ``.yaml`` file and a ``.pdf`` file that documents the system under test and the environment that generated that result, including any configuration options in effect. + +Note that, during the review period, submitters may be asked to include additional details in the yaml and pdf to enable reproducibility by a third party. + +#### 11.4.1 System Description YAML +The system description yaml is a hybrid human-readable and machine-readable description of the total system under test. It contains fields for the System overall, the Nodes that make up the solution (clients and storage), as well as Power information of the nodes. + +An example can be found [HERE](https://github.com/mlcommons/storage/blob/main/system_configuration.yaml) + +The fields in the example document are required unless otherwise called out. Of particular note are the following: + + - **System.type** + - Can choose from local-storage, hyper-converged, shared-[file|block|object], cloud-deployment + - **System.required_rack_units** + - This is the total rackspace required by the solution as deployed including any required backend networking (but not including the client network) + + +#### 11.4.2 System Description PDF + +The goal of the pdf is to complement the YAML file, providing additional detail on the system to enable full reproduction by a third party. We encourage submitters to add details that are more easily captured by diagrams and text description, rather than a YAML. + +This file is should include everything that a third party would need in order to recreate the results in the submission, including product model numbers or hardware config details, unit counts of drives and/or components, system and network topologies, software used with version numbers, and any non-default configuration options used by any of the above. + +A great example of a system description pdf can be found [here](https://github.com/mlcommons/storage_results_v0.5/tree/main/closed/DDN/systems). + + +**Cover page** + +The following information is required to be included in the system description PDF: + +- System name of the submission +- Submitter name +- Submission date +- Version of the benchmark +- Solution type of the submission +- Submission division (OPEN or CLOSED) +- Power Requirements +- System Topology + +**Mandatory Power requirements** + +Systems that require customer provisioning of power (for example, systems intended to be deployed in on-premises data centers or in co-located data centers) shall include a “Power Requirements Table”. Systems designed to only run in a cloud or hyper-converged environment do not have to include this table. + +The power requirements table shall list all hardware devices required to operate the storage system. Shared network equipment also used for client network communication and optional storage management systems do not need to be included. The power requirements table shall include: + +1. Every component in the system that requires electrical power. +2. For each component, every PSU for each system component. +3. For each PSU, the PSU nameplate rated power. +4. For each PSU (or redundant groups of PSUs0, the design power. + +Two examples of a power requirements tables are shown below: + +**Power Requirements Table** (Large system example) + +| System component | Power supply unit | Nameplate rated power | Design power | +| -------------------- | ----------------- | --------------------- | -------------- | +| Storage controller 1 | Power supply 1 | 1200 watts | 3600 watts | +| | Power supply 2 | 1200 watts | | +| | Power supply 3 | 1200 watts | | +| | Power supply 4 | 1200 watts | | +| Storage shelf 1 | Power supply 1 | 1000 watts | 1000 watts | +| | Power supply 2 | 1000 watts | | +| Network switch 1 | Power supply 1 | 1200 watts | 1200 watts | +| | Power supply 2 | 1200 watts | | +| **Totals** | | **9200 watts** | **5800 watts** | + +**Power Requirements Table** (Direct-attached media system example) + +| System component | Power supply unit | Nameplate rated power | Design power | +| -------------------- | ----------------- | --------------------- | -------------- | +| NVMe SSD 1 | 12VDC supply | 10 watts | 10 watts | +| | 3.3VDC supply | 2 watts | 2 watts | +| **Totals** | | **12 watts** | **12 watts** | + +System component and power supply unit names in the above tables are examples. Consistent names should be used in bill-of-material documentation, system diagrams and descriptive text. + +**System Topology** +The system topology needs to show logical connections between the nodes and network devices listed in the system-description.yaml. The simplest form is made up of squares and lines with a square for each node and a line for each connection between the nodes. Every node listed in the system-description.yaml needs to have a representative visual in the topology diagram. For large deployments (larger than 4 nodes), use an appropriate scaling notation. For example, in a solution of 16 identical client nodes, show squares for the first and last nodes (with node names and numbers in the nodes) separated by "...". + +**Mandatory Rack Units Requirements** + +If the system requires the physical deployment of dedicated hardware, ie: is not a cloud-based deployment or a hyperconverged deployment, you will need to include the total number of rack units that will be consumed by the storage system under test in the SystemDescription file(s), plus any supporting gear that is required for the configuration being tested. That supporting gear could include, for example, network switches for a "backend" or private network that is required for the storage system to operate. The rack units measure does not need to include any of the gear that connects the storage system to the ``host nodes``. + +**Optional information** + +The following *recommended* structure of systems.pdf provides a starting point for additional optional information. Submitters are free to adjust this structure as they see fit. + +If the submission is for a commercial system, a pdf of the product spec document can add significant value. If it is a system that does not have a spec document (e.g., a research system, HPC etc), or the product spec pdf doesn’t include all the required detail, the document can contain (all these are optional): + +- Recommended: High-level system diagram e.g., showing the ``host node``(s), storage system main components, and network topology used when connecting everything (e.g., spine-and-leaf, butterfly, etc.), and any non-default configuration options that were set during the benchmark run. +- Optional: Additional text description of the system, if the information is not captured in the YAML, e.g., the storage system’s components (make and model, optional features, capabilities, etc) and all configuration settings that are relevant to ML/AI benchmarks. If the make/model doesn’t specify all the components of the hardware platform it is running on, eg: it’s an Software-Defined-Storage product, then those should be included here (just like the client component list). +- Optional: We recommended the following three categories for the text description: + 1. Software, + 2. Hardware, and + 3. Settings. + +## 12. Review + +### 12.1 Visibility of results and code during review + +During the review process, only certain groups are allowed to inspect results and code. +| Group | Can Inspect | +| --- | --- | +| Review committee | All results, all code | +| Submitters | All results, all code | +| Public | No results, no code | + +### 12.2 Filing objections + +Submitters must officially file objections to other submitter’s code by creating a GitHub issue prior to the “Filing objections” deadline that cites the offending lines, the rules section violated, and, if pertinent, corresponding lines of the reference implementation that are not equivalent. Each submitter must file objections with a “by ” tag and a “against ” tag. Multiple organizations may append their “by ” to an existing objection if desired. If an objector comes to believe the objection is in error they may remove their “by ” tag. All objections with no “by ” tags at the end of the filing deadline will be closed. Submitters should file an objection, then discuss with the submitter to verify if the objection is correct. Following filing of an issue but before resolution, both objecting submitter and owning submitter may add comments to help the review committee understand the problem. If the owning submitter acknowledges the problem, they may append the “fix_required” tag and begin to fix the issue. + +### 12.3 Resolving objections + +The review committee will review each objection, and either establish consensus or vote. If the committee votes to support an objection, it will provide some basic guidance on an acceptable fix and append the “fix_required” tag. If the committee votes against an objection, it will close the issue. + +### 12.4 Fixing objections + +Code should be updated via a pull request prior to the “fixing objections” deadline. Following submission of all fixes, the objecting submitter should confirm that the objection has been addressed with the objector(s) and ask them to remove their “by tags. If the objector is not satisfied by the fix, then the review committee will decide the issue at its final review meeting. The review committee may vote to accept a fix and close the issue, or reject a fix and request the submission be moved to open or withdrawn. + +### 12.5 Withdrawing results / changing division + +Anytime up until the final human readable deadline (typically within 2-3 business days before the press call, so July 28th, 2025, in this case), an entry may be withdrawn by amending the pull request. Alternatively, an entry may be voluntarily moved from the closed division to the open division. Each benchmark results submission is treated separately for reporting in the results table and in terms of withdrawing it. For example, submitting a 3D-Unet run with 20 clients and 80 A100 accelerators is separate from submitting a 3D-Unet run with 19 clients and 76 accelerators. From 25ed7da90fc23c3e0d32160d4ad1192aea41a0e0 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Tue, 4 Nov 2025 11:36:20 -0800 Subject: [PATCH 02/23] Clarify directory structure requirements in Rules.md --- Rules.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/Rules.md b/Rules.md index 993acb74..c3aeb6f8 100644 --- a/Rules.md +++ b/Rules.md @@ -589,7 +589,7 @@ A complete submission for one workload (3D-Unet, ResNet, or Cosmoflow) contains 11.3.2. Within the top-level directory of the submission structure there must be a directory named "closed" and/or one named "open", and nothing more. -11.3.3. Within the Closed or Open directories there must be a single directory whose name if the name of the submitter (the same as the top-level directory). +11.3.3. Within the Closed or Open directories there must be a single directory whose name is the name of the submitter (the same as the top-level directory). 11.3.4. Within the submitter directory mentioned just above, there must be exactly three directories: "code", "results", and "systems". @@ -597,13 +597,19 @@ A complete submission for one workload (3D-Unet, ResNet, or Cosmoflow) contains If this is in the "open" hierarchy, any modifications made to the benchmark code must be included here, and if this is in the "closed" hierarchy, there must be no changes to the benchmark code. Note that in both cases this must be the code that was actually run to generate those results. -11.3.6. The "results" directory must include one or more directories that are the names of the "systems under test". Eg: a system name could be "Big_and_Fast_4000". +11.3.6. The "results" directory, whether it is wihin the "closed' or "open" hierarchies, must include one or more directories that are the names of the "systems under test". Eg: a system name could be "Big_and_Fast_4000". This name can be anything the submitter wants, it is just a name to both idenfity the set of results that were collected from a given configuration of storage system and to link together those results with the .pdf and .yaml files that describe the system-under-test. Note that only results from a given set of configuration parameters and hardware and software components of the system-under-test can be part of a given "system name", any change to the configuration parameters or hardware or software will force the results that come from those runs to be held in a different "system name". -11.3.7. The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file. Each of those files must be named with the "system name". +11.3.7. Within a "system name" directory in the "results" directory, there must be one or both of the following directories, and nothing else: "training", and/or "checkpointing". + +11.3.8. Within the "training" directory, there must be one or more of the following directories, and nothing else: "unet3d", "resnet50" and/or "cosmoflow". + +11.3.9. Within the "checkpointing" directory, there must be one or more of the following directories, and nothing else: "llama3-8b", "llama3-70b", "llama3-405b", and/or "llama3-1t". + +11.3.10. The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file, and nothing more. Each of those files must be named with the "system name". Eg: for a system-under-test named "Big_and_Fast_4000", there must be a "Big_and_Fast_4000.yaml" and a "Big_and_Fast_4000.pdf" file. From 39ae0048ce984589d65f5528834b7991e566ecc6 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Tue, 2 Dec 2025 15:41:49 -0800 Subject: [PATCH 03/23] Fill out more detail on the directory structure Updated submission guidelines and directory structure requirements for OPEN and CLOSED benchmarks, including detailed validation checks and file requirements. --- Rules.md | 179 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 95 insertions(+), 84 deletions(-) diff --git a/Rules.md b/Rules.md index c3aeb6f8..0773897c 100644 --- a/Rules.md +++ b/Rules.md @@ -553,6 +553,8 @@ OPEN division benchmarks must be referred to using the benchmark name plus the t ## 11. Submission +11.1. + A successful run result consists of a directory tree structure containing the set of files produced by the benchmark as the result, plus the manually created SystemDescription files (both PDF and yaml) that describe the storage solution under test and the environment the test was run in. The whole package must be uploaded to MLCommons via the UI provided to submitters. @@ -585,34 +587,67 @@ A complete submission for one workload (3D-Unet, ResNet, or Cosmoflow) contains ### 11.3 Directory Structure for CLOSED or OPEN Submissions +The output directory hierarchy and the files that populate it should be automatically created and filled in by the `mplstorage` command, +but it is documented here to ensure that the `mlpstorage` command the the submission validation checker command are operating upon a single definition for that structure. + +The submission validation checker should check that the tested directory hierarachy matches the below requirements and output messages for all cases where it does not match. +The tool should make it's best effort to continue testing all the other aspects of the directory hierarchy after any given failure. +If the tested directory hierarchy does not meet all of the below requirements, then it should be labelled as invalid and tghe validation check should fail. + 11.3.1. The submission structure must start from a single directory whose name is the name of the submitter. -11.3.2. Within the top-level directory of the submission structure there must be a directory named "closed" and/or one named "open", and nothing more. +11.3.2. Within the top-level directory of the submission structure there must be a directory named "closed" and/or one named "open", and nothing more. These names are case-sensitive. + +11.3.3. The "open" directory hierarchy should be constructed identically to the "closed" directory hierarchy describe just below. -11.3.3. Within the Closed or Open directories there must be a single directory whose name is the name of the submitter (the same as the top-level directory). +11.3.4. Within the "closed" directory there must be a single directory whose name is the name of the submitter (the same as the top-level directory). -11.3.4. Within the submitter directory mentioned just above, there must be exactly three directories: "code", "results", and "systems". +11.3.5. Within the submitter directory mentioned just above, there must be exactly three directories: "code", "results", and "systems". These names are case-sensitive. -11.3.5. The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. +11.3.6. The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. If this is in the "open" hierarchy, any modifications made to the benchmark code must be included here, and if this is in the "closed" hierarchy, there must be no changes to the benchmark code. Note that in both cases this must be the code that was actually run to generate those results. -11.3.6. The "results" directory, whether it is wihin the "closed' or "open" hierarchies, must include one or more directories that are the names of the "systems under test". Eg: a system name could be "Big_and_Fast_4000". +11.3.7. The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file, and nothing more. Each of those files must be named with the "system name". +Eg: for a system-under-test named "Big_and_Fast_4000_buffered", there must be a "Big_and_Fast_4000_buffered.yaml" and a "Big_and_Fast_4000_buffered.pdf" file. These names are case-sensitive. + +11.3.8. The "results" directory, whether it is within the "closed' or "open" hierarchies, must include one or more directories that are the names of the systems-under-test. Eg: a system name could be "Big_and_Fast_4000_buffered". This name can be anything the submitter wants, it is just a name to both idenfity the set of results that were collected from a given configuration of storage system and to link together those results with the .pdf and .yaml files that describe the system-under-test. -Note that only results from a given set of configuration parameters and hardware and software components of the system-under-test can be part of a given "system name", -any change to the configuration parameters or hardware or software will force the results that come from those runs to be held in a different "system name". -11.3.7. Within a "system name" directory in the "results" directory, there must be one or both of the following directories, and nothing else: "training", and/or "checkpointing". +11.3.9. All the configuration parameters and hardware and software components of the system-under-test that are part of a given *system name* must be identical. Any changes to those configuration parameters or hardware or software must be submitted as a separate *system name*. These names are case-sensitive. + +11.3.10. Within a *system name* directory in the "results" directory, there must be one or both of the following directories, and nothing else: "training", and/or "checkpointing". These names are case-sensitive. + +11.3.11. Within the "training" directory, there must be one or more of the following *workload directories*, and nothing else: "unet3d", "resnet50" and/or "cosmoflow". These names are case-sensitive. + +11.3.12. Within the *workload directories* in the "training" hierarchy, there must exist *phase directories* named "datagen" and "run", and nothing else. These names are case-sensitive. + +11.3.13. Within the "datagen" *phase directory* within the "training" directory hierarchy, there must be exactly one *timestamp directory* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. + +11.3.14. Within the *timestamp directory* within the "datagen" *phase*, there must exist the following files: "training_datagen.stdout.log", "training_datagen.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. + +11.3.15. The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. + +11.3.16. Within the "run" *phase directory* within the "training" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. + +11.3.17. Within the "run" *phase directory* within the "training" directory hierarchy, there must also be exactly 5 subdirectories named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. + +11.3.18. Within each *timestamp directory* within the "run" *phase*, there must exist the following files: "training_run.stdout.log", "training_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. -11.3.8. Within the "training" directory, there must be one or more of the following directories, and nothing else: "unet3d", "resnet50" and/or "cosmoflow". +11.3.19. The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. -11.3.9. Within the "checkpointing" directory, there must be one or more of the following directories, and nothing else: "llama3-8b", "llama3-70b", "llama3-405b", and/or "llama3-1t". +11.3.20. Within the "checkpointing" directory, there must be one or more of the following *workload directories*, and nothing else: "llama3-8b", "llama3-70b", "llama3-405b", and/or "llama3-1t". These names are case-sensitive. -11.3.10. The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file, and nothing more. Each of those files must be named with the "system name". -Eg: for a system-under-test named "Big_and_Fast_4000", there must be a "Big_and_Fast_4000.yaml" and a "Big_and_Fast_4000.pdf" file. +11.3.21. Within the *workload directories* within the "checkpointing" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. +11.3.22. Within the *workload directories* within the "checkpointing" directory hierarchy, there must also be exactly ten *timestamp directories* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. +11.3.23. Within the *timestamp directories* within the "checkpointing" directory hierarchy, there must exist the following files: "checkpointing_run.stdout.log", "checkpointing_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. + +11.3.24. The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. + +11.3.25. Pictorially, here is what this looks like: ``` root_folder (or any name you prefer) ├── Closed @@ -624,65 +659,65 @@ root_folder (or any name you prefer) │ │ │ ├── unet3d │ │ │ │ ├── datagen │ │ │ │ │ └── YYYYMMDD_HHmmss -│ │ │ │ │ └── dlio_log_files +│ │ │ │ │ └── dlio_config │ │ │ │ └── run │ │ │ │ ├──results.json │ │ │ │ ├── YYYYMMDD_HHmmss -│ │ │ │ │ └── dlio_log_files +│ │ │ │ │ └── dlio_config │ │ │ │ ... (5x Runs per Emulated Accelerator Type) │ │ │ │ └── YYYYMMDD_HHmmss -│ │ │ │ └── dlio_log_files +│ │ │ │ └── dlio_config │ │ │ ├── resnet50 │ │ │ │ ├── datagen │ │ │ │ │ └── YYYYMMDD_HHmmss -│ │ │ │ │ └── dlio_log_files +│ │ │ │ │ └── dlio_config │ │ │ │ └── run │ │ │ │ ├──results.json │ │ │ │ ├── YYYYMMDD_HHmmss -│ │ │ │ │ └── dlio_log_files +│ │ │ │ │ └── dlio_config │ │ │ │ ... (5x Runs per Emulated Accelerator Type) │ │ │ │ └── YYYYMMDD_HHmmss -│ │ │ │ └── dlio_log_files +│ │ │ │ └── dlio_config │ │ │ └── cosmoflow │ │ │ ├── datagen │ │ │ │ └── YYYYMMDD_HHmmss -│ │ │ │ └── dlio_log_files +│ │ │ │ └── dlio_config │ │ │ └── run │ │ │ ├──results.json │ │ │ ├── YYYYMMDD_HHmmss -│ │ │ │ └── dlio_log_files +│ │ │ │ └── dlio_config │ │ │ ... (5x Runs per Emulated Accelerator Type) │ │ │ └── YYYYMMDD_HHmmss -│ │ │ └── dlio_log_files +│ │ │ └── dlio_config │ │ └── checkpointing │ │ ├── llama3-8b │ │ │ ├──results.json │ │ │ ├── YYYYMMDD_HHmmss -│ │ │ │ └── dlio_log_files +│ │ │ │ └── dlio_config │ │ ... (10x Runs for Read and Write. May be combined in a single run) │ │ │ └── YYYYMMDD_HHmmss -│ │ │ └── dlio_log_files +│ │ │ └── dlio_config │ │ ├── llama3-70b │ │ │ ├──results.json │ │ │ ├── YYYYMMDD_HHmmss -│ │ │ │ └── dlio_log_files +│ │ │ │ └── dlio_config │ │ ... (10x Runs for Read and Write. May be combined in a single run) │ │ │ └── YYYYMMDD_HHmmss -│ │ │ └── dlio_log_files +│ │ │ └── dlio_config │ │ ├── llama3-405b │ │ │ ├──results.json │ │ │ ├── YYYYMMDD_HHmmss -│ │ │ │ └── dlio_log_files +│ │ │ │ └── dlio_config │ │ ... (10x Runs for Read and Write. May be combined in a single run) │ │ │ └── YYYYMMDD_HHmmss -│ │ │ └── dlio_log_files +│ │ │ └── dlio_config │ │ └── llama3-1t │ │ ├──results.json │ │ ├── YYYYMMDD_HHmmss -│ │ │ └── dlio_log_files +│ │ │ └── dlio_config │ │ ... (10x Runs for Read and Write. May be combined in a single run) │ │ └── YYYYMMDD_HHmmss -│ │ └── dlio_log_files +│ │ └── dlio_config │ └── systems │ ├──system-name-1.yaml │ ├──system-name-1.pdf @@ -698,116 +733,92 @@ root_folder (or any name you prefer) │ │ ├── unet3d │ │ │ ├── datagen │ │ │ │ └── YYYYMMDD_HHmmss - │ │ │ │ └── dlio_log_files + │ │ │ │ └── dlio_config │ │ │ └── run │ │ | ├──results.json │ │ │ ├── YYYYMMDD_HHmmss - │ │ │ │ └── dlio_log_files + │ │ │ │ └── dlio_config │ │ │ ... (5x Runs per Emulated Accelerator Type) │ │ │ └── YYYYMMDD_HHmmss - │ │ │ └── dlio_log_files + │ │ │ └── dlio_config │ │ ├── resnet50 │ │ │ ├── datagen │ │ │ │ └── YYYYMMDD_HHmmss - │ │ │ │ └── dlio_log_files + │ │ │ │ └── dlio_config │ │ │ └── run │ │ | ├──results.json │ │ │ ├── YYYYMMDD_HHmmss - │ │ │ │ └── dlio_log_files + │ │ │ │ └── dlio_config │ │ │ ... (5x Runs per Emulated Accelerator Type) │ │ │ └── YYYYMMDD_HHmmss - │ │ │ └── dlio_log_files + │ │ │ └── dlio_config │ │ └── cosmoflow │ │ ├── datagen │ │ │ └── YYYYMMDD_HHmmss - │ │ │ └── dlio_log_files + │ │ │ └── dlio_config │ │ └── run │ │ ├──results.json │ │ ├── YYYYMMDD_HHmmss - │ │ │ └── dlio_log_files + │ │ │ └── dlio_config │ │ ... (5x Runs per Emulated Accelerator Type) │ │ └── YYYYMMDD_HHmmss - │ │ └── dlio_log_files + │ │ └── dlio_config │ └── checkpointing │ ├── llama3-8b │ | ├──results.json │ │ ├── YYYYMMDD_HHmmss - │ │ │ └── dlio_log_files + │ │ │ └── dlio_config │ │ ... (10x Runs for Read and Write. May be combined in a single run) │ │ └── YYYYMMDD_HHmmss - │ │ └── dlio_log_files + │ │ └── dlio_config │ ├── llama3-70b │ | ├──results.json │ │ ├── YYYYMMDD_HHmmss - │ │ │ └── dlio_log_files + │ │ │ └── dlio_config │ │ ... (10x Runs for Read and Write. May be combined in a single run) │ │ └── YYYYMMDD_HHmmss - │ │ └── dlio_log_files + │ │ └── dlio_config │ ├── llama3-405b │ | ├──results.json │ │ ├── YYYYMMDD_HHmmss - │ │ │ └── dlio_log_files + │ │ │ └── dlio_config │ │ ... (10x Runs for Read and Write. May be combined in a single run) │ │ └── YYYYMMDD_HHmmss - │ │ └── dlio_log_files + │ │ └── dlio_config │ └── llama3-1t │ ├──results.json │ ├── YYYYMMDD_HHmmss - │ │ └── dlio_log_files + │ │ └── dlio_config │ ... (10x Runs for Read and Write. May be combined in a single run) │ └── YYYYMMDD_HHmmss - │ └── dlio_log_files + │ └── dlio_config └── systems ├──system-name-1.yaml ├──system-name-1.pdf ├──system-name-2.yaml └──system-name-2.pdf ``` - -#### 11.3.1 DLIO Log Files Required -The Training and Checkpointing workloads both use DLIO to execute the test. The following files are required for every run in a submission: +11.3.26. Since the "dlio_log" subdirectory has a similar structure in all cases, it is describe pictorially just below: ``` -YYYYMMDD_HHmmss -├── [training|checkpointing]_[datagen|run].stdout.log # Captured manually if running DLIO directly. mlpstorage captures this automatically -├── [training|checkpointing]_[datagen|run].stderr.log # Captured manually if running DLIO directly. mlpstorage captures this automatically -├── *[output|per_epoch_stats|summary].json # Captured manually if running DLIO directly. mlpstorage captures this automatically -├── dlio.log -└── dlio_config | .hydra_config # Running DLIO manually creates a ".hydra_config" directory. mlpstorage names this "dlio_config" - ├── config.yaml - ├── hydra.yaml - └── overrides.yaml - +└── YYYYMMDD_HHmmss + ├── [training|checkpointing]_[datagen|run].stdout.log + ├── [training|checkpointing]_[datagen|run].stderr.log + ├── *[output|per_epoch_stats|summary].json + ├── dlio.log + └── dlio_config + ├── config.yaml + ├── hydra.yaml + └── overrides.yaml ``` ### 11.4 System Description -The purpose of the system description is to provide sufficient detail on the storage system under test, and the ``host nodes`` running the test, plus the network connecting them, to enable full reproduction of the benchmark results by a third party. - -Each submission must contain a ``.yaml`` file and a ``.pdf`` file. If you submit more than one benchmark result, each submission must have a unique ``.yaml`` file and a ``.pdf`` file that documents the system under test and the environment that generated that result, including any configuration options in effect. - -Note that, during the review period, submitters may be asked to include additional details in the yaml and pdf to enable reproducibility by a third party. - -#### 11.4.1 System Description YAML -The system description yaml is a hybrid human-readable and machine-readable description of the total system under test. It contains fields for the System overall, the Nodes that make up the solution (clients and storage), as well as Power information of the nodes. - -An example can be found [HERE](https://github.com/mlcommons/storage/blob/main/system_configuration.yaml) - -The fields in the example document are required unless otherwise called out. Of particular note are the following: - - - **System.type** - - Can choose from local-storage, hyper-converged, shared-[file|block|object], cloud-deployment - - **System.required_rack_units** - - This is the total rackspace required by the solution as deployed including any required backend networking (but not including the client network) - - -#### 11.4.2 System Description PDF - -The goal of the pdf is to complement the YAML file, providing additional detail on the system to enable full reproduction by a third party. We encourage submitters to add details that are more easily captured by diagrams and text description, rather than a YAML. - -This file is should include everything that a third party would need in order to recreate the results in the submission, including product model numbers or hardware config details, unit counts of drives and/or components, system and network topologies, software used with version numbers, and any non-default configuration options used by any of the above. +The purpose of the two system description files is to provide sufficient detail on the storage system under test, and the ``host nodes`` running the test, plus the network connecting them, to enable full reproduction of the benchmark results by a third party. -A great example of a system description pdf can be found [here](https://github.com/mlcommons/storage_results_v0.5/tree/main/closed/DDN/systems). +The *SystemDescription.yaml* file is a machine-readable file providing additional detail on the system, while the *SystemDescription.pdf* complements that with diagrams and human-readable text. +11.4.1. The *SystemDescription.yaml* file must be validated by a tool that will compare it's internal YAML structure to that of a schema, and output messages describing how that file does not match the schema. +If any schema violations are found, then validation checker should continue looking for more mistakes but should overall fail the validation check. **Cover page** From 55617148b2d4ba49d05c885a390cbb843f66a34d Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Tue, 2 Dec 2025 15:49:37 -0800 Subject: [PATCH 04/23] Format rules for submission structure requirements --- Rules.md | 54 +++++++++++++++++++++++++++--------------------------- 1 file changed, 27 insertions(+), 27 deletions(-) diff --git a/Rules.md b/Rules.md index 0773897c..b82f92da 100644 --- a/Rules.md +++ b/Rules.md @@ -594,60 +594,60 @@ The submission validation checker should check that the tested directory hierara The tool should make it's best effort to continue testing all the other aspects of the directory hierarchy after any given failure. If the tested directory hierarchy does not meet all of the below requirements, then it should be labelled as invalid and tghe validation check should fail. -11.3.1. The submission structure must start from a single directory whose name is the name of the submitter. +**11.3.1.** The submission structure must start from a single directory whose name is the name of the submitter. -11.3.2. Within the top-level directory of the submission structure there must be a directory named "closed" and/or one named "open", and nothing more. These names are case-sensitive. +**11.3.2.** Within the top-level directory of the submission structure there must be a directory named "closed" and/or one named "open", and nothing more. These names are case-sensitive. -11.3.3. The "open" directory hierarchy should be constructed identically to the "closed" directory hierarchy describe just below. +**11.3.3.** The "open" directory hierarchy should be constructed identically to the "closed" directory hierarchy describe just below. -11.3.4. Within the "closed" directory there must be a single directory whose name is the name of the submitter (the same as the top-level directory). +**11.3.4.** Within the "closed" directory there must be a single directory whose name is the name of the submitter (the same as the top-level directory). -11.3.5. Within the submitter directory mentioned just above, there must be exactly three directories: "code", "results", and "systems". These names are case-sensitive. +**11.3.5.** Within the submitter directory mentioned just above, there must be exactly three directories: "code", "results", and "systems". These names are case-sensitive. -11.3.6. The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. +**11.3.6.** The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. If this is in the "open" hierarchy, any modifications made to the benchmark code must be included here, and if this is in the "closed" hierarchy, there must be no changes to the benchmark code. Note that in both cases this must be the code that was actually run to generate those results. -11.3.7. The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file, and nothing more. Each of those files must be named with the "system name". +**11.3.7.** The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file, and nothing more. Each of those files must be named with the "system name". Eg: for a system-under-test named "Big_and_Fast_4000_buffered", there must be a "Big_and_Fast_4000_buffered.yaml" and a "Big_and_Fast_4000_buffered.pdf" file. These names are case-sensitive. -11.3.8. The "results" directory, whether it is within the "closed' or "open" hierarchies, must include one or more directories that are the names of the systems-under-test. Eg: a system name could be "Big_and_Fast_4000_buffered". +**11.3.8.** The "results" directory, whether it is within the "closed' or "open" hierarchies, must include one or more directories that are the names of the systems-under-test. Eg: a system name could be "Big_and_Fast_4000_buffered". This name can be anything the submitter wants, it is just a name to both idenfity the set of results that were collected from a given configuration of storage system and to link together those results with the .pdf and .yaml files that describe the system-under-test. -11.3.9. All the configuration parameters and hardware and software components of the system-under-test that are part of a given *system name* must be identical. Any changes to those configuration parameters or hardware or software must be submitted as a separate *system name*. These names are case-sensitive. +**11.3.9.** All the configuration parameters and hardware and software components of the system-under-test that are part of a given *system name* must be identical. Any changes to those configuration parameters or hardware or software must be submitted as a separate *system name*. These names are case-sensitive. -11.3.10. Within a *system name* directory in the "results" directory, there must be one or both of the following directories, and nothing else: "training", and/or "checkpointing". These names are case-sensitive. +**11.3.10.** Within a *system name* directory in the "results" directory, there must be one or both of the following directories, and nothing else: "training", and/or "checkpointing". These names are case-sensitive. -11.3.11. Within the "training" directory, there must be one or more of the following *workload directories*, and nothing else: "unet3d", "resnet50" and/or "cosmoflow". These names are case-sensitive. +**11.3.11.** Within the "training" directory, there must be one or more of the following *workload directories*, and nothing else: "unet3d", "resnet50" and/or "cosmoflow". These names are case-sensitive. -11.3.12. Within the *workload directories* in the "training" hierarchy, there must exist *phase directories* named "datagen" and "run", and nothing else. These names are case-sensitive. +**11.3.12.** Within the *workload directories* in the "training" hierarchy, there must exist *phase directories* named "datagen" and "run", and nothing else. These names are case-sensitive. -11.3.13. Within the "datagen" *phase directory* within the "training" directory hierarchy, there must be exactly one *timestamp directory* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. +**11.3.13.** Within the "datagen" *phase directory* within the "training" directory hierarchy, there must be exactly one *timestamp directory* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. -11.3.14. Within the *timestamp directory* within the "datagen" *phase*, there must exist the following files: "training_datagen.stdout.log", "training_datagen.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. +**11.3.14.** Within the *timestamp directory* within the "datagen" *phase*, there must exist the following files: "training_datagen.stdout.log", "training_datagen.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. -11.3.15. The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. +**11.3.15.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. -11.3.16. Within the "run" *phase directory* within the "training" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. +**11.3.16.** Within the "run" *phase directory* within the "training" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. -11.3.17. Within the "run" *phase directory* within the "training" directory hierarchy, there must also be exactly 5 subdirectories named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. +**11.3.17.** Within the "run" *phase directory* within the "training" directory hierarchy, there must also be exactly 5 subdirectories named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. -11.3.18. Within each *timestamp directory* within the "run" *phase*, there must exist the following files: "training_run.stdout.log", "training_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. +**11.3.18.** Within each *timestamp directory* within the "run" *phase*, there must exist the following files: "training_run.stdout.log", "training_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. -11.3.19. The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. +**11.3.19.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. -11.3.20. Within the "checkpointing" directory, there must be one or more of the following *workload directories*, and nothing else: "llama3-8b", "llama3-70b", "llama3-405b", and/or "llama3-1t". These names are case-sensitive. +**11.3.20.** Within the "checkpointing" directory, there must be one or more of the following *workload directories*, and nothing else: "llama3-8b", "llama3-70b", "llama3-405b", and/or "llama3-1t". These names are case-sensitive. -11.3.21. Within the *workload directories* within the "checkpointing" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. +**11.3.21.** Within the *workload directories* within the "checkpointing" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. -11.3.22. Within the *workload directories* within the "checkpointing" directory hierarchy, there must also be exactly ten *timestamp directories* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. +**11.3.22.** Within the *workload directories* within the "checkpointing" directory hierarchy, there must also be exactly ten *timestamp directories* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. -11.3.23. Within the *timestamp directories* within the "checkpointing" directory hierarchy, there must exist the following files: "checkpointing_run.stdout.log", "checkpointing_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. +**11.3.23.** Within the *timestamp directories* within the "checkpointing" directory hierarchy, there must exist the following files: "checkpointing_run.stdout.log", "checkpointing_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. -11.3.24. The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. +**11.3.24.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. -11.3.25. Pictorially, here is what this looks like: +**11.3.25.** Pictorially, here is what this looks like: ``` root_folder (or any name you prefer) ├── Closed @@ -798,7 +798,7 @@ root_folder (or any name you prefer) ├──system-name-2.yaml └──system-name-2.pdf ``` -11.3.26. Since the "dlio_log" subdirectory has a similar structure in all cases, it is describe pictorially just below: +**11.3.26.** Since the "dlio_log" subdirectory has a similar structure in all cases, it is describe pictorially just below: ``` └── YYYYMMDD_HHmmss ├── [training|checkpointing]_[datagen|run].stdout.log @@ -817,7 +817,7 @@ The purpose of the two system description files is to provide sufficient detail The *SystemDescription.yaml* file is a machine-readable file providing additional detail on the system, while the *SystemDescription.pdf* complements that with diagrams and human-readable text. -11.4.1. The *SystemDescription.yaml* file must be validated by a tool that will compare it's internal YAML structure to that of a schema, and output messages describing how that file does not match the schema. +**11.4.1.** The *SystemDescription.yaml* file must be validated by a tool that will compare it's internal YAML structure to that of a schema, and output messages describing how that file does not match the schema. If any schema violations are found, then validation checker should continue looking for more mistakes but should overall fail the validation check. **Cover page** From 85df4a2f4c02c7bb39ad9f7b0ed4769abda3b7b6 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Wed, 3 Dec 2025 12:44:02 -0800 Subject: [PATCH 05/23] Focus the doc on the new text Remove all the text that carried over from the Submission_guidelines.md file, and start filling out the main structure of the sections of the new document. --- Rules.md | 752 ++++++------------------------------------------------- 1 file changed, 70 insertions(+), 682 deletions(-) diff --git a/Rules.md b/Rules.md index b82f92da..fe7ebd4a 100644 --- a/Rules.md +++ b/Rules.md @@ -1,653 +1,81 @@ -# MLPerf™ Storage V3.0 Benchmark Rules +# MLPerf™ Storage V2.0 Benchmark Validation —————————————————————————————————————————— - [MLPerf Storage Benchmark Submission Guidelines v2.0](#mlperf-storage-benchmark-submission-guidelines-v20) - [1. Introduction](#1-introduction) - - [1.1 Timeline](#11-timeline) - - [2. Benchmark Overview](#2-benchmark-overview) - - [2.1 Training](#21-training) - - [2.2 Checkpointing](#22-checkpointing) - - [3 Definitions](#3-definitions) - - [4. Performance Metrics](#4-performance-metrics) - - [5. Benchmark Code](#5-benchmark-code) - - [6. General Rules](#6-general-rules) - - [6.1. Strive to be fair](#61-strive-to-be-fair) - - [6.2. System and framework must be available](#62-system-and-framework-must-be-available) - - [6.3 Non-determinism](#63-non-determinism) - - [6.4. Result rounding](#64-result-rounding) - - [6.5. Stable storage must be used](#65-stable-storage-must-be-used) - - [6.6. Caching](#66-caching) - - [6.7. Replicability is mandatory](#67-replicability-is-mandatory) - - [7. Dataset Generation](#7-dataset-generation) - - [8. Single-host Submissions](#8-single-host-submissions) - - [9. Distributed Training Submissions](#9-distributed-training-submissions) - - [10. CLOSED and OPEN Divisions](#10-closed-and-open-divisions) - - [10.1 CLOSED: virtually all changes are disallowed](#101-closed:-virtually-all-changes-are-disallowed) - - [10.2 OPEN: changes are allowed but must be disclosed](#102-open:-changes-are-allowed-but-must-be-disclosed) - - [11. Submission](#11-submission) - - [11.1 What to submit - CLOSED submissions](#111-what-to-submit---closed-submissions) - - [11.2 What to submit - OPEN submissions](#112-what-to-submit---open-submissions) - - [11.3 Directory Structure for CLOSED or OPEN Submissions](#113-directory-structure-for-closed-or-open-submissions) - - [11.4 System Description](#114-system-description) - - [11.4.1 System Description YAML](#1141-system-description-yaml) - - [11.4.2 System Description PDF](#1142-system-description-pdf) - - [12. Review](#12-review) - - [12.1 Visibility of results and code during review](#121-visibility-of-results-and-code-during-review) - - [12.2 Filing objections](#122-filing-objections) - - [12.3 Resolving objections](#123-resolving-objections) - - [12.4 Fixing objections](#124-fixing-objections) - - [12.5 Withdrawing results / changing division](#125-withdrawing-results-/-changing-division) - - [13. Roadmap for future MLPerf Storage releases](#13-roadmap-for-future-mlperf-storage-releases) + - [2. Directory Structure for All Submissions](#2-directory-structure-for-all-submissions) + - [3. Sanity Checking the Training Options](#3-sanity-checking-the-training-options) + - [4. Sanity Checking the Checkpointing Options](#3-sanity-checking-the-checkpointing-options) ## 1. Introduction -MLPerf™ Storage is a benchmark suite to characterize the performance of storage systems that support machine learning workloads. The suite consists of 2 workload categories: +These are the requirements for the *submission validation checker* for version 2.0 of the MLPerf™ Storage benchmark, +but since the `mlpstorage` tool will be responsible for generating the vast majority (if not all) of the contents of a submission, it is also a spec for what `mlpstorage` should generate. -1. Training -2. Checkpointing - -This benchmark attempts to balance two goals. First, we aim for **comparability** between benchmark submissions to enable decision making by the AI/ML Community. Second, we aim for **flexibility** to enable experimentation and to show off unique storage system features that will benefit the AI/ML Community. To that end we have defined two classes of submissions: CLOSED and OPEN. - -Published results for the 3D-Unet, ResNet-50, and Cosmoflow Training workloads are comparable across v1.0 and v2.0 of the MLPerf Storage benchmark. A [full listing of comparability is available](https://github.com/mlcommons/policies/blob/master/MLPerf_Compatibility_Table.adoc). - -The MLPerf name and logo are trademarks of the MLCommons® Association ("MLCommons"). In order to refer to a result using the MLPerf name, the result must conform to the letter and spirit of the rules specified in this document. MLCommons reserves the right to solely determine if a use of its name or logos is acceptable. - -### 1.1 Timeline - -| Date | Description | -| ---- | ----------- | -| Jun 18, 2025 | Freeze rules & benchmark code. | -| Jun 24, 2025 | Open benchmark for submissions. | -| Jul 7, 2025 | **Submissions due.** | -| Jul 7, 2025 - Aug 4, 2025 | Review period. | -| Aug 4, 2025 | **Benchmark competition results are published.** | - - -## 2. Benchmark Overview - -This version of the benchmark does not include offline or online data pre-processing. We are aware that data pre-processing is an important part of the ML data pipeline and we will include it in a future version of the benchmark. - -Each benchmark setup must be executed a number of times (5 for training and 10 for checkpointing). All logs from every run must be submitted as part of a submission package. The final metrics are the average across the runs. Runs must be consecutive with no failed runs between the submitted runs. Runs can not be cherry-picked from a range of runs excepting that all five runs are consecutive within the large sequence of runs. - -### 2.1 Training - -MLPerf Storage emulates (or "simulates", the terms are used interchangably in this document) accelerators for the training workloads with the tool DLIO developed by Argonne National Labs. DLIO uses the standard AI frameworks (PyTorch, Tensorflow, Numpy, etc) to load data from storage to memory at the same intensity as a given accelerator. - -**This emulation means that submitters do not need to use hardware accelerators (e.g., GPUs, TPUs, and other ASICs) when running MLPerf Storage - Training.** - -Instead, our benchmark tool replaces the training on the accelerator for a single batch of data with a ``sleep()`` call. The ``sleep()`` interval depends on the batch size and accelerator type and has been determined through measurement on a system running the real training workload. The rest of the data ingestion pipeline (data loading, caching, checkpointing) is unchanged and runs in the same way as when the actual training is performed. - -There are two main advantages to accelerator emulation. First, MLPerf Storage allows testing different storage systems with different types of accelerators. To change the type of accelerator that the benchmark emulates (e.g., to switch to a system with NVIDIA H100 GPUs instead of A100 GPUs), it is enough to adjust the batch size and ``sleep()`` parameter. The second advantage is that MLPerf Storage can put a high load on the storage system simply by increasing the number of emulated accelerators. This allows for testing the behavior of the storage system in large-scale scenarios without purchasing/renting the AI compute infrastructure. - -The benchmark suite provides workload [configurations](https://github.com/mlcommons/storage/tree/main/storage-conf/workload) that simulate the I/O patterns of selected workloads listed in Table 1. The I/O patterns for each MLPerf Storage benchmark correspond to the I/O patterns of the MLPerf Training and MLPerf HPC benchmarks (i.e., the I/O generated by our tool for 3D U-Net closely follows the I/O generated by actually running the 3D U-Net training workload). The benchmark suite can also generate synthetic datasets which show the same I/O load as the actual datasets listed in Table 1. - -| Area | Problem | Model | Data Loader | Dataset seed | Minimum AU% | -| ---- | ------- | ----- | ----------- | ------------ | ----------- | -| Vision | Image segmentation (medical) | 3D U-Net | PyTorch | KiTS 19 (140MB/sample) | 90% | -| Vision | Image classification | ResNet-50 | TensorFlow | ImageNet (150KB/sample) | 90% | -| Scientific | Cosmology | parameter prediction | TensorFlow | CosmoFlow N-body simulation (2MB/sample) | 70% | - -Table 1: Benchmark description - -- Benchmark start point: The dataset is in **shared persistent storage**. -- Benchmark end point: The measurement ends after a predetermined number of epochs. *Note: data transfers from storage in this test terminate with the data in host DRAM; transfering data into the accelerator memory is not included in this benchmark.* -- Configuration files for the workloads and dataset content can be found [here](https://github.com/mlcommons/storage/tree/main/storage-conf/workload). - -### 2.2 Checkpointing -#### 2.2.1 models -Benchmark results may be submitted for the following four model configurations. The associated model architectures and parallelism settings are listed below. The number of MPI processes must be set to 8, 64, 512, and 1024 for the respective models for CLOSED submission. - -For CLOSED submissions, participants are not permitted to change the total number of simulated accelerators. However, they may adjust the number of simulated accelerators per host, as long as each host uses more than 4 simulated accelerators. This allows the use of nodes with higher simulated accelerator density and fewer total nodes. Note: the aggregate simulated accelerator memory across all nodes must be sufficient to accommodate the model’s checkpoint size. - -**Table 2 LLM models** - -| Model | 8B | 70B | 405B | 1T | -|------------------------|--------|--------|---------|--------| -| Hidden dimension | 4096 | 8192 | 16384 | 25872 | -| FFN size | 14336 | 28672 | 53248 | 98304 | -| num_attention_heads | 32 | 128 | 128 | 192 | -| num_kv_heads | 8 | 8 | 8 | 32 | -| Num layers | 32 | 80 | 126 | 128 | -| Parallelism (TPxPPxDP) | 1×1×8 | 8×1x8 | 8×32×2 | 8×64×2 | -| Total Processes | 8 | 64 | 512 | 1024 | -| ZeRO | 3 | 3 | 1 | 1 | -| Checkpoint size | 105 GB | 912 GB | 5.29 TB | 18 TB | -| Subset: 8-Process Size | 105 GB | 114 GB | 94 GB | 161 GB | - - -#### 2.2.2 Benchmark Execution -**Checkpoint Modes (global storage vs local storage)** - -There are two operational modes: - -* ``default``: Used for shared storage systems. In this mode, the benchmark runs on multiple hosts to write/read the entire checkpoint dataset. The total number of processes (emulated accelerators) must match the number listed in Table 2 (TP×PP×DP = Total Processes). - -* ``subset``: Intended for node local storage systems. In this mode, checkpointing is simulated on a single host by writing/reading only a fraction (``num_gpus/TP/PP/DP``) of the checkpoint data, where ``num_gpus`` is the number of simulated accelerators on the host. The only allowed value for number of processes in a subset submission is 8 (the 8B model does not support subset mode as it is already set to 8 processes). - -**Checkpoint write and (read) recovery** - -For each submission, one must first perform the checkpoint write, then clear the cache if required, and finally perform the checkpoint read. The required command-line flags are: -*Note: Clearing caches is done to ensure that no data for the read phase comes from the filesystem cache* - -For a submission, the sequence is the following: -1. Write 10x checkpoints -2. Clear filesystem caches if necessary -3. Read 10x checkpoints - -The default options will run the read and write checkpoints in a single mlpstorage call. For example, the following command will execute a sequence of writing 10 checkpoints and reading those same 10 checkpoints. -```bash -mlpstorage checkpointing run --client-host-memory-in-gb 512 --model llama3-8b --num-processes 8 --checkpoint-folder /mnt/checkpoint_test -``` - -If caches need to be cleared use the following parameters for the WRITE and READ tests. - -* WRITE: ``--num-checkpoints-read=0`` -* READ: ``--num-checkpoints-write=0`` - - -In the above example, the write tests would be executed first with this command which will do the writes but no reads. -```bash -mlpstorage checkpointing run --client-host-memory-in-gb 512 --model llama3-8b --num-processes 8 --checkpoint-folder /mnt/checkpoint_test --num-checkpoints-read=0 -``` - -After the write tests complete, clear the caches on your hosts. A standard linux system would use a command like this: -```bash -echo 3 > /proc/sys/vm/drop_caches -``` -The end result of "clearing caches" is that 100% data for the read phase should come from the storage system under test and not from the client's filesystem cache. - -Finally, with the same example the read tests would be executed with the following command which indicates no writes during this phase: -```bash -mlpstorage checkpointing run --client-host-memory-in-gb 512 --model llama3-8b --num-processes 8 --checkpoint-folder /mnt/checkpoint_test --num-checkpoints-write=0 -``` - -Caches need to be cleared by the user outside of the mlpstorage tool. - -##### 2.2.2.1 Clearing Caches - -The checkpoints that are written are quite large. **If the checkpoint size per client node is less than 3x the client node's memory capacity, then the filesystem cache needs to be cleared between the write and read phases.** - -Examples: - -| Model (Total Size) | Num Clients & Memory | Size for ranks | Size for 1st and Last Client | Need to Clear Caches? | -|---------------------|-------------------------------------------|----------------------------|----------------------------------------------------------|------------------------------------------------------------------| -| Llama3 405b (5.2TB) | 8x (64 Ranks / Node)
1024GB per Client | 256x 11.8GB
256x 8.85GB | First: 755GB (64x 11.8GB)
Last: 566.4GB (64x 8.85GB) | No (556GB x 3 = 1,699GB which is greater than the client memory) | -| Llama3 70b (912GB) | 8x (8x Ranks / Node)
1024GB per Client | 64x 11.23GB | First: 89.8GB (8x 14.23GB)
Last: Same as First (DP=1) | Yes (89.8 x 3 = 269.5GB which is less than the client memory) | - -In the first case, after 2x checkpoints data that has been written is being flushed from the filesystem cache. This means that after 10x checkpoints a standard Linux system will not have any data in the filesystem cache that would be read for a checkpoint recovery starting back at the first written checkpoint. - -In the second case, after 10x checkpoints, 898GB of data will have been written per client with each client having 1024GB of memory. Without clearing caches this data would be read from the filesystem cache - -**fsync** - -We enforce ``fsync`` to be applied during checkpoint writes to ensure data is flushed to persistent storage. ``fsync`` is enabled by default in all workload configuration files. - -**Example Execution Commands** - -* ``default`` mode (``WORLD_SIZE = TP*PP*DP`` as listed in Table 2): - ```bash - # Perform checkpoint writes (make sure the number of hosts is WORLD_SIZE/num_processes_per_host) - mlpstorage checkpointing run --model llama3-405b \ - --hosts ip1 ip2 .... \ - --num-processes 512 \ - --num-checkpoints-read 0 \ - --checkpoint-folder ./checkpoint_data1 \ - --results-dir ./mlpstorage_results \ - --client-host-memory-in-gb 64 - - # Clear the cache (This might require admin access to the system) - ... - - # perform checkpoint reads - mlpstorage checkpointing run --model llama3-405b \ - --hosts ip1 ip2 .... \ - --num-processes 512 \ - --num-checkpoints-write 0 \ - --checkpoint-folder ./checkpoint_data1 \ - --results-dir ./mlpstorage_results \ - --client-host-memory-in-gb 64 - ``` -* ``subset`` mode (on a single host with **8 simulated accelerators**) - ```bash - # Perform checkpoint writes (data parallelism must match Table 2) - mlpstorage checkpointing run --model llama3-405b \ - --hosts ip1 \ - --num-processes 8 \ - --num-checkpoints-read 0 \ - --checkpoint-folder ./checkpoint_data1 \ - --results-dir ./mlpstorage_results \ - --client-host-memory-in-gb 64 - # Clear the cache - ... - # Perform checkpoint read (data parallelism must match Table 2) - mlpstorage checkpointing run --model llama3-405b \ - --hosts ip1 \ - --num-processes 8 \ - --num-checkpoints-write 0 \ - --checkpoint-folder ./checkpoint_data1 \ - --results-dir ./mlpstorage_results \ - --client-host-memory-in-gb 64 - ``` - -#### 2.2.3 Metrics and Results Reporting -We report the checkpoint time per write / read and I/O throughput from each rank. For each run: - - * The metric for duration is the maximum time across all processes. - * The metric for throughput is the minimum across all processes. - -A checkpoint workload submission must include 10 checkpoints written and 10 checkpoints read as well as the logs for any optional processes as outlined in section 2.2.5 (clearing caches, storage remapping, etc) - -#### 2.2.4 Requirements for Simultaneously Readable and Writable - -Checkpoint recovery is intended to mimic an environment where a failure has occurred and the data needs to be read by different hosts than wrote the data. - -For storage systems where all hosts can read and write all data simultaneously, the process described above satisfies the requirements. - -For storage systems where 1 host has write access to a volume but all hosts have read access, the above process also satisfies the requirements so long as reads can be fulfilled immediately following a write. - -For storage systems where 1 host has write access to a volume and a "remapping" process is required for other hosts to read the same data, the time to remap must be measured and included in the submission. - -When a checkpoint is taken/written, it must be written to stable storage, but that checkpoint does not need to be readable by other other hosts yet. If it is not readable by other hosts immediately after the checkpoint write is complete, if it requires some additional processing or reconfiguration before the checkpoint is readable by other hosts, the time duration between the checkpoint being completed and the earliest time that that checkpoint could be read by a different ``host node`` must be reported in the SystemDescription.yaml file. That duration between write completion and availability for reading will be added to the time to read/recover from the benchmark. - -**Any processes between the write and read phases of checkpointing that are required before data can be read by a different host than wrote the data must be measured and included in the submission. The time for these processes will be added to the recovery time and throughput calculation for submitted scores** - -The system_configuration.yaml document must list whether the solution support simultaneous reads and/or writes as such: -```yaml -System: - shared_capabilities: - multi_host_support: True # False is used for local storage - simultaneous_write_support: False # Are simultaneous writes by multiple hosts supported in the submitted configuration - simultaneous_read__support: True # Are simultaneous reads by multiple hosts supported in the submitted configuration -``` - -#### 2.2.5 OPEN vs CLOSED submissions -For CLOSED submissions, the total number of processes must be fixed according to Table 2. - -For OPEN submissions, the total number of processes may be increased in multiples of (TP×PP) to showcase the scalability of the storage solution. - -**Table 3: Configuration parameters and their mutability in CLOSED and OPEN divisions** - -| Parameter | Meaning | Default value | Changeable in CLOSED | Changeable in OPEN | -|------------------------------------|----------------------------------------------|-----------------------------------------------|----------------------|--------------------| -| --ppn **(USE HOST:SLOTS INSTEAD)** | Number of processes per node | N/A | YES (minimal 4) | YES (minimal 4) | -| --num-processes | Total number of processes | Node local: 8
Global: the value in Table 1 | NO | YES | -| --checkpoint-folder | The folder to save the checkpoint data | checkpoint/{workload} | YES | YES | -| --num-checkpoints-write | Number of write checkpoints | 10 or 0** | NO | NO | -| --num-checkpoints-read | Number of write checkpoints | 10 or 0** | NO | NO | - -**The ``--ppn`` syntax above was incorrect for the MPI package the benchmark uses, please use the syntax ``hostname:slotcount`` for the hosts listed in the ``--hosts`` argument. The ``slotcount`` value has the same meaning as the ``ppn`` value, the number of processes per node to run.** - -** By default, --num-checkpoints-read and --num-checkpoints-write are set to be 10. To perform write only, one has to turn off read by explicitly setting ``--num-checkpoints-read=0``; to perform read only, one has to turn off write by explicitly set ``--num-checkpoints-write=0`` - -For an OPEN or CLOSED submission, the process must follow: -1. Write 10 checkpoints -2. Clearing Caches or Remapping Volumes if required -3. Read 10 checkpoint - -DLIO and mlpstorage both support options to run 10 checkpoints with a single call or run 10 checkpoints as separate invokations of the tools. So long as the process is followed, checkpoints can be executed as a 10 checkpoint batch or individually. - -### 2.3 Vector Database - -## 3 Definitions -The following definitions are used throughout this document: - -- A **sample** is the unit of data on which training is run, e.g., an image, or a sentence. -- A **step** is defined to be the first batch of data loaded into the (emulated) accelerator. -- **Accelerator Utilization (AU)** is defined as the percentage of time taken by the simulated accelerators, relative to the total benchmark running time. Higher is better. -- **Design power** is defined to be the minimum measurement of electrical power that must be capable of being supplied to a single or collection of power supply units (PSUs) in order to avoid violating regulatory and safety requirements. For individual PSUs, the design power equals the nameplate rated power. For groups of redundant PSUs, the design power is equal to the sum of the nameplate rated power of the minimum number of PSUs required to be simultaneously operational. -- A **division** is a set of rules for implementing benchmarks from a suite to produce a class of comparable results. MLPerf Storage allows CLOSED and OPEN divisions, detailed in Section 6. -- **DLIO ([code link](https://github.com/argonne-lcf/dlio_benchmark), [paper link](https://ieeexplore.ieee.org/document/9499416))** is a benchmarking tool for deep learning applications. DLIO is the core of the MLPerf Storage benchmark and with specified configurations will emulate the I/O pattern for the workloads listed in Table 1. MLPerf Storage provides wrapper scripts to launch DLIO. There is no need to know the internals of DLIO to do a CLOSED submission, as the wrapper scripts provided by MLPerf Storage will suffice. However, for OPEN submissions changes to the DLIO code might be required (e.g., to add custom data loaders). -- **Dataset content** refers to the data and the total capacity of the data, not the format of how the data is stored. Specific information on dataset content can be found [here](https://github.com/mlcommons/storage/tree/main/storage-conf/workload). -- **Dataset format** refers to the format in which the training data is stored (e.g., npz, hdf5, csv, png, tfrecord, etc.), not the content or total capacity of the dataset. - - *NOTE: we plan to add support for Object storage in a future version of the benchmark, so OPEN submissions that include benchmark application changes and a description of how the original MLPerf Training benchmark dataset was mapped into Objects will be appreciated.* -- A **storage system** consists of a defined set of hardware and software resources that provide storage services to one or more ``host nodes``. Storage systems can be hardware based, software-defined, virtualized, hyperconverged, or cloud based, and must be capable of providing the minimum storage services required to run the benchmark. If the storage system requires a dedicated network, then the hardware required for that network must be included in the ``storage system``. If the storage system is hyperconverged, then it will probably share hardware (eg: CPU and/or networking) with the ``host nodes``. -- A **storage scaling unit** is defined as the minimum unit by which the performance and scale of a storage system can be increased. Examples of storage scaling units are “nodes”, “controllers”, “virtual machines” or “shelves”. Benchmark runs with different numbers of storage scaling units allow a reviewer to evaluate how well a given storage solution is able to scale as more scaling units are added. -- A **host node** is defined as the minimum unit by which the load upon the storage system under test can be increased. Every ``host node`` must run the same number of simulated accelerators. A ``host node`` can be instantiated by running the MLPerf Storage benchmark code within a Container or within a VM guest image or natively within an entire physical system. The number of Containers or VM guest images per physical system and the CPU resources per ``host node`` is up to the submitter. Note that the maximum DRAM available to any ``host node`` must be used when calculating the dataset size to be generated for the test. -- An **ML framework** is a specific version of a software library or set of related libraries for training ML models using a system. Examples include specific versions of Caffe2, MXNet, PaddlePaddle, PyTorch, or TensorFlow. -- A **benchmark** is an abstract problem that can be solved using ML by training a model based on a specific dataset or simulation environment to a target quality level. -- A **reference implementation** is a specific implementation of a benchmark provided by the MLPerf organization. -- A **benchmark implementation** is an implementation of a benchmark in a particular framework by a user under the rules of a specific division. -- A **run** is a complete execution of a benchmark implementation on a system. -- A **benchmark result** is the mean of 5 run results, executed consecutively. The dataset is generated only once for the 5 runs, prior to those runs. The 5 runs must be done on the same machine(s). -- **Nameplate rated power** is defined as the maximum power capacity that can be provided by a power supply unit (PSU), as declared to a certification authority. The nameplate rated power can typically be obtained from the PSU datasheet. -- A **Power Supply Unit (PSU)** is a component which converts an AC or DC voltage input to one or more DC voltage outputs for the purpose of powering a system or subsystem. Power supply units may be redundant and hot swappable. -- **SPEC PTDaemon® Interface (PTDaemon®)** is a software component created by the Standard Performance Evaluation Corporation (SPEC) designed to simplify the measurement of power consumption by abstracting the interface between benchmarking software and supported power analyzers. -- A **Supported power analyzer** is a test device supported by the PTDaemon® software that measures the instantaneous voltage and multiplies it by the instantaneous current, then accumulates these values over a specific time period to provide a cumulative measurement of consumed electrical power. For a listing of supported power analyzers, see https://www.spec.org/power/docs/SPECpower-Device_List.html -- A **System Under Test (SUT)** is the storage system being benchmarked. - - -- The storage system under test must be described via one of the following **storage system access types**. The overall solution might support more than one of the below types, but any given benchmark submission must be described by the access type that was actually used during that submission. An optional vendor-specified qualifier may be specified. This will be displayed in the results table after the storage system access type, for example, “NAS - RDMA”. - - **Direct-attached media** – any solution using local media on the ``host node``(s); eg: NVMe-attached storage with a local filesystem layered over it. This will be abbreviated “**Local**” in the results table. - - **Remotely-attached block device** – any solution using remote block storage; eg: a SAN using FibreChannel, iSCSI, NVMeoF, NVMeoF over RDMA, etc, with a local filesystem implementation layered over it. This will be abbreviated “**Remote Block**” in the results table. - - **Shared filesystem using a standards-defined access protocol** – any solution using a version of standard NFS or CIFS/SMB to access storage. This will be abbreviated “**NAS**” in the results table. - - **Shared filesystem using a proprietary access protocol** – any network-shared filesystem solution that requires a unique/proprietary protocol implementation to be installed on the ``host node``(s) to access storage; eg: an HPC parallel filesystem. This will be abbreviated “**Proprietary**” in the results table. - - **Object** – any solution accessed using an object protocol such as S3, RADOS, etc. This will be abbreviated “**Object**” in the results table. - - **Other** – any solution whose access is not sufficiently described by the above categories. This will be abbreviated “**Other**” in the results table. - -## 4. Performance Metrics - -The metrics reported by the benchmark are different for different types of workloads. They are broken out below. - -### 4.1. Training Workloads - -The benchmark performance metric for Training workloads (3D-Unet, ResNet-50, and Cosmflow) is **samples per second, subject to a minimum accelerator utilization (AU) defined for that workload**. Higher samples per second is better. - -To pass a benchmark run, the AU should be equal to or greater than the minimum value, and is computed as follows: -``` -AU (percentage) = (total_compute_time/total_benchmark_running_time) * 100 -``` - -All the I/O operations from the first **step** are excluded from the AU calculation in order to avoid the disturbance in the averages caused by the startup costs of the data processing pipeline, allowing the AU to more-quickly converge on the steady-state performance of the pipeline. The I/O operations that are excluded from the AU calculation **are** included in the samples/second reported by the benchmark, however. - -If all I/O operations are hidden by compute time, then the `total_compute_time` will equal the `total_benchmark_running_time` and the AU will be 100%. - -The total compute time can be derived from the batch size, total dataset size, number of simulated accelerators, and sleep time: -``` -total_compute_time = (records_per_file * total_files) / simulated_accelerators / batch_size * computation_time * epochs. -``` - -*NOTE: The sleep time has been determined by running the actual MLPerf training workloads including the compute step on real hardware and is dependent on the accelerator type. In this version of the benchmark we include sleep times for **NVIDIA A100 and H100 GPUs**. We plan on expanding the measurements to different accelerator types in future releases.* - -### 4.2. Checkpoint Workloads - -The benchmark performance metrics for Checkpoint workloads (write/take, and read/recover) are **bandwidth while writing, and bandwidth while reading**, plus an additional data point which is the amount of time required, if any, between the completion of writing a checkpoint and the first point at which that checkpoint can be read from a different ``host node``. That duration between write completeion and availability for reading will be added to the time to read/recover from the benchmark. - -**Submitters do not need to use hardware accelerators (e.g., GPUs, TPUs, and other ASICs) when running MLPerf Storage - Checkpointing.** - -## 5. Benchmark Code - -The MLPerf Storage working group provides a benchmark implementation which includes: -- Scripts to determine the minimum dataset size required for your system, for a given benchmark. -- Scripts for data generation. -- Benchmark tool, based on DLIO, with configuration files for the benchmarks. -- A script for running the benchmark on one host (additional setup is required if you are running a distributed training benchmark – see Section 5). -- A script for generating the results report (additional scripting and setup may be required if you are running a distributed training benchmark – see Section 5), and potentially additional supporting scripts. - -More details on installation and running the benchmark can be found in the [Github repo](https://github.com/mlcommons/storage) - -## 6. General Rules - -The following apply to all results submitted for this benchmark. - -### 6.1. Strive to be fair - -Benchmarking should be conducted to measure the framework and storage system performance as fairly as possible. Ethics and reputation matter. - -### 6.2. System and framework must be available - -- **Available Systems**. To be called an ``available system`` all components of the system must be publicly available. If any components of the system are not available at the time of the benchmark results submission, those components must be included in an ``available system`` submission that is submitted in the next round of MLPerf Storage benchmark submissions. Otherwise, the results for that submission may be retracted from the MLCommons results dashboard. -- **RDI Systems**. If you are measuring the performance of an experimental framework or system, you must make the system and framework you use available upon demand for replication by MLCommons. This class of systems will be called RDI (research, development, internal). - -### 6.3 Non-determinism -The data generator in DLIO uses a fixed random seed that must not be changed, to ensure that all submissions are working with the same dataset. Random number generators may be seeded from the following sources: -- Clock -- System source of randomness, e.g. /dev/random or /dev/urandom -- Another random number generator initialized with an allowed seed -Random number generators may be initialized repeatedly in multiple processes or threads. For a single run, the same seed may be shared across multiple processes or threads. - -The storage system must not be informed of the random seed or the source of randomness. This is intended to disallow submissions where the storage systen can predict the access pattern of the data samples. - -### 6.4. Result rounding -Public results should be rounded normally, to two decimal places. - -### 6.5. Stable storage must be used - -For all workloads stable storage must be used, but there are some differences in the specifics. - -#### 6.5.1. Training Workloads - -The MLPerf Storage benchmark will create the dataset on the storage system, in the desired ``dataset format``, before the start of the benchmark run. The data must reside on stable storage before the actual benchmark testing can run. - -#### 6.5.2. Checkpoint Workloads - -See section "2.2.3 Metrics and Results Reporting" for more details. - -### 6.6. Caching -Caching of training data on ``host nodes`` running MLPerf Storage is controlled via a warm up run, dataset size to memory ratios, and changing random seeds between runs. -1. All runs must use a warm-up run before the 5 test runs. -2. For Training benchmarks, the dataset size must be at least 5x larger than the sum of memory across all of the MLPerf Storage nodes -3. The random seed must change for each run as controlled by the benchmark.py script - -### 6.7. Replicability is mandatory -Results that cannot be replicated are not valid results. Replicated results should be within 5% within 5 tries. - -### 6.8 Consecutive Runs Requirement -Each of the benchmarks described in this document have a requirement for multiple runs. This is to ensure consistency of operation of the system under test as well as ensure statistical significance of the measurements. - -Unless otherwise noted, the multiple runs for a workload need to be run consecutively. To ensure this requirement is met, the time between runs (from the stop time of one run and the start time to the next run) needs to be less than the time to execute a single run. This is to discourage cherry-picking of results which is expressly forbidden and against the spirit of the rules. - -## 7. Dataset Generation - -This section only describes the dataset generation methodology and requirements for Training workloads, the equivalent topic is covered in section 2.2, Checkpointing. - -MLPerf Storage uses DLIO to generate synthetic data. Instructions on how to generate the datasets for each benchmark are available [here](https://github.com/mlcommons/storage). The datasets are generated following the sample size distribution and structure of the dataset seeds (see Table 1) for each of the benchmarks. - -**Minimum dataset size**. The MLPerf Storage benchmark script **must be used** to run the benchmarks since it calculates the minimum dataset size for each benchmark. It does so using the provided number of simulated accelerators and the size of all of the ``host node``’s memory in GB. The minimum dataset size computation is as follows: - -- Calculate required minimum samples given number of steps per epoch *(NB: num_steps_per_epoch is a minimum of 500)*: -``` - min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes -``` -- Calculate required minimum samples given host memory to eliminate client-side caching effects; *(NB: HOST_MEMORY_MULTIPLIER = 5)*: -``` - min_samples_host_memory_across_all_nodes = number_of_hosts * memory_per_host_in_GB * HOST_MEMORY_MULTIPLIER * 1024 * 1024 * 1024 / record_length -``` -- Ensure we meet both constraints: -``` - min_samples = max(min_samples_steps_per_epoch, min_samples_host_memory_across_all_nodes) -``` -- Calculate minimum files to generate -``` - min_total_files= min_samples / num_samples_per_file - min_files_size = min_samples * record_length / 1024 / 1024 / 1024 -``` - -A minimum of ``min_total_files`` files are required which will consume ``min_files_size`` GB of storage. - -**Running the benchmark on a subset of a larger dataset**. We support running the benchmark on a subset of the synthetically generated dataset. One can generate a large dataset and then run the benchmark on a subset of that dataset by setting ``num_files_train`` or ``num_files_eval`` smaller than the number of files available in the dataset folder. Note that if the dataset is stored in multiple subfolders, the subset actually used by this run will be evenly selected from all the subfolders. In this case, ``num_subfolders_train`` and ``num_subfolders_eval`` need to be equal to the actual number of subfolders inside the dataset folder in order to generate valid results. - -Please note that the log file(s) output during the generation step needs to be included in the benchmark results submission package. - -## 8. Single-host Submissions - -This section only applies to Training workloads, the equivalent topic is covered in section 2.2.2, "subset mode". - -Submitters can add load to the storage system in two orthogonal ways: (1) increase the number of simulated accelerators inside one ``host node`` (i.e., one machine), and/or (2) increase the number of ``host nodes`` connected to the storage system. - -For single-host submissions, increase the number of simulated accelerators by changing the ``--num-accelerators`` parameter to the ``benchmark.sh script``. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. - -For **single-host submissions**, CLOSED and OPEN division results must include benchmark runs for the maximum simulated accelerators that can be run on ONE HOST NODE, in ONE MLPerf Storage job, without going below the 90% accelerator utilization threshold. - -## 9. Distributed Training Submissions - -This setup simulates distributed training of a single training task, spread across multiple ``host nodes``, on a shared dataset. The current version of the benchmark only supports data parallelism, not model parallelism. - -Submitters must respect the following for multi-host node submissions: -- All the data must be accessible to all the ``host nodes``. -- The number of simulated accelerators in each ``host node`` must be identical. - -While it is recommended that all ``host nodes`` be as close as possible to identical, that is not required by these Rules. The fact that distributed training uses a pool-wide common barrier to synchronize the transition from one step to the next of all ``host nodes`` results in the overall performance of the cluster being determined by the slowest ``host node``. - -Here are a few practical suggestions on how to leverage a set of non-identical hardware, but these are not requirements of these Rules. It is possible to leverage very large physical nodes by using multiple Containers or VM guest images per node, each with dedicated affinity to given CPUs cores and where DRAM capacity and NUMA locality have been configured. Alternatively, larger physical nodes that have higher numbers of cores or additional memory than the others may have those additional cores or memory disabled. - -For **distributed training submissions**, CLOSED and OPEN division results must include benchmark runs for the maximum number of simulated accelerators across all ``host nodes`` that can be run in the distributed training setup, without going below the 90% accelerator utilization threshold. Each ``host node`` must run the same number of simulated accelerators for the submission to be valid. - -## 10. CLOSED and OPEN Divisions - -### 10.1 CLOSED: virtually all changes are disallowed -CLOSED represents a level playing field where all results are **comparable** across submissions. CLOSED explicitly forfeits flexibility in order to enable easy comparability. - -In order to accomplish that, most of the optimizations and customizations to the AI/ML algorithms and framework that might typically be applied during benchmarking or even during production use must be disallowed. Optimizations and customizations to the storage system are allowed in CLOSED. - -For CLOSED submissions of this benchmark, the MLPerf Storage codebase takes the place of the AI/ML algorithms and framework, and therefore cannot be changed. The sole exception to this rule is if the submitter decides to apply the code change identified in PR#299 of the DLIO repo in github, the resulting codebase will be considered "unchanged" for the purposes of this rule. - -A small number of parameters can be configured in CLOSED submissions; listed in the tables below. - -**Table: Training Workload Tunable Parameters for CLOSED** - -| Parameter | Description | Default | -|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|----------| -| *Dataset parameters* | | | -| dataset.num_files_train | Number of files for the training set | -- | -| dataset.num_subfolders_train | Number of subfolders that the training set is stored | 0 | -| dataset.data_folder | The path where dataset is stored | -- | -| | | | -| *Reader parameters* | | | -| reader.read_threads | Number of threads to load the data | -- | -| reader.computation_threads | Number of threads to preprocess the data (only for resnet) | -- | -| reader.transfer_size | An int64 scalar representing the number of bytes in the read buffer. (only supported for Tensorflow models -- Resnet and Cosmoflow) | | -| reader.prefetch_size | An int64 scalar representing the amount of prefetching done, with values of 0, 1, or 2. | | -| reader.odirect | Enable ODIRECT mode for Unet3D Training | False | -| | | | -| *Checkpoint parameters* | | | -| checkpoint.checkpoint_folder | The folder to save the checkpoints | -- | -| | | | -| *Storage parameters* | | | -| storage.storage_root | The storage root directory | ./ | -| storage.storage_type | The storage type | local_fs | - -**Table: Checkpoint Workload Tunable Parameters for CLOSED** - -| Parameter | Description | Default | -|----------------------------------|-------------------------------------------------------------|-----------------------| -| checkpoint.checkpoint_folder | The storage directory for writing and reading checkpoints | ./checkpoints/ | -| checkpoint.num_checkpoints_write | The number of checkpoint writes to do in a single dlio call | 10 | -| checkpoint.num_checkpoints_read | The number of checkpoint reads to do in a single dlio call | 10 | - - -CLOSED division benchmarks must be referred to using the benchmark name plus the term CLOSED, e.g. “The system was able to support *N ACME X100* accelerators running a CLOSED division 3D U-Net workload at only 8% less than optimal performance.” - -### 10.2 OPEN: changes are allowed but must be disclosed - -OPEN allows more **flexibility** to tune and change both the benchmark and the storage system configuration to show off new approaches or new features that will benefit the AI/ML Community. OPEN explicitly forfeits comparability to allow showcasing innovation. - -The essence of OPEN division results is that for a given benchmark area, they are “best case” results if optimizations and customizations are allowed. The submitter has the opportunity to show the performance of the storage system if an arbitrary, but documented, set of changes are made to the data storage environment or algorithms. - -Changes to DLIO itself are allowed in OPEN division submissions. Any changes to DLIO code or command line options must be disclosed. - -While changes to DLIO are allowed, changing the workload itself is not. Ie: how the workload is processed can be changed, but those changes cannot fundamentally change the purpose and result of the training. For example, changing the workload imposed upon storage by a ResNet-50 training task into 3D-Unet training task is not allowed. - -In addition to what can be changed in the CLOSED submission, the following parameters can be changed in the benchmark.sh script: - -| Parameter | Description | Default | -|------------------------------|--------------------------------------------|---------------------------------------------------------------------| -| framework | The machine learning framework. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | -| | | | -| *Dataset parameters* | | | -| dataset.format | Format of the dataset. | 3D U-Net: .npz
ResNet-50: .tfrecord
Cosmoflow: .tfrecord | -| dataset.num_samples_per_file | | 3D U-Net: 1
ResNet-50: 1251
Cosmoflow: 1 | -| | | | -| *Reader parameters* | | | -| reader.data_loader | Supported options: Tensorflow or PyTorch. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | - - -#### 10.2.1 OPEN: num_samples_per_file -Changing this parameter is supported only with Tensorflow, using tfrecord datasets. Currently, the benchmark code only supports num_samples_per_file = 1 for Pytorch data loader. To support other values, the data loader needs to be adjusted. - -#### 10.2.2 OPEN: data_loader -OPEN submissions can have custom data loaders. If a new data loader is added, or an existing data loader is changed, the DLIO code will need to be modified. - -#### 10.2.3 Execution of OPEN submissions -OPEN division benchmarks must be referred to using the benchmark name plus the term OPEN, e.g. “The system was able to support N ACME X100 accelerators running an OPEN division 3D U-Net workload at only 8% less than optimal performance.” - -## 11. Submission - -11.1. - -A successful run result consists of a directory tree structure containing the set of files produced by the benchmark as the result, plus the manually created SystemDescription files (both PDF and yaml) that describe the storage solution under test and the environment the test was run in. - -The whole package must be uploaded to MLCommons via the UI provided to submitters. - -It will be possible to upload your results many times, not just once, but each upload completely replaces the prior upload before the submission deadline. - -At least your final upload, if not all of them, should include all of the individual result submissions that you want to be included. Eg: if you want to submit results for A100 and H100, that would be two submissions but only one upload operation. - -The following is not a requirement of these rules, but a possibly valuable risk management strategy. Consider uploading whatever results you have every day or two. Each new upload replaces the last one. If some disaster happened and you were not able to continue tuning your submission, you would at least have the prior submission package available as a backup. - -### 11.1 What to submit - CLOSED submissions - -A complete submission for one workload (3D-Unet, ResNet, or Cosmoflow) contains 3 folders: -1. **results** folder, containing, for each system: - - The entire output folder generated by running MLPerf Storage. - - Final submission JSON summary files ``results.json``. The JSON file must be generated using the ``mlpstorage reportgen`` script. The ``mlpstorage reportgen`` command must be run on the rank0 machine in order to collect the correct set of files for the submission. - - The logs from the benchmark runs, but only from the rank0 systems not all of the systems. - - The logs from the dataset generation step that built the files that this benchmark run read from. -2. **systems** folder, containing: - - ``.yaml`` - - ``.pdf`` - - For system naming examples look [here](https://github.com/mlcommons/storage_results_v0.5/tree/main/closed) in the ``results/closed`` subdirectory below each submitter's directory. -3. **code** folder, containing: - - Source code of the benchmark implementation. The submission source code and logs must be made available to other submitters for auditing purposes during the review period. - -### 11.2 What to submit - OPEN submissions - -- Everything that is required for a CLOSED submission, following the same structure. -- Additionally, the source code used for the OPEN Submission benchmark implementations must be available under a license that permits MLCommon to use the implementation for benchmarking. - -### 11.3 Directory Structure for CLOSED or OPEN Submissions +The *submission validation checker* should check that the tested directory hierarachy matches the below requirements and output messages for all cases where it does not match. +The tool should make it's best effort to continue testing all the other aspects of the directory hierarchy after any given failure. +If the tested directory hierarchy does not meet all of the below requirements, then it should be labelled as invalid and the validation check should fail. -The output directory hierarchy and the files that populate it should be automatically created and filled in by the `mplstorage` command, -but it is documented here to ensure that the `mlpstorage` command the the submission validation checker command are operating upon a single definition for that structure. +Even if the structure of a submission package matches the spec, the options that were used to run the benchmark may not fall within acceptable bounds, +so we need the *submission validation checker* to check for illegal/inapproriate option settings, +and for semantic mismatches between different options that were used. -The submission validation checker should check that the tested directory hierarachy matches the below requirements and output messages for all cases where it does not match. -The tool should make it's best effort to continue testing all the other aspects of the directory hierarchy after any given failure. -If the tested directory hierarchy does not meet all of the below requirements, then it should be labelled as invalid and tghe validation check should fail. +### 2. Directory Structure for All Submissions -**11.3.1.** The submission structure must start from a single directory whose name is the name of the submitter. +**2.1.** The submission structure must start from a single directory whose name is the name of the submitter. This can be any string, possibly including blanks. -**11.3.2.** Within the top-level directory of the submission structure there must be a directory named "closed" and/or one named "open", and nothing more. These names are case-sensitive. +**2.2.** Within the top-level directory of the submission structure there must be a directory named "closed" and/or one named "open", and nothing more. These names are case-sensitive. -**11.3.3.** The "open" directory hierarchy should be constructed identically to the "closed" directory hierarchy describe just below. +**2.3.** The "open" directory hierarchy should be constructed identically to the "closed" directory hierarchy describe just below. -**11.3.4.** Within the "closed" directory there must be a single directory whose name is the name of the submitter (the same as the top-level directory). +**2.4.** Within the "closed" directory there must be a single directory whose name is the name of the submitter (the same as the top-level directory). -**11.3.5.** Within the submitter directory mentioned just above, there must be exactly three directories: "code", "results", and "systems". These names are case-sensitive. +**2.5.** Within the submitter directory mentioned just above, there must be exactly three directories: "code", "results", and "systems". These names are case-sensitive. -**11.3.6.** The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. +**2.6.** The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. If this is in the "open" hierarchy, any modifications made to the benchmark code must be included here, and if this is in the "closed" hierarchy, there must be no changes to the benchmark code. Note that in both cases this must be the code that was actually run to generate those results. -**11.3.7.** The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file, and nothing more. Each of those files must be named with the "system name". +**2.7.** The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file, and nothing more. Each of those files must be named with the "system name". Eg: for a system-under-test named "Big_and_Fast_4000_buffered", there must be a "Big_and_Fast_4000_buffered.yaml" and a "Big_and_Fast_4000_buffered.pdf" file. These names are case-sensitive. -**11.3.8.** The "results" directory, whether it is within the "closed' or "open" hierarchies, must include one or more directories that are the names of the systems-under-test. Eg: a system name could be "Big_and_Fast_4000_buffered". +**2.8.** The "results" directory, whether it is within the "closed' or "open" hierarchies, must include one or more directories that are the names of the systems-under-test. Eg: a system name could be "Big_and_Fast_4000_buffered". This name can be anything the submitter wants, it is just a name to both idenfity the set of results that were collected from a given configuration of storage system and to link together those results with the .pdf and .yaml files that describe the system-under-test. -**11.3.9.** All the configuration parameters and hardware and software components of the system-under-test that are part of a given *system name* must be identical. Any changes to those configuration parameters or hardware or software must be submitted as a separate *system name*. These names are case-sensitive. +**2.9.** All the configuration parameters and hardware and software components of the system-under-test that are part of a given *system name* must be identical. Any changes to those configuration parameters or hardware or software must be submitted as a separate *system name*. These names are case-sensitive. -**11.3.10.** Within a *system name* directory in the "results" directory, there must be one or both of the following directories, and nothing else: "training", and/or "checkpointing". These names are case-sensitive. +**2.10.** Within a *system name* directory in the "results" directory, there must be one or both of the following directories, and nothing else: "training", and/or "checkpointing". These names are case-sensitive. -**11.3.11.** Within the "training" directory, there must be one or more of the following *workload directories*, and nothing else: "unet3d", "resnet50" and/or "cosmoflow". These names are case-sensitive. +**2.11.** Within the "training" directory, there must be one or more of the following *workload directories*, and nothing else: "unet3d", "resnet50" and/or "cosmoflow". These names are case-sensitive. -**11.3.12.** Within the *workload directories* in the "training" hierarchy, there must exist *phase directories* named "datagen" and "run", and nothing else. These names are case-sensitive. +**2.12.** Within the *workload directories* in the "training" hierarchy, there must exist *phase directories* named "datagen" and "run", and nothing else. These names are case-sensitive. -**11.3.13.** Within the "datagen" *phase directory* within the "training" directory hierarchy, there must be exactly one *timestamp directory* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. +**2.13.** Within the "datagen" *phase directory* within the "training" directory hierarchy, there must be exactly one *timestamp directory* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. -**11.3.14.** Within the *timestamp directory* within the "datagen" *phase*, there must exist the following files: "training_datagen.stdout.log", "training_datagen.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. +**2.14.** Within the *timestamp directory* within the "datagen" *phase*, there must exist the following files: "training_datagen.stdout.log", "training_datagen.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. -**11.3.15.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. +**2.15.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. -**11.3.16.** Within the "run" *phase directory* within the "training" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. +**2.16.** Within the "run" *phase directory* within the "training" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. -**11.3.17.** Within the "run" *phase directory* within the "training" directory hierarchy, there must also be exactly 5 subdirectories named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. +**2.17.** Within the "run" *phase directory* within the "training" directory hierarchy, there must also be exactly 5 subdirectories named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. -**11.3.18.** Within each *timestamp directory* within the "run" *phase*, there must exist the following files: "training_run.stdout.log", "training_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. +**2.18.** Within each *timestamp directory* within the "run" *phase*, there must exist the following files: "training_run.stdout.log", "training_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. -**11.3.19.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. +**2.19.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. -**11.3.20.** Within the "checkpointing" directory, there must be one or more of the following *workload directories*, and nothing else: "llama3-8b", "llama3-70b", "llama3-405b", and/or "llama3-1t". These names are case-sensitive. +**2.20.** Within the "checkpointing" directory, there must be one or more of the following *workload directories*, and nothing else: "llama3-8b", "llama3-70b", "llama3-405b", and/or "llama3-1t". These names are case-sensitive. -**11.3.21.** Within the *workload directories* within the "checkpointing" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. +**2.21.** Within the *workload directories* within the "checkpointing" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. -**11.3.22.** Within the *workload directories* within the "checkpointing" directory hierarchy, there must also be exactly ten *timestamp directories* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. +**2.22.** Within the *workload directories* within the "checkpointing" directory hierarchy, there must also be exactly ten *timestamp directories* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. -**11.3.23.** Within the *timestamp directories* within the "checkpointing" directory hierarchy, there must exist the following files: "checkpointing_run.stdout.log", "checkpointing_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. +**2.23.** Within the *timestamp directories* within the "checkpointing" directory hierarchy, there must exist the following files: "checkpointing_run.stdout.log", "checkpointing_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. -**11.3.24.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. +**2.24.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. -**11.3.25.** Pictorially, here is what this looks like: +**2.25.** Pictorially, here is what this looks like: ``` root_folder (or any name you prefer) ├── Closed @@ -798,7 +226,7 @@ root_folder (or any name you prefer) ├──system-name-2.yaml └──system-name-2.pdf ``` -**11.3.26.** Since the "dlio_log" subdirectory has a similar structure in all cases, it is describe pictorially just below: +**2.26.** Since the "dlio_log" subdirectory has a similar structure in all cases, it is describe pictorially just below: ``` └── YYYYMMDD_HHmmss ├── [training|checkpointing]_[datagen|run].stdout.log @@ -811,108 +239,68 @@ root_folder (or any name you prefer) └── overrides.yaml ``` -### 11.4 System Description +### 3. Sanity Checking the Training Options + +dfg + +#### 3.1. CLOSED Versus OPEN Options + +dfg + +#### 3.2. Dataset Generation Options -The purpose of the two system description files is to provide sufficient detail on the storage system under test, and the ``host nodes`` running the test, plus the network connecting them, to enable full reproduction of the benchmark results by a third party. +dfh -The *SystemDescription.yaml* file is a machine-readable file providing additional detail on the system, while the *SystemDescription.pdf* complements that with diagrams and human-readable text. +#### 3.3. Benchmark Run Options -**11.4.1.** The *SystemDescription.yaml* file must be validated by a tool that will compare it's internal YAML structure to that of a schema, and output messages describing how that file does not match the schema. -If any schema violations are found, then validation checker should continue looking for more mistakes but should overall fail the validation check. +dfg -**Cover page** +### 4. Sanity Checking the Checkpointing Options -The following information is required to be included in the system description PDF: +dgh -- System name of the submission -- Submitter name -- Submission date -- Version of the benchmark -- Solution type of the submission -- Submission division (OPEN or CLOSED) -- Power Requirements -- System Topology +#### 4.1. CLOSED Versus OPEN Options -**Mandatory Power requirements** +dgh -Systems that require customer provisioning of power (for example, systems intended to be deployed in on-premises data centers or in co-located data centers) shall include a “Power Requirements Table”. Systems designed to only run in a cloud or hyper-converged environment do not have to include this table. +#### 4.2. Benchmark Run Options -The power requirements table shall list all hardware devices required to operate the storage system. Shared network equipment also used for client network communication and optional storage management systems do not need to be included. The power requirements table shall include: +For OPEN submissions, the total number of processes may be increased in multiples of (TP×PP) to showcase the scalability of the storage solution. + +**Table 3: Configuration parameters and their mutability in CLOSED and OPEN divisions** + +| Parameter | Meaning | Default value | Changeable in CLOSED | Changeable in OPEN | +|------------------------------------|----------------------------------------------|-----------------------------------------------|----------------------|--------------------| +| --ppn hostname:slotcount | Number of processes per node | N/A | YES (minimal 4) | YES (minimal 4) | +| --num-processes | Total number of processes | Node local: 8
Global: the value in Table 1 | NO | YES | +| --checkpoint-folder | The folder to save the checkpoint data | checkpoint/{workload} | YES | YES | +| --num-checkpoints-write | Number of write checkpoints | 10 or 0** | NO | NO | +| --num-checkpoints-read | Number of write checkpoints | 10 or 0** | NO | NO | + +**In the ``--ppn`` syntax above, the ``slotcount`` value has the same meaning as the ``ppn`` value, the number of processes per node to run.** -1. Every component in the system that requires electrical power. -2. For each component, every PSU for each system component. -3. For each PSU, the PSU nameplate rated power. -4. For each PSU (or redundant groups of PSUs0, the design power. +** By default, --num-checkpoints-read and --num-checkpoints-write are set to be 10. To perform write only, one has to turn off read by explicitly setting ``--num-checkpoints-read=0``; to perform read only, one has to turn off write by explicitly set ``--num-checkpoints-write=0`` -Two examples of a power requirements tables are shown below: -**Power Requirements Table** (Large system example) -| System component | Power supply unit | Nameplate rated power | Design power | -| -------------------- | ----------------- | --------------------- | -------------- | -| Storage controller 1 | Power supply 1 | 1200 watts | 3600 watts | -| | Power supply 2 | 1200 watts | | -| | Power supply 3 | 1200 watts | | -| | Power supply 4 | 1200 watts | | -| Storage shelf 1 | Power supply 1 | 1000 watts | 1000 watts | -| | Power supply 2 | 1000 watts | | -| Network switch 1 | Power supply 1 | 1200 watts | 1200 watts | -| | Power supply 2 | 1200 watts | | -| **Totals** | | **9200 watts** | **5800 watts** | -**Power Requirements Table** (Direct-attached media system example) -| System component | Power supply unit | Nameplate rated power | Design power | -| -------------------- | ----------------- | --------------------- | -------------- | -| NVMe SSD 1 | 12VDC supply | 10 watts | 10 watts | -| | 3.3VDC supply | 2 watts | 2 watts | -| **Totals** | | **12 watts** | **12 watts** | -System component and power supply unit names in the above tables are examples. Consistent names should be used in bill-of-material documentation, system diagrams and descriptive text. -**System Topology** -The system topology needs to show logical connections between the nodes and network devices listed in the system-description.yaml. The simplest form is made up of squares and lines with a square for each node and a line for each connection between the nodes. Every node listed in the system-description.yaml needs to have a representative visual in the topology diagram. For large deployments (larger than 4 nodes), use an appropriate scaling notation. For example, in a solution of 16 identical client nodes, show squares for the first and last nodes (with node names and numbers in the nodes) separated by "...". -**Mandatory Rack Units Requirements** -If the system requires the physical deployment of dedicated hardware, ie: is not a cloud-based deployment or a hyperconverged deployment, you will need to include the total number of rack units that will be consumed by the storage system under test in the SystemDescription file(s), plus any supporting gear that is required for the configuration being tested. That supporting gear could include, for example, network switches for a "backend" or private network that is required for the storage system to operate. The rack units measure does not need to include any of the gear that connects the storage system to the ``host nodes``. -**Optional information** -The following *recommended* structure of systems.pdf provides a starting point for additional optional information. Submitters are free to adjust this structure as they see fit. -If the submission is for a commercial system, a pdf of the product spec document can add significant value. If it is a system that does not have a spec document (e.g., a research system, HPC etc), or the product spec pdf doesn’t include all the required detail, the document can contain (all these are optional): -- Recommended: High-level system diagram e.g., showing the ``host node``(s), storage system main components, and network topology used when connecting everything (e.g., spine-and-leaf, butterfly, etc.), and any non-default configuration options that were set during the benchmark run. -- Optional: Additional text description of the system, if the information is not captured in the YAML, e.g., the storage system’s components (make and model, optional features, capabilities, etc) and all configuration settings that are relevant to ML/AI benchmarks. If the make/model doesn’t specify all the components of the hardware platform it is running on, eg: it’s an Software-Defined-Storage product, then those should be included here (just like the client component list). -- Optional: We recommended the following three categories for the text description: - 1. Software, - 2. Hardware, and - 3. Settings. -## 12. Review -### 12.1 Visibility of results and code during review -During the review process, only certain groups are allowed to inspect results and code. -| Group | Can Inspect | -| --- | --- | -| Review committee | All results, all code | -| Submitters | All results, all code | -| Public | No results, no code | -### 12.2 Filing objections -Submitters must officially file objections to other submitter’s code by creating a GitHub issue prior to the “Filing objections” deadline that cites the offending lines, the rules section violated, and, if pertinent, corresponding lines of the reference implementation that are not equivalent. Each submitter must file objections with a “by ” tag and a “against ” tag. Multiple organizations may append their “by ” to an existing objection if desired. If an objector comes to believe the objection is in error they may remove their “by ” tag. All objections with no “by ” tags at the end of the filing deadline will be closed. Submitters should file an objection, then discuss with the submitter to verify if the objection is correct. Following filing of an issue but before resolution, both objecting submitter and owning submitter may add comments to help the review committee understand the problem. If the owning submitter acknowledges the problem, they may append the “fix_required” tag and begin to fix the issue. -### 12.3 Resolving objections -The review committee will review each objection, and either establish consensus or vote. If the committee votes to support an objection, it will provide some basic guidance on an acceptable fix and append the “fix_required” tag. If the committee votes against an objection, it will close the issue. -### 12.4 Fixing objections -Code should be updated via a pull request prior to the “fixing objections” deadline. Following submission of all fixes, the objecting submitter should confirm that the objection has been addressed with the objector(s) and ask them to remove their “by tags. If the objector is not satisfied by the fix, then the review committee will decide the issue at its final review meeting. The review committee may vote to accept a fix and close the issue, or reject a fix and request the submission be moved to open or withdrawn. -### 12.5 Withdrawing results / changing division -Anytime up until the final human readable deadline (typically within 2-3 business days before the press call, so July 28th, 2025, in this case), an entry may be withdrawn by amending the pull request. Alternatively, an entry may be voluntarily moved from the closed division to the open division. Each benchmark results submission is treated separately for reporting in the results table and in terms of withdrawing it. For example, submitting a 3D-Unet run with 20 clients and 80 A100 accelerators is separate from submitting a 3D-Unet run with 19 clients and 76 accelerators. From b7350b2b6d57eac4186b44b7dc58d87ef2023ece Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Wed, 3 Dec 2025 13:26:58 -0800 Subject: [PATCH 06/23] Update Rules.md --- Rules.md | 181 +++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 168 insertions(+), 13 deletions(-) diff --git a/Rules.md b/Rules.md index fe7ebd4a..f6c39b25 100644 --- a/Rules.md +++ b/Rules.md @@ -5,9 +5,14 @@ - [1. Introduction](#1-introduction) - [2. Directory Structure for All Submissions](#2-directory-structure-for-all-submissions) - [3. Sanity Checking the Training Options](#3-sanity-checking-the-training-options) + - [3.1. CLOSED Versus OPEN Options](#31-closed-versus-open-options) + - [3.2. Benchmark Dataset Generation Options](#32-benchmark-dataset-generation-options) + - [3.3. Benchmark Run Options](#33-benchmark-run-options) - [4. Sanity Checking the Checkpointing Options](#3-sanity-checking-the-checkpointing-options) + - [4.1. CLOSED Versus OPEN Options](#41-closed-versus-open-options) + - [4.2. Benchmark Run Options](#42-benchmark-run-options) -## 1. Introduction +# 1. Introduction These are the requirements for the *submission validation checker* for version 2.0 of the MLPerf™ Storage benchmark, but since the `mlpstorage` tool will be responsible for generating the vast majority (if not all) of the contents of a submission, it is also a spec for what `mlpstorage` should generate. @@ -20,7 +25,7 @@ Even if the structure of a submission package matches the spec, the options that so we need the *submission validation checker* to check for illegal/inapproriate option settings, and for semantic mismatches between different options that were used. -### 2. Directory Structure for All Submissions +# 2. Directory Structure for All Submissions **2.1.** The submission structure must start from a single directory whose name is the name of the submitter. This can be any string, possibly including blanks. @@ -40,10 +45,10 @@ Note that in both cases this must be the code that was actually run to generate Eg: for a system-under-test named "Big_and_Fast_4000_buffered", there must be a "Big_and_Fast_4000_buffered.yaml" and a "Big_and_Fast_4000_buffered.pdf" file. These names are case-sensitive. **2.8.** The "results" directory, whether it is within the "closed' or "open" hierarchies, must include one or more directories that are the names of the systems-under-test. Eg: a system name could be "Big_and_Fast_4000_buffered". -This name can be anything the submitter wants, it is just a name to both idenfity the set of results that were collected from a given +This name can be anything the submitter wants, it is just a name to both idenfity the set of results that were collected from a given configuration of storage system and to link together those results with the .pdf and .yaml files that describe the system-under-test. -**2.9.** All the configuration parameters and hardware and software components of the system-under-test that are part of a given *system name* must be identical. Any changes to those configuration parameters or hardware or software must be submitted as a separate *system name*. These names are case-sensitive. +**2.9.** All the configuration parameters and hardware and software components of the system-under-test that are part of a given *system name* must be identical. Any changes to those configuration parameters or hardware or software must be submitted as a separate *system name*, so we should compare the configuration parameters and hardware and software components to verify that they're the same across all the tests and runs within the given *system name* directory hierarchy, to the extent that we can. The *system names* are case-sensitive. **2.10.** Within a *system name* directory in the "results" directory, there must be one or both of the following directories, and nothing else: "training", and/or "checkpointing". These names are case-sensitive. @@ -239,31 +244,155 @@ root_folder (or any name you prefer) └── overrides.yaml ``` -### 3. Sanity Checking the Training Options +# 3. Sanity Checking the Training Options dfg -#### 3.1. CLOSED Versus OPEN Options +## 3.1. CLOSED Versus OPEN Options dfg -#### 3.2. Dataset Generation Options +## 3.2. Dataset Generation Options -dfh +Minimum dataset size. The MLPerf Storage benchmark script must be used to run the benchmarks since it calculates the minimum dataset size for each benchmark. It does so using the provided number of simulated accelerators and the size of all of the host node’s memory in GB. The minimum dataset size computation is as follows: -#### 3.3. Benchmark Run Options +Calculate required minimum samples given number of steps per epoch (NB: num_steps_per_epoch is a minimum of 500): + min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes +Calculate required minimum samples given host memory to eliminate client-side caching effects; (NB: HOST_MEMORY_MULTIPLIER = 5): + min_samples_host_memory_across_all_nodes = number_of_hosts * memory_per_host_in_GB * HOST_MEMORY_MULTIPLIER * 1024 * 1024 * 1024 / record_length +Ensure we meet both constraints: + min_samples = max(min_samples_steps_per_epoch, min_samples_host_memory_across_all_nodes) +Calculate minimum files to generate + min_total_files= min_samples / num_samples_per_file + min_files_size = min_samples * record_length / 1024 / 1024 / 1024 +A minimum of min_total_files files are required which will consume min_files_size GB of storage. -dfg +## 3.3. Benchmark Run Options + +The benchmark performance metric for Training workloads (3D-Unet, ResNet-50, and Cosmflow) is samples per second, subject to a minimum accelerator utilization (AU) defined for that workload. Higher samples per second is better. + +To pass a benchmark run, the AU should be equal to or greater than the minimum value, and is computed as follows: + +AU (percentage) = (total_compute_time/total_benchmark_running_time) * 100 +All the I/O operations from the first step are excluded from the AU calculation in order to avoid the disturbance in the averages caused by the startup costs of the data processing pipeline, allowing the AU to more-quickly converge on the steady-state performance of the pipeline. The I/O operations that are excluded from the AU calculation are included in the samples/second reported by the benchmark, however. + +If all I/O operations are hidden by compute time, then the total_compute_time will equal the total_benchmark_running_time and the AU will be 100%. + +The total compute time can be derived from the batch size, total dataset size, number of simulated accelerators, and sleep time: + +total_compute_time = (records_per_file * total_files) / simulated_accelerators / batch_size * computation_time * epochs. + + + + +8. Single-host Submissions +This section only applies to Training workloads, the equivalent topic is covered in section 2.2.2, "subset mode". + +Submitters can add load to the storage system in two orthogonal ways: (1) increase the number of simulated accelerators inside one host node (i.e., one machine), and/or (2) increase the number of host nodes connected to the storage system. + +For single-host submissions, increase the number of simulated accelerators by changing the --num-accelerators parameter to the benchmark.sh script. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. + +For single-host submissions, CLOSED and OPEN division results must include benchmark runs for the maximum simulated accelerators that can be run on ONE HOST NODE, in ONE MLPerf Storage job, without going below the 90% accelerator utilization threshold. + +9. Distributed Training Submissions +This setup simulates distributed training of a single training task, spread across multiple host nodes, on a shared dataset. The current version of the benchmark only supports data parallelism, not model parallelism. + +Submitters must respect the following for multi-host node submissions: + +All the data must be accessible to all the host nodes. +The number of simulated accelerators in each host node must be identical. +While it is recommended that all host nodes be as close as possible to identical, that is not required by these Rules. The fact that distributed training uses a pool-wide common barrier to synchronize the transition from one step to the next of all host nodes results in the overall performance of the cluster being determined by the slowest host node. + +Here are a few practical suggestions on how to leverage a set of non-identical hardware, but these are not requirements of these Rules. It is possible to leverage very large physical nodes by using multiple Containers or VM guest images per node, each with dedicated affinity to given CPUs cores and where DRAM capacity and NUMA locality have been configured. Alternatively, larger physical nodes that have higher numbers of cores or additional memory than the others may have those additional cores or memory disabled. + +For distributed training submissions, CLOSED and OPEN division results must include benchmark runs for the maximum number of simulated accelerators across all host nodes that can be run in the distributed training setup, without going below the 90% accelerator utilization threshold. Each host node must run the same number of simulated accelerators for the submission to be valid. + + + +For CLOSED submissions of this benchmark, the MLPerf Storage codebase takes the place of the AI/ML algorithms and framework, and therefore cannot be changed. The sole exception to this rule is if the submitter decides to apply the code change identified in PR#299 of the DLIO repo in github, the resulting codebase will be considered "unchanged" for the purposes of this rule. + +A small number of parameters can be configured in CLOSED submissions; listed in the tables below. -### 4. Sanity Checking the Checkpointing Options +**Table: Training Workload Tunable Parameters for CLOSED** + +| Parameter | Description | Default | +|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|----------| +| *Dataset parameters* | | | +| dataset.num_files_train | Number of files for the training set | -- | +| dataset.num_subfolders_train | Number of subfolders that the training set is stored | 0 | +| dataset.data_folder | The path where dataset is stored | -- | +| | | | +| *Reader parameters* | | | +| reader.read_threads | Number of threads to load the data | -- | +| reader.computation_threads | Number of threads to preprocess the data (only for resnet) | -- | +| reader.transfer_size | An int64 scalar representing the number of bytes in the read buffer. (only supported for Tensorflow models -- Resnet and Cosmoflow) | | +| reader.prefetch_size | An int64 scalar representing the amount of prefetching done, with values of 0, 1, or 2. | | +| reader.odirect | Enable ODIRECT mode for Unet3D Training | False | +| | | | +| *Checkpoint parameters* | | | +| checkpoint.checkpoint_folder | The folder to save the checkpoints | -- | +| | | | +| *Storage parameters* | | | +| storage.storage_root | The storage root directory | ./ | +| storage.storage_type | The storage type | local_fs | + +In addition to what can be changed in the CLOSED submission, the following parameters can be changed in the benchmark.sh script: + +| Parameter | Description | Default | +|------------------------------|--------------------------------------------|---------------------------------------------------------------------| +| framework | The machine learning framework. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | +| | | | +| *Dataset parameters* | | | +| dataset.format | Format of the dataset. | 3D U-Net: .npz
ResNet-50: .tfrecord
Cosmoflow: .tfrecord | +| dataset.num_samples_per_file | | 3D U-Net: 1
ResNet-50: 1251
Cosmoflow: 1 | +| | | | +| *Reader parameters* | | | +| reader.data_loader | Supported options: Tensorflow or PyTorch. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | + +# 4. Sanity Checking the Checkpointing Options dgh -#### 4.1. CLOSED Versus OPEN Options +## 4.1. CLOSED Versus OPEN Options dgh -#### 4.2. Benchmark Run Options +## 4.2. Benchmark Run Options + +The checkpoints that are written are quite large. If the checkpoint size per client node is less than 3x the client node's memory capacity, then the filesystem cache needs to be cleared between the write and read phases. + +We enforce fsync to be applied during checkpoint writes to ensure data is flushed to persistent storage. fsync is enabled by default in all workload configuration files. + +A checkpoint workload submission must include 10 checkpoints written and 10 checkpoints read as well as the logs for any optional processes as outlined in section 2.2.5 (clearing caches, storage remapping, etc) + +Benchmark results may be submitted for the following four model configurations. The associated model architectures and parallelism settings are listed below. The number of MPI processes must be set to 8, 64, 512, and 1024 for the respective models for CLOSED submission. + +For CLOSED submissions, participants are not permitted to change the total number of simulated accelerators. However, they may adjust the number of simulated accelerators per host, as long as each host uses more than 4 simulated accelerators. This allows the use of nodes with higher simulated accelerator density and fewer total nodes. Note: the aggregate simulated accelerator memory across all nodes must be sufficient to accommodate the model’s checkpoint size. + +**Table 2 LLM models** + +| Model | 8B | 70B | 405B | 1T | +|------------------------|--------|--------|---------|--------| +| Hidden dimension | 4096 | 8192 | 16384 | 25872 | +| FFN size | 14336 | 28672 | 53248 | 98304 | +| num_attention_heads | 32 | 128 | 128 | 192 | +| num_kv_heads | 8 | 8 | 8 | 32 | +| Num layers | 32 | 80 | 126 | 128 | +| Parallelism (TPxPPxDP) | 1×1×8 | 8×1x8 | 8×32×2 | 8×64×2 | +| Total Processes | 8 | 64 | 512 | 1024 | +| ZeRO | 3 | 3 | 1 | 1 | +| Checkpoint size | 105 GB | 912 GB | 5.29 TB | 18 TB | +| Subset: 8-Process Size | 105 GB | 114 GB | 94 GB | 161 GB | + + +**Table: Checkpoint Workload Tunable Parameters for CLOSED** + +| Parameter | Description | Default | +|----------------------------------|-------------------------------------------------------------|-----------------------| +| checkpoint.checkpoint_folder | The storage directory for writing and reading checkpoints | ./checkpoints/ | +| checkpoint.num_checkpoints_write | The number of checkpoint writes to do in a single dlio call | 10 | +| checkpoint.num_checkpoints_read | The number of checkpoint reads to do in a single dlio call | 10 | + For OPEN submissions, the total number of processes may be increased in multiples of (TP×PP) to showcase the scalability of the storage solution. @@ -281,10 +410,36 @@ For OPEN submissions, the total number of processes may be increased in multiple ** By default, --num-checkpoints-read and --num-checkpoints-write are set to be 10. To perform write only, one has to turn off read by explicitly setting ``--num-checkpoints-read=0``; to perform read only, one has to turn off write by explicitly set ``--num-checkpoints-write=0`` +### 4.2. Storage System Must Be Simultaneously R/W or _Remappable_ + +For storage systems where 1 host has write access to a volume but all hosts have read access, the above process also satisfies the requirements so long as reads can be fulfilled immediately following a write. + +For storage systems where 1 host has write access to a volume and a "remapping" process is required for other hosts to read the same data, the time to remap must be measured and included in the submission. + +When a checkpoint is taken/written, it must be written to stable storage, but that checkpoint does not need to be readable by other other hosts yet. If it is not readable by other hosts immediately after the checkpoint write is complete, if it requires some additional processing or reconfiguration before the checkpoint is readable by other hosts, the time duration between the checkpoint being completed and the earliest time that that checkpoint could be read by a different host node must be reported in the SystemDescription.yaml file. That duration between write completion and availability for reading will be added to the time to read/recover from the benchmark. + +Any processes between the write and read phases of checkpointing that are required before data can be read by a different host than wrote the data must be measured and included in the submission. The time for these processes will be added to the recovery time and throughput calculation for submitted scores + +The system_configuration.yaml document must list whether the solution support simultaneous reads and/or writes as such: + +System: + shared_capabilities: + multi_host_support: True # False is used for local storage + simultaneous_write_support: False # Are simultaneous writes by multiple hosts supported in the submitted configuration + simultaneous_read__support: True # Are simultaneous reads by multiple hosts supported in the submitted configuration + +## 5. Validating The Phases +The MLPerf Storage working group provides a benchmark implementation which includes: +* Scripts to determine the minimum dataset size required for your system, for a given benchmark. +* Scripts for data generation. +* Benchmark tool, based on DLIO, with configuration files for the benchmarks. +* A script for running the benchmark on one host (additional setup is required if you are running a distributed training benchmark – see Section 5). +* A script for generating the results report (additional scripting and setup may be required if you are running a distributed training benchmark – see Section 5), and potentially additional supporting scripts. +Each of the benchmarks described in this document have a requirement for multiple runs. This is to ensure consistency of operation of the system under test as well as ensure statistical significance of the measurements. Unless otherwise noted, the multiple runs for a workload need to be run consecutively. To ensure this requirement is met, the time between runs (from the stop time of one run and the start time to the next run) needs to be less than the time to execute a single run. This is to discourage cherry-picking of results which is expressly forbidden and against the spirit of the rules. From 6430ce87bdfe7ab67df4ac10d90c1a6fb57ff715 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Wed, 3 Dec 2025 13:38:59 -0800 Subject: [PATCH 07/23] Update section titles for clarity and consistency --- Rules.md | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/Rules.md b/Rules.md index f6c39b25..8fb6113a 100644 --- a/Rules.md +++ b/Rules.md @@ -1,18 +1,21 @@ # MLPerf™ Storage V2.0 Benchmark Validation —————————————————————————————————————————— -- [MLPerf Storage Benchmark Submission Guidelines v2.0](#mlperf-storage-benchmark-submission-guidelines-v20) +- [MLPerf Storage V2.0 Benchmark Validation](#mlperf-storage-v20-benchmark-validation) - [1. Introduction](#1-introduction) - [2. Directory Structure for All Submissions](#2-directory-structure-for-all-submissions) - - [3. Sanity Checking the Training Options](#3-sanity-checking-the-training-options) + - [3. Checking the Training Options](#3-checking-the-training-options) - [3.1. CLOSED Versus OPEN Options](#31-closed-versus-open-options) - [3.2. Benchmark Dataset Generation Options](#32-benchmark-dataset-generation-options) - [3.3. Benchmark Run Options](#33-benchmark-run-options) - - [4. Sanity Checking the Checkpointing Options](#3-sanity-checking-the-checkpointing-options) + - [4. Checking the Checkpointing Options](#3-checking-the-checkpointing-options) - [4.1. CLOSED Versus OPEN Options](#41-closed-versus-open-options) - [4.2. Benchmark Run Options](#42-benchmark-run-options) + - [4.3. Storage System Must Be Simultaneously R/W or Remappable](#43-storage-system-must-be-simultaneously-rw-or-remappable) + - [5. Validating The Phases](#5-validating-the-phases) -# 1. Introduction + +# 1. Introduction These are the requirements for the *submission validation checker* for version 2.0 of the MLPerf™ Storage benchmark, but since the `mlpstorage` tool will be responsible for generating the vast majority (if not all) of the contents of a submission, it is also a spec for what `mlpstorage` should generate. @@ -25,7 +28,7 @@ Even if the structure of a submission package matches the spec, the options that so we need the *submission validation checker* to check for illegal/inapproriate option settings, and for semantic mismatches between different options that were used. -# 2. Directory Structure for All Submissions +# 2. Directory Structure for All Submissions **2.1.** The submission structure must start from a single directory whose name is the name of the submitter. This can be any string, possibly including blanks. @@ -244,11 +247,7 @@ root_folder (or any name you prefer) └── overrides.yaml ``` -# 3. Sanity Checking the Training Options - -dfg - -## 3.1. CLOSED Versus OPEN Options +# 3. Checking the Training Options dfg @@ -336,7 +335,7 @@ A small number of parameters can be configured in CLOSED submissions; listed in | storage.storage_root | The storage root directory | ./ | | storage.storage_type | The storage type | local_fs | -In addition to what can be changed in the CLOSED submission, the following parameters can be changed in the benchmark.sh script: +In addition to what can be changed in the CLOSED submission, the following parameters can be changed in OPEN submissions: | Parameter | Description | Default | |------------------------------|--------------------------------------------|---------------------------------------------------------------------| @@ -349,7 +348,7 @@ In addition to what can be changed in the CLOSED submission, the following param | *Reader parameters* | | | | reader.data_loader | Supported options: Tensorflow or PyTorch. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | -# 4. Sanity Checking the Checkpointing Options +# 4. Checking the Checkpointing Options dgh @@ -410,7 +409,7 @@ For OPEN submissions, the total number of processes may be increased in multiple ** By default, --num-checkpoints-read and --num-checkpoints-write are set to be 10. To perform write only, one has to turn off read by explicitly setting ``--num-checkpoints-read=0``; to perform read only, one has to turn off write by explicitly set ``--num-checkpoints-write=0`` -### 4.2. Storage System Must Be Simultaneously R/W or _Remappable_ +## 4.3. Storage System Must Be Simultaneously R/W or _Remappable_ For storage systems where 1 host has write access to a volume but all hosts have read access, the above process also satisfies the requirements so long as reads can be fulfilled immediately following a write. From 8381dc4225a575419a43a22ef530c66765ea4f70 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Wed, 3 Dec 2025 14:41:17 -0800 Subject: [PATCH 08/23] Revise training and checkpointing options sections Updated sections for validating training options and checkpointing options in the rules document. --- Rules.md | 160 +++++++++++++++++++++++-------------------------------- 1 file changed, 66 insertions(+), 94 deletions(-) diff --git a/Rules.md b/Rules.md index 8fb6113a..43eb5ccf 100644 --- a/Rules.md +++ b/Rules.md @@ -4,11 +4,10 @@ - [MLPerf Storage V2.0 Benchmark Validation](#mlperf-storage-v20-benchmark-validation) - [1. Introduction](#1-introduction) - [2. Directory Structure for All Submissions](#2-directory-structure-for-all-submissions) - - [3. Checking the Training Options](#3-checking-the-training-options) - - [3.1. CLOSED Versus OPEN Options](#31-closed-versus-open-options) - - [3.2. Benchmark Dataset Generation Options](#32-benchmark-dataset-generation-options) - - [3.3. Benchmark Run Options](#33-benchmark-run-options) - - [4. Checking the Checkpointing Options](#3-checking-the-checkpointing-options) + - [3. Validating the Training Options](#3-validating-the-training-options) + - [3.1. Benchmark Dataset Generation Options](#32-benchmark-dataset-generation-options) + - [3.2. Benchmark Run Options](#33-benchmark-run-options) + - [4. Validating the Checkpointing Options](#3-validating-the-checkpointing-options) - [4.1. CLOSED Versus OPEN Options](#41-closed-versus-open-options) - [4.2. Benchmark Run Options](#42-benchmark-run-options) - [4.3. Storage System Must Be Simultaneously R/W or Remappable](#43-storage-system-must-be-simultaneously-rw-or-remappable) @@ -28,6 +27,8 @@ Even if the structure of a submission package matches the spec, the options that so we need the *submission validation checker* to check for illegal/inapproriate option settings, and for semantic mismatches between different options that were used. +The `mlpstorage` tool must be used to run the benchmarks, submitters are not allowed to run the underlying tools (eg: DLIO) directly to generate a submission package. + # 2. Directory Structure for All Submissions **2.1.** The submission structure must start from a single directory whose name is the name of the submitter. This can be any string, possibly including blanks. @@ -247,70 +248,45 @@ root_folder (or any name you prefer) └── overrides.yaml ``` -# 3. Checking the Training Options +# 3. Validating the Training Options dfg -## 3.2. Dataset Generation Options - -Minimum dataset size. The MLPerf Storage benchmark script must be used to run the benchmarks since it calculates the minimum dataset size for each benchmark. It does so using the provided number of simulated accelerators and the size of all of the host node’s memory in GB. The minimum dataset size computation is as follows: - -Calculate required minimum samples given number of steps per epoch (NB: num_steps_per_epoch is a minimum of 500): - min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes -Calculate required minimum samples given host memory to eliminate client-side caching effects; (NB: HOST_MEMORY_MULTIPLIER = 5): - min_samples_host_memory_across_all_nodes = number_of_hosts * memory_per_host_in_GB * HOST_MEMORY_MULTIPLIER * 1024 * 1024 * 1024 / record_length -Ensure we meet both constraints: - min_samples = max(min_samples_steps_per_epoch, min_samples_host_memory_across_all_nodes) -Calculate minimum files to generate - min_total_files= min_samples / num_samples_per_file - min_files_size = min_samples * record_length / 1024 / 1024 / 1024 -A minimum of min_total_files files are required which will consume min_files_size GB of storage. - -## 3.3. Benchmark Run Options - -The benchmark performance metric for Training workloads (3D-Unet, ResNet-50, and Cosmflow) is samples per second, subject to a minimum accelerator utilization (AU) defined for that workload. Higher samples per second is better. - -To pass a benchmark run, the AU should be equal to or greater than the minimum value, and is computed as follows: - -AU (percentage) = (total_compute_time/total_benchmark_running_time) * 100 -All the I/O operations from the first step are excluded from the AU calculation in order to avoid the disturbance in the averages caused by the startup costs of the data processing pipeline, allowing the AU to more-quickly converge on the steady-state performance of the pipeline. The I/O operations that are excluded from the AU calculation are included in the samples/second reported by the benchmark, however. +## 3.1. Dataset Generation Options -If all I/O operations are hidden by compute time, then the total_compute_time will equal the total_benchmark_running_time and the AU will be 100%. +**3.1.1.** The *submission validation checker* should take the provided number of simulated accelerators and the sizes of all of the host node’s memory as reported in the logfiles and recompute the minimum dataset size as follows: + * Calculate required minimum samples given number of steps per epoch (NB: `num_steps_per_epoch` is a minimum of 500): + * `min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes` + * Calculate required minimum samples given host memory to eliminate client-side caching effects; (NB: HOST_MEMORY_MULTIPLIER = 5): + * `min_samples_host_memory_across_all_nodes = number_of_hosts * memory_per_host_in_GB * HOST_MEMORY_MULTIPLIER * 1024 * 1024 * 1024 / record_length` + * Ensure we meet both constraints: + * `min_samples = max(min_samples_steps_per_epoch, min_samples_host_memory_across_all_nodes)` + * Calculate minimum files to generate + * `min_total_files= min_samples / num_samples_per_file` + * `min_files_size = min_samples * record_length / 1024 / 1024 / 1024` + * A minimum of `min_total_files` files are required which will consume `min_files_size` GB of storage. -The total compute time can be derived from the batch size, total dataset size, number of simulated accelerators, and sleep time: +## 3.2. Benchmark Run Options -total_compute_time = (records_per_file * total_files) / simulated_accelerators / batch_size * computation_time * epochs. +**3.2.1.** To pass a benchmark run, the AU (Accelerator Utilization) should be equal to or greater than the minimum value: + * `total_compute_time = (records_per_file * total_files) / simulated_accelerators / batch_size * computation_time * epochs` + * `AU = (total_compute_time/total_benchmark_running_time) * 100` + * All the I/O operations from the first step are excluded from the AU calculation. The I/O operations that are excluded from the AU calculation are included in the samples/second reported by the benchmark, however. +**3.2.2.** For single-host submissions, increase the number of simulated accelerators by changing the --num-accelerators parameter to the benchmark.sh script. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. +**3.2.3.** For single-host submissions, CLOSED and OPEN division results must include benchmark runs for the maximum simulated accelerators that can be run on ONE HOST NODE, in ONE MLPerf Storage job, without going below the 90% accelerator utilization threshold. +**3.2.4.** For distributed Training submissions, all the data must be accessible to all the host nodes. -8. Single-host Submissions -This section only applies to Training workloads, the equivalent topic is covered in section 2.2.2, "subset mode". - -Submitters can add load to the storage system in two orthogonal ways: (1) increase the number of simulated accelerators inside one host node (i.e., one machine), and/or (2) increase the number of host nodes connected to the storage system. - -For single-host submissions, increase the number of simulated accelerators by changing the --num-accelerators parameter to the benchmark.sh script. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. - -For single-host submissions, CLOSED and OPEN division results must include benchmark runs for the maximum simulated accelerators that can be run on ONE HOST NODE, in ONE MLPerf Storage job, without going below the 90% accelerator utilization threshold. - -9. Distributed Training Submissions -This setup simulates distributed training of a single training task, spread across multiple host nodes, on a shared dataset. The current version of the benchmark only supports data parallelism, not model parallelism. - -Submitters must respect the following for multi-host node submissions: - -All the data must be accessible to all the host nodes. -The number of simulated accelerators in each host node must be identical. +**3.2.5.** For distributed Training submissions, the number of simulated accelerators in each host node must be identical. While it is recommended that all host nodes be as close as possible to identical, that is not required by these Rules. The fact that distributed training uses a pool-wide common barrier to synchronize the transition from one step to the next of all host nodes results in the overall performance of the cluster being determined by the slowest host node. -Here are a few practical suggestions on how to leverage a set of non-identical hardware, but these are not requirements of these Rules. It is possible to leverage very large physical nodes by using multiple Containers or VM guest images per node, each with dedicated affinity to given CPUs cores and where DRAM capacity and NUMA locality have been configured. Alternatively, larger physical nodes that have higher numbers of cores or additional memory than the others may have those additional cores or memory disabled. - -For distributed training submissions, CLOSED and OPEN division results must include benchmark runs for the maximum number of simulated accelerators across all host nodes that can be run in the distributed training setup, without going below the 90% accelerator utilization threshold. Each host node must run the same number of simulated accelerators for the submission to be valid. +**3.2.6.** For distributed Training submissions, the *submission validation checker* should emit a warning (not fail the validation) if the physical nodes that run the benchmark code are widely enough different in their capability. Here are a few practical suggestions on how to leverage a set of non-identical hardware, but these are not requirements of these Rules. It is possible to leverage very large physical nodes by using multiple Containers or VM guest images per node, each with dedicated affinity to given CPUs cores and where DRAM capacity and NUMA locality have been configured. Alternatively, larger physical nodes that have higher numbers of cores or additional memory than the others may have those additional cores or memory disabled. +**3.2.7.** For CLOSED submissions of this benchmark, the MLPerf Storage codebase cannot be changed, so the *submission validation checker* SHOULD do an `md5sum` of the code directory hierachy in the submission package and verify that that matches a precalculated checksum stored as a literal in the validator's codebase. - -For CLOSED submissions of this benchmark, the MLPerf Storage codebase takes the place of the AI/ML algorithms and framework, and therefore cannot be changed. The sole exception to this rule is if the submitter decides to apply the code change identified in PR#299 of the DLIO repo in github, the resulting codebase will be considered "unchanged" for the purposes of this rule. - -A small number of parameters can be configured in CLOSED submissions; listed in the tables below. +**3.2.8.** For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. **Table: Training Workload Tunable Parameters for CLOSED** @@ -328,45 +304,46 @@ A small number of parameters can be configured in CLOSED submissions; listed in | reader.prefetch_size | An int64 scalar representing the amount of prefetching done, with values of 0, 1, or 2. | | | reader.odirect | Enable ODIRECT mode for Unet3D Training | False | | | | | -| *Checkpoint parameters* | | | -| checkpoint.checkpoint_folder | The folder to save the checkpoints | -- | -| | | | | *Storage parameters* | | | | storage.storage_root | The storage root directory | ./ | | storage.storage_type | The storage type | local_fs | -In addition to what can be changed in the CLOSED submission, the following parameters can be changed in OPEN submissions: +**3.2.9.** For OPEN submissions of this benchmark, only a few additional parameters can be modified over those allowed in CLOSED, and those additional parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. + +**Table: Training Workload Tunable Parameters for OPEN** -| Parameter | Description | Default | -|------------------------------|--------------------------------------------|---------------------------------------------------------------------| -| framework | The machine learning framework. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | -| | | | -| *Dataset parameters* | | | -| dataset.format | Format of the dataset. | 3D U-Net: .npz
ResNet-50: .tfrecord
Cosmoflow: .tfrecord | -| dataset.num_samples_per_file | | 3D U-Net: 1
ResNet-50: 1251
Cosmoflow: 1 | -| | | | -| *Reader parameters* | | | -| reader.data_loader | Supported options: Tensorflow or PyTorch. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | +| Parameter | Description | Default | +|------------------------------|--------------------------------------------|---------------------------------------------------------------------------------------| +| framework | The machine learning framework. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | +| | | | +| *Dataset parameters* | | | +| dataset.format | Format of the dataset. | 3D U-Net: .npz
ResNet-50: .tfrecord
Cosmoflow: .tfrecord | +| dataset.num_samples_per_file | | 3D U-Net: 1
ResNet-50: 1251
Cosmoflow: 1 | +| | | | +| *Reader parameters* | | | +| reader.data_loader | Supported options: Tensorflow or PyTorch. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | -# 4. Checking the Checkpointing Options +# 4. Validating the Checkpointing Options dgh -## 4.1. CLOSED Versus OPEN Options +## 4.1. Benchmark Run Options -dgh +**4.1.1.** A checkpoint workload submission must include 10 checkpoints written and 10 checkpoints read as well as the logs for any optional processes. + +**4.1.2.** The checkpoint data written per client node musyt be more than 3x the client node's memory capacity, otherwise the filesystem cache needs to be cleared between the write and read phases. -## 4.2. Benchmark Run Options +**4.1.3.** We must verify that all the benchmark workload configuration files have set to do an fsync call at the end of each of the 10 checkpoint writes. -The checkpoints that are written are quite large. If the checkpoint size per client node is less than 3x the client node's memory capacity, then the filesystem cache needs to be cleared between the write and read phases. +**4.1.4.** The benchmark must be run with one of the four model configuration detailed below. -We enforce fsync to be applied during checkpoint writes to ensure data is flushed to persistent storage. fsync is enabled by default in all workload configuration files. +**4.1.5.** For CLOSED submissions, the number of MPI processes must be set to 8, 64, 512, and 1024 for the respective models. -A checkpoint workload submission must include 10 checkpoints written and 10 checkpoints read as well as the logs for any optional processes as outlined in section 2.2.5 (clearing caches, storage remapping, etc) +**4.1.6.** For CLOSED submissions, submitters are not permitted to change the total number of simulated accelerators. -Benchmark results may be submitted for the following four model configurations. The associated model architectures and parallelism settings are listed below. The number of MPI processes must be set to 8, 64, 512, and 1024 for the respective models for CLOSED submission. +**4.1.7.** For CLOSED submissions, submitters may adjust the number of simulated accelerators **per host**, as long as each host uses more than 4 simulated accelerators. -For CLOSED submissions, participants are not permitted to change the total number of simulated accelerators. However, they may adjust the number of simulated accelerators per host, as long as each host uses more than 4 simulated accelerators. This allows the use of nodes with higher simulated accelerator density and fewer total nodes. Note: the aggregate simulated accelerator memory across all nodes must be sufficient to accommodate the model’s checkpoint size. +**4.1.8.** The aggregate simulated accelerator memory across all nodes must be sufficient to accommodate the model’s checkpoint size. **Table 2 LLM models** @@ -383,17 +360,16 @@ For CLOSED submissions, participants are not permitted to change the total numbe | Checkpoint size | 105 GB | 912 GB | 5.29 TB | 18 TB | | Subset: 8-Process Size | 105 GB | 114 GB | 94 GB | 161 GB | +**4.1.9.** For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. + **Table: Checkpoint Workload Tunable Parameters for CLOSED** | Parameter | Description | Default | |----------------------------------|-------------------------------------------------------------|-----------------------| | checkpoint.checkpoint_folder | The storage directory for writing and reading checkpoints | ./checkpoints/ | -| checkpoint.num_checkpoints_write | The number of checkpoint writes to do in a single dlio call | 10 | -| checkpoint.num_checkpoints_read | The number of checkpoint reads to do in a single dlio call | 10 | - -For OPEN submissions, the total number of processes may be increased in multiples of (TP×PP) to showcase the scalability of the storage solution. +**4.1.10.** For OPEN submissions of this benchmark, the total number of processes may be increased in multiples of (TP×PP) to showcase the scalability of the storage solution. **Table 3: Configuration parameters and their mutability in CLOSED and OPEN divisions** @@ -405,28 +381,24 @@ For OPEN submissions, the total number of processes may be increased in multiple | --num-checkpoints-write | Number of write checkpoints | 10 or 0** | NO | NO | | --num-checkpoints-read | Number of write checkpoints | 10 or 0** | NO | NO | -**In the ``--ppn`` syntax above, the ``slotcount`` value has the same meaning as the ``ppn`` value, the number of processes per node to run.** - -** By default, --num-checkpoints-read and --num-checkpoints-write are set to be 10. To perform write only, one has to turn off read by explicitly setting ``--num-checkpoints-read=0``; to perform read only, one has to turn off write by explicitly set ``--num-checkpoints-write=0`` - -## 4.3. Storage System Must Be Simultaneously R/W or _Remappable_ - -For storage systems where 1 host has write access to a volume but all hosts have read access, the above process also satisfies the requirements so long as reads can be fulfilled immediately following a write. +**NOTE: In the ``--ppn`` syntax above, the ``slotcount`` value means the number of processes per node to run.** -For storage systems where 1 host has write access to a volume and a "remapping" process is required for other hosts to read the same data, the time to remap must be measured and included in the submission. +## 4.2. Storage System Must Be Simultaneously R/W or _Remappable_ -When a checkpoint is taken/written, it must be written to stable storage, but that checkpoint does not need to be readable by other other hosts yet. If it is not readable by other hosts immediately after the checkpoint write is complete, if it requires some additional processing or reconfiguration before the checkpoint is readable by other hosts, the time duration between the checkpoint being completed and the earliest time that that checkpoint could be read by a different host node must be reported in the SystemDescription.yaml file. That duration between write completion and availability for reading will be added to the time to read/recover from the benchmark. +**4.2.1.** If a submitter needs to issue a cache flush operation between the write phase and the read phase of a checkpoint benchmark run, then the validator needs to check that ``--num-checkpoints-read=0`` was set during the write phase, that there was a short pause of up to 30 seconds maximum, then the write phase was started with ``--num-checkpoints-write=0`` set. -Any processes between the write and read phases of checkpointing that are required before data can be read by a different host than wrote the data must be measured and included in the submission. The time for these processes will be added to the recovery time and throughput calculation for submitted scores +**4.2.2.** The validator must verify that the total test duration starts from the timestamp of the first checkpoint written and ends at the ending timestamp of the last checkpoint read, notably including the "remapping" time. -The system_configuration.yaml document must list whether the solution support simultaneous reads and/or writes as such: +**4.2.3.** For a _remapping_ solution, the time duration between the checkpoint being completed and the earliest time that that checkpoint could be read by a different host node must be reported in the `SystemDescription.yaml` file. +**4.2.4.** The system_configuration.yaml document must list whether the solution support simultaneous reads and/or writes as such: +``` System: shared_capabilities: multi_host_support: True # False is used for local storage simultaneous_write_support: False # Are simultaneous writes by multiple hosts supported in the submitted configuration simultaneous_read__support: True # Are simultaneous reads by multiple hosts supported in the submitted configuration - +``` ## 5. Validating The Phases From 6b4d47dc6f7f9caee15864cc857efaae34c5dcd5 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Wed, 3 Dec 2025 14:42:02 -0800 Subject: [PATCH 09/23] Remove section 5 on Validating The Phases Removed section on validating phases and its related content. --- Rules.md | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/Rules.md b/Rules.md index 43eb5ccf..62d5c4e3 100644 --- a/Rules.md +++ b/Rules.md @@ -11,7 +11,6 @@ - [4.1. CLOSED Versus OPEN Options](#41-closed-versus-open-options) - [4.2. Benchmark Run Options](#42-benchmark-run-options) - [4.3. Storage System Must Be Simultaneously R/W or Remappable](#43-storage-system-must-be-simultaneously-rw-or-remappable) - - [5. Validating The Phases](#5-validating-the-phases) # 1. Introduction @@ -400,19 +399,6 @@ System: simultaneous_read__support: True # Are simultaneous reads by multiple hosts supported in the submitted configuration ``` -## 5. Validating The Phases - -The MLPerf Storage working group provides a benchmark implementation which includes: - -* Scripts to determine the minimum dataset size required for your system, for a given benchmark. -* Scripts for data generation. -* Benchmark tool, based on DLIO, with configuration files for the benchmarks. -* A script for running the benchmark on one host (additional setup is required if you are running a distributed training benchmark – see Section 5). -* A script for generating the results report (additional scripting and setup may be required if you are running a distributed training benchmark – see Section 5), and potentially additional supporting scripts. - -Each of the benchmarks described in this document have a requirement for multiple runs. This is to ensure consistency of operation of the system under test as well as ensure statistical significance of the measurements. Unless otherwise noted, the multiple runs for a workload need to be run consecutively. To ensure this requirement is met, the time between runs (from the stop time of one run and the start time to the next run) needs to be less than the time to execute a single run. This is to discourage cherry-picking of results which is expressly forbidden and against the spirit of the rules. - - From 596a5751136392fe379943a3364c4f7d41f3acbe Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Wed, 3 Dec 2025 14:43:19 -0800 Subject: [PATCH 10/23] Update section structure in Rules.md Reorganize the sections under 'Validating the Checkpointing Options' for clarity. --- Rules.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/Rules.md b/Rules.md index 62d5c4e3..94fb84d1 100644 --- a/Rules.md +++ b/Rules.md @@ -8,9 +8,8 @@ - [3.1. Benchmark Dataset Generation Options](#32-benchmark-dataset-generation-options) - [3.2. Benchmark Run Options](#33-benchmark-run-options) - [4. Validating the Checkpointing Options](#3-validating-the-checkpointing-options) - - [4.1. CLOSED Versus OPEN Options](#41-closed-versus-open-options) - - [4.2. Benchmark Run Options](#42-benchmark-run-options) - - [4.3. Storage System Must Be Simultaneously R/W or Remappable](#43-storage-system-must-be-simultaneously-rw-or-remappable) + - [4.1. Benchmark Run Options](#42-benchmark-run-options) + - [4.2. Storage System Must Be Simultaneously R/W or Remappable](#43-storage-system-must-be-simultaneously-rw-or-remappable) # 1. Introduction @@ -324,8 +323,6 @@ While it is recommended that all host nodes be as close as possible to identical # 4. Validating the Checkpointing Options -dgh - ## 4.1. Benchmark Run Options **4.1.1.** A checkpoint workload submission must include 10 checkpoints written and 10 checkpoints read as well as the logs for any optional processes. From 8d8e0fb71272453c71b5d1db43c19e7c9059dea0 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Wed, 3 Dec 2025 14:44:00 -0800 Subject: [PATCH 11/23] Fix section numbering in Rules.md --- Rules.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Rules.md b/Rules.md index 94fb84d1..1285c745 100644 --- a/Rules.md +++ b/Rules.md @@ -5,11 +5,11 @@ - [1. Introduction](#1-introduction) - [2. Directory Structure for All Submissions](#2-directory-structure-for-all-submissions) - [3. Validating the Training Options](#3-validating-the-training-options) - - [3.1. Benchmark Dataset Generation Options](#32-benchmark-dataset-generation-options) - - [3.2. Benchmark Run Options](#33-benchmark-run-options) + - [3.1. Benchmark Dataset Generation Options](#31-benchmark-dataset-generation-options) + - [3.2. Benchmark Run Options](#32-benchmark-run-options) - [4. Validating the Checkpointing Options](#3-validating-the-checkpointing-options) - - [4.1. Benchmark Run Options](#42-benchmark-run-options) - - [4.2. Storage System Must Be Simultaneously R/W or Remappable](#43-storage-system-must-be-simultaneously-rw-or-remappable) + - [4.1. Benchmark Run Options](#41-benchmark-run-options) + - [4.2. Storage System Must Be Simultaneously R/W or Remappable](#42-storage-system-must-be-simultaneously-rw-or-remappable) # 1. Introduction From 107be50fa5f85cbefd8a038ba5c5ea12e6da50a2 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Wed, 3 Dec 2025 14:48:15 -0800 Subject: [PATCH 12/23] Update title in MLPerf Storage V2.0 rules document --- Rules.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Rules.md b/Rules.md index 1285c745..22c3bbf2 100644 --- a/Rules.md +++ b/Rules.md @@ -1,7 +1,7 @@ -# MLPerf™ Storage V2.0 Benchmark Validation +# MLPerf™ Storage V2.0 Benchmark Validation Rules —————————————————————————————————————————— -- [MLPerf Storage V2.0 Benchmark Validation](#mlperf-storage-v20-benchmark-validation) +- [MLPerf Storage V2.0 Benchmark Validation](#mlperf-storage-v20-benchmark-validation-rules) - [1. Introduction](#1-introduction) - [2. Directory Structure for All Submissions](#2-directory-structure-for-all-submissions) - [3. Validating the Training Options](#3-validating-the-training-options) From 4ffe61bb5a8d6fabf5bfc0a176c2193157fbeeb5 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Wed, 3 Dec 2025 14:48:29 -0800 Subject: [PATCH 13/23] Fix heading formatting in Rules.md --- Rules.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Rules.md b/Rules.md index 22c3bbf2..4b85cfdd 100644 --- a/Rules.md +++ b/Rules.md @@ -1,7 +1,7 @@ # MLPerf™ Storage V2.0 Benchmark Validation Rules —————————————————————————————————————————— -- [MLPerf Storage V2.0 Benchmark Validation](#mlperf-storage-v20-benchmark-validation-rules) +- [MLPerf Storage V2.0 Benchmark Validation Rules](#mlperf-storage-v20-benchmark-validation-rules) - [1. Introduction](#1-introduction) - [2. Directory Structure for All Submissions](#2-directory-structure-for-all-submissions) - [3. Validating the Training Options](#3-validating-the-training-options) From 2bb406e3d7115c1aafaac4a52164ffb989bcbf7d Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Fri, 12 Dec 2025 07:16:26 -0800 Subject: [PATCH 14/23] Update section numbers and headings in Rules.md --- Rules.md | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/Rules.md b/Rules.md index 4b85cfdd..09ab0704 100644 --- a/Rules.md +++ b/Rules.md @@ -248,11 +248,13 @@ root_folder (or any name you prefer) # 3. Validating the Training Options -dfg +## 3.1. Datasize Options -## 3.1. Dataset Generation Options +**3.1.1.** The *submission validation checker* should... -**3.1.1.** The *submission validation checker* should take the provided number of simulated accelerators and the sizes of all of the host node’s memory as reported in the logfiles and recompute the minimum dataset size as follows: +## 3.2. Datagen Options + +**3.2.1.** The *submission validation checker* should take the provided number of simulated accelerators and the sizes of all of the host node’s memory as reported in the logfiles and recompute the minimum dataset size as follows: * Calculate required minimum samples given number of steps per epoch (NB: `num_steps_per_epoch` is a minimum of 500): * `min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes` * Calculate required minimum samples given host memory to eliminate client-side caching effects; (NB: HOST_MEMORY_MULTIPLIER = 5): @@ -264,27 +266,27 @@ dfg * `min_files_size = min_samples * record_length / 1024 / 1024 / 1024` * A minimum of `min_total_files` files are required which will consume `min_files_size` GB of storage. -## 3.2. Benchmark Run Options +## 3.3. Run Options -**3.2.1.** To pass a benchmark run, the AU (Accelerator Utilization) should be equal to or greater than the minimum value: +**3.3.1.** To pass a benchmark run, the AU (Accelerator Utilization) should be equal to or greater than the minimum value: * `total_compute_time = (records_per_file * total_files) / simulated_accelerators / batch_size * computation_time * epochs` * `AU = (total_compute_time/total_benchmark_running_time) * 100` * All the I/O operations from the first step are excluded from the AU calculation. The I/O operations that are excluded from the AU calculation are included in the samples/second reported by the benchmark, however. -**3.2.2.** For single-host submissions, increase the number of simulated accelerators by changing the --num-accelerators parameter to the benchmark.sh script. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. +**3.3.2.** For single-host submissions, increase the number of simulated accelerators by changing the --num-accelerators parameter to the benchmark.sh script. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. **3.2.3.** For single-host submissions, CLOSED and OPEN division results must include benchmark runs for the maximum simulated accelerators that can be run on ONE HOST NODE, in ONE MLPerf Storage job, without going below the 90% accelerator utilization threshold. -**3.2.4.** For distributed Training submissions, all the data must be accessible to all the host nodes. +**3.3.4.** For distributed Training submissions, all the data must be accessible to all the host nodes. -**3.2.5.** For distributed Training submissions, the number of simulated accelerators in each host node must be identical. +**3.3.5.** For distributed Training submissions, the number of simulated accelerators in each host node must be identical. While it is recommended that all host nodes be as close as possible to identical, that is not required by these Rules. The fact that distributed training uses a pool-wide common barrier to synchronize the transition from one step to the next of all host nodes results in the overall performance of the cluster being determined by the slowest host node. -**3.2.6.** For distributed Training submissions, the *submission validation checker* should emit a warning (not fail the validation) if the physical nodes that run the benchmark code are widely enough different in their capability. Here are a few practical suggestions on how to leverage a set of non-identical hardware, but these are not requirements of these Rules. It is possible to leverage very large physical nodes by using multiple Containers or VM guest images per node, each with dedicated affinity to given CPUs cores and where DRAM capacity and NUMA locality have been configured. Alternatively, larger physical nodes that have higher numbers of cores or additional memory than the others may have those additional cores or memory disabled. +**3.3.6.** For distributed Training submissions, the *submission validation checker* should emit a warning (not fail the validation) if the physical nodes that run the benchmark code are widely enough different in their capability. Here are a few practical suggestions on how to leverage a set of non-identical hardware, but these are not requirements of these Rules. It is possible to leverage very large physical nodes by using multiple Containers or VM guest images per node, each with dedicated affinity to given CPUs cores and where DRAM capacity and NUMA locality have been configured. Alternatively, larger physical nodes that have higher numbers of cores or additional memory than the others may have those additional cores or memory disabled. -**3.2.7.** For CLOSED submissions of this benchmark, the MLPerf Storage codebase cannot be changed, so the *submission validation checker* SHOULD do an `md5sum` of the code directory hierachy in the submission package and verify that that matches a precalculated checksum stored as a literal in the validator's codebase. +**3.3.7.** For CLOSED submissions of this benchmark, the MLPerf Storage codebase cannot be changed, so the *submission validation checker* SHOULD do an `md5sum` of the code directory hierachy in the submission package and verify that that matches a precalculated checksum stored as a literal in the validator's codebase. -**3.2.8.** For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. +**3.3.8.** For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. **Table: Training Workload Tunable Parameters for CLOSED** @@ -306,7 +308,7 @@ While it is recommended that all host nodes be as close as possible to identical | storage.storage_root | The storage root directory | ./ | | storage.storage_type | The storage type | local_fs | -**3.2.9.** For OPEN submissions of this benchmark, only a few additional parameters can be modified over those allowed in CLOSED, and those additional parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. +**3.3.9.** For OPEN submissions of this benchmark, only a few additional parameters can be modified over those allowed in CLOSED, and those additional parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. **Table: Training Workload Tunable Parameters for OPEN** From b1bba55269b441f8e053a50b8e206aff5b574c67 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Tue, 16 Dec 2025 11:55:38 -0800 Subject: [PATCH 15/23] Revise rules for timestamp directories and configurations --- Rules.md | 57 ++++++++++++++++++++++++++++++++------------------------ 1 file changed, 33 insertions(+), 24 deletions(-) diff --git a/Rules.md b/Rules.md index 09ab0704..b2bab59c 100644 --- a/Rules.md +++ b/Rules.md @@ -68,21 +68,25 @@ configuration of storage system and to link together those results with the .pdf **2.17.** Within the "run" *phase directory* within the "training" directory hierarchy, there must also be exactly 5 subdirectories named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. -**2.18.** Within each *timestamp directory* within the "run" *phase*, there must exist the following files: "training_run.stdout.log", "training_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. +**2.18** The timestamp (the day and time) represented by the name of each *timestamp directory* must be separated by less than the duration of a single *timestamp directory* from it's neighboring *timestamp directories*. Ie: the gap between a consecutive pair of *timestamp directories* must be short enough that we can be sure that there was no benchmark activity between them. -**2.19.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. +**2.19.** Within each *timestamp directory* within the "run" *phase*, there must exist the following files: "training_run.stdout.log", "training_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. -**2.20.** Within the "checkpointing" directory, there must be one or more of the following *workload directories*, and nothing else: "llama3-8b", "llama3-70b", "llama3-405b", and/or "llama3-1t". These names are case-sensitive. +**2.20.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. -**2.21.** Within the *workload directories* within the "checkpointing" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. +**2.21.** Within the "checkpointing" directory, there must be one or more of the following *workload directories*, and nothing else: "llama3-8b", "llama3-70b", "llama3-405b", and/or "llama3-1t". These names are case-sensitive. -**2.22.** Within the *workload directories* within the "checkpointing" directory hierarchy, there must also be exactly ten *timestamp directories* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. +**2.22.** Within the *workload directories* within the "checkpointing" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. -**2.23.** Within the *timestamp directories* within the "checkpointing" directory hierarchy, there must exist the following files: "checkpointing_run.stdout.log", "checkpointing_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. +**2.23.** Within the *workload directories* within the "checkpointing" directory hierarchy, there must also be exactly ten *timestamp directories* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. -**2.24.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. +**2.24** The timestamp (the day and time) represented by the name of each *timestamp directory* must be separated by less than the duration of a single *timestamp directory* from it's neighboring *timestamp directories*. Ie: the gap between a consecutive pair of *timestamp directories* must be short enough that we can be sure that there was no benchmark activity between them. -**2.25.** Pictorially, here is what this looks like: +**2.25.** Within the *timestamp directories* within the "checkpointing" directory hierarchy, there must exist the following files: "checkpointing_run.stdout.log", "checkpointing_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. + +**2.27.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. + +**2.28.** Pictorially, here is what this looks like: ``` root_folder (or any name you prefer) ├── Closed @@ -233,7 +237,7 @@ root_folder (or any name you prefer) ├──system-name-2.yaml └──system-name-2.pdf ``` -**2.26.** Since the "dlio_log" subdirectory has a similar structure in all cases, it is describe pictorially just below: +**2.29.** Since the "dlio_log" subdirectory has a similar structure in all cases, it is describe pictorially just below: ``` └── YYYYMMDD_HHmmss ├── [training|checkpointing]_[datagen|run].stdout.log @@ -246,7 +250,7 @@ root_folder (or any name you prefer) └── overrides.yaml ``` -# 3. Validating the Training Options +# 3. Validating the Training Workloads ## 3.1. Datasize Options @@ -323,25 +327,25 @@ While it is recommended that all host nodes be as close as possible to identical | *Reader parameters* | | | | reader.data_loader | Supported options: Tensorflow or PyTorch. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | -# 4. Validating the Checkpointing Options +**3.3.10** The arguments to `mlpstorage` that set the directory pathname where the dataset is stored and the directory where the output logfiles are stored must both be set and must be set to different values. -## 4.1. Benchmark Run Options +**3.3.11** The `mlpstorage` command should do a "df" command on the directory pathname where the dataset is stored and another one on the directory pathname where the output logfiles are stored and record those values in the logfile. The *submission validator* should find those entries in the run's logfile and verify that they are different filesystems. We don't want the submitter to, by acccident, place the logfiles onto the storage system under test since that would skew the results. -**4.1.1.** A checkpoint workload submission must include 10 checkpoints written and 10 checkpoints read as well as the logs for any optional processes. +# 4. Validating the Checkpointing Workloads -**4.1.2.** The checkpoint data written per client node musyt be more than 3x the client node's memory capacity, otherwise the filesystem cache needs to be cleared between the write and read phases. +## 4.1. Benchmark Run Options -**4.1.3.** We must verify that all the benchmark workload configuration files have set to do an fsync call at the end of each of the 10 checkpoint writes. +**4.1.1.** The checkpoint data written per client node must be more than 3x the client node's memory capacity, otherwise the filesystem cache needs to be cleared between the write and read phases. -**4.1.4.** The benchmark must be run with one of the four model configuration detailed below. +**4.1.2.** We must verify that all the benchmark workload configuration files have been set to do an fsync call at the end of each of the 10 checkpoint writes. -**4.1.5.** For CLOSED submissions, the number of MPI processes must be set to 8, 64, 512, and 1024 for the respective models. +**4.1.3.** The benchmark must be run with one of the four model configuration detailed below. -**4.1.6.** For CLOSED submissions, submitters are not permitted to change the total number of simulated accelerators. +**4.1.4.** For CLOSED submissions, the number of MPI processes must be set to 8, 64, 512, and 1024 for the respective models. (see table 2) -**4.1.7.** For CLOSED submissions, submitters may adjust the number of simulated accelerators **per host**, as long as each host uses more than 4 simulated accelerators. +**4.1.5.** For CLOSED submissions, submitters may adjust the number of simulated accelerators **per host**, as long as each host uses more than 4 simulated accelerators and the total number of simulated accelerators (the total number of processes) matches the requirement. (see table 2) -**4.1.8.** The aggregate simulated accelerator memory across all nodes must be sufficient to accommodate the model’s checkpoint size. +**4.1.6.** The aggregate simulated accelerator memory across all nodes must be sufficient to accommodate the model’s checkpoint size. That is, the GB of memory associated with the chosen accelerator (eg: H100) times the accelerator count must be equal to or greater than the total checkpoint size for that scale of checkpoint. (see table 2) **Table 2 LLM models** @@ -358,8 +362,7 @@ While it is recommended that all host nodes be as close as possible to identical | Checkpoint size | 105 GB | 912 GB | 5.29 TB | 18 TB | | Subset: 8-Process Size | 105 GB | 114 GB | 94 GB | 161 GB | -**4.1.9.** For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. - +**4.1.7.** For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. **Table: Checkpoint Workload Tunable Parameters for CLOSED** @@ -367,7 +370,7 @@ While it is recommended that all host nodes be as close as possible to identical |----------------------------------|-------------------------------------------------------------|-----------------------| | checkpoint.checkpoint_folder | The storage directory for writing and reading checkpoints | ./checkpoints/ | -**4.1.10.** For OPEN submissions of this benchmark, the total number of processes may be increased in multiples of (TP×PP) to showcase the scalability of the storage solution. +**4.1.8.** For OPEN submissions of this benchmark, the total number of processes may be increased in multiples of (TP×PP) to showcase the scalability of the storage solution. **Table 3: Configuration parameters and their mutability in CLOSED and OPEN divisions** @@ -381,9 +384,15 @@ While it is recommended that all host nodes be as close as possible to identical **NOTE: In the ``--ppn`` syntax above, the ``slotcount`` value means the number of processes per node to run.** +**4.1.9** The arguments to `mlpstorage` that set the directory pathname where the checkpoints are written and read and the directory where the output logfiles are stored must both be set and must be set to different values. + +**4.1.10** The `mlpstorage` command should do a "df" command on the directory pathname where the checkpoints are written and read and another one on the directory pathname where the output logfiles are stored and record those values in the logfile. The *submission validator* should find those entries in the run's logfile and verify that they are different filesystems. We don't want the submitter to, by acccident, place the logfiles onto the storage system under test since that would skew the results. + +**4.1.11** The `mlpstorage` command must accept a parameter telling it that this is a *subset* run and add that info to the output log file. The *submission validator* must flag an error if the `subset` argument is given but the total number of accelerators is not exactly 8, or the model is "8B". + ## 4.2. Storage System Must Be Simultaneously R/W or _Remappable_ -**4.2.1.** If a submitter needs to issue a cache flush operation between the write phase and the read phase of a checkpoint benchmark run, then the validator needs to check that ``--num-checkpoints-read=0`` was set during the write phase, that there was a short pause of up to 30 seconds maximum, then the write phase was started with ``--num-checkpoints-write=0`` set. +**4.2.1.** If a submitter needs to issue a cache flush operation between the write phase and the read phase of a checkpoint benchmark run, then the validator must check that ``--num-checkpoints-read=0`` was set during the write phase, that there was a short pause of up to 30 seconds maximum, then the write phase was started with ``--num-checkpoints-write=0`` set. **4.2.2.** The validator must verify that the total test duration starts from the timestamp of the first checkpoint written and ends at the ending timestamp of the last checkpoint read, notably including the "remapping" time. From 22eb65fc33d44bda9fb73a53e5dc21719ba493e2 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Tue, 16 Dec 2025 12:14:39 -0800 Subject: [PATCH 16/23] Update rules for subdirectory and validation requirements --- Rules.md | 30 +++++++++++++++++------------- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/Rules.md b/Rules.md index b2bab59c..a20acb7a 100644 --- a/Rules.md +++ b/Rules.md @@ -66,7 +66,7 @@ configuration of storage system and to link together those results with the .pdf **2.16.** Within the "run" *phase directory* within the "training" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. -**2.17.** Within the "run" *phase directory* within the "training" directory hierarchy, there must also be exactly 5 subdirectories named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. +**2.17.** Within the "run" *phase directory* within the "training" directory hierarchy, there must also be exactly 6 subdirectories named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. Note that the 1st of those 6 is the *warm up* run and will not be included in the reported performance. **2.18** The timestamp (the day and time) represented by the name of each *timestamp directory* must be separated by less than the duration of a single *timestamp directory* from it's neighboring *timestamp directories*. Ie: the gap between a consecutive pair of *timestamp directories* must be short enough that we can be sure that there was no benchmark activity between them. @@ -254,24 +254,28 @@ root_folder (or any name you prefer) ## 3.1. Datasize Options -**3.1.1.** The *submission validation checker* should... +**3.1.1.** The *submission validator* must verify that the *datasize* option was used by finding the entry(s) in the log file showing its use. + +**3.1.2.** The *submission validator* must recalculate the minimum dataset size by using the provided number of simulated accelerators and the sizes of all of the host node’s memory as reported in the logfiles as described below and fail the run if the size recorded in the run's logfile doesn't exactly match the recalculated value. + * Calculate required minimum samples given number of steps per epoch (NB: `num_steps_per_epoch` is a minimum of 500): + * `min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes` + * Calculate required minimum samples given host memory to eliminate client-side caching effects; (NB: HOST_MEMORY_MULTIPLIER = 5): + * `min_samples_host_memory_across_all_nodes = number_of_hosts * memory_per_host_in_GB * HOST_MEMORY_MULTIPLIER * 1024 * 1024 * 1024 / record_length` + * Ensure we meet both constraints: + * `min_samples = max(min_samples_steps_per_epoch, min_samples_host_memory_across_all_nodes)` + * Calculate minimum files to generate + * `min_total_files= min_samples / num_samples_per_file` + * `min_files_size = min_samples * record_length / 1024 / 1024 / 1024` + * A minimum of `min_total_files` files are required which will consume `min_files_size` GB of storage. ## 3.2. Datagen Options -**3.2.1.** The *submission validation checker* should take the provided number of simulated accelerators and the sizes of all of the host node’s memory as reported in the logfiles and recompute the minimum dataset size as follows: - * Calculate required minimum samples given number of steps per epoch (NB: `num_steps_per_epoch` is a minimum of 500): - * `min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes` - * Calculate required minimum samples given host memory to eliminate client-side caching effects; (NB: HOST_MEMORY_MULTIPLIER = 5): - * `min_samples_host_memory_across_all_nodes = number_of_hosts * memory_per_host_in_GB * HOST_MEMORY_MULTIPLIER * 1024 * 1024 * 1024 / record_length` - * Ensure we meet both constraints: - * `min_samples = max(min_samples_steps_per_epoch, min_samples_host_memory_across_all_nodes)` - * Calculate minimum files to generate - * `min_total_files= min_samples / num_samples_per_file` - * `min_files_size = min_samples * record_length / 1024 / 1024 / 1024` - * A minimum of `min_total_files` files are required which will consume `min_files_size` GB of storage. +**3.2.1** The amount of data generated during the *datagen* phase must be equal **or larger** than the amount of data calculated during the *datasize* phase or the run must be failed. ## 3.3. Run Options +**3.3.0.** The amount of data the *run* phase is told to use must be exactly equal to the *datasize* value calculated earlier, but can be less than the value used in the *datagen* phase. + **3.3.1.** To pass a benchmark run, the AU (Accelerator Utilization) should be equal to or greater than the minimum value: * `total_compute_time = (records_per_file * total_files) / simulated_accelerators / batch_size * computation_time * epochs` * `AU = (total_compute_time/total_benchmark_running_time) * 100` From eb06ed99081cb67fffac92d2a80dfb0730811729 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Tue, 16 Dec 2025 13:43:42 -0800 Subject: [PATCH 17/23] Update rules for mlpstorage command usage --- Rules.md | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/Rules.md b/Rules.md index a20acb7a..9bc6964f 100644 --- a/Rules.md +++ b/Rules.md @@ -27,6 +27,8 @@ and for semantic mismatches between different options that were used. The `mlpstorage` tool must be used to run the benchmarks, submitters are not allowed to run the underlying tools (eg: DLIO) directly to generate a submission package. +**1.1.** The `mlpstorage` command must obtain (somehow) the pathname of the output file directory hierarchy and directly create and/or append to the files within that hierarchy to successively build out the submission folder. We don't want the submitter to manually create anything in that hierarchy except for the SystemDescription.* files (if we can help it). + # 2. Directory Structure for All Submissions **2.1.** The submission structure must start from a single directory whose name is the name of the submitter. This can be any string, possibly including blanks. @@ -41,7 +43,7 @@ The `mlpstorage` tool must be used to run the benchmarks, submitters are not all **2.6.** The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. If this is in the "open" hierarchy, any modifications made to the benchmark code must be included here, and if this is in the "closed" hierarchy, there must be no changes to the benchmark code. -Note that in both cases this must be the code that was actually run to generate those results. +Note that in both cases this must be the code that was actually run to generate those results. In a CLOSED submission, the *submission validator* should do an md5sum of the code directory hierarchy, compare that to a value hard-coded into the validator code, and fail the validation if there is a difference. **2.7.** The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file, and nothing more. Each of those files must be named with the "system name". Eg: for a system-under-test named "Big_and_Fast_4000_buffered", there must be a "Big_and_Fast_4000_buffered.yaml" and a "Big_and_Fast_4000_buffered.pdf" file. These names are case-sensitive. @@ -274,27 +276,26 @@ root_folder (or any name you prefer) ## 3.3. Run Options -**3.3.0.** The amount of data the *run* phase is told to use must be exactly equal to the *datasize* value calculated earlier, but can be less than the value used in the *datagen* phase. +**3.3.1.** The amount of data the *run* phase is told to use must be exactly equal to the *datasize* value calculated earlier, but can be less than the value used in the *datagen* phase. To express that, you can run the benchmark on a subset of that dataset by setting `num_files_train` or `num_files_eval` smaller than the number of files available in the dataset folder, but `num_subfolders_train` and `num_subfolders_eval` must be to be equal to the actual number of subfolders inside the dataset folder in order to generate valid results. -**3.3.1.** To pass a benchmark run, the AU (Accelerator Utilization) should be equal to or greater than the minimum value: +**3.3.2.** To pass a benchmark run, the AU (Accelerator Utilization) should be equal to or greater than the minimum value: * `total_compute_time = (records_per_file * total_files) / simulated_accelerators / batch_size * computation_time * epochs` * `AU = (total_compute_time/total_benchmark_running_time) * 100` * All the I/O operations from the first step are excluded from the AU calculation. The I/O operations that are excluded from the AU calculation are included in the samples/second reported by the benchmark, however. -**3.3.2.** For single-host submissions, increase the number of simulated accelerators by changing the --num-accelerators parameter to the benchmark.sh script. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. +**3.3.3.** For single-host submissions, increase the number of simulated accelerators by changing the `--num-accelerators` parameter to the benchmark.sh script. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. -**3.2.3.** For single-host submissions, CLOSED and OPEN division results must include benchmark runs for the maximum simulated accelerators that can be run on ONE HOST NODE, in ONE MLPerf Storage job, without going below the 90% accelerator utilization threshold. +**3.2.4.** For single-host submissions, CLOSED and OPEN division results must include benchmark runs for the maximum simulated accelerators that can be run on one host node, in one MLPerf Storage job, without going below the 90% accelerator utilization threshold. -**3.3.4.** For distributed Training submissions, all the data must be accessible to all the host nodes. +**3.3.5.** For distributed Training submissions, all the data must be accessible to all the host nodes. **_(not clear how to check this, so maybe remove?)_** -**3.3.5.** For distributed Training submissions, the number of simulated accelerators in each host node must be identical. -While it is recommended that all host nodes be as close as possible to identical, that is not required by these Rules. The fact that distributed training uses a pool-wide common barrier to synchronize the transition from one step to the next of all host nodes results in the overall performance of the cluster being determined by the slowest host node. +**3.3.6.** For distributed Training submissions, the number of simulated accelerators in each host node must be identical. -**3.3.6.** For distributed Training submissions, the *submission validation checker* should emit a warning (not fail the validation) if the physical nodes that run the benchmark code are widely enough different in their capability. Here are a few practical suggestions on how to leverage a set of non-identical hardware, but these are not requirements of these Rules. It is possible to leverage very large physical nodes by using multiple Containers or VM guest images per node, each with dedicated affinity to given CPUs cores and where DRAM capacity and NUMA locality have been configured. Alternatively, larger physical nodes that have higher numbers of cores or additional memory than the others may have those additional cores or memory disabled. +**3.3.7.** For distributed Training submissions, the *submission validation checker* should emit a warning (not fail the validation) if the physical nodes that run the benchmark code are widely enough different in their capability. **_(not clear we should do this, so maybe remove?)_** -**3.3.7.** For CLOSED submissions of this benchmark, the MLPerf Storage codebase cannot be changed, so the *submission validation checker* SHOULD do an `md5sum` of the code directory hierachy in the submission package and verify that that matches a precalculated checksum stored as a literal in the validator's codebase. +**3.3.8.** For CLOSED submissions of this benchmark, the MLPerf Storage codebase cannot be changed, so the *submission validation checker* SHOULD do an `md5sum` of the code directory hierachy in the submission package and verify that that matches a precalculated checksum stored as a literal in the validator's codebase. -**3.3.8.** For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. +**3.3.9.** For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. **Table: Training Workload Tunable Parameters for CLOSED** @@ -316,7 +317,7 @@ While it is recommended that all host nodes be as close as possible to identical | storage.storage_root | The storage root directory | ./ | | storage.storage_type | The storage type | local_fs | -**3.3.9.** For OPEN submissions of this benchmark, only a few additional parameters can be modified over those allowed in CLOSED, and those additional parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. +**3.3.10.** For OPEN submissions of this benchmark, only a few additional parameters can be modified over those allowed in CLOSED, and those additional parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. **Table: Training Workload Tunable Parameters for OPEN** @@ -331,9 +332,9 @@ While it is recommended that all host nodes be as close as possible to identical | *Reader parameters* | | | | reader.data_loader | Supported options: Tensorflow or PyTorch. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | -**3.3.10** The arguments to `mlpstorage` that set the directory pathname where the dataset is stored and the directory where the output logfiles are stored must both be set and must be set to different values. +**3.3.11** The arguments to `mlpstorage` that set the directory pathname where the dataset is stored and the directory where the output logfiles are stored must both be set and must be set to different values. -**3.3.11** The `mlpstorage` command should do a "df" command on the directory pathname where the dataset is stored and another one on the directory pathname where the output logfiles are stored and record those values in the logfile. The *submission validator* should find those entries in the run's logfile and verify that they are different filesystems. We don't want the submitter to, by acccident, place the logfiles onto the storage system under test since that would skew the results. +**3.3.12** The `mlpstorage` command should do a "df" command on the directory pathname where the dataset is stored and another one on the directory pathname where the output logfiles are stored and record those values in the logfile. The *submission validator* should find those entries in the run's logfile and verify that they are different filesystems. We don't want the submitter to, by acccident, place the logfiles onto the storage system under test since that would skew the results. # 4. Validating the Checkpointing Workloads From 3a09f4474a925d13e7a3c90e4c4cbbcc1ee2a4d8 Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Tue, 16 Dec 2025 13:44:46 -0800 Subject: [PATCH 18/23] Fix bad formatting for the minimum dataset size calculation steps --- Rules.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/Rules.md b/Rules.md index 9bc6964f..97e29ee9 100644 --- a/Rules.md +++ b/Rules.md @@ -259,16 +259,16 @@ root_folder (or any name you prefer) **3.1.1.** The *submission validator* must verify that the *datasize* option was used by finding the entry(s) in the log file showing its use. **3.1.2.** The *submission validator* must recalculate the minimum dataset size by using the provided number of simulated accelerators and the sizes of all of the host node’s memory as reported in the logfiles as described below and fail the run if the size recorded in the run's logfile doesn't exactly match the recalculated value. - * Calculate required minimum samples given number of steps per epoch (NB: `num_steps_per_epoch` is a minimum of 500): - * `min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes` - * Calculate required minimum samples given host memory to eliminate client-side caching effects; (NB: HOST_MEMORY_MULTIPLIER = 5): - * `min_samples_host_memory_across_all_nodes = number_of_hosts * memory_per_host_in_GB * HOST_MEMORY_MULTIPLIER * 1024 * 1024 * 1024 / record_length` - * Ensure we meet both constraints: - * `min_samples = max(min_samples_steps_per_epoch, min_samples_host_memory_across_all_nodes)` - * Calculate minimum files to generate - * `min_total_files= min_samples / num_samples_per_file` - * `min_files_size = min_samples * record_length / 1024 / 1024 / 1024` - * A minimum of `min_total_files` files are required which will consume `min_files_size` GB of storage. + * Calculate required minimum samples given number of steps per epoch (NB: `num_steps_per_epoch` is a minimum of 500): + * `min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes` + * Calculate required minimum samples given host memory to eliminate client-side caching effects; (NB: HOST_MEMORY_MULTIPLIER = 5): + * `min_samples_host_memory_across_all_nodes = number_of_hosts * memory_per_host_in_GB * HOST_MEMORY_MULTIPLIER * 1024 * 1024 * 1024 / record_length` + * Ensure we meet both constraints: + * `min_samples = max(min_samples_steps_per_epoch, min_samples_host_memory_across_all_nodes)` + * Calculate minimum files to generate + * `min_total_files= min_samples / num_samples_per_file` + * `min_files_size = min_samples * record_length / 1024 / 1024 / 1024` + * A minimum of `min_total_files` files are required which will consume `min_files_size` GB of storage. ## 3.2. Datagen Options From 31e06309145cc742ba3ebec51fdf814de7eef2ca Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Tue, 16 Dec 2025 15:54:20 -0800 Subject: [PATCH 19/23] Add SystemDescription schema with detailed specifications First drraft of a schema for the SystemDescription.yaml files created by submitters. This is in the format required by Yamale. --- SystemDescription_Schema.yaml | 82 +++++++++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) create mode 100644 SystemDescription_Schema.yaml diff --git a/SystemDescription_Schema.yaml b/SystemDescription_Schema.yaml new file mode 100644 index 00000000..7d855be7 --- /dev/null +++ b/SystemDescription_Schema.yaml @@ -0,0 +1,82 @@ +system: include('system_description',required=True) +power: include('power_requirements',required=True) +nodes: + dlio_nodes: include('node_description',required=True) + storage_data_nodes: include('node_description',required=True) + storage_metadata_nodes: include('node_description',required=False) +--- +system_description: + name: str(min=1) + description: str(min=1) + storage_location: enum('remote','local','hyper-converged') + client_software: enum('in-box','proprietary') + storage_interface: enum('block','file','object') + required_rack_units: int(min=1) + shared_capabilities: + multi_host_support: enum('True','False') # False is used for local storage + simultaneous_write_support: enum('True','False') # Are simultaneous writes by multiple hosts supported? + simultaneous_read__support: enum('True','False') # Are simultaneous reads by multiple hosts supported? + max_sequential_read: int(min=1,required=True) # In GiB/s + max_sequential_write: int(min=1,required=True) # In GiB/s + max_random_read: int(min=1,required=True) # In GiB/s + max_random_write: int(min=1,required=True) # In GiB/s +--- +power_requirements: + provisioned: include('power_summary',required=True) + consumed: include('power_summary',required=False) +--- +power_summary: + dlio_client: include('power_detail') + storage_data_node: include('power_detail') + backend_switch: include('power_detail') +--- +power_detail: + quantity: int(min=1 ) + psu1_nameplate_power: int(min=1,required=True) # in watts + psu2_nameplate_power: int(min=1,required=False) # in watts + psu3_nameplate_power: int(min=1,required=False) # in watts + psu4_nameplate_power: int(min=1,required=False) # in watts + psu5_nameplate_power: int(min=1,required=False) # in watts + psu6_nameplate_power: int(min=1,required=False) # in watts + design_power: int(min=1) # in Watts + num_active_psus: int(min=1) + num_passive_psus: int(min=0) +--- +node_description: + quantity: int(min=1) + hardware: include('hardware_description') + networking: list(include('network_instance'),min=1) + operating_system: include('operating_system_description') + tuning: + # All non-default tunings for OS need to be listed + mpi_configuration: + environment_variables: + version: Open MPI 4.1.4 + sysctl_parameters: + + +--- +hardware_description: + model: str(min=1) + rack_units: int(min=1) + power_supplies: int(min=1) + psu_configuration: enum('active/passive','active/active') + psu_rating: int(min=1) + memory_capacity: int(min=1) # in GB, eg: 256 + memory_configuration: 8x32GB + cpu_qty: int(min=1) + cpu_model: str(min=1) + cpu_cores: int(min=1) +--- +network_instance: + type: enum('management','data','backend') + model: str(min=1) + speed: int(min=1) # in Gb/s + qty: int(min=1) +--- +operating_system_description: + name: str(min=1) + version: str(min=1) + release_date: str(min=1) + kernel_version: str(min=1) + cpu_architecture: enum('x86','arm') From 98ab8af18145d8fc48f034e23bd77fe97cb3202b Mon Sep 17 00:00:00 2001 From: FileSystemGuy <99758333+FileSystemGuy@users.noreply.github.com> Date: Thu, 18 Dec 2025 11:23:20 -0800 Subject: [PATCH 20/23] Create README.md for checker directory Add README for submission validation checker directory --- mlpstorage/checker/README.md | 4 ++++ 1 file changed, 4 insertions(+) create mode 100644 mlpstorage/checker/README.md diff --git a/mlpstorage/checker/README.md b/mlpstorage/checker/README.md new file mode 100644 index 00000000..6c36b1fb --- /dev/null +++ b/mlpstorage/checker/README.md @@ -0,0 +1,4 @@ +# This directory contains the submision validation checker. + +The required reviews for this directory hierarchy are different tfrom the rest of the benchmark repo, +MLCommons' internal development group are required reviewers for any changes here. From 27b48335bf612c0bb6f2d22bd709e7c8af2de329 Mon Sep 17 00:00:00 2001 From: Curtis Anderson <99758333+FileSystemGuy@users.noreply.github.com> Date: Mon, 5 Jan 2026 14:13:22 -0800 Subject: [PATCH 21/23] Clarify two things Clarify the meaning of two rules. --- Rules.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Rules.md b/Rules.md index 97e29ee9..4b3812b7 100644 --- a/Rules.md +++ b/Rules.md @@ -31,7 +31,7 @@ The `mlpstorage` tool must be used to run the benchmarks, submitters are not all # 2. Directory Structure for All Submissions -**2.1.** The submission structure must start from a single directory whose name is the name of the submitter. This can be any string, possibly including blanks. +**2.1.** The submission structure must start from a single directory whose name is the name of the submitter. This can be any string, but a blank or any other character in that string that cannot be part of a POSIX filename should be replaced 1-for-1 with a dash character. **2.2.** Within the top-level directory of the submission structure there must be a directory named "closed" and/or one named "open", and nothing more. These names are case-sensitive. @@ -285,7 +285,7 @@ root_folder (or any name you prefer) **3.3.3.** For single-host submissions, increase the number of simulated accelerators by changing the `--num-accelerators` parameter to the benchmark.sh script. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. -**3.2.4.** For single-host submissions, CLOSED and OPEN division results must include benchmark runs for the maximum simulated accelerators that can be run on one host node, in one MLPerf Storage job, without going below the 90% accelerator utilization threshold. +**3.2.4.** For single-host submissions, in both CLOSED and OPEN division results, the validator should fail the run if there is more than one client node used during that run. **3.3.5.** For distributed Training submissions, all the data must be accessible to all the host nodes. **_(not clear how to check this, so maybe remove?)_** From 4efd41444372af284c1d8826394959c7ba02e309 Mon Sep 17 00:00:00 2001 From: Curtis Anderson <99758333+FileSystemGuy@users.noreply.github.com> Date: Mon, 5 Jan 2026 14:53:58 -0800 Subject: [PATCH 22/23] Add section summary tokens Updated rules to clarify requirements and add description tokens for each section. --- Rules.md | 118 +++++++++++++++++++++++++++---------------------------- 1 file changed, 59 insertions(+), 59 deletions(-) diff --git a/Rules.md b/Rules.md index 4b3812b7..e58f5da4 100644 --- a/Rules.md +++ b/Rules.md @@ -27,68 +27,68 @@ and for semantic mismatches between different options that were used. The `mlpstorage` tool must be used to run the benchmarks, submitters are not allowed to run the underlying tools (eg: DLIO) directly to generate a submission package. -**1.1.** The `mlpstorage` command must obtain (somehow) the pathname of the output file directory hierarchy and directly create and/or append to the files within that hierarchy to successively build out the submission folder. We don't want the submitter to manually create anything in that hierarchy except for the SystemDescription.* files (if we can help it). +1.1. **mlpstorageGeneratesHierarchy** -- The `mlpstorage` command must obtain (somehow) the pathname of the output file directory hierarchy and directly create and/or append to the files within that hierarchy to successively build out the submission folder. We don't want the submitter to manually create anything in that hierarchy except for the SystemDescription.* files (if we can help it). # 2. Directory Structure for All Submissions -**2.1.** The submission structure must start from a single directory whose name is the name of the submitter. This can be any string, but a blank or any other character in that string that cannot be part of a POSIX filename should be replaced 1-for-1 with a dash character. +2.1. **submitterRootDirectory** -- The submission structure must start from a single directory whose name is the name of the submitter. This can be any string, but a blank or any other character in that string that cannot be part of a POSIX filename should be replaced 1-for-1 with a dash character. -**2.2.** Within the top-level directory of the submission structure there must be a directory named "closed" and/or one named "open", and nothing more. These names are case-sensitive. +2.2. **topLevelSubdirectories** -- Within the top-level directory of the submission structure there must be a directory named "closed" and/or one named "open", and nothing more. These names are case-sensitive. -**2.3.** The "open" directory hierarchy should be constructed identically to the "closed" directory hierarchy describe just below. +2.3. **openMatchesClosed** -- The "open" directory hierarchy should be constructed identically to the "closed" directory hierarchy describe just below. -**2.4.** Within the "closed" directory there must be a single directory whose name is the name of the submitter (the same as the top-level directory). +2.4. **closedSubmitterDirectory** -- Within the "closed" directory there must be a single directory whose name is the name of the submitter (the same as the top-level directory). -**2.5.** Within the submitter directory mentioned just above, there must be exactly three directories: "code", "results", and "systems". These names are case-sensitive. +2.5. **requiredSubdirectories** -- Within the submitter directory mentioned just above, there must be exactly three directories: "code", "results", and "systems". These names are case-sensitive. -**2.6.** The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. +2.6. c**odeDirectoryContents** -- The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. If this is in the "open" hierarchy, any modifications made to the benchmark code must be included here, and if this is in the "closed" hierarchy, there must be no changes to the benchmark code. Note that in both cases this must be the code that was actually run to generate those results. In a CLOSED submission, the *submission validator* should do an md5sum of the code directory hierarchy, compare that to a value hard-coded into the validator code, and fail the validation if there is a difference. -**2.7.** The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file, and nothing more. Each of those files must be named with the "system name". +2.7. **systemsDirectoryFiles** -- The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file, and nothing more. Each of those files must be named with the "system name". Eg: for a system-under-test named "Big_and_Fast_4000_buffered", there must be a "Big_and_Fast_4000_buffered.yaml" and a "Big_and_Fast_4000_buffered.pdf" file. These names are case-sensitive. -**2.8.** The "results" directory, whether it is within the "closed' or "open" hierarchies, must include one or more directories that are the names of the systems-under-test. Eg: a system name could be "Big_and_Fast_4000_buffered". +2.8. **resultsDirectorySystems** -- The "results" directory, whether it is within the "closed' or "open" hierarchies, must include one or more directories that are the names of the systems-under-test. Eg: a system name could be "Big_and_Fast_4000_buffered". This name can be anything the submitter wants, it is just a name to both idenfity the set of results that were collected from a given configuration of storage system and to link together those results with the .pdf and .yaml files that describe the system-under-test. -**2.9.** All the configuration parameters and hardware and software components of the system-under-test that are part of a given *system name* must be identical. Any changes to those configuration parameters or hardware or software must be submitted as a separate *system name*, so we should compare the configuration parameters and hardware and software components to verify that they're the same across all the tests and runs within the given *system name* directory hierarchy, to the extent that we can. The *system names* are case-sensitive. +2.9. **identicalSystemConfig** -- All the configuration parameters and hardware and software components of the system-under-test that are part of a given *system name* must be identical. Any changes to those configuration parameters or hardware or software must be submitted as a separate *system name*, so we should compare the configuration parameters and hardware and software components to verify that they're the same across all the tests and runs within the given *system name* directory hierarchy, to the extent that we can. The *system names* are case-sensitive. -**2.10.** Within a *system name* directory in the "results" directory, there must be one or both of the following directories, and nothing else: "training", and/or "checkpointing". These names are case-sensitive. +2.10. **workloadCategories** -- Within a *system name* directory in the "results" directory, there must be one or both of the following directories, and nothing else: "training", and/or "checkpointing". These names are case-sensitive. -**2.11.** Within the "training" directory, there must be one or more of the following *workload directories*, and nothing else: "unet3d", "resnet50" and/or "cosmoflow". These names are case-sensitive. +2.11. **trainingWorkloads** -- Within the "training" directory, there must be one or more of the following *workload directories*, and nothing else: "unet3d", "resnet50" and/or "cosmoflow". These names are case-sensitive. -**2.12.** Within the *workload directories* in the "training" hierarchy, there must exist *phase directories* named "datagen" and "run", and nothing else. These names are case-sensitive. +2.12. **trainingPhases** -- Within the *workload directories* in the "training" hierarchy, there must exist *phase directories* named "datagen" and "run", and nothing else. These names are case-sensitive. -**2.13.** Within the "datagen" *phase directory* within the "training" directory hierarchy, there must be exactly one *timestamp directory* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. +2.13. **datagenTimestamp** -- Within the "datagen" *phase directory* within the "training" directory hierarchy, there must be exactly one *timestamp directory* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. -**2.14.** Within the *timestamp directory* within the "datagen" *phase*, there must exist the following files: "training_datagen.stdout.log", "training_datagen.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. +2.14. **datagenFiles** -- Within the *timestamp directory* within the "datagen" *phase*, there must exist the following files: "training_datagen.stdout.log", "training_datagen.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. -**2.15.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. +2.15. **datagenDlioConfig** -- The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. -**2.16.** Within the "run" *phase directory* within the "training" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. +2.16. **runResultsJson** -- Within the "run" *phase directory* within the "training" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. -**2.17.** Within the "run" *phase directory* within the "training" directory hierarchy, there must also be exactly 6 subdirectories named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. Note that the 1st of those 6 is the *warm up* run and will not be included in the reported performance. +2.17. **runTimestamps** -- Within the "run" *phase directory* within the "training" directory hierarchy, there must also be exactly 6 subdirectories named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. Note that the 1st of those 6 is the *warm up* run and will not be included in the reported performance. -**2.18** The timestamp (the day and time) represented by the name of each *timestamp directory* must be separated by less than the duration of a single *timestamp directory* from it's neighboring *timestamp directories*. Ie: the gap between a consecutive pair of *timestamp directories* must be short enough that we can be sure that there was no benchmark activity between them. +2.18. **runTimestampGap** -- The timestamp (the day and time) represented by the name of each *timestamp directory* must be separated by less than the duration of a single *timestamp directory* from it's neighboring *timestamp directories*. Ie: the gap between a consecutive pair of *timestamp directories* must be short enough that we can be sure that there was no benchmark activity between them. -**2.19.** Within each *timestamp directory* within the "run" *phase*, there must exist the following files: "training_run.stdout.log", "training_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. +2.19. **runFiles** -- Within each *timestamp directory* within the "run" *phase*, there must exist the following files: "training_run.stdout.log", "training_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. -**2.20.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. +2.20. **runDlioConfig** -- The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. -**2.21.** Within the "checkpointing" directory, there must be one or more of the following *workload directories*, and nothing else: "llama3-8b", "llama3-70b", "llama3-405b", and/or "llama3-1t". These names are case-sensitive. +2.21. **checkpointingWorkloads** -- Within the "checkpointing" directory, there must be one or more of the following *workload directories*, and nothing else: "llama3-8b", "llama3-70b", "llama3-405b", and/or "llama3-1t". These names are case-sensitive. -**2.22.** Within the *workload directories* within the "checkpointing" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. +2.22. **checkpointingResultsJson** -- Within the *workload directories* within the "checkpointing" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. -**2.23.** Within the *workload directories* within the "checkpointing" directory hierarchy, there must also be exactly ten *timestamp directories* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. +2.23. **checkpointingTimestamps** -- Within the *workload directories* within the "checkpointing" directory hierarchy, there must also be exactly ten *timestamp directories* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. -**2.24** The timestamp (the day and time) represented by the name of each *timestamp directory* must be separated by less than the duration of a single *timestamp directory* from it's neighboring *timestamp directories*. Ie: the gap between a consecutive pair of *timestamp directories* must be short enough that we can be sure that there was no benchmark activity between them. +2.24. **checkpointingTimestampGap** -- The timestamp (the day and time) represented by the name of each *timestamp directory* must be separated by less than the duration of a single *timestamp directory* from it's neighboring *timestamp directories*. Ie: the gap between a consecutive pair of *timestamp directories* must be short enough that we can be sure that there was no benchmark activity between them. -**2.25.** Within the *timestamp directories* within the "checkpointing" directory hierarchy, there must exist the following files: "checkpointing_run.stdout.log", "checkpointing_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. +2.25. **checkpointingFiles** -- Within the *timestamp directories* within the "checkpointing" directory hierarchy, there must exist the following files: "checkpointing_run.stdout.log", "checkpointing_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. -**2.27.** The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. +2.26. **checkpointingDlioConfig** -- The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. -**2.28.** Pictorially, here is what this looks like: +2.27. **directoryDiagram** -- Pictorially, here is what this looks like: ``` root_folder (or any name you prefer) ├── Closed @@ -239,7 +239,7 @@ root_folder (or any name you prefer) ├──system-name-2.yaml └──system-name-2.pdf ``` -**2.29.** Since the "dlio_log" subdirectory has a similar structure in all cases, it is describe pictorially just below: +2.29. **dlioLog** -- Since the "dlio_log" subdirectory has a similar structure in all cases, it is describe pictorially just below: ``` └── YYYYMMDD_HHmmss ├── [training|checkpointing]_[datagen|run].stdout.log @@ -256,9 +256,9 @@ root_folder (or any name you prefer) ## 3.1. Datasize Options -**3.1.1.** The *submission validator* must verify that the *datasize* option was used by finding the entry(s) in the log file showing its use. +3.1.1. **verifyDatasizeUsage** -- The *submission validator* must verify that the *datasize* option was used by finding the entry(s) in the log file showing its use. -**3.1.2.** The *submission validator* must recalculate the minimum dataset size by using the provided number of simulated accelerators and the sizes of all of the host node’s memory as reported in the logfiles as described below and fail the run if the size recorded in the run's logfile doesn't exactly match the recalculated value. +3.1.2. **recalculateDatasetSize** -- The *submission validator* must recalculate the minimum dataset size by using the provided number of simulated accelerators and the sizes of all of the host node’s memory as reported in the logfiles as described below and fail the run if the size recorded in the run's logfile doesn't exactly match the recalculated value. * Calculate required minimum samples given number of steps per epoch (NB: `num_steps_per_epoch` is a minimum of 500): * `min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes` * Calculate required minimum samples given host memory to eliminate client-side caching effects; (NB: HOST_MEMORY_MULTIPLIER = 5): @@ -272,30 +272,30 @@ root_folder (or any name you prefer) ## 3.2. Datagen Options -**3.2.1** The amount of data generated during the *datagen* phase must be equal **or larger** than the amount of data calculated during the *datasize* phase or the run must be failed. +3.2.1. **datagenMinimumSize** -- The amount of data generated during the *datagen* phase must be equal **or larger** -- than the amount of data calculated during the *datasize* phase or the run must be failed. ## 3.3. Run Options -**3.3.1.** The amount of data the *run* phase is told to use must be exactly equal to the *datasize* value calculated earlier, but can be less than the value used in the *datagen* phase. To express that, you can run the benchmark on a subset of that dataset by setting `num_files_train` or `num_files_eval` smaller than the number of files available in the dataset folder, but `num_subfolders_train` and `num_subfolders_eval` must be to be equal to the actual number of subfolders inside the dataset folder in order to generate valid results. +3.3.1. **runDataMatchesDatasize** -- The amount of data the *run* phase is told to use must be exactly equal to the *datasize* value calculated earlier, but can be less than the value used in the *datagen* phase. To express that, you can run the benchmark on a subset of that dataset by setting `num_files_train` or `num_files_eval` smaller than the number of files available in the dataset folder, but `num_subfolders_train` and `num_subfolders_eval` must be to be equal to the actual number of subfolders inside the dataset folder in order to generate valid results. -**3.3.2.** To pass a benchmark run, the AU (Accelerator Utilization) should be equal to or greater than the minimum value: +3.3.2. **acceleratorUtilizationCheck** -- To pass a benchmark run, the AU (Accelerator Utilization) should be equal to or greater than the minimum value: * `total_compute_time = (records_per_file * total_files) / simulated_accelerators / batch_size * computation_time * epochs` * `AU = (total_compute_time/total_benchmark_running_time) * 100` * All the I/O operations from the first step are excluded from the AU calculation. The I/O operations that are excluded from the AU calculation are included in the samples/second reported by the benchmark, however. -**3.3.3.** For single-host submissions, increase the number of simulated accelerators by changing the `--num-accelerators` parameter to the benchmark.sh script. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. +3.3.3. **singleHostSimulatedAccelerators** -- For single-host submissions, increase the number of simulated accelerators by changing the `--num-accelerators` parameter to the benchmark.sh script. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. -**3.2.4.** For single-host submissions, in both CLOSED and OPEN division results, the validator should fail the run if there is more than one client node used during that run. +3.3.4. **singleHostClientLimit** -- For single-host submissions, in both CLOSED and OPEN division results, the validator should fail the run if there is more than one client node used during that run. -**3.3.5.** For distributed Training submissions, all the data must be accessible to all the host nodes. **_(not clear how to check this, so maybe remove?)_** +3.3.5. **distributedDataAccessibility** -- For distributed Training submissions, all the data must be accessible to all the host nodes. **_(not clear how to check this, so maybe remove?)_** -**3.3.6.** For distributed Training submissions, the number of simulated accelerators in each host node must be identical. +3.3.6. **identicalAcceleratorsPerNode** -- For distributed Training submissions, the number of simulated accelerators in each host node must be identical. -**3.3.7.** For distributed Training submissions, the *submission validation checker* should emit a warning (not fail the validation) if the physical nodes that run the benchmark code are widely enough different in their capability. **_(not clear we should do this, so maybe remove?)_** +3.3.7. **nodeCapabilityConsistency** -- For distributed Training submissions, the *submission validation checker* should emit a warning (not fail the validation) if the physical nodes that run the benchmark code are widely enough different in their capability. **_(not clear we should do this, so maybe remove?)_** -**3.3.8.** For CLOSED submissions of this benchmark, the MLPerf Storage codebase cannot be changed, so the *submission validation checker* SHOULD do an `md5sum` of the code directory hierachy in the submission package and verify that that matches a precalculated checksum stored as a literal in the validator's codebase. +3.3.8. **closedSubmissionChecksum** -- For CLOSED submissions of this benchmark, the MLPerf Storage codebase cannot be changed, so the *submission validation checker* SHOULD do an `md5sum` of the code directory hierachy in the submission package and verify that that matches a precalculated checksum stored as a literal in the validator's codebase. -**3.3.9.** For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. +3.3.9. **closedSubmissionParameters** -- For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. **Table: Training Workload Tunable Parameters for CLOSED** @@ -317,7 +317,7 @@ root_folder (or any name you prefer) | storage.storage_root | The storage root directory | ./ | | storage.storage_type | The storage type | local_fs | -**3.3.10.** For OPEN submissions of this benchmark, only a few additional parameters can be modified over those allowed in CLOSED, and those additional parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. +3.3.10. **openSubmissionParameters** -- For OPEN submissions of this benchmark, only a few additional parameters can be modified over those allowed in CLOSED, and those additional parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. **Table: Training Workload Tunable Parameters for OPEN** @@ -332,25 +332,25 @@ root_folder (or any name you prefer) | *Reader parameters* | | | | reader.data_loader | Supported options: Tensorflow or PyTorch. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | -**3.3.11** The arguments to `mlpstorage` that set the directory pathname where the dataset is stored and the directory where the output logfiles are stored must both be set and must be set to different values. +3.3.11. **mlpstoragePathArgs** -- The arguments to `mlpstorage` that set the directory pathname where the dataset is stored and the directory where the output logfiles are stored must both be set and must be set to different values. -**3.3.12** The `mlpstorage` command should do a "df" command on the directory pathname where the dataset is stored and another one on the directory pathname where the output logfiles are stored and record those values in the logfile. The *submission validator* should find those entries in the run's logfile and verify that they are different filesystems. We don't want the submitter to, by acccident, place the logfiles onto the storage system under test since that would skew the results. +3.3.12. **mlpstorageFilesystemCheck** -- The `mlpstorage` command should do a "df" command on the directory pathname where the dataset is stored and another one on the directory pathname where the output logfiles are stored and record those values in the logfile. The *submission validator* should find those entries in the run's logfile and verify that they are different filesystems. We don't want the submitter to, by acccident, place the logfiles onto the storage system under test since that would skew the results. # 4. Validating the Checkpointing Workloads ## 4.1. Benchmark Run Options -**4.1.1.** The checkpoint data written per client node must be more than 3x the client node's memory capacity, otherwise the filesystem cache needs to be cleared between the write and read phases. +4.1.1. **checkpointDataSizeRatio** -- The checkpoint data written per client node must be more than 3x the client node's memory capacity, otherwise the filesystem cache needs to be cleared between the write and read phases. -**4.1.2.** We must verify that all the benchmark workload configuration files have been set to do an fsync call at the end of each of the 10 checkpoint writes. +4.1.2. **fsyncVerification** -- We must verify that all the benchmark workload configuration files have been set to do an fsync call at the end of each of the 10 checkpoint writes. -**4.1.3.** The benchmark must be run with one of the four model configuration detailed below. +4.1.3. **modelConfigurationReq** -- The benchmark must be run with one of the four model configuration detailed below. -**4.1.4.** For CLOSED submissions, the number of MPI processes must be set to 8, 64, 512, and 1024 for the respective models. (see table 2) +4.1.4. **closedMpiProcesses** -- For CLOSED submissions, the number of MPI processes must be set to 8, 64, 512, and 1024 for the respective models. (see table 2) -**4.1.5.** For CLOSED submissions, submitters may adjust the number of simulated accelerators **per host**, as long as each host uses more than 4 simulated accelerators and the total number of simulated accelerators (the total number of processes) matches the requirement. (see table 2) +4.1.5. **closedAcceleratorsPerHost** -- For CLOSED submissions, submitters may adjust the number of simulated accelerators **per host**, as long as each host uses more than 4 simulated accelerators and the total number of simulated accelerators (the total number of processes) matches the requirement. (see table 2) -**4.1.6.** The aggregate simulated accelerator memory across all nodes must be sufficient to accommodate the model’s checkpoint size. That is, the GB of memory associated with the chosen accelerator (eg: H100) times the accelerator count must be equal to or greater than the total checkpoint size for that scale of checkpoint. (see table 2) +4.1.6. **aggregateAcceleratorMemory** -- The aggregate simulated accelerator memory across all nodes must be sufficient to accommodate the model’s checkpoint size. That is, the GB of memory associated with the chosen accelerator (eg: H100) times the accelerator count must be equal to or greater than the total checkpoint size for that scale of checkpoint. (see table 2) **Table 2 LLM models** @@ -367,7 +367,7 @@ root_folder (or any name you prefer) | Checkpoint size | 105 GB | 912 GB | 5.29 TB | 18 TB | | Subset: 8-Process Size | 105 GB | 114 GB | 94 GB | 161 GB | -**4.1.7.** For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. +4.1.7. **closedCheckpointParameters** -- For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. **Table: Checkpoint Workload Tunable Parameters for CLOSED** @@ -375,7 +375,7 @@ root_folder (or any name you prefer) |----------------------------------|-------------------------------------------------------------|-----------------------| | checkpoint.checkpoint_folder | The storage directory for writing and reading checkpoints | ./checkpoints/ | -**4.1.8.** For OPEN submissions of this benchmark, the total number of processes may be increased in multiples of (TP×PP) to showcase the scalability of the storage solution. +4.1.8. **openSubmissionScaling** -- For OPEN submissions of this benchmark, the total number of processes may be increased in multiples of (TP×PP) to showcase the scalability of the storage solution. **Table 3: Configuration parameters and their mutability in CLOSED and OPEN divisions** @@ -389,21 +389,21 @@ root_folder (or any name you prefer) **NOTE: In the ``--ppn`` syntax above, the ``slotcount`` value means the number of processes per node to run.** -**4.1.9** The arguments to `mlpstorage` that set the directory pathname where the checkpoints are written and read and the directory where the output logfiles are stored must both be set and must be set to different values. +4.1.9. **checkpointPathArgs** -- The arguments to `mlpstorage` that set the directory pathname where the checkpoints are written and read and the directory where the output logfiles are stored must both be set and must be set to different values. -**4.1.10** The `mlpstorage` command should do a "df" command on the directory pathname where the checkpoints are written and read and another one on the directory pathname where the output logfiles are stored and record those values in the logfile. The *submission validator* should find those entries in the run's logfile and verify that they are different filesystems. We don't want the submitter to, by acccident, place the logfiles onto the storage system under test since that would skew the results. +4.1.10. **checkpointFilesystemCheck** -- The `mlpstorage` command should do a "df" command on the directory pathname where the checkpoints are written and read and another one on the directory pathname where the output logfiles are stored and record those values in the logfile. The *submission validator* should find those entries in the run's logfile and verify that they are different filesystems. We don't want the submitter to, by acccident, place the logfiles onto the storage system under test since that would skew the results. -**4.1.11** The `mlpstorage` command must accept a parameter telling it that this is a *subset* run and add that info to the output log file. The *submission validator* must flag an error if the `subset` argument is given but the total number of accelerators is not exactly 8, or the model is "8B". +4.1.11. **subsetRunValidation** -- The `mlpstorage` command must accept a parameter telling it that this is a *subset* run and add that info to the output log file. The *submission validator* must flag an error if the `subset` argument is given but the total number of accelerators is not exactly 8, or the model is "8B". ## 4.2. Storage System Must Be Simultaneously R/W or _Remappable_ -**4.2.1.** If a submitter needs to issue a cache flush operation between the write phase and the read phase of a checkpoint benchmark run, then the validator must check that ``--num-checkpoints-read=0`` was set during the write phase, that there was a short pause of up to 30 seconds maximum, then the write phase was started with ``--num-checkpoints-write=0`` set. +4.2.1. **cacheFlushValidation** -- If a submitter needs to issue a cache flush operation between the write phase and the read phase of a checkpoint benchmark run, then the validator must check that ``--num-checkpoints-read=0`` was set during the write phase, that there was a short pause of up to 30 seconds maximum, then the write phase was started with ``--num-checkpoints-write=0`` set. -**4.2.2.** The validator must verify that the total test duration starts from the timestamp of the first checkpoint written and ends at the ending timestamp of the last checkpoint read, notably including the "remapping" time. +4.2.2. **totalTestDuration** -- The validator must verify that the total test duration starts from the timestamp of the first checkpoint written and ends at the ending timestamp of the last checkpoint read, notably including the "remapping" time. -**4.2.3.** For a _remapping_ solution, the time duration between the checkpoint being completed and the earliest time that that checkpoint could be read by a different host node must be reported in the `SystemDescription.yaml` file. +4.2.3. **remappingTimeReporting** -- For a _remapping_ solution, the time duration between the checkpoint being completed and the earliest time that that checkpoint could be read by a different host node must be reported in the `SystemDescription.yaml` file. -**4.2.4.** The system_configuration.yaml document must list whether the solution support simultaneous reads and/or writes as such: +4.2.4. **simultaneousRwSupport** -- The system_configuration.yaml document must list whether the solution support simultaneous reads and/or writes as such: ``` System: shared_capabilities: From bb9c5c1ca757e7580af39229c14b891359a96407 Mon Sep 17 00:00:00 2001 From: Curtis Anderson <99758333+FileSystemGuy@users.noreply.github.com> Date: Mon, 5 Jan 2026 14:57:03 -0800 Subject: [PATCH 23/23] Fix typo in codeDirectoryContents rule --- Rules.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Rules.md b/Rules.md index e58f5da4..e15c5c79 100644 --- a/Rules.md +++ b/Rules.md @@ -41,7 +41,7 @@ The `mlpstorage` tool must be used to run the benchmarks, submitters are not all 2.5. **requiredSubdirectories** -- Within the submitter directory mentioned just above, there must be exactly three directories: "code", "results", and "systems". These names are case-sensitive. -2.6. c**odeDirectoryContents** -- The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. +2.6. **codeDirectoryContents** -- The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. If this is in the "open" hierarchy, any modifications made to the benchmark code must be included here, and if this is in the "closed" hierarchy, there must be no changes to the benchmark code. Note that in both cases this must be the code that was actually run to generate those results. In a CLOSED submission, the *submission validator* should do an md5sum of the code directory hierarchy, compare that to a value hard-coded into the validator code, and fail the validation if there is a difference.