From e77a75e06faa3b977b9fa49a207819936773a92f Mon Sep 17 00:00:00 2001
From: Getnet <gbetrie@nrel.gov>
Date: Mon, 15 Sep 2025 15:18:34 -0500
Subject: [PATCH 1/5] Updated run requirements.

---
 HPL/README.md | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)
diff --git a/HPL/README.md b/HPL/README.md
index 90a3b82..eb02037 100644
--- a/HPL/README.md
+++ b/HPL/README.md
@@ -1,7 +1,7 @@
 # HPL
 
 ## Purpose and Description
-HPL solves a random dense linear system in double precision arithmetic on distributed-memory. It depends on the Message Passing Interface, Basic Linear Algebra Subprograms, or the Vector Signal Image Processing Library. The software is available at the [Netlib HPL benchmark](https://www.netlib.org/benchmark/hpl/) and most vendors including ([Nvidia](https://docs.nvidia.com/nvidia-hpc-benchmarks/HPL_benchmark.html), [AMD](https://www.amd.com/en/developer/zen-software-studio/applications/pre-built-applications.html), and [Intel](https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2024-1/overview-intel-distribution-for-linpack-benchmark.html) offer hardware-optimized versions of it. The HPL benchmark assesses both the peak performance and the integrity of the system's hardware, from individual nodes to the entire system.
+HPL solves a random dense linear system in double precision arithmetic on distributed-memory. It depends on the Message Passing Interface, Basic Linear Algebra Subprograms, or the Vector Signal Image Processing Library. The software is available at the [Netlib HPL benchmark](https://www.netlib.org/benchmark/hpl/) and most vendors including ([Nvidia](https://docs.nvidia.com/nvidia-hpc-benchmarks/HPL_benchmark.html), [AMD](https://www.amd.com/en/developer/zen-software-studio/applications/pre-built-applications.html), and [Intel](https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2024-1/overview-intel-distribution-for-linpack-benchmark.html) offer hardware-optimized versions of it. The HPL benchmark assesses the floating-point performance by stressing cores and memory of the target system.
 
 ## Licensing Requirements
 
@@ -19,6 +19,20 @@ Optimized binaries of HPL can be obtained from [Intel-HPL](https://www.intel.com
 
 ## Run Definitions and Requirements
 
+## Tests
+
+Testing will include single-node and multi-node configurations. The single-node test exposes the compute capability of the unit by reducing the effect interconnect at a node level, while the multi-node test measures the overall system performance and exposes scaling and interconnect bottlenecks at the system level. In general, the problem size (`N`) should be tuned to saturate at least 80% of available system memory so that maximum peak performance would be acheived.
+
+## How to run
+
+To execute xhpl on CPUs, MPI with or without OpenMP support are required. The required input is number of nodes, total number of MPI ranks, total number of ranks per node and total number of OpenMP threads per MPI rank. The benchmark results can be obtained with Slurm cluster management: `srun -N <number of nodes> -n <total number of ranks> -c < number of cpus per task> ./xhpl`.
+
+To run xhpl on GPUs, you need GPU-Aware MPI with or without OpenMP support for an optimal performance. The required input is number of nodes, total number of MPI ranks, total number of ranks per node, total number of OpenMP threads per MPI rank, and total number of GPUs per node. The benchmark results can be obtained with Slurm: `srun -N <number of nodes> -n <total number of ranks> --cpus-per-task=< number of CPUs per task> --gpus-per-node=<total number of GPUs per node> ./xhpl`.
+
+The offeror should reveal any potential for performance optimization on the target system that provides an optimal task configuration by running As-is and Optimized cases. The As-is case will saturate at least 90% cores per CPU node to establish baseline performance and expose potential computational bottlenecks within the CPU’s floating-point performance and memory-related issues such as suboptimal memory bandwidth. The optimized case will saturate all NUMA-nodes and threads on the CPU node, and will include configurations exploring strategies for achieving the maximum FLOPS on the target system. On GPU nodes, the optimized case will saturate all GPUs and one thread per node, aiming to reveal opportunities for GPUs and the interconnect performance bottlenecks.
+
+## How to validate
+
 The run output should show that tests sucessfully passed, finished and ended, as in:
 ```
 ================================================================================
@@ -39,22 +53,14 @@ End of Tests.
 
 ```
 
-## How to run
-
-To execute xhpl on CPUs, MPI with or without OpenMP support are required. The required input is number of nodes, total number of MPI ranks, total number of ranks per node and total number of OpenMP threads per MPI rank. The benchmark results can be obtained with Slurm cluster management: `srun -N <number of nodes> -n <total number of ranks> -c < number of cpus per task> ./xhpl`.  
-
-To run xhpl on GPUs, you need GPU-Aware MPI with or without OpenMP support for an optimal performance. The required input is number of nodes, total number of MPI ranks, total number of ranks per node, total number of OpenMP threads per MPI rank, and total number of GPUs per node. The benchmark results can be obtained with Slurm: `srun -N <number of nodes> -n <total number of ranks> --cpus-per-task=< number of CPUs per task> --gpus-per-node=<total number of GPUs per node> ./xhpl`. 
-
-### Tests
-
-Testing will include single-node and multi-node configurations.
-
 ## Run Rules
 
-Publicly available, optimized HPL versions or binaries are permitted. A single or multiple programming models might be used to optimize performance based on the architecture of the machine.
+* Publicly available, optimized HPL versions or binaries are permitted. 
+* Any optimizations would be allowed in the code, compilers and task configuration as long as the offeror would provide a high-level description of the optimization techniques used and their impact on performance in the Text response.
+* The offeror could use an HPLv2.3.x or later version with GNU, Intel, or accelerator-specific compilers and libraries.
 
 ## Benchmark test results to report and files to return
 
-* The Make.myarch files or script, job submission scripts, stdout and stderr files from each run, an environment dump, and HPL.dat files shall be included in the File response.
-* The Text response should include high-level descriptions of build and run optimizations.
+* File response should include Make.myarch files or script, job submission scripts, stdout and stderr files, an environment dump, and HPL.dat files for each run.
+* The Text response should include high-level descriptions of any optimizations and justification if the obtained performance vary from theoretical performance by less than 80%.
 * For performance reporting, the performance reported in the output files and the theoretical performance should be entered into the Spreadsheet (`report/HPL_benchmark.csv`) response.
\ No newline at end of file

From fc3400c61d440350789d4f4ff7ab94edffcb8652 Mon Sep 17 00:00:00 2001
From: gbetrie <51185244+gbetrie@users.noreply.github.com>
Date: Tue, 16 Sep 2025 08:28:25 -0500
Subject: [PATCH 2/5] Update HPL/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 HPL/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/HPL/README.md b/HPL/README.md
index eb02037..daab297 100644
--- a/HPL/README.md
+++ b/HPL/README.md
@@ -21,7 +21,7 @@ Optimized binaries of HPL can be obtained from [Intel-HPL](https://www.intel.com
 
 ## Tests
 
-Testing will include single-node and multi-node configurations. The single-node test exposes the compute capability of the unit by reducing the effect interconnect at a node level, while the multi-node test measures the overall system performance and exposes scaling and interconnect bottlenecks at the system level. In general, the problem size (`N`) should be tuned to saturate at least 80% of available system memory so that maximum peak performance would be acheived.
+Testing will include single-node and multi-node configurations. The single-node test exposes the compute capability of the unit by reducing the effect interconnect at a node level, while the multi-node test measures the overall system performance and exposes scaling and interconnect bottlenecks at the system level. In general, the problem size (`N`) should be tuned to saturate at least 80% of available system memory so that maximum peak performance would be achieved.
 
 ## How to run
 

From 8c6ad04a3073305348b60025d81caff778303bc6 Mon Sep 17 00:00:00 2001
From: gbetrie <51185244+gbetrie@users.noreply.github.com>
Date: Tue, 16 Sep 2025 08:28:45 -0500
Subject: [PATCH 3/5] Update HPL/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 HPL/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/HPL/README.md b/HPL/README.md
index daab297..e9f7767 100644
--- a/HPL/README.md
+++ b/HPL/README.md
@@ -33,7 +33,7 @@ The offeror should reveal any potential for performance optimization on the targ
 
 ## How to validate
 
-The run output should show that tests sucessfully passed, finished and ended, as in:
+The run output should show that tests successfully passed, finished and ended, as in:
 ```
 ================================================================================
 ...

From 2e737205555aad63a21ec85fa82b3ef076899d8d Mon Sep 17 00:00:00 2001
From: gbetrie <51185244+gbetrie@users.noreply.github.com>
Date: Tue, 16 Sep 2025 08:29:33 -0500
Subject: [PATCH 4/5] Update HPL/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 HPL/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/HPL/README.md b/HPL/README.md
index e9f7767..5ae5770 100644
--- a/HPL/README.md
+++ b/HPL/README.md
@@ -62,5 +62,5 @@ End of Tests.
 ## Benchmark test results to report and files to return
 
 * File response should include Make.myarch files or script, job submission scripts, stdout and stderr files, an environment dump, and HPL.dat files for each run.
-* The Text response should include high-level descriptions of any optimizations and justification if the obtained performance vary from theoretical performance by less than 80%.
+* The Text response should include high-level descriptions of any optimizations and justification if the obtained performance varies from theoretical performance by less than 80%.
 * For performance reporting, the performance reported in the output files and the theoretical performance should be entered into the Spreadsheet (`report/HPL_benchmark.csv`) response.
\ No newline at end of file

From 20e92aa4d46250e92b11077009caa0eb8941880b Mon Sep 17 00:00:00 2001
From: gbetrie <51185244+gbetrie@users.noreply.github.com>
Date: Tue, 16 Sep 2025 09:08:29 -0500
Subject: [PATCH 5/5] Update HPL/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 HPL/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/HPL/README.md b/HPL/README.md
index 5ae5770..d72832e 100644
--- a/HPL/README.md
+++ b/HPL/README.md
@@ -1,7 +1,7 @@
 # HPL
 
 ## Purpose and Description
-HPL solves a random dense linear system in double precision arithmetic on distributed-memory. It depends on the Message Passing Interface, Basic Linear Algebra Subprograms, or the Vector Signal Image Processing Library. The software is available at the [Netlib HPL benchmark](https://www.netlib.org/benchmark/hpl/) and most vendors including ([Nvidia](https://docs.nvidia.com/nvidia-hpc-benchmarks/HPL_benchmark.html), [AMD](https://www.amd.com/en/developer/zen-software-studio/applications/pre-built-applications.html), and [Intel](https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2024-1/overview-intel-distribution-for-linpack-benchmark.html) offer hardware-optimized versions of it. The HPL benchmark assesses the floating-point performance by stressing cores and memory of the target system.
+HPL solves a random dense linear system in double precision arithmetic on distributed memory. It depends on the Message Passing Interface, Basic Linear Algebra Subprograms, or the Vector Signal Image Processing Library. The software is available at the [Netlib HPL benchmark](https://www.netlib.org/benchmark/hpl/) and most vendors including ([Nvidia](https://docs.nvidia.com/nvidia-hpc-benchmarks/HPL_benchmark.html), [AMD](https://www.amd.com/en/developer/zen-software-studio/applications/pre-built-applications.html), and [Intel](https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2024-1/overview-intel-distribution-for-linpack-benchmark.html) offer hardware-optimized versions of it. The HPL benchmark assesses the floating-point performance by stressing cores and memory of the target system.
 
 ## Licensing Requirements