This repository contains the code for FFTMatvec, described in the paper "Sreeram Venkat, Milinda Fernando, Stefan Henneking, and Omar Ghattas. Fast and Scalable FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices with Application to Linear Inverse Problems Governed by Autonomous Dynamical Systems.. SIAM Journal of Scientific Computing. 2025. To appear. arXiv preprint arXiv:2407.13066."
FFTMatvec is now performance portable to AMD GPUs and supports mixed-precision computations. See "Sreeram Venkat, Kasia Swirydowicz, Noah Wolfe, and Omar Ghattas. Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices. Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2025. To appear. arXiv preprint [arXiv:2508.10202] (https://arxiv.org/abs/2508.10202)."
|
|
|
View an animation of the FFTMatvec algorithm here.
Note: This is the main branch, which only supports NVIDIA GPUs. For AMD and (experimental) Intel support as well as mixed-precision computation, use the mp branch.
The documentation for the code can be found here.
Note: To run on Intel GPUs (experimental), unzip the sycl-release directory and follow the instructions in the README.md file in that directory.
To build the code, the following dependencies are required:
- CUDA (with cuFFT, cuBLAS, and (optionally) cuTENSOR 2.x) and a CUDA enabled GPU OR ROCm/HIP (with rocFFT, rocBLAS, and RCCL) and an AMD GPU
- NCCL for running on NVIDIA GPUs
- HDF5 (parallel version is required)
First, clone the repository:
git clone https://github.com/s769/FFTMatvec.git
git checkout mpThen, build the code (CUDA):
cmake -B build -DNCCL_ROOT=/path/to/nccl -DCMAKE_BUILD_TYPE=Release [-DCUTENSOR_ROOT=/path/to/cutensor] [-DCUDA_ARCH=XX]
cmake --build build --parallelor (HIP)
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_WITH_HIP=ON [-DAMDGPU_TARGETS=gfxABC]
cmake --build build --parallelNote: the -DCUTENSOR_ROOT option is only needed if the cuTENSOR 2.x library is not in the usual CUDA library path. Some systems may have the cuTENSOR 1.x library in the CUDA library path, which is not compatible with this code. In that case, the cuTENSOR 2.x library must be installed, and the path to the cuTENSOR 2.x library must be provided to the build command.
Tests will build by default. If you don't want to build the tests, you can disable them by adding -DENABLE_TESTING=OFF to the cmake command. To run the tests, use ctest in the build directory. Tests require a minimum of 2 GPUs to run by default. To change this, set -DNUM_PROCS=X to the desired number of GPUs.
To build the documentation, the following dependencies are required:
Then, build the documentation by running mkdocs build or mkdocs serve in the docs directory. The built documentation will be in the site directory.
The main executable is fft_matvec. It takes the following arguments:
-pr(int): Number of processor rows (default: 1)-pc(int): Number of processor columns (default: 1)-g(bool): Use global sizes (default: false)-Nm(int): Number of global block columns (default: 10, ignored if-gis false)-Nd(int): Number of global block rows (default: 5, ignored if-gis false)-Nt(int): Block size (default: 7)-nm(int): Number of local block columns (default: 3, ignored if-gis true)-nd(int): Number of local block rows (default: 2, ignored if-gis true)-v(bool): Print input/output vectors (default: false)-N(int): Number of matvecs to use for timing (default: 100)-raw(bool): Print raw timing data instead of table (default: false)-t(bool): Check matvec results (default: false)-prec(string): Precision Code: 5 characters, each D or S (case insensitive) representing the precision of the corresponding matrix/vector component (D=double, S=single). Components are: broadcast/pad, fft, sbgemv, ifft, unpad/reduce. Default is DDDDD.-s(string): Directory to save output files to (default "" - don't save output)-rand(bool): Use deterministic double precision values for the input vectors/matrices that cannot be represented as single precision floats without error. Result checking will not work with this option enabled (default: false)-h(bool): Print help message
Note: pr x pc must be equal to the number of processors used to run the code. If no values are provided for -pr and -pc, the code will run with pr = 1 and pc = num_mpi_procs.
For boolean arguments, just pass the flag to enable it without a value. For example:
mpiexec -np 4 ./build/fft_matvec -pr 2 -pc 2 -g -Nm 20 -Nd 10 -Nt 7 -nm 4 -nd 3 -v -N 100 -prec dssdd -rand -s .will run the code with 4 processors, a 2x2 processor grid, global sizes, 20 global block columns, 10 global block rows, a block size of 7, 4 local block columns, 3 local block rows, print input/output vectors, and use 100 matvecs for timing. The FFT and SBGEMV will be computed in single precision; all other components in double precision. Deterministic double precision values that cannot be represented as single precision floats without error will be used to initialize the matrix/vector, and output vectors will be saved in the current directory.
This code is released under the MIT License. See LICENSE for more information.