This is a beta release of cuPyNumeric.
Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA 12 and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA 12/13 and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.11/.
Highlights
Support matrix changes
- Start distributing conda packages for CUDA 13.
- Port to cuSolverMp 0.7 (now the new required minimum).
- Validate cuPyNumeric on DGX Spark.
Note that currently the pip wheels do not include CUDA 13 support, nor cuSolverMp support (linear solve / matrix decomposition APIs are constrained to single-GPU execution when using the wheels).
Added functionality
cupynumeric.histogram2dandcupynumeric.histogramddcupynumeric.lexsortcupynumeric.isin- Multi-GPU & multi-node implementation of QR factorization, based on cuSolverMp
Performance improvements
- Accelerate axis-wise reductions on GPUs by combining multiple kernel invocations into one.
- Parallelize specialized implementation for
cupynumeric.take, and use it in more cases, includingcupynumeric.take_along_axis.
UX improvements
- I/O functions (e.g. hdf5
to_file) and memory offloading (e.g.offload_to) functions from Legate now accept cuPyNumeric ndarrays directly.
Known issues
- We are aware of hangs when using cuSolverMp-based APIs on 4+ Perlmutter nodes. This appears to be a cluster-specific issue, that we are investigating.
- We are aware of hangs when using UCX 1.19 with the CUDA 13 conda packages. These are typically accompanied by an error message like this:
We are investigating a proper fix. For the time being, setting
ib_md.c:287 UCX ERROR ibv_reg_mr(address=(nil), length=134217728, access=0xf) failed: Bad address ucp_mm.c:76 UCX ERROR failed to register address (nil) (cuda) length 134217728 on md[6]=mlx5_0: Input/output error (md supports: host|cuda)UCX_MEMTYPE_CACHE=noin the environment appears to resolve the hang, at the cost of potentially decreasing UCX performance.
Full Changelog: v25.10.00...v25.11.00