Releases: nv-legate/legate
v25.11.00
This is a beta release of Legate.
Pip wheels are available on PyPI at https://pypi.org/project/legate/, for Linux (x86-64 and ARM64, with CUDA 12 and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/legate, for Linux (x86-64 and ARM64, with CUDA 12/13 and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.
Documentation for this release can be found at https://docs.nvidia.com/legate/25.11/.
Highlights
Support matrix changes
- Start distributing conda packages for CUDA 13.
Added functionality
- Add
AllReduceoperation to the collective communication module. - Add dedicated Store transformation for dimension broadcasting.
Small improvements
- Add support for the
aprunlauncher. - Various bug fixes to the experimental streaming (a.k.a. auto-batching) execution mode.
- Expose nullable
LogicalArrayandStructLogicalArray(and related factory methods) to the Python API. - Accept objects exposing the
LegateDataInterface(e.g. cuPyNumeric ndarrays) in some Legate APIs (I/O functions,offload_to,as_logical_array).
Breaking changes
- Remove support for the CAL communicator, as is no longer necessary for downstream libraries after the release of cuSolverMp 0.7.
- Change the default instance mapping policy, to leave the dimension ordering unspecified. Tasks that don't request a specific dimension ordering (in the mapper) must be prepared to work with any ordering.
Known issues
- As of October 2025, the GASNet wrapper on Perlmutter only works when the NERSC-provided
mpichmodule is loaded. Attempts to build or use the wrapper withcray-mpichcurrently fail, so make suremodule load mpichis issued before runningbuild-gex-wrapper.sh. - As of October 2025, Perlmutter jobs that request more than 32 GB of device memory (for example,
--fbmem 64000) must includeREALM_DEFAULT_ARGS='-gex:bindcuda 0'. Otherwise the OFI provider aborts withUnexpected error 12 (Cannot allocate memory) from fi_mr_regattr().
Full Changelog: https://docs.nvidia.com/legate/latest/changes/2511.html, v25.10.00...v25.11.00
v25.10.00
This is a beta release of Legate.
Pip wheels are available on PyPI at https://pypi.org/project/legate/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/legate, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.
Documentation for this release can be found at https://docs.nvidia.com/legate/25.10/.
Highlights
Added functionality
- Implement streamable parallel HDF5 writing API.
- Implement explicitly batched HDF5 reading API.
- Add a CPU collective communication backend based on UCC.
Streaming
- Various bug fixes to the experimental streaming (a.k.a. auto-batching) execution mode.
- Add basic user documentation.
Small improvements
- Add variadic versions of partitioning constraints
alignandbroadcast. - Add GDB/LLDB pretty printers for Legate internal container and smart pointer classes.
- Add support for OpenMPI 5.
Breaking changes
- Enable GPUDirectStorage HDF5 backend by default (change default of
--io-use-vfd-gdsfromFalsetoTrue). - Move nightly conda packages to a dedicated channel,
-c legate-nightly. - Remove the deprecated
legate/cuda/cuda.hheader and associatedLEGATE_CHECK_CUDA', 'LEGATE_CHECK_CUDA_STREAMmacros.
Known issues
- As of October 2025, the GASNet wrapper on Perlmutter only works when the NERSC-provided
mpichmodule is loaded. Attempts to build or use the wrapper withcray-mpichcurrently fail, so make suremodule load mpichis issued before runningbuild-gex-wrapper.sh. - As of October 2025, Perlmutter jobs that request more than 32 GB of device memory (for example,
--fbmem 64000) must includeREALM_DEFAULT_ARGS='-gex:bindcuda 0'. Otherwise the OFI provider aborts withUnexpected error 12 (Cannot allocate memory) from fi_mr_regattr().
Full Changelog: https://docs.nvidia.com/legate/latest/changes/2510.html, v25.08.00...v25.10.00
v25.08.00
This is a beta release of Legate.
Pip wheels are available on PyPI at https://pypi.org/project/legate/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/legate, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.
Documentation for this release can be found at https://docs.nvidia.com/legate/25.08/.
New features
Streaming / auto-batching
Add experimental support for "streaming", a new execution mode that (under certain conditions) allows a series of operations to execute in batches, ultimately allowing the same series of operations to run using less memory.
When streaming a section of code, a "producer" parallel launch doesn't have to complete in full before a subsequent "consumer" parallel launch can start. Instead, a "consumer" worker can start immediately after the "producer" worker it depends on has finished. Therefore, any intermediate data created by the partial execution of the "producer" can be eagerly discarded before the next batch of the "producer" operation runs, thus reducing overall memory pressure.
This feature is experimental and does not yet support all use-cases. Invalid use may lead to exceptions, hangs, or outright crashes. Upcoming releases will focus on rounding out this feature, adding safety checks etc.
Interoperability
- Support DLPack for importing to and exporting from Legate Stores.
- Add
cuda::std::mdspan-based accessor classes, for accessing data inPhysicalStores.
Performance improvements
- Explicitly manage Python threadstate, which avoids some race conditions when multiple Python tasks are sharing the same device, and allows consecutive tasks to avoid re-initializing CUDA libraries at entry.
- Avoid collecting Python stack trace information by default when not profiling, as it can impose significant overhead, especially for short operations. The
--provenanceflag can be used to force the collection of Python stack trace information, which can be useful to add more context to--show-progress, NVTX ranges, and some error messages.
Profiling
- Build Legate profiler packages with support for exporting Legate-level information to Nsight Systems.
- More clearly visualize when a Python task is blocked on the cpython GIL.
Deprecations
- Deprecate
legate::print_dense_array(), as it was introducing an unnecessary dependence on CUDA runtime symbols. Downstream users should either use the span accessors of physical stores (which support easy dimension-aware iteration for printing), or implement their custom debugging utilities. - Deprecate
legate::mapping::InstLayout, which was used to select betweenAOSandSOA, of which onlySOAwas ever properly supported.
Miscellaneous
- Fixed compilation issues with nvc++ and GCC 14.
Full Changelog: https://docs.nvidia.com/legate/latest/changes/2508.html, v25.07.00...v25.08.00
v25.07.00
This is a beta release of Legate.
Pip wheels are available on PyPI at https://pypi.org/project/legate/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/legate, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.
Documentation for this release can be found at https://docs.nvidia.com/legate/25.07/.
New features
Support matrix changes
- macOS wheels are now available on PyPI.
- Add support for Blackwell CUDA architecture and MNNVL.
- Drop support for Python 3.10 and add support for Python 3.13.
- Remove NumPy 1.X restriction from packages (now compatible with NumPy 2.X).
Interoperability
- Add support for the PEP-3118 buffer protocol to
legate.core.InlineAllocation. - Add a CUDA stream to
InlineAllocation's__cuda_array_interface__.
Python tasks
- Expose
TaskConfigobject. - Add support for unbound stores.
Documentation
- Start publishing nightly doc builds to https://nv-legate.github.io/legate.
- Added saxpy and manual task examples.
- Various documentation improvements (
InlineAllocation,ListLogicalArray)
Miscellaneous
- Refactor argument parsing; now all arguments can be provided through the
LEGATE_CONFIGenvironment variable. - Add support for
std::vector<bool>tolegate::Scalar. - Add ability for users to statically declare task-wide configuration options, such as the task's signature, constraints, and default variant options, via a static
TASK_CONFIGmember on task declarations.
Full Changelog: v25.03.02...v25.07.00
v25.03.02
This is a beta release of Legate.
Linux x86 and ARM builds for Python 3.10 - 3.12 with multi-node support are available on PyPI at https://pypi.org/project/legate/, and as conda packages at https://anaconda.org/legate/legate. GASNet-based (rather than UCX-based) conda packages are under the gex label.
Documentation for this release can be found at https://docs.nvidia.com/legate/25.03/.
New features
PIP install support
-
With this release, Linux x86 and ARM builds of Legate are available as Python wheels on PyPI, and can be installed with:
pip install legateSee https://docs.nvidia.com/legate/25.03/installation.html#installing-pypi-packages for further instructions.
-
These wheels support multi-node execution through UCX, when paired with an installation of MPI; see https://docs.nvidia.com/legate/25.03/networking-wheels.html for more details.
Miscellaneous
- Support MPICH mpirun when running the
legatelauncher with--launcher mpirun.
v25.03.00
Linux x86 and ARM conda packages with multi-node support (based on UCX or GASNet) are available for this release at https://anaconda.org/legate/legate (GASNet-based packages are under the gex label).
Documentation for this release can be found at https://docs.nvidia.com/legate/25.03/.
New features
Licensing
- With this release Legate is available as open-source, under the Apache-2.0 license. The full source code can be found at https://github.com/nv-legate/legate.
UX improvements
- Stop passing default options to Nsight Systems when using the
--nsysflag of thelegatedriver. Any non-default arguments are fully in the control of the user, through--nsys-extra. - Add the
legate.core.ProfileRangePython context manager (and associated C++ API), to annotate sub-spans within a larger task span on the profiler visualization.
Documentation improvements
- Add a user guide chapter on accelerating multi-GPU HDF5 workloads.
Deprecations
- Variants no longer need to specify the size of their return value. Legate will compute this information automatically.
Miscellaneous
- The
TaskContextis now exposed to Python tasks. - Legate is now compatible with NumPy 2.x.
- Provide a per-processor/per-GPU caching mechanism, useful e.g. for reusing CUDA library handles across tasks.
Full changelog: https://docs.nvidia.com/legate/25.03/changes/2503.html
Known issues
- We are aware of possible performance regressions when using UCX 1.18. We are temporarily restricting our packages to UCX <= 1.17 while we investigate this.
v25.01.00
This is a closed-source release, governed by the following EULA: https://docs.nvidia.com/legate/25.01/eula.pdf.
Linux x86 and ARM conda packages with multi-node support (based on UCX or GASNet) are available at https://anaconda.org/legate/legate (GASNet-based packages are under the gex label).
Documentation for this release can be found at https://docs.nvidia.com/legate/25.01/.
New features
Memory management
- There is no longer a separation between the memory pools used for ahead-of-task-execution ("deferred") allocations, and task-execution-time ("eager") allocations. The
--eager-alloc-percentageflag is thus obsolete. Instead, a task that creates temporary or output buffers during execution must be registered withhas_allocations=true, and a newallocation_pool_size()mapper callback must provide an upper bound for the task's total size of allocations. See https://docs.nvidia.com/legate/25.01/api/cpp/mapping.html for more detailed instructions. - Add the
offload_to()API, that allows a user to offload a store or array to a particular memory kind, such that any copies in other memories are discarded. This can be useful e.g. to evict an array from GPU memory onto system memory, freeing up space for subsequent GPU tasks.
I/O
- Move the HDF5 interface out of the experimental namespace.
- Use cuFile to accelerate HDF5 reads on the GPU.
- Add support for reading "binary" HDF5 datasets.
Deprecations
- Remove the
task_target()callback from the Legate mapper. Users should utilize the resource scoping mechanism instead, if they need to restrict where tasks should run. - Drop support for the Maxwell GPU architecture. Legate now requires at least Pascal (
sm_60).
Miscellaneous
- Increase the maximum array dimension from 4 to 6.
- Record stacktraces on Legate exceptions and error messages.
- Consider NUMA node topology when allocating CPU cores and memory during automatic machine configuration.
- Add environment variable
LEGATE_LIMIT_STDOUT, to only print out the output from one of the copies of the top-level program in a multi-process execution. - Add
legate::LogicalStore::reinterpret_as()to reinterpret the underlying storage of aLogicalStoreas another data-type.
Full changelog: https://docs.nvidia.com/legate/25.01/changes/2501.html
v24.11.01
This is a closed-source release, governed by the following EULA: https://docs.nvidia.com/legate/24.11/eula.pdf.
Linux x86 and ARM conda packages with multi-node support (based on UCX or GASNet) are available at https://anaconda.org/legate/legate (GASNet-based packages are under the gex label).
Documentation for this release can be found at https://docs.nvidia.com/legate/24.11/.
New features
- Bug fixes for release 24.11.00
v24.11.00
This is a closed-source release, governed by the following EULA: https://docs.nvidia.com/legate/24.11/eula.pdf.
Linux x86 and ARM conda packages with multi-node support (based on UCX or GASNet) are available at https://anaconda.org/legate/legate (GASNet-based packages are under the gex label).
Documentation for this release can be found at https://docs.nvidia.com/legate/24.11/.
New features
- Provide an MPI wrapper, that the user can compile against their local MPI installation, and integrate with an existing build of Legate. This is useful when a user needs to use an MPI installation different from the one Legate was compiled against.
- Add support for using GASNet as the networking backend, useful on platforms not currently supported by UCX, e.g. Slingshot11. Provide scripts for the user to compile GASNet on their local machine, and integrate with an existing build of Legate.
- Automatic machine configuration; Legate will now detect the available hardware resources at startup, and no longer needs to be provided information such as the amount of memory to allocate.
- Print more information on what data is taking up memory when Legate encounters an out-of-memory error.
- Support scalar parameters, default arguments and reduction privileges in Python tasks.
- Add support for a
concurrent_task_barrier, useful in preventing NCCL deadlocks. - Allow tasks to specify that CUDA context synchronization at task exit can be skipped, reducing latency.
- Experimental support for distributed hdf5 and zarr I/O.
- Experimental support for single-CPU/GPU fast-path task execution (skipping the tasking runtime dependency analysis).
- Experimental implementation of a "bloated" instance prefetching API, which instructs the runtime to create instances encompassing multiple slices of a store ahead of time, potentially reducing intermediate memory usage.
- full changelog
Known issues
The GPUDirectStorage backend of the hdf5 I/O module (off by default, and enabled with LEGATE_IO_USE_VFD_GDS=1) is not currently working (enabling it will result in a crash). We are working on a fix.
Legate's auto-configuration heuristics will attempt to split CPU cores and system memory evenly across all instantiated OpenMP processors, not accounting for the actual core count and memory limits of each NUMA domain. In cases where the number of OpenMP groups does not evenly divide the number of NUMA domains, this bug may cause unsatisfiable core and memory allocations, resulting in error messages such as:
not enough cores in NUMA domain 0 (72 < 284)reservation ('OMP0 proc 1d00000000000005 (worker 8)') cannot be satisfiedinsufficient memory in NUMA node 4 (102533955584 > 102005473280 bytes) - skipping allocation
These issues should only affect performance if you are actually running computations on the OpenMP cores (rather than using the GPUs for computation). You can always adjust the automatically derived configuration values through LEGATE_CONFIG, see https://docs.nvidia.com/legate/latest/usage.html#resource-allocation.
v24.06.01
This is a patch release, and includes the following fixes:
- Fix for #945
- Fix for StanfordLegion/legion#1719
- Fix cuda package dependencies
This is a closed-source release, governed by the following EULA: https://docs.nvidia.com/legate/24.06/eula.pdf. x86 conda packages with multi-node support (based on UCX) are available at https://anaconda.org/legate/legate-core.
Documentation for this release can be found at https://docs.nvidia.com/legate/24.06/.