Skip to content

Releases: nv-legate/legate

v25.11.00

27 Nov 06:24
807d188

Choose a tag to compare

This is a beta release of Legate.

Pip wheels are available on PyPI at https://pypi.org/project/legate/, for Linux (x86-64 and ARM64, with CUDA 12 and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/legate, for Linux (x86-64 and ARM64, with CUDA 12/13 and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.

Documentation for this release can be found at https://docs.nvidia.com/legate/25.11/.

Highlights

Support matrix changes

  • Start distributing conda packages for CUDA 13.

Added functionality

  • Add AllReduce operation to the collective communication module.
  • Add dedicated Store transformation for dimension broadcasting.

Small improvements

  • Add support for the aprun launcher.
  • Various bug fixes to the experimental streaming (a.k.a. auto-batching) execution mode.
  • Expose nullable LogicalArray and StructLogicalArray (and related factory methods) to the Python API.
  • Accept objects exposing the LegateDataInterface (e.g. cuPyNumeric ndarrays) in some Legate APIs (I/O functions, offload_to, as_logical_array).

Breaking changes

  • Remove support for the CAL communicator, as is no longer necessary for downstream libraries after the release of cuSolverMp 0.7.
  • Change the default instance mapping policy, to leave the dimension ordering unspecified. Tasks that don't request a specific dimension ordering (in the mapper) must be prepared to work with any ordering.

Known issues

  • As of October 2025, the GASNet wrapper on Perlmutter only works when the NERSC-provided mpich module is loaded. Attempts to build or use the wrapper with cray-mpich currently fail, so make sure module load mpich is issued before running build-gex-wrapper.sh.
  • As of October 2025, Perlmutter jobs that request more than 32 GB of device memory (for example, --fbmem 64000) must include REALM_DEFAULT_ARGS='-gex:bindcuda 0'. Otherwise the OFI provider aborts with Unexpected error 12 (Cannot allocate memory) from fi_mr_regattr().

Full Changelog: https://docs.nvidia.com/legate/latest/changes/2511.html, v25.10.00...v25.11.00

v25.10.00

30 Oct 21:23
b0a2071

Choose a tag to compare

This is a beta release of Legate.

Pip wheels are available on PyPI at https://pypi.org/project/legate/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/legate, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.

Documentation for this release can be found at https://docs.nvidia.com/legate/25.10/.

Highlights

Added functionality

  • Implement streamable parallel HDF5 writing API.
  • Implement explicitly batched HDF5 reading API.
  • Add a CPU collective communication backend based on UCC.

Streaming

  • Various bug fixes to the experimental streaming (a.k.a. auto-batching) execution mode.
  • Add basic user documentation.

Small improvements

  • Add variadic versions of partitioning constraints align and broadcast.
  • Add GDB/LLDB pretty printers for Legate internal container and smart pointer classes.
  • Add support for OpenMPI 5.

Breaking changes

  • Enable GPUDirectStorage HDF5 backend by default (change default of --io-use-vfd-gds from False to True).
  • Move nightly conda packages to a dedicated channel, -c legate-nightly.
  • Remove the deprecated legate/cuda/cuda.h header and associated LEGATE_CHECK_CUDA', 'LEGATE_CHECK_CUDA_STREAM macros.

Known issues

  • As of October 2025, the GASNet wrapper on Perlmutter only works when the NERSC-provided mpich module is loaded. Attempts to build or use the wrapper with cray-mpich currently fail, so make sure module load mpich is issued before running build-gex-wrapper.sh.
  • As of October 2025, Perlmutter jobs that request more than 32 GB of device memory (for example, --fbmem 64000) must include REALM_DEFAULT_ARGS='-gex:bindcuda 0'. Otherwise the OFI provider aborts with Unexpected error 12 (Cannot allocate memory) from fi_mr_regattr().

Full Changelog: https://docs.nvidia.com/legate/latest/changes/2510.html, v25.08.00...v25.10.00

v25.08.00

05 Sep 07:38
47132da

Choose a tag to compare

This is a beta release of Legate.

Pip wheels are available on PyPI at https://pypi.org/project/legate/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/legate, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.

Documentation for this release can be found at https://docs.nvidia.com/legate/25.08/.

New features

Streaming / auto-batching

Add experimental support for "streaming", a new execution mode that (under certain conditions) allows a series of operations to execute in batches, ultimately allowing the same series of operations to run using less memory.

When streaming a section of code, a "producer" parallel launch doesn't have to complete in full before a subsequent "consumer" parallel launch can start. Instead, a "consumer" worker can start immediately after the "producer" worker it depends on has finished. Therefore, any intermediate data created by the partial execution of the "producer" can be eagerly discarded before the next batch of the "producer" operation runs, thus reducing overall memory pressure.

This feature is experimental and does not yet support all use-cases. Invalid use may lead to exceptions, hangs, or outright crashes. Upcoming releases will focus on rounding out this feature, adding safety checks etc.

Interoperability

  • Support DLPack for importing to and exporting from Legate Stores.
  • Add cuda::std::mdspan-based accessor classes, for accessing data in PhysicalStores.

Performance improvements

  • Explicitly manage Python threadstate, which avoids some race conditions when multiple Python tasks are sharing the same device, and allows consecutive tasks to avoid re-initializing CUDA libraries at entry.
  • Avoid collecting Python stack trace information by default when not profiling, as it can impose significant overhead, especially for short operations. The --provenance flag can be used to force the collection of Python stack trace information, which can be useful to add more context to --show-progress, NVTX ranges, and some error messages.

Profiling

  • Build Legate profiler packages with support for exporting Legate-level information to Nsight Systems.
  • More clearly visualize when a Python task is blocked on the cpython GIL.

Deprecations

  • Deprecate legate::print_dense_array(), as it was introducing an unnecessary dependence on CUDA runtime symbols. Downstream users should either use the span accessors of physical stores (which support easy dimension-aware iteration for printing), or implement their custom debugging utilities.
  • Deprecate legate::mapping::InstLayout, which was used to select between AOS and SOA, of which only SOA was ever properly supported.

Miscellaneous

  • Fixed compilation issues with nvc++ and GCC 14.

Full Changelog: https://docs.nvidia.com/legate/latest/changes/2508.html, v25.07.00...v25.08.00

v25.07.00

09 Jul 18:36
a46dc3d

Choose a tag to compare

This is a beta release of Legate.

Pip wheels are available on PyPI at https://pypi.org/project/legate/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/legate, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.

Documentation for this release can be found at https://docs.nvidia.com/legate/25.07/.

New features

Support matrix changes

  • macOS wheels are now available on PyPI.
  • Add support for Blackwell CUDA architecture and MNNVL.
  • Drop support for Python 3.10 and add support for Python 3.13.
  • Remove NumPy 1.X restriction from packages (now compatible with NumPy 2.X).

Interoperability

  • Add support for the PEP-3118 buffer protocol to legate.core.InlineAllocation.
  • Add a CUDA stream to InlineAllocation's __cuda_array_interface__.

Python tasks

  • Expose TaskConfig object.
  • Add support for unbound stores.

Documentation

  • Start publishing nightly doc builds to https://nv-legate.github.io/legate.
  • Added saxpy and manual task examples.
  • Various documentation improvements (InlineAllocation, ListLogicalArray)

Miscellaneous

  • Refactor argument parsing; now all arguments can be provided through the LEGATE_CONFIG environment variable.
  • Add support for std::vector<bool> to legate::Scalar.
  • Add ability for users to statically declare task-wide configuration options, such as the task's signature, constraints, and default variant options, via a static TASK_CONFIG member on task declarations.

Full Changelog: v25.03.02...v25.07.00

v25.03.02

09 Apr 19:08
75dc0a9

Choose a tag to compare

This is a beta release of Legate.

Linux x86 and ARM builds for Python 3.10 - 3.12 with multi-node support are available on PyPI at https://pypi.org/project/legate/, and as conda packages at https://anaconda.org/legate/legate. GASNet-based (rather than UCX-based) conda packages are under the gex label.

Documentation for this release can be found at https://docs.nvidia.com/legate/25.03/.

New features

PIP install support

Miscellaneous

  • Support MPICH mpirun when running the legate launcher with --launcher mpirun.

v25.03.00

17 Mar 23:04
40d2963

Choose a tag to compare

Linux x86 and ARM conda packages with multi-node support (based on UCX or GASNet) are available for this release at https://anaconda.org/legate/legate (GASNet-based packages are under the gex label).

Documentation for this release can be found at https://docs.nvidia.com/legate/25.03/.

New features

Licensing

UX improvements

  • Stop passing default options to Nsight Systems when using the --nsys flag of the legate driver. Any non-default arguments are fully in the control of the user, through --nsys-extra.
  • Add the legate.core.ProfileRange Python context manager (and associated C++ API), to annotate sub-spans within a larger task span on the profiler visualization.

Documentation improvements

Deprecations

  • Variants no longer need to specify the size of their return value. Legate will compute this information automatically.

Miscellaneous

  • The TaskContext is now exposed to Python tasks.
  • Legate is now compatible with NumPy 2.x.
  • Provide a per-processor/per-GPU caching mechanism, useful e.g. for reusing CUDA library handles across tasks.

Full changelog: https://docs.nvidia.com/legate/25.03/changes/2503.html

Known issues

  • We are aware of possible performance regressions when using UCX 1.18. We are temporarily restricting our packages to UCX <= 1.17 while we investigate this.

v25.01.00

08 Feb 06:20
9fc6801

Choose a tag to compare

This is a closed-source release, governed by the following EULA: https://docs.nvidia.com/legate/25.01/eula.pdf.

Linux x86 and ARM conda packages with multi-node support (based on UCX or GASNet) are available at https://anaconda.org/legate/legate (GASNet-based packages are under the gex label).

Documentation for this release can be found at https://docs.nvidia.com/legate/25.01/.

New features

Memory management

  • There is no longer a separation between the memory pools used for ahead-of-task-execution ("deferred") allocations, and task-execution-time ("eager") allocations. The --eager-alloc-percentage flag is thus obsolete. Instead, a task that creates temporary or output buffers during execution must be registered with has_allocations=true, and a new allocation_pool_size() mapper callback must provide an upper bound for the task's total size of allocations. See https://docs.nvidia.com/legate/25.01/api/cpp/mapping.html for more detailed instructions.
  • Add the offload_to() API, that allows a user to offload a store or array to a particular memory kind, such that any copies in other memories are discarded. This can be useful e.g. to evict an array from GPU memory onto system memory, freeing up space for subsequent GPU tasks.

I/O

  • Move the HDF5 interface out of the experimental namespace.
  • Use cuFile to accelerate HDF5 reads on the GPU.
  • Add support for reading "binary" HDF5 datasets.

Deprecations

  • Remove the task_target() callback from the Legate mapper. Users should utilize the resource scoping mechanism instead, if they need to restrict where tasks should run.
  • Drop support for the Maxwell GPU architecture. Legate now requires at least Pascal (sm_60).

Miscellaneous

  • Increase the maximum array dimension from 4 to 6.
  • Record stacktraces on Legate exceptions and error messages.
  • Consider NUMA node topology when allocating CPU cores and memory during automatic machine configuration.
  • Add environment variable LEGATE_LIMIT_STDOUT, to only print out the output from one of the copies of the top-level program in a multi-process execution.
  • Add legate::LogicalStore::reinterpret_as() to reinterpret the underlying storage of a LogicalStore as another data-type.

Full changelog: https://docs.nvidia.com/legate/25.01/changes/2501.html

v24.11.01

07 Dec 04:10
29368dc

Choose a tag to compare

This is a closed-source release, governed by the following EULA: https://docs.nvidia.com/legate/24.11/eula.pdf.

Linux x86 and ARM conda packages with multi-node support (based on UCX or GASNet) are available at https://anaconda.org/legate/legate (GASNet-based packages are under the gex label).

Documentation for this release can be found at https://docs.nvidia.com/legate/24.11/.

New features

  • Bug fixes for release 24.11.00

v24.11.00

17 Nov 00:49
583cbc0

Choose a tag to compare

This is a closed-source release, governed by the following EULA: https://docs.nvidia.com/legate/24.11/eula.pdf.

Linux x86 and ARM conda packages with multi-node support (based on UCX or GASNet) are available at https://anaconda.org/legate/legate (GASNet-based packages are under the gex label).

Documentation for this release can be found at https://docs.nvidia.com/legate/24.11/.

New features

  • Provide an MPI wrapper, that the user can compile against their local MPI installation, and integrate with an existing build of Legate. This is useful when a user needs to use an MPI installation different from the one Legate was compiled against.
  • Add support for using GASNet as the networking backend, useful on platforms not currently supported by UCX, e.g. Slingshot11. Provide scripts for the user to compile GASNet on their local machine, and integrate with an existing build of Legate.
  • Automatic machine configuration; Legate will now detect the available hardware resources at startup, and no longer needs to be provided information such as the amount of memory to allocate.
  • Print more information on what data is taking up memory when Legate encounters an out-of-memory error.
  • Support scalar parameters, default arguments and reduction privileges in Python tasks.
  • Add support for a concurrent_task_barrier, useful in preventing NCCL deadlocks.
  • Allow tasks to specify that CUDA context synchronization at task exit can be skipped, reducing latency.
  • Experimental support for distributed hdf5 and zarr I/O.
  • Experimental support for single-CPU/GPU fast-path task execution (skipping the tasking runtime dependency analysis).
  • Experimental implementation of a "bloated" instance prefetching API, which instructs the runtime to create instances encompassing multiple slices of a store ahead of time, potentially reducing intermediate memory usage.
  • full changelog

Known issues

The GPUDirectStorage backend of the hdf5 I/O module (off by default, and enabled with LEGATE_IO_USE_VFD_GDS=1) is not currently working (enabling it will result in a crash). We are working on a fix.

Legate's auto-configuration heuristics will attempt to split CPU cores and system memory evenly across all instantiated OpenMP processors, not accounting for the actual core count and memory limits of each NUMA domain. In cases where the number of OpenMP groups does not evenly divide the number of NUMA domains, this bug may cause unsatisfiable core and memory allocations, resulting in error messages such as:

  • not enough cores in NUMA domain 0 (72 < 284)
  • reservation ('OMP0 proc 1d00000000000005 (worker 8)') cannot be satisfied
  • insufficient memory in NUMA node 4 (102533955584 > 102005473280 bytes) - skipping allocation

These issues should only affect performance if you are actually running computations on the OpenMP cores (rather than using the GPUs for computation). You can always adjust the automatically derived configuration values through LEGATE_CONFIG, see https://docs.nvidia.com/legate/latest/usage.html#resource-allocation.

v24.06.01

10 Sep 20:11
v24.06.01
19d55cf

Choose a tag to compare

This is a patch release, and includes the following fixes:

This is a closed-source release, governed by the following EULA: https://docs.nvidia.com/legate/24.06/eula.pdf. x86 conda packages with multi-node support (based on UCX) are available at https://anaconda.org/legate/legate-core.

Documentation for this release can be found at https://docs.nvidia.com/legate/24.06/.