Provide a raft::copy overload for mdspan-to-mdspan copies#1818
Merged
rapids-bot[bot] merged 104 commits intorapidsai:branch-23.12from Oct 6, 2023
Merged
Provide a raft::copy overload for mdspan-to-mdspan copies#1818rapids-bot[bot] merged 104 commits intorapidsai:branch-23.12from
rapids-bot[bot] merged 104 commits intorapidsai:branch-23.12from
Conversation
…fea-add-buffer
…fea-add-buffer
…fea-add-buffer
…fea-add-buffer
…fea-add-buffer
…fea-add-buffer
…fea-add-buffer
…fea-add-buffer
…fea-add-buffer
Contributor
Author
|
/ok to test |
divyegala
approved these changes
Oct 2, 2023
Contributor
Author
|
/ok to test |
Contributor
Author
|
/ok to test |
Contributor
Author
|
CI failing for same reason as other 23.12 PRs. I believe this should be unblocked after #1868 goes through and gets merged back into this PR. |
Member
|
/ok to test |
Member
|
/merge |
divyegala
pushed a commit
to divyegala/raft
that referenced
this pull request
Oct 6, 2023
) # Purpose This PR provides a utility for copying between generic mdspans. This includes between host and device, between mdspans of different layouts, and between mdspans of different (convertible) data types ## API `raft::copy(raft_resources, dest_mdspan, src_mdspan);` # Limitations - Currently does not support copies between mdspans on two different GPUs - Currently not performant for generic host-to-host copies (would be much easier to optimize with submdspan for padded layouts) - Submdspan with padded layouts would also make it easier to improve perf of some device-to-device copies, though perf should already be quite good for most device-to-device copies. # Design - Includes optional `RAFT_DISABLE_CUDA` build definition in order to use this utility in CUDA-free builds (important for use in the FIL backend for Triton) - Includes a new `raft::stream_view` object which is a thin wrapper around `rmm::stream_view`. Its purpose is solely to provide a symbol that will be defined in CUDA-free builds and which will throw exceptions or log error messages if someone tries to use a CUDA stream in a CUDA-free build. This avoids a whole bunch of ifdefs that would otherwise infect the whole codebase. - Uses (roughly in order of preference): `cudaMemcpyAsync, std::copy, cublas, custom device kernel, custom host-to-host transfer logic` for the underlying copy - Provides two different headers: `raft/core/copy.hpp` and `raft/core/copy.cuh`. This is to accommodate the custom kernel necessary for handling completely generic device-to-device copies. See below for more details. ## Details on the header split For many instantiations, even those which involve the device, we do not require nvcc compilation. If, however, we determine at compilation time that we must use a custom kernel for the copy, then we must invoke nvcc. We do not wish to indicate that a public header file is a C++ header when it is a CUDA header or vice versa, so we split the definitions into separate `hpp` and `cuh` files, with all template instantiations requiring the custom kernel enable-if'd out of the hpp file. Thus, the cuh header can be used for _any_ mdspan-to-mdspan copy, but the hpp file will not compile for those specific instantiations that require a custom kernel. The recommended workflow is that if a `cpp` file requires an mdspan-to-mdspan copy, first try the `hpp` header. If that fails, the `cpp` file must be converted to a `cu` file, and the `cuh` header should be used. For source files that are already being compiled with nvcc (i.e. `.cu` files), the `cuh` header might as well be used and will not result in any additional compile time penalty. # Remaining tasks to leave WIP status - [x] Add benchmarks for copies - [x] Ensure that new function is correctly added to docs # Follow-up items - Optimize host-to-host transfers using a cache-oblivious approach with SIMD-accelerated transposes for contiguous memory - Test cache-oblivious device-to-device transfers and compare performance - Provide transparent support for copies between devices. ## Relationship to mdbuffer This utility encapsulates a substantial chunk of the core logic required for the mdbuffer implementation. It is being split into its own PR both because it is useful on its own and because the mdbuffer work has been delayed by higher priority tasks. Close rapidsai#1779 Authors: - William Hicks (https://github.com/wphicks) - Tarang Jain (https://github.com/tarang-jain) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#1818
rapids-bot bot
pushed a commit
to rapidsai/cuml
that referenced
this pull request
Oct 10, 2023
Something appears to have changed between 23.10 and 23.12 causing the unit test suite to require more memory than before. As of now, I see no meaningful new commits on the 23.12 branch of cuml. The only real new commit is the addition of ARM CUDA 12 conda builds, which are completely independent. Therefore, the cause of the OOMs must be a change in a dependency. I do see multiple significant commits in raft, so perhaps one of those is the cause. rapidsai/raft#1818 seems like the most plausible culprit, but that's coming from my position of absolute ignorance about raft and the fact that unintentionally copying data in mdspans would in principle be an easy way to accidentally increase memory usage. However, it really could be coming from anywhere. Authors: - Vyas Ramasubramani (https://github.com/vyasr) - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - William Hicks (https://github.com/wphicks) - Jake Awe (https://github.com/AyodeAwe) URL: #5611
loulankxh
pushed a commit
to loulankxh/raft
that referenced
this pull request
Oct 14, 2025
) # Purpose This PR provides a utility for copying between generic mdspans. This includes between host and device, between mdspans of different layouts, and between mdspans of different (convertible) data types ## API `raft::copy(raft_resources, dest_mdspan, src_mdspan);` # Limitations - Currently does not support copies between mdspans on two different GPUs - Currently not performant for generic host-to-host copies (would be much easier to optimize with submdspan for padded layouts) - Submdspan with padded layouts would also make it easier to improve perf of some device-to-device copies, though perf should already be quite good for most device-to-device copies. # Design - Includes optional `RAFT_DISABLE_CUDA` build definition in order to use this utility in CUDA-free builds (important for use in the FIL backend for Triton) - Includes a new `raft::stream_view` object which is a thin wrapper around `rmm::stream_view`. Its purpose is solely to provide a symbol that will be defined in CUDA-free builds and which will throw exceptions or log error messages if someone tries to use a CUDA stream in a CUDA-free build. This avoids a whole bunch of ifdefs that would otherwise infect the whole codebase. - Uses (roughly in order of preference): `cudaMemcpyAsync, std::copy, cublas, custom device kernel, custom host-to-host transfer logic` for the underlying copy - Provides two different headers: `raft/core/copy.hpp` and `raft/core/copy.cuh`. This is to accommodate the custom kernel necessary for handling completely generic device-to-device copies. See below for more details. ## Details on the header split For many instantiations, even those which involve the device, we do not require nvcc compilation. If, however, we determine at compilation time that we must use a custom kernel for the copy, then we must invoke nvcc. We do not wish to indicate that a public header file is a C++ header when it is a CUDA header or vice versa, so we split the definitions into separate `hpp` and `cuh` files, with all template instantiations requiring the custom kernel enable-if'd out of the hpp file. Thus, the cuh header can be used for _any_ mdspan-to-mdspan copy, but the hpp file will not compile for those specific instantiations that require a custom kernel. The recommended workflow is that if a `cpp` file requires an mdspan-to-mdspan copy, first try the `hpp` header. If that fails, the `cpp` file must be converted to a `cu` file, and the `cuh` header should be used. For source files that are already being compiled with nvcc (i.e. `.cu` files), the `cuh` header might as well be used and will not result in any additional compile time penalty. # Remaining tasks to leave WIP status - [x] Add benchmarks for copies - [x] Ensure that new function is correctly added to docs # Follow-up items - Optimize host-to-host transfers using a cache-oblivious approach with SIMD-accelerated transposes for contiguous memory - Test cache-oblivious device-to-device transfers and compare performance - Provide transparent support for copies between devices. ## Relationship to mdbuffer This utility encapsulates a substantial chunk of the core logic required for the mdbuffer implementation. It is being split into its own PR both because it is useful on its own and because the mdbuffer work has been delayed by higher priority tasks. Close rapidsai#1779 Authors: - William Hicks (https://github.com/wphicks) - Tarang Jain (https://github.com/tarang-jain) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#1818
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This PR provides a utility for copying between generic mdspans. This includes between host and device, between mdspans of different layouts, and between mdspans of different (convertible) data types
API
raft::copy(raft_resources, dest_mdspan, src_mdspan);Limitations
Design
RAFT_DISABLE_CUDAbuild definition in order to use this utility in CUDA-free builds (important for use in the FIL backend for Triton)raft::stream_viewobject which is a thin wrapper aroundrmm::stream_view. Its purpose is solely to provide a symbol that will be defined in CUDA-free builds and which will throw exceptions or log error messages if someone tries to use a CUDA stream in a CUDA-free build. This avoids a whole bunch of ifdefs that would otherwise infect the whole codebase.cudaMemcpyAsync, std::copy, cublas, custom device kernel, custom host-to-host transfer logicfor the underlying copyraft/core/copy.hppandraft/core/copy.cuh. This is to accommodate the custom kernel necessary for handling completely generic device-to-device copies. See below for more details.Details on the header split
For many instantiations, even those which involve the device, we do not require nvcc compilation. If, however, we determine at compilation time that we must use a custom kernel for the copy, then we must invoke nvcc. We do not wish to indicate that a public header file is a C++ header when it is a CUDA header or vice versa, so we split the definitions into separate
hppandcuhfiles, with all template instantiations requiring the custom kernel enable-if'd out of the hpp file.Thus, the cuh header can be used for any mdspan-to-mdspan copy, but the hpp file will not compile for those specific instantiations that require a custom kernel. The recommended workflow is that if a
cppfile requires an mdspan-to-mdspan copy, first try thehppheader. If that fails, thecppfile must be converted to acufile, and thecuhheader should be used. For source files that are already being compiled with nvcc (i.e..cufiles), thecuhheader might as well be used and will not result in any additional compile time penalty.Remaining tasks to leave WIP status
Follow-up items
Relationship to mdbuffer
This utility encapsulates a substantial chunk of the core logic required for the mdbuffer implementation. It is being split into its own PR both because it is useful on its own and because the mdbuffer work has been delayed by higher priority tasks.
Close #1779