Provide a raft::copy overload for mdspan-to-mdspan copies by wphicks · Pull Request #1818 · rapidsai/raft

wphicks · 2023-09-12T17:58:50Z

Purpose

This PR provides a utility for copying between generic mdspans. This includes between host and device, between mdspans of different layouts, and between mdspans of different (convertible) data types

API

raft::copy(raft_resources, dest_mdspan, src_mdspan);

Limitations

Currently does not support copies between mdspans on two different GPUs
Currently not performant for generic host-to-host copies (would be much easier to optimize with submdspan for padded layouts)
Submdspan with padded layouts would also make it easier to improve perf of some device-to-device copies, though perf should already be quite good for most device-to-device copies.

Design

Includes optional RAFT_DISABLE_CUDA build definition in order to use this utility in CUDA-free builds (important for use in the FIL backend for Triton)
Includes a new raft::stream_view object which is a thin wrapper around rmm::stream_view. Its purpose is solely to provide a symbol that will be defined in CUDA-free builds and which will throw exceptions or log error messages if someone tries to use a CUDA stream in a CUDA-free build. This avoids a whole bunch of ifdefs that would otherwise infect the whole codebase.
Uses (roughly in order of preference): cudaMemcpyAsync, std::copy, cublas, custom device kernel, custom host-to-host transfer logic for the underlying copy
Provides two different headers: raft/core/copy.hpp and raft/core/copy.cuh. This is to accommodate the custom kernel necessary for handling completely generic device-to-device copies. See below for more details.

Details on the header split

For many instantiations, even those which involve the device, we do not require nvcc compilation. If, however, we determine at compilation time that we must use a custom kernel for the copy, then we must invoke nvcc. We do not wish to indicate that a public header file is a C++ header when it is a CUDA header or vice versa, so we split the definitions into separate hpp and cuh files, with all template instantiations requiring the custom kernel enable-if'd out of the hpp file.

Thus, the cuh header can be used for any mdspan-to-mdspan copy, but the hpp file will not compile for those specific instantiations that require a custom kernel. The recommended workflow is that if a cpp file requires an mdspan-to-mdspan copy, first try the hpp header. If that fails, the cpp file must be converted to a cu file, and the cuh header should be used. For source files that are already being compiled with nvcc (i.e. .cu files), the cuh header might as well be used and will not result in any additional compile time penalty.

Remaining tasks to leave WIP status

Add benchmarks for copies
Ensure that new function is correctly added to docs

Follow-up items

Optimize host-to-host transfers using a cache-oblivious approach with SIMD-accelerated transposes for contiguous memory
Test cache-oblivious device-to-device transfers and compare performance
Provide transparent support for copies between devices.

Relationship to mdbuffer

This utility encapsulates a substantial chunk of the core logic required for the mdbuffer implementation. It is being split into its own PR both because it is useful on its own and because the mdbuffer work has been delayed by higher priority tasks.

Close #1779

…fea-add-buffer

wphicks · 2023-09-22T18:51:00Z

/ok to test

…_copy

wphicks · 2023-10-03T15:08:57Z

/ok to test

wphicks · 2023-10-04T19:15:51Z

/ok to test

wphicks · 2023-10-05T14:45:00Z

CI failing for same reason as other 23.12 PRs. I believe this should be unblocked after #1868 goes through and gets merged back into this PR.

cjnolet · 2023-10-06T02:47:31Z

/ok to test

cjnolet · 2023-10-06T15:07:47Z

/merge

) # Purpose This PR provides a utility for copying between generic mdspans. This includes between host and device, between mdspans of different layouts, and between mdspans of different (convertible) data types ## API `raft::copy(raft_resources, dest_mdspan, src_mdspan);` # Limitations - Currently does not support copies between mdspans on two different GPUs - Currently not performant for generic host-to-host copies (would be much easier to optimize with submdspan for padded layouts) - Submdspan with padded layouts would also make it easier to improve perf of some device-to-device copies, though perf should already be quite good for most device-to-device copies. # Design - Includes optional `RAFT_DISABLE_CUDA` build definition in order to use this utility in CUDA-free builds (important for use in the FIL backend for Triton) - Includes a new `raft::stream_view` object which is a thin wrapper around `rmm::stream_view`. Its purpose is solely to provide a symbol that will be defined in CUDA-free builds and which will throw exceptions or log error messages if someone tries to use a CUDA stream in a CUDA-free build. This avoids a whole bunch of ifdefs that would otherwise infect the whole codebase. - Uses (roughly in order of preference): `cudaMemcpyAsync, std::copy, cublas, custom device kernel, custom host-to-host transfer logic` for the underlying copy - Provides two different headers: `raft/core/copy.hpp` and `raft/core/copy.cuh`. This is to accommodate the custom kernel necessary for handling completely generic device-to-device copies. See below for more details. ## Details on the header split For many instantiations, even those which involve the device, we do not require nvcc compilation. If, however, we determine at compilation time that we must use a custom kernel for the copy, then we must invoke nvcc. We do not wish to indicate that a public header file is a C++ header when it is a CUDA header or vice versa, so we split the definitions into separate `hpp` and `cuh` files, with all template instantiations requiring the custom kernel enable-if'd out of the hpp file. Thus, the cuh header can be used for _any_ mdspan-to-mdspan copy, but the hpp file will not compile for those specific instantiations that require a custom kernel. The recommended workflow is that if a `cpp` file requires an mdspan-to-mdspan copy, first try the `hpp` header. If that fails, the `cpp` file must be converted to a `cu` file, and the `cuh` header should be used. For source files that are already being compiled with nvcc (i.e. `.cu` files), the `cuh` header might as well be used and will not result in any additional compile time penalty. # Remaining tasks to leave WIP status - [x] Add benchmarks for copies - [x] Ensure that new function is correctly added to docs # Follow-up items - Optimize host-to-host transfers using a cache-oblivious approach with SIMD-accelerated transposes for contiguous memory - Test cache-oblivious device-to-device transfers and compare performance - Provide transparent support for copies between devices. ## Relationship to mdbuffer This utility encapsulates a substantial chunk of the core logic required for the mdbuffer implementation. It is being split into its own PR both because it is useful on its own and because the mdbuffer work has been delayed by higher priority tasks. Close rapidsai#1779 Authors: - William Hicks (https://github.com/wphicks) - Tarang Jain (https://github.com/tarang-jain) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#1818

Something appears to have changed between 23.10 and 23.12 causing the unit test suite to require more memory than before. As of now, I see no meaningful new commits on the 23.12 branch of cuml. The only real new commit is the addition of ARM CUDA 12 conda builds, which are completely independent. Therefore, the cause of the OOMs must be a change in a dependency. I do see multiple significant commits in raft, so perhaps one of those is the cause. rapidsai/raft#1818 seems like the most plausible culprit, but that's coming from my position of absolute ignorance about raft and the fact that unintentionally copying data in mdspans would in principle be an easy way to accidentally increase memory usage. However, it really could be coming from anywhere. Authors: - Vyas Ramasubramani (https://github.com/vyasr) - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - William Hicks (https://github.com/wphicks) - Jake Awe (https://github.com/AyodeAwe) URL: #5611

) # Purpose This PR provides a utility for copying between generic mdspans. This includes between host and device, between mdspans of different layouts, and between mdspans of different (convertible) data types ## API `raft::copy(raft_resources, dest_mdspan, src_mdspan);` # Limitations - Currently does not support copies between mdspans on two different GPUs - Currently not performant for generic host-to-host copies (would be much easier to optimize with submdspan for padded layouts) - Submdspan with padded layouts would also make it easier to improve perf of some device-to-device copies, though perf should already be quite good for most device-to-device copies. # Design - Includes optional `RAFT_DISABLE_CUDA` build definition in order to use this utility in CUDA-free builds (important for use in the FIL backend for Triton) - Includes a new `raft::stream_view` object which is a thin wrapper around `rmm::stream_view`. Its purpose is solely to provide a symbol that will be defined in CUDA-free builds and which will throw exceptions or log error messages if someone tries to use a CUDA stream in a CUDA-free build. This avoids a whole bunch of ifdefs that would otherwise infect the whole codebase. - Uses (roughly in order of preference): `cudaMemcpyAsync, std::copy, cublas, custom device kernel, custom host-to-host transfer logic` for the underlying copy - Provides two different headers: `raft/core/copy.hpp` and `raft/core/copy.cuh`. This is to accommodate the custom kernel necessary for handling completely generic device-to-device copies. See below for more details. ## Details on the header split For many instantiations, even those which involve the device, we do not require nvcc compilation. If, however, we determine at compilation time that we must use a custom kernel for the copy, then we must invoke nvcc. We do not wish to indicate that a public header file is a C++ header when it is a CUDA header or vice versa, so we split the definitions into separate `hpp` and `cuh` files, with all template instantiations requiring the custom kernel enable-if'd out of the hpp file. Thus, the cuh header can be used for _any_ mdspan-to-mdspan copy, but the hpp file will not compile for those specific instantiations that require a custom kernel. The recommended workflow is that if a `cpp` file requires an mdspan-to-mdspan copy, first try the `hpp` header. If that fails, the `cpp` file must be converted to a `cu` file, and the `cuh` header should be used. For source files that are already being compiled with nvcc (i.e. `.cu` files), the `cuh` header might as well be used and will not result in any additional compile time penalty. # Remaining tasks to leave WIP status - [x] Add benchmarks for copies - [x] Ensure that new function is correctly added to docs # Follow-up items - Optimize host-to-host transfers using a cache-oblivious approach with SIMD-accelerated transposes for contiguous memory - Test cache-oblivious device-to-device transfers and compare performance - Provide transparent support for copies between devices. ## Relationship to mdbuffer This utility encapsulates a substantial chunk of the core logic required for the mdbuffer implementation. It is being split into its own PR both because it is useful on its own and because the mdbuffer work has been delayed by higher priority tasks. Close rapidsai#1779 Authors: - William Hicks (https://github.com/wphicks) - Tarang Jain (https://github.com/tarang-jain) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#1818

tarang-jain added 30 commits April 3, 2023 11:08

Initial commit

e24fd2e

Merge branch 'branch-23.04' of https://github.com/rapidsai/raft into …

b8cda77

…fea-add-buffer

New commit

07dabfe

Merge branch 'branch-23.06' of https://github.com/rapidsai/raft into …

64eb461

…fea-add-buffer

Update

21c2641

Merge

c84daa6

Merge

4ad421b

Merge

ea11b07

build

ab19410

Test start

9870e9d

Test start

51a2581

Merge branch 'branch-23.06' of https://github.com/rapidsai/raft into …

552b21e

…fea-add-buffer

style changes

d0e7b2c

merge

f72f7f8

merge dependencies.yaml

05f9daa

Updates

0250931

Merge branch 'branch-23.06' of https://github.com/rapidsai/raft into …

057743d

…fea-add-buffer

Debugging

20042b0

Update gtest

2d189c3

Merge branch 'branch-23.06' of https://github.com/rapidsai/raft into …

53c4557

…fea-add-buffer

Merge branch 'branch-23.06' of https://github.com/rapidsai/raft into …

de753ae

…fea-add-buffer

Some updates after reviews

2f8b294

Use raft::resources

6539ef4

Merge branch 'branch-23.06' of https://github.com/rapidsai/raft into …

1709521

…fea-add-buffer

move exception

008bb5b

Updates after PR Reviews

5b97273

Merge branch 'branch-23.06' of https://github.com/rapidsai/raft into …

5be6ec2

…fea-add-buffer

Add container policy

838bfef

further changes with container policy

e035e2e

Merge branch 'branch-23.06' of https://github.com/rapidsai/raft into …

cd91a88

…fea-add-buffer

wphicks and others added 4 commits September 21, 2023 11:25

Correct dtype compatibility test

faa402a

Provide cleaner compile error for using copy with unsupported types

2eba34d

Merge branch 'branch-23.10' into fea-mdspan_copy

ca77cf0

Update stream_view docs

4389b64

wphicks mentioned this pull request Sep 22, 2023

[FEA] Use cache-oblivious copies for arbitrary copies in raft::copy #1842

Open

wphicks added 2 commits September 22, 2023 12:20

Merge branch 'branch-23.10' into fea-mdspan_copy

7416b73

Merge branch 'branch-23.10' into fea-mdspan_copy

7f407ed

wphicks added 2 commits September 22, 2023 15:11

Update stream view docs

62ac60a

Merge remote-tracking branch 'origin/fea-mdspan_copy' into fea-mdspan…

5bddcc8

…_copy

wphicks changed the base branch from branch-23.10 to branch-23.12 October 2, 2023 14:07

wphicks and others added 2 commits October 2, 2023 10:07

Merge branch 'branch-23.12' into fea-mdspan_copy

bd5a8f8

Add static asserts for mdspan_copyable

a8b17a8

wphicks added the 4 - Waiting on Reviewer Waiting for reviewer to review or respond label Oct 2, 2023

Correct iteration in host-to-host copies

722425c

divyegala approved these changes Oct 2, 2023

View reviewed changes

wphicks added 5 - Ready to Merge and removed 4 - Waiting on Reviewer Waiting for reviewer to review or respond labels Oct 3, 2023

Fix double-defined target from branch merge

0863db0

Merge branch 'branch-23.12' into fea-mdspan_copy

5c4349e

rapids-bot bot merged commit c735ecb into rapidsai:branch-23.12 Oct 6, 2023

vyasr mentioned this pull request Oct 10, 2023

Reduce parallelism to avoid OOMs in wheel tests rapidsai/cuml#5611

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a raft::copy overload for mdspan-to-mdspan copies#1818

Provide a raft::copy overload for mdspan-to-mdspan copies#1818
rapids-bot[bot] merged 104 commits intorapidsai:branch-23.12from
wphicks:fea-mdspan_copy

wphicks commented Sep 12, 2023 •

edited

Loading

Uh oh!

wphicks commented Sep 22, 2023

Uh oh!

wphicks commented Oct 3, 2023

Uh oh!

wphicks commented Oct 4, 2023

Uh oh!

wphicks commented Oct 5, 2023

Uh oh!

cjnolet commented Oct 6, 2023

Uh oh!

cjnolet commented Oct 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

Conversation

wphicks commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

API

Limitations

Design

Details on the header split

Remaining tasks to leave WIP status

Follow-up items

Relationship to mdbuffer

Uh oh!

wphicks commented Sep 22, 2023

Uh oh!

wphicks commented Oct 3, 2023

Uh oh!

wphicks commented Oct 4, 2023

Uh oh!

wphicks commented Oct 5, 2023

Uh oh!

cjnolet commented Oct 6, 2023

Uh oh!

cjnolet commented Oct 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

wphicks commented Sep 12, 2023 •

edited

Loading