Cudastf by sidelnik · Pull Request #794 · NVIDIA/MatX

sidelnik · 2024-11-05T00:00:30Z

Initial updates to the build system to get Matx working with CUDASTF

cliffburdick · 2024-11-05T21:40:27Z

examples/cgsolve.cu


  exec.sync();
+#if 1
+  ctx.finalize();


what is finalize used for vs sync? Could you hide the context in the executor so the user doesn't need it, and calling exec.sync() calls finalize()?

finalize terminates everything in the stf context, it waits for asynchronous tasks, deletes internal resources etc... you can only do it once, sync is more equivalent to a ctx.task_fence() which is a non blocking fence (it returns a CUDA stream, and waiting for that stream means everything was done).

I'd like to move finalize to the dtor of the executor, but there are some caveats if you define the executor as a static variable, is this allowed ? The caveat might be some inappropriate unload ordering of CUDA and STF libraries as usual ...

Sounds good. I think the destructor is the right place. but does sync() work as expected?

@sidelnik is it doing a task fence with a stream sync ?

@caugonnet , sync() should be calling ctx.task_fence() now. I agree, I think we should place the ctx.finalize() inside the stf executor dtor

cliffburdick · 2024-11-05T21:58:38Z

examples/fft_conv.cu

  }

+#if 0
  cudaEventRecord(stop, stream);


Eventually we should mask these events behind the executor as well so the timing is the same regardless of the executor.

Yes this makes it look like the code is very different for both executors but that timing is the sole reason especially if finalize is moved to the dtor

include/matx/core/operator_utils.h

cliffburdick · 2024-11-06T17:54:15Z

include/matx/core/tensor.h

   */
  __MATX_HOST__ tensor_t(tensor_t const &rhs) noexcept
-      : detail::tensor_impl_t<T, RANK, Desc>{rhs.ldata_, rhs.desc_}, storage_(rhs.storage_)
+      : detail::tensor_impl_t<T, RANK, Desc>{rhs.ldata_, rhs.desc_, rhs.stf_ldata_}, storage_(rhs.storage_)


It would be good to understand why this extra data member is needed, because this pointer exists on the device potentially many times, so it can increase the size of the operator.

That's where a careful review of the design is needed ... Our logical data class tracks the use of a specific piece of data, your tensor seems to be a view to some data (with shapes and so on), so it's ok to have just the pointer and shapes, but in STF we do need to keep track of the internal state of the data (who owns a copy, which tasks depend on it, etc...). This is what the logical data does on your behalf and which your tensors cannot do by merely using the pointer.

One conservative take is to say that if you slice a tensor, this is the SAME logical data, so that further concurrent write accesses are serialized. This is sub-optimal when you have non overlapping slices but we cannot do better in a simple strategy. This ensures correctness but not optimality for concurrency

@cliffburdick you say it exists many times on the device, but isn't this a host only class ?

tensor_t is host/device, but tensor_impl_t is device-only

Then i'm even surprised a logical_data can exist in device code, or the storage for it ! But this may be a pointer to an optional logical data ... We need to improve that

ugh I mistyped. tensor_t is ONLY on the host. tensor_impl_t is both.

Still, it's surprising that we allow the logical data pointer to go on a device

ldata is local data, and is ultimately just a raw pointer that points to the data needed on the device. This may be the same as the base pointer, or it may be something like a strided/offset pointer.

cliffburdick · 2024-11-14T17:52:25Z

include/matx/executors/stf.h

+       * 
+       * @param stream CUDA stream
+       */
+      stfExecutor(cudaStream_t stream) : stream_(stream) {


What does a stream do here? I thought STF had its own internal streams?

@cliffburdick In STF you can create nested/localized contexts & streams from existing (non-STF created) streams. This allows STF mechanisms to be correctly synchronized within the existing stream ecosystem. @caugonnet correct me if I am wrong.

caugonnet · 2024-11-14T19:11:47Z

include/matx/core/tensor.h


+  template <typename S2 = Storage, typename D2 = Desc,
+            std::enable_if_t<is_matx_storage_v<typename remove_cvref<S2>::type> && is_matx_descriptor_v<typename remove_cvref<D2>::type>, bool> = true>
+  tensor_t(S2 &&s, D2 &&desc, T* ldata, std::optional<stf_logicaldata_type > *stf_ldata_) :


We need to do something about that type ... std::optional<stf_logicaldata_type > *stf_ldata_

The rationale is to be able to define a tensor before it is associated to an executor, so the logical data might be set lazily.

caugonnet · 2024-11-14T19:13:08Z

include/matx/core/tensor_impl.h

     */
    tensor_impl_t() {
-
+      auto ldptr = new std::optional<stf_logicaldata_type>();


this feels bad

This won't compile anymore since we don't allow std:: types on the device. It might work with cuda::std::optional, but we don't use that anywhere currently.

@sidelnik now that we have the notion of logical_token, i believe we might simplify that. Maybe rename stf_logicaldata_type to stf_token ?

The risk with token is that if we get it wrong, it's easier to mess things : with a logical data, until you do have some "value" in it, you can't read it and you'll get runtime errors. If you have aliases which would currently use the same token under the hood, it would also have to use the same token when creating the aliased data.

caugonnet · 2024-11-14T19:13:30Z

include/matx/core/tensor_impl.h

+    template <typename DescriptorType, std::enable_if_t<is_matx_descriptor_v<typename remove_cvref<DescriptorType>::type>, bool> =     true>
+    __MATX_INLINE__ __MATX_DEVICE__ __MATX_HOST__ tensor_impl_t(T *const ldata,
+                    DescriptorType &&desc, std::optional<stf_logicaldata_type > *stf_ldata)
+        : ldata_(ldata), desc_{std::forward<DescriptorType>(desc)}, stf_ldata_(stf_ldata)


::std::move(stf_ldata) ?

caugonnet · 2024-11-14T19:17:48Z

include/matx/core/tensor_impl.h

+#endif
+
+      if (perm == 0) {
+          task.add_deps(ld.write());


We could directly build a task_dep in CUDASTF, matching perm value with the type ... But it seems there is no such thing as a clean way to do this !

caugonnet · 2024-11-14T19:19:21Z

include/matx/core/tensor_impl.h

+            place = getDataPlace(Data());
+#endif
+
+            *stf_ldata_ = ctx.logical_data(cuda::experimental::stf::void_interface());


Some comment would be welcome here :) This is creating a logical data with a void data interface because we don't rely on CUDASTF for transfers/allocation, it's just for sync.

Putting a value here, and not a shape of a void interface means we don't have to issue a "write" task in CUDASTF

caugonnet · 2024-11-14T19:29:19Z

include/matx/core/utils.h

 namespace detail {
+
+#if 0
+__MATX_INLINE__ cuda::experimental::stf::data_place getDataPlace(void *ptr) {


Why don't we keep it ? Note that for void data interface it's not super critical but still ...

caugonnet · 2024-11-14T19:31:50Z

include/matx/core/utils.h

+                        return data_place::current_device();
+                    case MATX_INVALID_MEMORY:
+                        //std::cout << "Data kind is invalid: assuming managed memory\n";
+                        return data_place::managed;


this seems like an error

caugonnet · 2024-11-14T19:33:26Z

include/matx/executors/stf.h

+            }
+            else {
+                //std::cout << " RANK 0 not on LHS operator = " << op.str() << '\n';
+                detail::matxOpT0Kernel<<<blocks, threads, 0, stream_>>>(op);


Why do we sometimes use something without a task ? Is it coherent with STF tasks?

caugonnet · 2024-11-14T19:34:48Z

include/matx/executors/stf.h

+
+            bool stride = detail::get_grid_dims<Op::Rank()>(blocks, threads, sizes, 256);
+
+            if constexpr (Op::Rank() == 1) {


It looks like we could factorize all that constexpr cascade, and move the constexpr tests into the lambda ?

caugonnet · 2024-11-14T19:35:41Z

include/matx/generators/generator1d.h

          }

+        template <typename Task>
+        __MATX_INLINE__ void apply_dep_to_task([[maybe_unused]] Task &&task, [[maybe_unused]] int perm=1) const noexcept { }


So this operator is defined per operator, and is STF specific ? it's not part of the executor nor relying on overloads / traits ?

caugonnet · 2024-11-14T19:37:58Z

include/matx/operators/conv.h

+              b_.apply_dep_to_task(tsk, 1);
+
+              tsk->*[&](cudaStream_t s) {
+                  auto exec = cudaExecutor(s);


So create a nested MatX executor, is that legal ?

I think it should be fine. The cache is ultimately what would possibly have side effects

What happens exactly in the dtor of the executor @cliffburdick, nothing special like a stream sync ?

No, it doesn't do anything

cliffburdick · 2024-11-20T19:06:05Z

include/matx/core/tensor_impl.h

+        if constexpr (is_cuda_executor_v<Executor>) {
+            return;
+        }
+        else if constexpr (!is_cuda_executor_v<Executor>) {


just else?

cliffburdick · 2024-11-20T19:07:12Z

include/matx/core/tensor_impl.h

    Desc desc_;
+
+  public:
+    mutable std::optional<stf_logicaldata_type > *stf_ldata_;


As discussed before this won't work since we can't use std:: objects on the device. It might work with cuda::std::optional, but we'd likely need to justify the overhead vs other options

cliffburdick · 2024-11-22T22:24:43Z

include/matx/operators/constval.h

          return v_; };

+      template <typename Task>
+      __MATX_INLINE__ void apply_dep_to_task([[maybe_unused]] Task &&task, [[maybe_unused]] int perm) const noexcept { }


Operator members typically use camel case format

cliffburdick · 2024-11-22T22:26:10Z

include/matx/operators/all.h

+            tsk.set_symbol("all_task");
+
+            output.PreRun(out_dims_, std::forward<Executor>(ex));
+            output.apply_dep_to_task(tsk, 0);


Why isn't apply_dep_to_task just part of PreRun? It looks like it's called in the same place

cliffburdick · 2024-11-22T23:17:45Z

include/matx/operators/fft.h

-            if constexpr (std::is_same_v<FFTType, fft_t>) { 
-              fft_impl(permute(cuda::std::get<0>(out), perm_), permute(a_, perm_), fft_size_, norm_, ex);
+            // stfexecutor case
+            if constexpr (!is_cuda_executor_v<Executor>) {


Do you want this to run for the host executor too?

cliffburdick · 2024-11-22T23:19:48Z

include/matx/operators/fft.h

+                output.apply_dep_to_task(tsk, 0);
+                a_.apply_dep_to_task(tsk, 1);
+
+                tsk->*[&](cudaStream_t s) {


Rather than checking if this is not a cuda executor, then creating one inside, can it somehow pull a stream from STF and just use that here?

caugonnet · 2025-01-22T07:54:10Z

include/matx/transforms/cgsolve.h

      // A*X
-      (Ap = matvec(A, X)).run(stream);
+      //(Ap = matvec(A, X)).run(stream);
+      (Ap = matvec(A, X)).run(exec);


is that the same to call run(exec) and run(stream) when we have a "classic" executor ? (won't it trigger much more work ?)

…s a .csv file of results

Merging cudastf branch to main branch

sidelnik added 6 commits November 4, 2024 15:22

Update build config to pull CUDASTF

d6dc01d

remove const expr

245b20f

Updates to get basic cudastf functionality working with matx

9b35ec8

move to void_interface

7d298d4

add stf executor

154b3f9

support for cgsolve operator and a few examples

c8ef988

cliffburdick reviewed Nov 5, 2024

View reviewed changes

cliffburdick reviewed Nov 6, 2024

View reviewed changes

include/matx/core/operator_utils.h Show resolved Hide resolved

cliffburdick reviewed Nov 6, 2024

View reviewed changes

cliffburdick reviewed Nov 14, 2024

View reviewed changes

caugonnet reviewed Nov 14, 2024

View reviewed changes

cliffburdick reviewed Nov 20, 2024

View reviewed changes

cliffburdick reviewed Nov 22, 2024

View reviewed changes

sidelnik added 2 commits December 3, 2024 13:17

make the sync() that is part of stfexecutor call ctx.task_fence()

52b18c9

fix typo

d726b10

sidelnik added 10 commits December 17, 2024 13:11

minor typo fix

b062577

update version of stf

bbf9abc

cleanup constexpr case for stfexecutor

3e831ea

cleanup constexpr case for stfexecutor

702fe79

add conditional support for cudagraph to the stf executor

5bfe21e

update to latest cudastf

f407256

switch to use logical token

221599c

update parameters for radar code

7a5bb6c

update to radar code to work with command line args

0c2432f

cleanup to support different executor

3ae267b

caugonnet reviewed Jan 22, 2025

View reviewed changes

sidelnik and others added 19 commits January 24, 2025 11:10

cleanup radar code to emit stf and cuda versions

6a75794

test script that runs simple radar with different input sizes. output…

f1facca

…s a .csv file of results

enable cuda graphs as a command line argument enableGraphs

0199e75

add support for the random/randomOp generator

39b16f4

get the basic spectrogram code working with stf

9b7c4b0

get spectrogram cudagraph code working with stf

f9e09f1

add assert in the case stream capture is turned on if creating a plan

6c9a791

Merge branch 'cudastf' into cudastf_latest

a1efd1c

Merge pull request #2 from sidelnik/cudastf_latest

6437eab

Merging cudastf branch to main branch

Apps using matx with stf should get these flags

bbb9aae

fix constructor

e13c9b6

fix typo/bug

7244399

update to example code to fix compile error

66f6850

update to example code to fix compile error

89e2a43

update test script for radar code

973886b

temp fix to the allocator dtor

92885e7

remove warning to work with latest stf

8607840

replace logical token with token

14e0985

update version to use cccl from main

92e04d5


		bool stride = detail::get_grid_dims<Op::Rank()>(blocks, threads, sizes, 256);

		if constexpr (Op::Rank() == 1) {

Conversation

sidelnik commented Nov 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

caugonnet Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

caugonnet Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

caugonnet Nov 14, 2024 •

edited

Loading

caugonnet Nov 14, 2024 •

edited

Loading