MatX for element-wise tensor calculation #1113

CrreeL · 2026-01-11T05:10:06Z

CrreeL
Jan 11, 2026

Hi! I have a cluster of expressions with the form like p=Aexp(-(E-Fx)/(kBT)) to deal with in my project. I tried both MatX (numpy and matlab like) and thrust (with functor and transform function). It turns out that MatX is about half as fast as thrust. I think it reasonable, since MatX provides more user-friendly syntax. However, I still want to verify whether this is consistent with developers' expectations.

Answered by cliffburdick

Jan 11, 2026

Hi @CrreeL , we do not necessarily think it should be faster than Thrust, but it shouldn't be slower. Both should use the same constructs under the hood, but there are cases where I would expect it to be the fastest.

As far as benchmarking, make_tensor by default uses managed memory, and your first iteration of the loop with page the memory in and out of the GPU, which has a large negative contribution to the runtime. There are other reasons for the slowness on the first loop too, and in general, you should never take the first iteration as part of your measurements. For simple element-wise computations like you're doing there are likely only two sources of long time penalties:

Managed m…

View full answer

cliffburdick · 2026-01-11T05:32:16Z

cliffburdick
Jan 11, 2026
Maintainer

Hi @CrreeL , thanks for reporting this. We definitely do not expect thrust to be faster. We will test it out on our end. Can you please provide a sample of what your thrust syntax looks like so we can compare?

3 replies

CrreeL Jan 11, 2026
Author

I'm pleasantly surprised to hear that MatX is designed to be faster than thrust. Sure, I would like to provide a code sample. It should be available tomorrow.

CrreeL Jan 11, 2026
Author

I think I've located what's wrong. I made a test code calculating expressions like p=Aexp(-(E-Fx)/(kBT)). Firstly, make_tensor and (tensors = initialvalue).run(), cudaDeviceSynchronize, and then performs expressions calculation, cudaDeviceSynchronize again. The calculation was repeated for 10 times using a for loop, and the time cost for the first time was substantially longer than the remaining 9 runs.

Maybe I missed some details in the documentation. So, the first execution of element-wise expressions will incur a one-time cost? And if only the element values of the tensors are updated, the one-time cost is not incurred again, right? For example, I define the tensors as class members, and update the tensors and evaluate the expressions through member function calls each time. The one-time cost of each expression will only be incurred when I call the member function for the first time?

cliffburdick Jan 11, 2026
Maintainer

Hi @CrreeL , we do not necessarily think it should be faster than Thrust, but it shouldn't be slower. Both should use the same constructs under the hood, but there are cases where I would expect it to be the fastest.

As far as benchmarking, make_tensor by default uses managed memory, and your first iteration of the loop with page the memory in and out of the GPU, which has a large negative contribution to the runtime. There are other reasons for the slowness on the first loop too, and in general, you should never take the first iteration as part of your measurements. For simple element-wise computations like you're doing there are likely only two sources of long time penalties:

Managed memory migration
GPU Clocks

To fix 1 you can use MATX_DEVICE_MEMORY as an argument for the memory type to make_tensor. This will give the fastest speed possible, even faster than managed on further loop iterations, because it doesn't have any paging. The downside is that you cannot access the memory on the CPU on most systems since it's equivalent to a cudaMalloc. For #2 you can fix the clocks on the GPU by using nvidia-smi, or usually it's good enough to run many iterations in a loop, start your timing events, then run the rest.

A good pattern would be:

Run 100 iterations
Start events
Run 100 iterations
Stop events and record timing

If you do this and we're slower than thrust, please let me know.

Answer selected by CrreeL

CrreeL · 2026-01-12T15:09:35Z

CrreeL
Jan 12, 2026
Author

I tried MATX_DEVICE_MEMORY, and achieved significant speed up. Indeed, MatX is faster than thrust (transform +hand-writen functor), at least for my case, which is calculating element-wise expressions like p=Aexp(-(E-Fx)/(kBT)).
Thank you so much for your patience with a GPU programming rookie!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MatX for element-wise tensor calculation #1113

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MatX for element-wise tensor calculation #1113

Uh oh!

CrreeL Jan 11, 2026

Replies: 2 comments · 3 replies

Uh oh!

cliffburdick Jan 11, 2026 Maintainer

Uh oh!

CrreeL Jan 11, 2026 Author

Uh oh!

Uh oh!

CrreeL Jan 11, 2026 Author

Uh oh!

cliffburdick Jan 11, 2026 Maintainer

Uh oh!

CrreeL Jan 12, 2026 Author

CrreeL
Jan 11, 2026

Replies: 2 comments 3 replies

cliffburdick
Jan 11, 2026
Maintainer

CrreeL Jan 11, 2026
Author

CrreeL Jan 11, 2026
Author

cliffburdick Jan 11, 2026
Maintainer

CrreeL
Jan 12, 2026
Author