Kernel fusion for particles #4888

dalarlla · 2026-01-09T15:51:07Z

dalarlla
Jan 9, 2026

Recently I've been making use of the kernel fusion variants of ParallelFor to get better performance on GPU when using AMR. I've made use of the isFusingCandidate() function for multifabs to switch between a MFIter based and kernel fusion launches and the results have been good with much better performance at higher box counts.

I currently use particles to act as probe points within my code to allow the user to generate time history data at specified points. However, as far as I can tell there is no kernel fusion analogue for particles and the approach seems to be to use a ParIter to loop over particle tiles and launch a kernel per tile. On GPU, as I understand it these tiles will correspond to boxes in my multifabs. As such, if I have many particles scattered across all or most of my boxes at higher amr levels I am concerned my particle kernel to extract data from the underlying multifabs will become a bottlenecks.

Would it be feasible to add functionality to loop over tiles within the parallelfor when the underlying multifabs are on GPU and are good fusion candidates?

AlexanderSinn · 2026-01-09T21:07:16Z

AlexanderSinn
Jan 9, 2026
Collaborator

This sounds quite interesting. With some effort, I think this can be implemented using TagParallelFor defined in https://github.com/AMReX-Codes/amrex/blob/development/Src/Base/AMReX_TagParallelFor.H . Specifically using the VectorTag in

amrex/Src/Base/AMReX_TagParallelFor.H

Lines 96 to 103 in 1a2a948

    
           template <class T> 
        
           struct VectorTag { 
        
               T* p; 
        
               Long m_size; 
        
               [[nodiscard]] AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE 
        
               Long size () const noexcept { return m_size; } 
        
           };

as a starting point but replacing T* with an instance of ParticleTileData and adding extra info like the number of particles (i.e. size) local tile index, and maybe mesh refinement level, tile size, or even the Array4 of the data needed. You will need an MFIter loop with no kernel launches to construct all the Tags and add them to the Vector. Then that can be turned into a TagVector which copies the data structure needed for the single kernel execution to the GPU, which adds a bit of overhead. The TagVector can be reused for multiple kernel launches, but changing the number of particles in any of the tiles (e.g. using FillBoundary) will invalidate the TagVector so that a new one will need to be created. So this will need to be done much more frequently than with the Multifabs.

1 reply

AlexanderSinn Jan 9, 2026
Collaborator

One note is that the MFParallelFor and TagParallelFor use different approaches internally to assign GPU threads to the chunks of work. TagParallelFor uses bisect, while MFParallelFor overallocates threads and uses a division. In principle, both could use either approach or even something more advanced. It would be interesting to have some performance data on which is better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kernel fusion for particles #4888

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Kernel fusion for particles #4888

Uh oh!

dalarlla Jan 9, 2026

Replies: 1 comment · 1 reply

Uh oh!

AlexanderSinn Jan 9, 2026 Collaborator

Uh oh!

AlexanderSinn Jan 9, 2026 Collaborator

dalarlla
Jan 9, 2026

Replies: 1 comment 1 reply

AlexanderSinn
Jan 9, 2026
Collaborator

AlexanderSinn Jan 9, 2026
Collaborator