Replies: 1 comment 1 reply
-
|
This sounds quite interesting. With some effort, I think this can be implemented using TagParallelFor defined in https://github.com/AMReX-Codes/amrex/blob/development/Src/Base/AMReX_TagParallelFor.H . Specifically using the VectorTag in amrex/Src/Base/AMReX_TagParallelFor.H Lines 96 to 103 in 1a2a948 as a starting point but replacing T* with an instance of ParticleTileData and adding extra info like the number of particles (i.e. size) local tile index, and maybe mesh refinement level, tile size, or even the Array4 of the data needed. You will need an MFIter loop with no kernel launches to construct all the Tags and add them to the Vector. Then that can be turned into a TagVector which copies the data structure needed for the single kernel execution to the GPU, which adds a bit of overhead. The TagVector can be reused for multiple kernel launches, but changing the number of particles in any of the tiles (e.g. using FillBoundary) will invalidate the TagVector so that a new one will need to be created. So this will need to be done much more frequently than with the Multifabs. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Recently I've been making use of the kernel fusion variants of ParallelFor to get better performance on GPU when using AMR. I've made use of the isFusingCandidate() function for multifabs to switch between a MFIter based and kernel fusion launches and the results have been good with much better performance at higher box counts.
I currently use particles to act as probe points within my code to allow the user to generate time history data at specified points. However, as far as I can tell there is no kernel fusion analogue for particles and the approach seems to be to use a ParIter to loop over particle tiles and launch a kernel per tile. On GPU, as I understand it these tiles will correspond to boxes in my multifabs. As such, if I have many particles scattered across all or most of my boxes at higher amr levels I am concerned my particle kernel to extract data from the underlying multifabs will become a bottlenecks.
Would it be feasible to add functionality to loop over tiles within the parallelfor when the underlying multifabs are on GPU and are good fusion candidates?
Beta Was this translation helpful? Give feedback.
All reactions