GPU race condition in MPI communication #98
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Bugfix
Fixes a race condition within the MPI communication of the GPU execution of the Color model.
In the
ScaLBL_Communicator::BiSendD3Q7AAroutine, the GPU kernel must finish packing the MPI buffer prior to sending the message. Currently, there is no guarentee that the kernel finishes processing, leading to a race condition in the MPI communication, and communication of a partially uninitialized message, leading to non-reproducible results depending on the number of subdomains:This manifests as noise at the domain decomposition boundary, as shown in this water invasion of an oil saturated cubic sphere pack:
before_bugfix.mp4
Adding a device synchronization before the
MPI_Isendcalls ensures the GPU kernels have finished packing the message, leading to reproducible results independent of the number of subdomains:and no introduction of water phase at the domain decomposition boundary:
after_bugfix.mp4
I have not extensively checked the other models to see if this fix needs to be extended elsewhere in the code. I am also not certain if some compilers may pick up on this dependency and force device synchronization before sending, so it may or may not have impacted others.
Commit 1364d10 contains some minor compilation fixes I needed to get the code to compile with nvhpc.
Resolves #94.