Skip to content

Conversation

@jeremyfirst22
Copy link
Contributor

Bugfix

Fixes a race condition within the MPI communication of the GPU execution of the Color model.

In the ScaLBL_Communicator::BiSendD3Q7AA routine, the GPU kernel must finish packing the MPI buffer prior to sending the message. Currently, there is no guarentee that the kernel finishes processing, leading to a race condition in the MPI communication, and communication of a partially uninitialized message, leading to non-reproducible results depending on the number of subdomains:

Without_Bugfix

This manifests as noise at the domain decomposition boundary, as shown in this water invasion of an oil saturated cubic sphere pack:

before_bugfix.mp4

Adding a device synchronization before the MPI_Isend calls ensures the GPU kernels have finished packing the message, leading to reproducible results independent of the number of subdomains:

With_Bugfix

and no introduction of water phase at the domain decomposition boundary:

after_bugfix.mp4

I have not extensively checked the other models to see if this fix needs to be extended elsewhere in the code. I am also not certain if some compilers may pick up on this dependency and force device synchronization before sending, so it may or may not have impacted others.

Commit 1364d10 contains some minor compilation fixes I needed to get the code to compile with nvhpc.

Resolves #94.

@diogosiebert
Copy link
Collaborator

The build-and-test action failed due to an issue downloading silo. However, I can confirm that the code builds successfully and resolves issue #94 using a different case from the one originally reported.

Since James has also approved the changes, and given the impact of this bug, I will proceed to merge the fix directly into the master branch.

@diogosiebert diogosiebert merged commit b5cac49 into OPM:master Oct 15, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mass balance issue in GPU Color model?

3 participants