-
Notifications
You must be signed in to change notification settings - Fork 63
Description
Some of my code uses Python tasks, because I launch a large number of operations sized for an individual GPU each, and I want to take separate streams of these and place them on different GPUs. (E.g., the first K * 1,000 operations go to the first GPU, the second K * 1,000 go to the second GPU, etc. up to hundreds of GPUs.)
Now the question is how do I make this performance portable.
Python tasks are not permitted to use cuPyNumeric themselves, so they have to use something else. My initial version used CuPy for the implementation of my GPU Python task. As a first pass, it was easy for me to add a top-level switch to enable NumPy as an alternative to CuPy.
That gets me CPU and GPU, but still leaves out the multi-CPU/OpenMP case. I cannot afford to run a rank per CPU core (nor would that likely be efficient), and vanilla NumPy has very limited support for OpenMP (essentially only what BLAS will do on its own). Keep in mind I run a variety of NumPy operations, not limited to what ultimately boils down to a BLAS call.
I would like to do this without writing my code multiple times. CuPy and NumPy are close enough I can stick to one implementation and stub out the differences. Other solutions like Numba would require a code rewrite, and would not actually be more ergonomic (i.e., the CuPy/NumPy-style implementation is actually how I want to write this code, I do not want to be writing loops, especially with all the implicit broadcasting in the code).
I can think of a couple options:
- Somehow get NumPy to run on OpenMP. As far as I'm aware NumPy does not have complete support for this and I would only get a subset of operations parallelized this way.
- Permit Python tasks to call cuPyNumeric operations, and somehow set things up so that, inside of a Python task, these calls directly call the leaf implementations for the appropriate architecture. See Nested parallelism #993 for details.
(For LANL/SLAC.)