(Currently: based on guesswork and second-hand information from @matthiasdiener. Away at a conference, will investigate more when I have time.)
Right now:
- Sees CPU scalar passed, generates
pytato Placeholder with shape=().
- That becomes an
ArrayArg from Loopy's perspective
- We then spend tons of time transferring these itty bitty things to the GPU.
Proposed remedy:
- Introduce a tag in
pytato that says "this placeholder should become a ValueArg.
- Apply that in arraycontext when creating the
Placeholders if appropriate.
- Respect the tag in pytato codegen.
This should remove the cost of these transfers.