-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Labels
enhancementNew feature or requestNew feature or request
Description
In this OSG Slack thread, after the PRP folks observed/complained that they were seeing unusually low GPU utilization by RIFT jobs, @astroclark explained:
turns out the latest batch of RIFT jobs that were submitted are intrinsically more CPU-intensive (a more expensive waveform approximant SEOBNRv4 fwiw). The waveform generation - CPU-bound - in this case is more expensive than the likelihood calculations - the GPU part. They're aware and agree that it would make more sense to run this type of job on CPUs.
This all makes sense and is not a problem, but sparks some questions and thoughts for me:
- How common is this (as a proportion of all the RIFT workflows you run, over time)?
- In practice, do/can you know in advance which runs which behave like this? Or is it something you can only really discover after you've run a workflow?
- Does it make sense to manually assign runs to CPUs or GPUs based on this knowledge, ad-hoc, or can it be done programmatically at either workflow generation-time or run-time, so humans aren't in the loop?
- Do you think it might be a good idea to instrument RIFT to collect some basic performance data while it runs, and then report, post-facto, the CPU and GPU utilization of each run as part of its results? I'm going to turn this last question into its own ticket (idea: collect and record runtime performance data #16), because PyCBC did this a long time ago and it's been enormously helpful, and it allows you to effectively set alarms if things go outside expected bounds (e.g., CPU utilization approaches zero) and/or run reports on RIFT performance over time, have automated performance regression tests between RIFT versions, etc.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request