Skip to content

can we identify (and re-route) intrinsically CPU-intensive workflows? #17

@couvares

Description

@couvares

In this OSG Slack thread, after the PRP folks observed/complained that they were seeing unusually low GPU utilization by RIFT jobs, @astroclark explained:

turns out the latest batch of RIFT jobs that were submitted are intrinsically more CPU-intensive (a more expensive waveform approximant SEOBNRv4 fwiw). The waveform generation - CPU-bound - in this case is more expensive than the likelihood calculations - the GPU part. They're aware and agree that it would make more sense to run this type of job on CPUs.

This all makes sense and is not a problem, but sparks some questions and thoughts for me:

  1. How common is this (as a proportion of all the RIFT workflows you run, over time)?
  2. In practice, do/can you know in advance which runs which behave like this? Or is it something you can only really discover after you've run a workflow?
  3. Does it make sense to manually assign runs to CPUs or GPUs based on this knowledge, ad-hoc, or can it be done programmatically at either workflow generation-time or run-time, so humans aren't in the loop?
  4. Do you think it might be a good idea to instrument RIFT to collect some basic performance data while it runs, and then report, post-facto, the CPU and GPU utilization of each run as part of its results? I'm going to turn this last question into its own ticket (idea: collect and record runtime performance data #16), because PyCBC did this a long time ago and it's been enormously helpful, and it allows you to effectively set alarms if things go outside expected bounds (e.g., CPU utilization approaches zero) and/or run reports on RIFT performance over time, have automated performance regression tests between RIFT versions, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions