Skip to content

Conversation

@artursarlo
Copy link

@artursarlo artursarlo commented Nov 19, 2025

Fix PerfSpect failing in containerized environments due to missing sudo

Problem

When running gProfiler with hardware metrics collection enabled inside a container, PerfSpect was becoming a zombie/defunct process and failing to collect any metrics. Investigation revealed that PerfSpect was attempting to use sudo to elevate privileges for running perf stat commands, but sudo is typically not available in container environments.

Error observed:

{"level":"ERROR","msg":"error from perf","error":"failed to run command (sudo bash /root/perfspect_temp/perfspect.tmp.3437908981/perf_stat.sh): exec: \"sudo\": executable file not found in $PATH"}

This caused PerfSpect to terminate immediately, resulting in:

  • Zombie/defunct processes visible in ps aux
  • No hardware metrics being collected
  • Silent failure with minimal error logging

Solution

Added the --noroot flag to the PerfSpect command invocation in hw_metrics.py. This flag prevents PerfSpect from attempting privilege elevation via sudo, which is unnecessary when:

  • Running inside containers (typically already running as root)
  • gProfiler already has sufficient privileges for system-wide monitoring
  • The environment doesn't have sudo installed

Changes

gprofiler/hw_metrics.py

  • Added --noroot flag to PerfSpect metrics command
  • This allows PerfSpect to run with current privileges without attempting to use sudo

Testing

  • ✅ Tested in containerized environment where issue was originally observed
  • ✅ Verified PerfSpect no longer attempts to use sudo
  • ✅ Confirmed hardware metrics are now collected successfully
  • ✅ No more zombie/defunct PerfSpect processes

Impact

  • Containers: Hardware metrics collection now works properly in containerized environments
  • Non-container deployments: No impact, as the flag simply prevents unnecessary privilege escalation attempts
  • Backwards compatibility: The --noroot flag is available in recent PerfSpect versions

Related Issues

This fix addresses the zombie process issue discovered during container testing where PerfSpect would fail immediately with "sudo: command not found" errors.

@mlim19
Copy link
Contributor

mlim19 commented Nov 21, 2025

The noroot option in perfspect requires the certain configuration for perfspect tool to work properly. https://github.com/intel/PerfSpect#metrics-without-root-permissions
Just adding the option in hw_metrics.py without checking whether the requirements are met or not is not a right way IMO. @harp-intel, can you explain the expected perfspect behavior if the requirements are not met?

@harp-intel
Copy link

As seen in the PerfSpect readme, there are three requirements for the PerfSpect metrics command to work when using the --no-root option.

sysctl -w kernel.perf_event_paranoid=0
This one is required for perf stat.

sysctl -w kernel.nmi_watchdog=0
If the watchdog is not disabled, the cpu-cycles fixed counter will not be available to PerfSpect resulting in it using a general purpose counter to collect the cpu-cycles event. This will result in additional event groups and, in theory, less accuracy from metrics due to additional multi-plexing.

write '125' to all perf_event_mux_interval_ms files found under /sys/devices/
'125' is a known reasonable interval for collecting the number of events generally collected by PerfSpect. If not set, the metric accuracy may be impacted and/or some metrics will not be produced due to missing event data.

@mlim19
Copy link
Contributor

mlim19 commented Nov 21, 2025

As seen in the PerfSpect readme, there are three requirements for the PerfSpect metrics command to work when using the --no-root option.

sysctl -w kernel.perf_event_paranoid=0 This one is required for perf stat.

sysctl -w kernel.nmi_watchdog=0 If the watchdog is not disabled, the cpu-cycles fixed counter will not be available to PerfSpect resulting in it using a general purpose counter to collect the cpu-cycles event. This will result in additional event groups and, in theory, less accuracy from metrics due to additional multi-plexing.

write '125' to all perf_event_mux_interval_ms files found under /sys/devices/ '125' is a known reasonable interval for collecting the number of events generally collected by PerfSpect. If not set, the metric accuracy may be impacted and/or some metrics will not be produced due to missing event data.

Thank you. It seems the first one is mandatory and the other two are recommended. @artursarlo, are you able to add a check for them before adding "noroot" option? Otherwise, we should bail out collection if it's running in a container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants