Skip to content

Conversation

@ashokbytebytego
Copy link
Contributor

Add Heartbeat-Based Dynamic Profiling System

🎯 Overview

This PR introduces a heartbeat protocol that enables dynamic profiling control. Agents periodically send heartbeats to a Performance Studio backend and receive start/stop profiling commands, allowing on-demand profiling without agent restarts.

💡 Motivation

Current gProfiler deployment requires:

  • Profilers to run continuously, consuming resources even when profiling isn't needed
  • Agent restarts to change profiling configuration
  • Manual coordination for targeted profiling campaigns

This heartbeat system solves these issues by enabling:

  • Centralized control: Manage all agents from a single Performance Studio backend
  • On-demand profiling: Profile only when needed, reducing resource usage
  • Dynamic configuration: Change profiler settings without agent restarts
  • Idempotent execution: Prevent duplicate work through UUID-based command tracking

📦 Changes Summary

5 files changed: 4 new files, 1 modified
Total changes: 1,838 insertions, 30 deletions

New Files

1. gprofiler/heartbeat.py (627 lines)

Core heartbeat system implementation

  • HeartbeatClient class:

    • Sends periodic heartbeats to Performance Studio backend
    • Receives start/stop profiling commands from server
    • Reports command completion status
    • Tracks executed command IDs for idempotency (in-memory with configurable limit)
    • Handles authentication via Bearer token
    • SLI metrics integration for success/failure tracking
  • DynamicGProfilerManager class:

    • Manages dynamic profiler lifecycle (start/stop based on server commands)
    • Creates and configures GProfiler instances dynamically
    • Handles command types: start and stop
    • Supports dynamic profiler configuration:
      • Duration, frequency, profiling mode
      • Target hostnames and PIDs
      • Per-profiler settings (Perf, PyPerf, PySppy, Java, PHP, Ruby, .NET, NodeJS)
      • PerfSpect hardware metrics integration
      • Max processes per profiler
    • Ensures proper cleanup of profiler resources
    • Thread-based profiler execution with graceful shutdown

Key Features:

  • Command idempotency prevents duplicate execution
  • Comprehensive error handling and logging
  • Memory-efficient command history tracking (max 1000 commands)
  • Automatic subprocess cleanup after profiling

2. docs/HEARTBEAT_SYSTEM_README.md (634 lines)

Complete system documentation

Contains:

  • System architecture overview with diagrams
  • API endpoint specifications (3 endpoints):
    • POST /api/metrics/profile_request - Submit profiling requests
    • POST /api/metrics/heartbeat - Agent heartbeat
    • POST /api/metrics/command_completion - Report command completion
  • Backend and agent features documentation
  • PerfSpect hardware metrics integration guide
  • Usage examples (command-line, curl examples)
  • Configuration options
  • Testing instructions (mock mode and live mode)
  • Troubleshooting guide
  • Building and running locally
  • Security considerations

3. tests/test_heartbeat_system.py (358 lines)

Comprehensive test suite

  • Mock mode (default): No backend required

    • Uses unittest.mock to simulate backend responses
    • Perfect for CI/CD and quick testing
    • Tests all heartbeat flows without external dependencies
  • Live mode (--live flag): Tests with real backend

    • Verifies integration with actual Performance Studio backend
    • Useful for end-to-end testing

Test Coverage:

  • ✅ Initial heartbeat (no commands)
  • ✅ CREATE and EXECUTE start command
  • ✅ Command idempotency (duplicate command handling)
  • ✅ CREATE and EXECUTE stop command
  • ✅ Multiple heartbeats with no pending commands
  • ✅ Command completion acknowledgments
  • ✅ Error handling and edge cases

Classes & Functions:

  • HeartbeatClient: Simulates agent behavior
  • create_test_profiling_request(): Submit test requests
  • create_mock_responses(): Mock backend for testing
  • run_tests(): Execute full test suite
  • main(): Entry point with mode selection

4. tests/run_heartbeat_agent.py (136 lines)

Agent test runner

  • Demonstrates how to run gProfiler in heartbeat mode
  • Configurable test environment
  • Helpful for local development and testing
  • Shows proper command-line argument usage
  • Includes instructions and usage guide

Modified Files

1. gprofiler/main.py (+83 lines, -30 lines)

Heartbeat mode integration

Changes:

  1. Import (line 51):

    from gprofiler.heartbeat import DynamicGProfilerManager, HeartbeatClient
  2. CLI Arguments (lines 865-881):

    --enable-heartbeat-server      # Enable heartbeat mode
    --heartbeat-interval <seconds> # Heartbeat frequency (default: 30s)
  3. Argument Validation (lines 956-964):

    • Validates --enable-heartbeat-server requires --upload-results
    • Validates authentication token is provided
    • Validates service name is provided
  4. Heartbeat Mode Execution (lines 1244-1274):

    • Checks if heartbeat mode is enabled
    • Creates HeartbeatClient with server connection details
    • Creates DynamicGProfilerManager to manage profiler lifecycle
    • Starts heartbeat loop (waits for server commands)
    • Falls back to normal profiling mode if heartbeat is disabled
    • Proper error handling and graceful shutdown

🔑 Key Features

Agent Features

✅ Heartbeat communication with configurable intervals (default: 30s)
✅ Dynamic profiling based on server commands (start/stop)
✅ Command-driven execution with full configuration control
✅ Idempotency to prevent duplicate command execution
✅ In-memory command tracking (max 1000 commands)
✅ Graceful error handling and retry logic
✅ PerfSpect hardware metrics support (auto-installation in dynamic mode)
✅ SLI metrics integration for monitoring
✅ Comprehensive subprocess cleanup

Backend Features (Expected from Performance Studio)

✅ REST API for submitting profiling requests
✅ Heartbeat endpoint for agent communication
✅ Command merging for multiple requests targeting same host
✅ Process-level and host-level stop commands
✅ Idempotent command execution using unique command IDs
✅ Command completion tracking
✅ PerfSpect integration for hardware metrics

Configuration Options

The heartbeat system supports dynamic configuration of:

  • Profiling duration and frequency
  • Profiling mode (CPU, allocation, none)
  • Target hostnames and PIDs
  • Per-profiler configuration:
    • Perf: enabled_restricted / enabled_aggressive / disabled
    • PyPerf: enabled / disabled
    • PySppy: enabled_fallback / enabled / disabled
    • Java async-profiler: enabled / disabled
    • PHP: enabled / disabled
    • Ruby: enabled / disabled
    • .NET: enabled / disabled
    • NodeJS: enabled / disabled
  • PerfSpect hardware metrics: enabled / disabled
  • Max processes per profiler (default: 10)
  • Continuous vs single-shot profiling

📋 Usage Example

Starting Agent in Heartbeat Mode

python gprofiler/main.py \
  --enable-heartbeat-server \
  --upload-results \
  --token "your-token" \
  --service-name "web-service" \
  --api-server "http://performance-studio:8000" \
  --heartbeat-interval 30 \
  --output-dir /tmp/profiles \
  --verbose

Backend: Submit Start Command

curl -X POST http://backend:8000/api/metrics/profile_request \
  -H "Content-Type: application/json" \
  -d '{
    "service_name": "web-service",
    "command_type": "start",
    "duration": 60,
    "frequency": 11,
    "profiling_mode": "cpu",
    "target_hostnames": ["host1", "host2"],
    "additional_args": {
      "enable_perfspect": true,
      "max_processes": 10,
      "profiler_configs": {
        "perf": "enabled_restricted",
        "pyperf": "enabled",
        "pyspy": "enabled_fallback"
      }
    }
  }'

Backend: Submit Stop Command

curl -X POST http://backend:8000/api/metrics/profile_request \
  -H "Content-Type: application/json" \
  -d '{
    "service_name": "web-service",
    "command_type": "stop",
    "stop_level": "host",
    "target_hostnames": ["host1"]
  }'

🧪 Testing

Run Mock Tests (No Backend Required)

python tests/test_heartbeat_system.py

Expected output:

🧪 Testing Heartbeat-Based Profiling Control System
🎭 Running in MOCK MODE (no real backend required)
============================================================
✓ All tests passed!

Test Summary:
   - Executed commands: 2
   - Commands executed: ['cmd_1', 'cmd_2']

📊 Mock Backend State:
   - Total heartbeats: 6
   - Completed commands: 2

Run Live Tests (Requires Backend)

python tests/test_heartbeat_system.py --live

Run Agent Test Runner

python tests/run_heartbeat_agent.py

🔄 How It Works

System Flow

┌─────────────────┐                    ┌─────────────────┐
│  Performance    │                    │   gProfiler     │
│  Studio Backend │                    │     Agent       │
└────────┬────────┘                    └────────┬────────┘
         │                                      │
         │◄──── Heartbeat (every 30s) ─────────┤
         │      {hostname, service, status}     │
         │                                      │
         ├────── Start Command ────────────────►│
         │      {duration: 60, frequency: 11}   │
         │                                      │
         │                               [Profiling...]
         │                                      │
         │◄──── Command Completion ─────────────┤
         │      {status: completed, time: 65s}  │
         │                                      │
         │◄──── Heartbeat ──────────────────────┤
         │                                      │
         ├────── Stop Command ─────────────────►│
         │                                      │
         │◄──── Command Completion ─────────────┤
         │      {status: completed}             │

Command Lifecycle

  1. User submits profiling request to Performance Studio
  2. Backend creates command with unique UUID
  3. Agent sends heartbeat to backend
  4. Backend responds with pending command
  5. Agent checks command hasn't been executed (idempotency)
  6. Agent executes command (start profiler with config)
  7. Agent reports completion to backend
  8. Backend updates command status

🛡️ Dependencies

No new dependencies required!

All dependencies used by the heartbeat system already exist in the project:

  • requests - Already in requirements.txt
  • configargparse - Already used in main.py
  • psutil - Already in requirements.txt
  • threading - Python standard library
  • datetime - Python standard library
  • socket - Python standard library

✅ Compatibility

  • ✅ Works with all existing gProfiler profilers (Perf, PyPerf, Java, PHP, Ruby, .NET, NodeJS)
  • ✅ Compatible with existing output mechanisms (local files, server upload)
  • ✅ Preserves all existing gProfiler features when not in heartbeat mode
  • ✅ Based on Intel's latest dynamic_profiling branch (commit 37e6acc)
  • ✅ No breaking changes to existing functionality
  • ✅ Backward compatible with existing deployments

📚 Documentation

Complete documentation available in:

  • docs/HEARTBEAT_SYSTEM_README.md - Full system documentation
  • Code comments in gprofiler/heartbeat.py - Implementation details
  • Test files - Usage examples and test scenarios

This PR introduces a heartbeat protocol that enables dynamic profiling
control. Agents periodically send heartbeats to a Performance Studio
backend and receive start/stop profiling commands, allowing on-demand
profiling without agent restarts.

Key features:
- HeartbeatClient for server communication
- DynamicGProfilerManager for profiler lifecycle management
- Command idempotency to prevent duplicate execution
- Support for dynamic profiler configuration
- PerfSpect hardware metrics integration
- Comprehensive test suite with mock and live modes
- Complete documentation with examples

Files added:
- gprofiler/heartbeat.py (627 lines)
- docs/HEARTBEAT_SYSTEM_README.md (634 lines)
- tests/test_heartbeat_system.py (358 lines)
- tests/run_heartbeat_agent.py (136 lines)

Files modified:
- gprofiler/main.py (heartbeat initialization)

Source: Pinterest's gprofiler repository
Testing: Mock tests pass, live tests verified with backend
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a heartbeat-based dynamic profiling system that enables centralized control of gProfiler agents through a Performance Studio backend. The system allows profiling commands to be issued remotely without agent restarts, providing on-demand profiling capabilities with idempotent command execution.

Key changes:

  • Heartbeat protocol implementation for agent-backend communication
  • Dynamic profiler lifecycle management based on server commands
  • Comprehensive testing framework with mock and live modes

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
gprofiler/heartbeat.py Core heartbeat client and dynamic profiler manager implementation
gprofiler/main.py Integration of heartbeat mode into main agent execution flow
tests/test_heartbeat_system.py Comprehensive test suite for heartbeat system with mock and live modes
tests/run_heartbeat_agent.py Agent test runner demonstrating heartbeat mode usage
docs/HEARTBEAT_SYSTEM_README.md Complete system documentation including API specs and usage examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

curlify_requests=getattr(args, 'curlify_requests', False),
hostname=get_hostname(),
verify=args.verify,
upload_timeout=getattr(args, 'server-upload-timeout', 120) # Default to 120 seconds
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attribute name 'server-upload-timeout' contains hyphens which are invalid in Python attribute names. Use underscore instead: getattr(args, 'server_upload_timeout', 120)

Suggested change
upload_timeout=getattr(args, 'server-upload-timeout', 120) # Default to 120 seconds
upload_timeout=getattr(args, 'server_upload_timeout', 120) # Default to 120 seconds

Copilot uses AI. Check for mistakes.
Comment on lines +100 to +101
"available_pids" : [java:{}, python:{}],
"namespaces" : [{namespace: kube_system, pods : [{pod_name: gprofiler, containers : {{pid:123, name: metrics-exporter},{pid:123, name: metrics-exporter}},{pod_name: webapp, containers : {{pid:123, name: metrics-exporter},{pid:123, name: metrics-exporter}}]}],
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON example contains syntax errors: missing quotes around keys and values, invalid nested braces. This example would fail to parse as valid JSON.

Suggested change
"available_pids" : [java:{}, python:{}],
"namespaces" : [{namespace: kube_system, pods : [{pod_name: gprofiler, containers : {{pid:123, name: metrics-exporter},{pid:123, name: metrics-exporter}},{pod_name: webapp, containers : {{pid:123, name: metrics-exporter},{pid:123, name: metrics-exporter}}]}],
"available_pids": [
{ "language": "java", "pids": [] },
{ "language": "python", "pids": [] }
],
"namespaces": [
{
"namespace": "kube_system",
"pods": [
{
"pod_name": "gprofiler",
"containers": [
{ "pid": 123, "name": "metrics-exporter" },
{ "pid": 124, "name": "metrics-exporter" }
]
},
{
"pod_name": "webapp",
"containers": [
{ "pid": 125, "name": "metrics-exporter" },
{ "pid": 126, "name": "metrics-exporter" }
]
}
]
}
],

Copilot uses AI. Check for mistakes.
Comment on lines +109 to +112
1. add k8s namespace hierarchy info as part of heartbeat
2. save k8s information in hostheartbeats table and create de-normalized table for containersToHosts, podsToHost and namespaceToHosts,
3. perform profiling : support profiling request by namespaces, pods and containers ( 5 )
4. test e2e ( 3 )
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be internal development notes rather than user-facing documentation. These implementation details and task items should be removed from the public documentation or moved to a separate internal document.

Suggested change
1. add k8s namespace hierarchy info as part of heartbeat
2. save k8s information in hostheartbeats table and create de-normalized table for containersToHosts, podsToHost and namespaceToHosts,
3. perform profiling : support profiling request by namespaces, pods and containers ( 5 )
4. test e2e ( 3 )

Copilot uses AI. Check for mistakes.
"duration": 60,
"frequency": 11,
"profiling_mode": "cpu",
"pids": ""
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'pids' field is documented as accepting an array of integers in line 68, but shown as an empty string here. This should be either an empty array [] or a populated array like [1234, 5678] for consistency.

Suggested change
"pids": ""
"pids": []

Copilot uses AI. Check for mistakes.
Comment on lines +291 to +293
"containers" : [],
"pods" : [],
"namespaces" : [],
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields ('containers', 'pods', 'namespaces') are not documented in the API specification section above. If these are valid fields, they should be documented with descriptions and examples.

Copilot uses AI. Check for mistakes.
from gprofiler.metadata.enrichment import EnrichmentOptions
from gprofiler.metadata.metadata_collector import get_static_metadata
from gprofiler.metadata.system_metadata import get_hostname
from gprofiler.metrics_publisher import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics_publisher module seems missing in this PR. please check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @mlim19 , will check it


# Check if heartbeat server mode is enabled FIRST
if args.enable_heartbeat_server:
if not args.upload_results:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a redundant check. we can remove in that case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the duplicate check

logger.debug("Heartbeat successful, no pending commands")
return None
else:
logger.warning(f"Heartbeat failed with status {response.status_code}: {response.text}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be treated as error as well, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed logger.warning to logger.error for heartbeat failures

if result.get("success") and result.get("profiling_command"):
logger.info(f"Received profiling command from server: {result.get('command_id')}")
return result
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else may handle 2 cases: case1 (not success) and case2 (success but no profiling_command). The first case shouldn't be handled as heartbeat successful. Please handle them separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated logic

command_id = command_response["command_id"]
command_type = profiling_command.get("command_type", "start")

logger.info(f"Received profiling command: {profiling_command}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant with the line 119. It would be better to turn one of them into debug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

logger.info(f"Received {command_type} command: {command_id}")

# Mark command as executed for idempotency
self.heartbeat_client.mark_command_executed(command_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these status update should happen only for valid command type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

# Stop current profiler if running, then start new one
logger.info("Starting new profiler due to start command")
# TODO: important comment to make sure profiler has stopped successful to avoid leak
self._stop_current_profiler()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the error handling for the cases stop and start are not processed properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unified try-exception block covering both start , stop commands

except Exception as e:
logger.error(f"Failed to start new profiler: {e}", exc_info=True)
# Report failure to the server
self.heartbeat_client.send_command_completion(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code might send the duplicated command completion message when exception is raised. Please remove it if it's right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed duplicate completion messages.

@mlim19
Copy link
Contributor

mlim19 commented Dec 6, 2025

Please fix build and linter issues.

Removed Pinterest-specific MetricsPublisher imports and calls.
requiring additional Pinterest-specific dependencies.
- Remove redundant upload_results validation check in main.py
- Change heartbeat failure logging from warning to error for better monitoring
- Separate success and error cases in send_heartbeat() response handling
- Reduce duplicate logging: change redundant 'Received profiling command' to debug level
- Validate command type before marking as executed to ensure proper idempotency
- Add unified error handling for start/stop command execution with try-except
- Ensure backend always receives command completion status (success or failure)
- Keep command details logging for operational visibility

Addresses all review comments from @mlim19 on PR intel#1009.
@ashokbytebytego
Copy link
Contributor Author

Testing

Tested by building gProfiler agent locally and verified:

  • Agent connects to Performance Studio backend successfully
  • Heartbeat communication working as expected
  • Start/stop commands received and executed properly
  • Profiles generated and uploaded to backend
  • Command completion status reported correctly
  • Error handling and logging functioning as designed

- Fix import sorting and code formatting
- Remove unused imports and variables
- Add missing logging import to profiler_base.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants