-
Notifications
You must be signed in to change notification settings - Fork 71
Add heartbeat-based dynamic profiling system to intel repo #1009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dynamic_profiling
Are you sure you want to change the base?
Add heartbeat-based dynamic profiling system to intel repo #1009
Conversation
This PR introduces a heartbeat protocol that enables dynamic profiling control. Agents periodically send heartbeats to a Performance Studio backend and receive start/stop profiling commands, allowing on-demand profiling without agent restarts. Key features: - HeartbeatClient for server communication - DynamicGProfilerManager for profiler lifecycle management - Command idempotency to prevent duplicate execution - Support for dynamic profiler configuration - PerfSpect hardware metrics integration - Comprehensive test suite with mock and live modes - Complete documentation with examples Files added: - gprofiler/heartbeat.py (627 lines) - docs/HEARTBEAT_SYSTEM_README.md (634 lines) - tests/test_heartbeat_system.py (358 lines) - tests/run_heartbeat_agent.py (136 lines) Files modified: - gprofiler/main.py (heartbeat initialization) Source: Pinterest's gprofiler repository Testing: Mock tests pass, live tests verified with backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a heartbeat-based dynamic profiling system that enables centralized control of gProfiler agents through a Performance Studio backend. The system allows profiling commands to be issued remotely without agent restarts, providing on-demand profiling capabilities with idempotent command execution.
Key changes:
- Heartbeat protocol implementation for agent-backend communication
- Dynamic profiler lifecycle management based on server commands
- Comprehensive testing framework with mock and live modes
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
gprofiler/heartbeat.py |
Core heartbeat client and dynamic profiler manager implementation |
gprofiler/main.py |
Integration of heartbeat mode into main agent execution flow |
tests/test_heartbeat_system.py |
Comprehensive test suite for heartbeat system with mock and live modes |
tests/run_heartbeat_agent.py |
Agent test runner demonstrating heartbeat mode usage |
docs/HEARTBEAT_SYSTEM_README.md |
Complete system documentation including API specs and usage examples |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
gprofiler/heartbeat.py
Outdated
| curlify_requests=getattr(args, 'curlify_requests', False), | ||
| hostname=get_hostname(), | ||
| verify=args.verify, | ||
| upload_timeout=getattr(args, 'server-upload-timeout', 120) # Default to 120 seconds |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attribute name 'server-upload-timeout' contains hyphens which are invalid in Python attribute names. Use underscore instead: getattr(args, 'server_upload_timeout', 120)
| upload_timeout=getattr(args, 'server-upload-timeout', 120) # Default to 120 seconds | |
| upload_timeout=getattr(args, 'server_upload_timeout', 120) # Default to 120 seconds |
| "available_pids" : [java:{}, python:{}], | ||
| "namespaces" : [{namespace: kube_system, pods : [{pod_name: gprofiler, containers : {{pid:123, name: metrics-exporter},{pid:123, name: metrics-exporter}},{pod_name: webapp, containers : {{pid:123, name: metrics-exporter},{pid:123, name: metrics-exporter}}]}], |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON example contains syntax errors: missing quotes around keys and values, invalid nested braces. This example would fail to parse as valid JSON.
| "available_pids" : [java:{}, python:{}], | |
| "namespaces" : [{namespace: kube_system, pods : [{pod_name: gprofiler, containers : {{pid:123, name: metrics-exporter},{pid:123, name: metrics-exporter}},{pod_name: webapp, containers : {{pid:123, name: metrics-exporter},{pid:123, name: metrics-exporter}}]}], | |
| "available_pids": [ | |
| { "language": "java", "pids": [] }, | |
| { "language": "python", "pids": [] } | |
| ], | |
| "namespaces": [ | |
| { | |
| "namespace": "kube_system", | |
| "pods": [ | |
| { | |
| "pod_name": "gprofiler", | |
| "containers": [ | |
| { "pid": 123, "name": "metrics-exporter" }, | |
| { "pid": 124, "name": "metrics-exporter" } | |
| ] | |
| }, | |
| { | |
| "pod_name": "webapp", | |
| "containers": [ | |
| { "pid": 125, "name": "metrics-exporter" }, | |
| { "pid": 126, "name": "metrics-exporter" } | |
| ] | |
| } | |
| ] | |
| } | |
| ], |
| 1. add k8s namespace hierarchy info as part of heartbeat | ||
| 2. save k8s information in hostheartbeats table and create de-normalized table for containersToHosts, podsToHost and namespaceToHosts, | ||
| 3. perform profiling : support profiling request by namespaces, pods and containers ( 5 ) | ||
| 4. test e2e ( 3 ) |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be internal development notes rather than user-facing documentation. These implementation details and task items should be removed from the public documentation or moved to a separate internal document.
| 1. add k8s namespace hierarchy info as part of heartbeat | |
| 2. save k8s information in hostheartbeats table and create de-normalized table for containersToHosts, podsToHost and namespaceToHosts, | |
| 3. perform profiling : support profiling request by namespaces, pods and containers ( 5 ) | |
| 4. test e2e ( 3 ) |
| "duration": 60, | ||
| "frequency": 11, | ||
| "profiling_mode": "cpu", | ||
| "pids": "" |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'pids' field is documented as accepting an array of integers in line 68, but shown as an empty string here. This should be either an empty array [] or a populated array like [1234, 5678] for consistency.
| "pids": "" | |
| "pids": [] |
| "containers" : [], | ||
| "pods" : [], | ||
| "namespaces" : [], |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields ('containers', 'pods', 'namespaces') are not documented in the API specification section above. If these are valid fields, they should be documented with descriptions and examples.
gprofiler/heartbeat.py
Outdated
| from gprofiler.metadata.enrichment import EnrichmentOptions | ||
| from gprofiler.metadata.metadata_collector import get_static_metadata | ||
| from gprofiler.metadata.system_metadata import get_hostname | ||
| from gprofiler.metrics_publisher import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metrics_publisher module seems missing in this PR. please check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure @mlim19 , will check it
gprofiler/main.py
Outdated
|
|
||
| # Check if heartbeat server mode is enabled FIRST | ||
| if args.enable_heartbeat_server: | ||
| if not args.upload_results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a redundant check. we can remove in that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the duplicate check
gprofiler/heartbeat.py
Outdated
| logger.debug("Heartbeat successful, no pending commands") | ||
| return None | ||
| else: | ||
| logger.warning(f"Heartbeat failed with status {response.status_code}: {response.text}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be treated as error as well, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed logger.warning to logger.error for heartbeat failures
| if result.get("success") and result.get("profiling_command"): | ||
| logger.info(f"Received profiling command from server: {result.get('command_id')}") | ||
| return result | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else may handle 2 cases: case1 (not success) and case2 (success but no profiling_command). The first case shouldn't be handled as heartbeat successful. Please handle them separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separated logic
gprofiler/heartbeat.py
Outdated
| command_id = command_response["command_id"] | ||
| command_type = profiling_command.get("command_type", "start") | ||
|
|
||
| logger.info(f"Received profiling command: {profiling_command}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant with the line 119. It would be better to turn one of them into debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
gprofiler/heartbeat.py
Outdated
| logger.info(f"Received {command_type} command: {command_id}") | ||
|
|
||
| # Mark command as executed for idempotency | ||
| self.heartbeat_client.mark_command_executed(command_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these status update should happen only for valid command type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
gprofiler/heartbeat.py
Outdated
| # Stop current profiler if running, then start new one | ||
| logger.info("Starting new profiler due to start command") | ||
| # TODO: important comment to make sure profiler has stopped successful to avoid leak | ||
| self._stop_current_profiler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the error handling for the cases stop and start are not processed properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unified try-exception block covering both start , stop commands
| except Exception as e: | ||
| logger.error(f"Failed to start new profiler: {e}", exc_info=True) | ||
| # Report failure to the server | ||
| self.heartbeat_client.send_command_completion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code might send the duplicated command completion message when exception is raised. Please remove it if it's right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed duplicate completion messages.
|
Please fix build and linter issues. |
Removed Pinterest-specific MetricsPublisher imports and calls. requiring additional Pinterest-specific dependencies.
- Remove redundant upload_results validation check in main.py - Change heartbeat failure logging from warning to error for better monitoring - Separate success and error cases in send_heartbeat() response handling - Reduce duplicate logging: change redundant 'Received profiling command' to debug level - Validate command type before marking as executed to ensure proper idempotency - Add unified error handling for start/stop command execution with try-except - Ensure backend always receives command completion status (success or failure) - Keep command details logging for operational visibility Addresses all review comments from @mlim19 on PR intel#1009.
TestingTested by building gProfiler agent locally and verified:
|
7fb40b8 to
46b728b
Compare
- Fix import sorting and code formatting - Remove unused imports and variables - Add missing logging import to profiler_base.py
Add Heartbeat-Based Dynamic Profiling System
🎯 Overview
This PR introduces a heartbeat protocol that enables dynamic profiling control. Agents periodically send heartbeats to a Performance Studio backend and receive start/stop profiling commands, allowing on-demand profiling without agent restarts.
💡 Motivation
Current gProfiler deployment requires:
This heartbeat system solves these issues by enabling:
📦 Changes Summary
5 files changed: 4 new files, 1 modified
Total changes: 1,838 insertions, 30 deletions
New Files
1.
gprofiler/heartbeat.py(627 lines)Core heartbeat system implementation
HeartbeatClientclass:DynamicGProfilerManagerclass:startandstopKey Features:
2.
docs/HEARTBEAT_SYSTEM_README.md(634 lines)Complete system documentation
Contains:
POST /api/metrics/profile_request- Submit profiling requestsPOST /api/metrics/heartbeat- Agent heartbeatPOST /api/metrics/command_completion- Report command completion3.
tests/test_heartbeat_system.py(358 lines)Comprehensive test suite
Mock mode (default): No backend required
unittest.mockto simulate backend responsesLive mode (
--liveflag): Tests with real backendTest Coverage:
Classes & Functions:
HeartbeatClient: Simulates agent behaviorcreate_test_profiling_request(): Submit test requestscreate_mock_responses(): Mock backend for testingrun_tests(): Execute full test suitemain(): Entry point with mode selection4.
tests/run_heartbeat_agent.py(136 lines)Agent test runner
Modified Files
1.
gprofiler/main.py(+83 lines, -30 lines)Heartbeat mode integration
Changes:
Import (line 51):
CLI Arguments (lines 865-881):
Argument Validation (lines 956-964):
--enable-heartbeat-serverrequires--upload-resultsHeartbeat Mode Execution (lines 1244-1274):
HeartbeatClientwith server connection detailsDynamicGProfilerManagerto manage profiler lifecycle🔑 Key Features
Agent Features
✅ Heartbeat communication with configurable intervals (default: 30s)
✅ Dynamic profiling based on server commands (start/stop)
✅ Command-driven execution with full configuration control
✅ Idempotency to prevent duplicate command execution
✅ In-memory command tracking (max 1000 commands)
✅ Graceful error handling and retry logic
✅ PerfSpect hardware metrics support (auto-installation in dynamic mode)
✅ SLI metrics integration for monitoring
✅ Comprehensive subprocess cleanup
Backend Features (Expected from Performance Studio)
✅ REST API for submitting profiling requests
✅ Heartbeat endpoint for agent communication
✅ Command merging for multiple requests targeting same host
✅ Process-level and host-level stop commands
✅ Idempotent command execution using unique command IDs
✅ Command completion tracking
✅ PerfSpect integration for hardware metrics
Configuration Options
The heartbeat system supports dynamic configuration of:
📋 Usage Example
Starting Agent in Heartbeat Mode
Backend: Submit Start Command
Backend: Submit Stop Command
🧪 Testing
Run Mock Tests (No Backend Required)
Expected output:
Run Live Tests (Requires Backend)
Run Agent Test Runner
🔄 How It Works
System Flow
Command Lifecycle
🛡️ Dependencies
No new dependencies required! ✅
All dependencies used by the heartbeat system already exist in the project:
requests- Already in requirements.txtconfigargparse- Already used in main.pypsutil- Already in requirements.txtthreading- Python standard librarydatetime- Python standard librarysocket- Python standard library✅ Compatibility
dynamic_profilingbranch (commit 37e6acc)📚 Documentation
Complete documentation available in:
docs/HEARTBEAT_SYSTEM_README.md- Full system documentationgprofiler/heartbeat.py- Implementation details