-
Notifications
You must be signed in to change notification settings - Fork 215
Add Ask Holmes response principles #1299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add Ask Holmes response principles #1299
Conversation
Signed-off-by: Codex <codex@openai.com>
WalkthroughThis change introduces a shared response principles template and refactors prompt guidance across multiple Holmes prompt templates. A new Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks✅ Passed checks (3 passed)
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (4)
🧰 Additional context used📓 Path-based instructions (1)holmes/plugins/prompts/**/*.jinja2📄 CodeRabbit inference engine (CLAUDE.md)
Files:
🧠 Learnings (2)📓 Common learnings📚 Learning: 2025-12-29T08:35:37.668ZApplied to files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
🔇 Additional comments (9)
Comment |
|
✅ Docker image ready for
Use this tag to pull the image for testing. 📋 Copy commandsgcloud auth configure-docker us-central1-docker.pkg.dev
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ed04b04
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ed04b04 me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ed04b04
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ed04b04Patch Helm values in one line (choose the chart you use): HolmesGPT chart: helm upgrade --install holmesgpt ./helm/holmes \
--set registry=me-west1-docker.pkg.dev/robusta-development/development \
--set image=holmes-dev:ed04b04Robusta wrapper chart: helm upgrade --install robusta robusta/robusta \
--reuse-values \
--set holmes.registry=me-west1-docker.pkg.dev/robusta-development/development \
--set holmes.image=holmes-dev:ed04b04 |
✅ Results of HolmesGPT evalsAutomatically triggered by commit 37248e1 on branch Results of HolmesGPT evals
Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%. Historical Comparison DetailsFilter: excluding branch 'codex/linear-mention-rob-129-holmesgpt-update-system-prompt' Status: Success - 13 test/model combinations loaded Experiments compared (30):
Comparison indicators:
📖 Legend
🔄 Re-run evals manually
Option 1: Comment on this PR with Or with more options (one per line): Run evals on a different branch (e.g., master) for comparison:
Quick re-run: Use Option 2: Trigger via GitHub Actions UI → "Run workflow" 🏷️ Valid markers
📋 Valid eval names (use with filter)test_ask_holmes:
test_investigate:
|
|
/eval |
|
@aantn Your eval run has finished. ✅ Completed successfully 🧪 Manual Eval Results
Results of HolmesGPT evals (branch:
|
| Status | Test case | Time | Turns | Tools | Cost |
|---|---|---|---|---|---|
| ✅ | 09_crashpod | 46.2s ↑26% | 7 | 16 | $0.1342 |
| ✅ | 101_loki_historical_logs_pod_deleted | 43.5s ↓21% | 7 | 15 | $0.1443 |
| ✅ | 111_pod_names_contain_service | 37.6s ↓13% | 7 | 15 | $0.1215 |
| ✅ | 12_job_crashing | 45.6s ↓12% | 8 | 17 | $0.1551 |
| ✅ | 162_get_runbooks | 42.0s ↓21% | 7 | 17 | $0.1610 |
| ✅ | 176_network_policy_blocking_traffic_no_runbooks | 29.2s ↓30% | 5 | 12 | $0.1015 |
| ✅ | 24_misconfigured_pvc | 30.3s ↓24% | 6 | 15 | $0.1057 |
| ✅ | 43_current_datetime_from_prompt | 3.5s ±0% | 1 | — | $0.0086 |
| ✅ | 61_exact_match_counting | 10.2s ±0% | 3 | 3 | $0.0326 |
| Total | 32.0s avg | 5.7 avg | 13.8 avg | $0.9645 |
Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.
Historical Comparison Details
Filter: excluding branch 'master'
Status: Success - 10 test/model combinations loaded
Experiments compared (30):
- github-20639554704.1463.1 (branch:
codex/linear-mention-rob-129-holmesgpt-update-system-prompt) - github-20639543222.1462.1 (branch:
codex/linear-mention-rob-39-update-grafana-tool-to-handle-large-n2m92v) - github-20639409982.1459.1 (branch:
focused-benchmarks) - ...and 27 more
Comparison indicators:
±0%— diff under 10% (within noise threshold)↑N%/↓N%— diff 10-25%↑N%/↓N%— diff over 25% (significant)
📖 Legend
| Icon | Meaning |
|---|---|
| ✅ | The test was successful |
| ➖ | The test was skipped |
| The test failed but is known to be flaky or known to fail | |
| 🚧 | The test had a setup failure (not a code regression) |
| 🔧 | The test failed due to mock data issues (not a code regression) |
| 🚫 | The test was throttled by API rate limits/overload |
| ❌ | The test failed and should be fixed before merging the PR |
🔄 Re-run evals manually
⚠️ Warning: Manual re-runs have NO default markers and will run ALL LLM tests (~100+), which can take 1+ hours. Usemarkers: regressionorfilter: test_nameto limit scope.
Option 1: Comment on this PR with /eval:
/eval
markers: regression
Or with more options (one per line):
/eval
model: gpt-4o
markers: regression
filter: 09_crashpod
iterations: 5
Run evals on a different branch (e.g., master) for comparison:
/eval
branch: master
markers: regression
| Option | Description |
|---|---|
model |
Model(s) to test (default: same as automatic runs) |
markers |
Pytest markers (no default - runs all tests!) |
filter |
Pytest -k filter |
iterations |
Number of runs, max 10 |
branch |
Run evals on a different branch (for cross-branch comparison) |
Quick re-run: Use /last to re-run the most recent /eval on this PR with the same parameters.
Option 2: Trigger via GitHub Actions UI → "Run workflow"
🏷️ Valid markers
benchmarkchain-of-causationcompactioncontext_windowcoralogixcountingdatabasedatadogdatetimeeasyembedsgrafana-dashboardhardkafkakubernetesleaked-informationlogslokimediummetricsnetworknewrelicno-cicdnumericalone-testport-forwardprometheusquestion-answerregressionrunbooksslackbotstoragetoolset-limitationtracestransparency
📋 Valid eval names (use with filter)
test_ask_holmes:
01_how_many_pods02_what_is_wrong_with_pod03_what_is_the_command_to_port_forward04_related_k8s_events05_image_version06_explain_issue07_high_latency08_sock_shop_frontend09_crashpod100a_loki_historical_logs101_loki_historical_logs_pod_deleted102_loki_label_discovery102a_loki_logs_transparency102b_loki_multiple_pods103_logs_transparency_default_limit104a_postgres_root_issue104b_postgres_missing_index_pgstat104c_postgres_minimal_missing_index105_redis_wrong_data_structure107_log_filter_http_status_code108_logs_nearby_lines109_logs_transparency_not_found10_image_pull_backoff110_cpu_graph_robusta_runner110_k8s_events_image_pull111_disabled_datadog_traces111_pod_names_contain_service111_tool_hallucination112_find_pvcs_by_uuid114_checkout_latency_tracing_rebuild115_checkout_errors_tracing117_new_relic_tracing117b_new_relic_block_embed118_new_relic_logs119_new_relic_metrics11_init_containers120_new_relic_traces2121_new_relic_checkout_errors_tracing122_new_relic_checkout_latency_tracing_rebuild123_new_relic_checkout_errors_tracing124_checkout_latency_prometheus12_job_crashing13a_pending_node_selector_basic13b_pending_node_selector_detailed14_pending_resources151_disabled_toolsets_fallback_only156_kafka_opensearch_latency157_disk_full_statefulset158_slack_chat_correct_date159_prometheus_high_cardinality_cpu15_failed_readiness_probe160_electricity_market_bidding_bug160a_cpu_per_namespace_graph160b_cpu_per_namespace_graph_with_prom_truncation160c_cpu_per_namespace_graph_with_global_truncation161_bidding_version_performance161_conversation_compaction162_get_runbooks163_compaction_follow_up164_datadog_traces_coupon_code165_alert_with_multiple_runbooks16_failed_no_toolset_found173_coralogix_logs174_coralogix_traces_ad175_coralogix_metrics_frontend176_network_policy_blocking_traffic_no_runbooks177_grafana_home_dashboard178_grafana_search_dashboard_query179_grafana_big_dashboard_query17_oom_kill180_connectivity_check_tcp181_connectivity_check_http182_connectivity_check_http_url18_oom_kill_from_issues_history19_detect_missing_app_details20_long_log_file_search21_job_fail_curl_no_svc_account22_high_latency_dbi_down23_app_error_in_current_logs24_misconfigured_pvc25_misconfigured_ingress_class26_page_render_times27a_multi_container_logs27b_multi_container_logs28_permissions_error30_basic_promql_graph_cluster_memory32_basic_promql_graph_pod_cpu33_cpu_metrics_discovery34_memory_graph35_tempo36_argocd_find_resource37_argocd_wrong_namespace38_rabbitmq_split_head39_failed_toolset41_setup_argo42_dns_issues_result_all_tools42_dns_issues_result_new_tools42_dns_issues_result_new_tools_no_runbook42_dns_issues_result_old_tools42_dns_issues_steps_new_all_tools42_dns_issues_steps_new_tools42_dns_issues_steps_old_tools43_current_datetime_from_prompt43_slack_deployment_logs44_slack_statefulset_logs45_fetch_deployment_logs_simple46_job_crashing_no_longer_exists47_truncated_logs_context_window48_logs_since_thursday49_logs_since_last_week50_logs_since_specific_date50a_logs_since_last_specific_month51_logs_summarize_errors52_logs_login_issues53_logs_find_term54_azure_sql54_not_truncated_when_getting_pods55_kafka_runbook57_cluster_name_confusion57_wrong_namespace58_counting_pods_by_status59_label_based_counting60_count_less_than61_exact_match_counting62_fetch_error_logs_with_errors63_fetch_error_logs_no_errors64_keda_vs_hpa_confusion65_health_check_followup66_http_error_needle67_performance_degradation68_cascading_failures69_rate_limit_exhaustion70_memory_leak_detection71_connection_pool_starvation73a_time_window_anomaly73b_time_window_anomaly74_config_change_impact75_network_flapping76_service_discovery_issue77_liveness_probe_misconfiguration78a_missing_cpu_limits78b_cpu_quota_exceeded79_configmap_mount_issue80_pvc_storage_class_mismatch81_service_account_permission_denied82_pod_anti_affinity_conflict83_secret_not_found84_network_policy_blocking_traffic85_hpa_not_scaling86_configmap_like_but_secret89_runbook_missing_cloudwatch90_runbook_basic_selection91a_datadog_metrics_missing_namespace91b_datadog_metrics_pod_exists91c_datadog_metrics_deployment91d_datadog_metrics_historical_pod91e_datadog_custom_metrics91f_datadog_logs_historical_pod91g_datadog_metrics_mismatched_pod91h_datadog_logs_empty_query_with_url91i_datadog_metrics_empty_query_with_url92_cpu_graph_conversation93_calling_datadog93_events_since_specific_date94_runbook_transparency95_runbook_memory_leak_detection96_no_matching_runbook97_logs_clarification_needed99_logs_transparency_custom_time
test_investigate:
01_oom_kill02_crashloop_backoff03_cpu_throttling04_image_pull_backoff05_crashpod06_job_failure07_job_syntax_error08_memory_pressure09_high_latency10_KubeDeploymentReplicasMismatch11_KubePodCrashLooping12_KubePodNotReady13_Watchdog14_tempo15_dns_resolution16_dns_resolution_no_tool17_investigate_correct_date
Summary
Testing
Eval Guidance
/eval make test-llm-ask-holmesto validate ask-holmes prompt behaviorCodex Task
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.