status: use hive jobs for probes by MitchLewis930 · Pull Request #4 · Signal65/cilium-Bugbot

MitchLewis930 · 2026-01-30T23:04:20Z

PR_044

Note

Medium Risk
Changes the startup/shutdown and readiness/health-reporting mechanics for status probes, which could affect agent health signals or probe lifetime under cancellation/timeouts.

Overview
Status probe startup is now driven by Hive jobs instead of lifecycle hooks. newStatusCollector registers a job.OneShot("probes") that starts probes, waits (with timeout) for the first successful run, marks allProbesInitialized, and closes probes when the job exits.

The previous startStatusCollector goroutine-based startup (and its cell.Health scope reporting) is removed, along with explicit probe shutdown from the OnStop hook; OnStop now only logs a KVStore-not-OK hint.

^{Written by Cursor Bugbot for commit 829f56e. This will update automatically on new commits. Configure here.}

This commit refactores the probe initialization to use a Hive Job instead of a plain lifecycle start hook. This way we can also get rid of the raw Go routine to execute the check if every probe successfully executed at least once before exposing the status. Note: the kvstore "shutdown check" is still part of its own lifecycle stop hook. Probably better to eventually move this to the kvstore module. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-30T23:09:47Z

pkg/status/cell.go

+
+		<-ctx.Done()
+		return nil
+	}))


Probes stopped prematurely on initial timeout error

High Severity

When WaitForFirstRun times out, returning an error causes the defer collector.statusCollector.Close() to execute immediately, stopping all running probes. The previous implementation kept probes running even when the initial check failed, only closing them during OnStop. Now, if any probe takes longer than StatusCollectorProbeCheckTimeout (default 5 minutes) for its first run, the entire status collector becomes non-functional because all probes are terminated.

cursor · 2026-01-30T23:09:47Z

pkg/status/cell.go

+
+		<-ctx.Done()
+		return nil
+	}))


Health parameter received but never used for reporting

Medium Severity

The health cell.Health parameter passed to the job.OneShot function is never used. The old code explicitly called probeCheckHealth.Degraded() on failure and probeCheckHealth.OK() on success to report probe initialization status. The new code receives the health parameter but doesn't call any health methods, so the probe check health is never reported to the hive health system. The comment "Report health whether all probes have been executed at least once" remains, but the actual health reporting was omitted.

cursor bot reviewed Jan 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

status: use hive jobs for probes#4

status: use hive jobs for probes#4
MitchLewis930 wants to merge 1 commit intopr_044_beforefrom
pr_044_after

MitchLewis930 commented Jan 30, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 30, 2026

Uh oh!

cursor bot Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

MitchLewis930 commented Jan 30, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 30, 2026

Choose a reason for hiding this comment

Probes stopped prematurely on initial timeout error

Uh oh!

cursor bot Jan 30, 2026

Choose a reason for hiding this comment

Health parameter received but never used for reporting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MitchLewis930 commented Jan 30, 2026 •

edited by cursor bot

Loading