Skip to content

Conversation

@joamaki
Copy link
Contributor

@joamaki joamaki commented Dec 11, 2025

The job scheduler used to register entire job.Groups as lifecycle hooks via module-private providers (cilium#pkg/hive/hive.go). That meant a group (and any “dynamic” jobs attached to it) could be started before all of its dependencies were known, so a goroutine might run prior to the component it relies on. Or a job might be stopped before other jobs depending on it were stopped.

For example:

  cell.Module(...,
    cell.Invoke(func(jg job.Group, a A) { jg.Add(... use A ...) },
    cell.Invoke(func(jg job.Group, b B) { jg.Add(... use B ...) },
  )

The jg is constructed by the first invoke and this causes it to be appended to the hive lifecycle.
Assuming B wasn't constructed yet, the second invoke will construct it and potentially its
constructor adds Bs start hooks to lifecycle. Now jg will start after A's start hooks but
before Bs start hooks. This means there's a chance that the job that uses B will execute before
B's start hook is called.

The refactor fixes that by:

  • Teaching cell.DefaultLifecycle about a HookDescriptiveInterface, so lifecycle logs/PrintHooks can include job names.
  • Letting Lifecycle.Start run again to start newly appended hooks; the registry now appends dynamic jobs after Hive startup and immediately calls Start to bring them online.
  • Moving lifecycle ownership out of individual groups and into the registry: the registry now holds the lifecycles and manages indvidual jobs. Groups are lightweight containers forwarding Add calls to the registry, so jobs added before Hive start still get queued, while jobs added later are started immediately but shut down in reverse order when the registry stops.
  • Updating all job implementations/tests and call sites to the simplified API, plus adding a log-driven test (TestJobLifecycleOrderingAcrossGroups) that asserts jobs start in the order added (even across multiple groups) and stop in reverse order, covering both jobs added before starting and after.

Together these changes eliminate the race where a job could start before its dependency, while also giving more precise lifecycle logging.

@joamaki joamaki force-pushed the pr/joamaki/job-lifecycle-refactor branch from bc2c578 to a57ae1d Compare December 11, 2025 14:48
@joamaki joamaki marked this pull request as ready for review December 12, 2025 14:14
@joamaki joamaki requested a review from a team as a code owner December 12, 2025 14:14
@joamaki joamaki requested review from pippolo84 and removed request for a team December 12, 2025 14:14
Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a style nit left inline. 💯

joamaki and others added 6 commits December 19, 2025 08:12
Refactor job groups so registry owns the lifecycle. This ensures that jobs
are always started in the order they're added and not in the order the job
groups are created.

- move lifecyle hooks onto the registry type and keep start/stop logic centralized
- drop group-level context/wait-group management and route Group.Add through the registry
- adjust job implementations/tests and all callers to the simplified API
"make test-race" was failing due to racy access to `times`.
Use a mutex.

Signed-off-by: Jussi Maki <jussi.maki@isovalent.com>
Looks like this hadn't been updated when an indirect dependency
became a direct dependency.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki force-pushed the pr/joamaki/job-lifecycle-refactor branch from a57ae1d to 6c85d8c Compare December 19, 2025 07:12
Copy link
Member

@dylandreimerink dylandreimerink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, I like re-using the lifecycles instead of the custom implementation, nice cleanup.

@joamaki joamaki merged commit 391da12 into main Dec 22, 2025
1 check passed
@joamaki joamaki deleted the pr/joamaki/job-lifecycle-refactor branch December 22, 2025 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants