feat: Lukasz/prod 4436 stack run sentinel harness controllers should each use an by zreigz · Pull Request #661 · pluralsh/deployment-operator

zreigz · 2026-02-16T15:32:01Z

Introduce a SentinelRunJob and StackRunJob CRD and controllers to manage job lifecycle via controller-runtime, improving reconciliation reliability, ownership handling, and status syncing back to Console.

(from michael: Also fixes PROD-4454 (maybe, should e2e test to confirm))

Test Plan

Test environment: https://console.your-env.onplural.sh/

Checklist

linear · 2026-02-16T15:32:05Z

PROD-4436 stack run + sentinel harness controllers should each use an intermediate crd for state management

PROD-4454 pretty sure ansible stacks are nonfunctional

greptile-apps · 2026-02-16T15:37:28Z

Greptile Summary

This PR introduces a SentinelRunJob CRD and a new controller-runtime based SentinelRunJobReconciler to manage Sentinel run Job lifecycle. The existing webhook-based SentinelReconciler is modified to create SentinelRunJob CRs instead of directly managing Jobs, delegating Job creation and status syncing to the new controller.

Critical bug in hasRunSecretData: envConsoleURL is used instead of envRunID to check the run ID (line 32 of sentinelrunjob_secret.go), causing the secret to be unnecessarily updated on every reconcile. This same bug exists in the old code at pkg/controller/sentinel/secret.go:32.
CR always recreated in reconcileSentinelRunJobCR: The Get result is ignored and Create is always called, which will produce AlreadyExists errors on subsequent reconciles of the same run.
CRD/deepcopy out of sync with Go types: The SentinelRunJobSpec Go struct only has RunID, but the generated CRD includes format (marked required) and jobSpec fields, and the deepcopy references a JobSpec pointer field. These need to be regenerated.
Finalizer never performs cleanup: addOrRemoveFinalizer doesn't return early on deletion, so the reconciler continues to reconcile Jobs/secrets for a deleting CR.
Switch fallthrough bug in volume mount filtering: defaultJobVolumeName case has an empty body that doesn't filter, only defaultJobTmpVolumeName returns false — this is a pre-existing bug copied from the old sentinel code.

Confidence Score: 2/5

This PR has multiple logic bugs that will cause runtime errors and should not be merged as-is.
The PR introduces a well-structured CRD and controller pattern, but contains several bugs: (1) a copy-paste bug using the wrong constant in secret validation, (2) missing guard against duplicate CR creation, (3) CRD/deepcopy out of sync with Go types, (4) finalizer that doesn't short-circuit on deletion, and (5) a pre-existing switch fallthrough bug that was carried forward. These issues collectively mean the controller will produce errors in normal operation.
Pay close attention to internal/controller/sentinelrunjob_secret.go, pkg/controller/sentinel/reconciler.go, and api/v1alpha1/zz_generated.deepcopy.go.

Important Files Changed

Filename	Overview
api/v1alpha1/sentinelrunjob_types.go	New CRD type definition for SentinelRunJob. The struct is clean but is out of sync with the generated CRD YAML and deepcopy (which include `format` and `jobSpec` fields not present in the Go type).
api/v1alpha1/zz_generated.deepcopy.go	Generated deepcopy functions for SentinelRunJob types. Out of sync with the Go types — references a `JobSpec` field that doesn't exist in the current `SentinelRunJobSpec` struct.
internal/controller/sentinelrunjob_controller.go	Core reconciler for SentinelRunJob CRs. Has issues with finalizer logic (no cleanup on delete, no early return), and the overall flow continues even during deletion.
internal/controller/sentinelrunjob_job.go	Job generation logic for sentinel runs. Contains a switch fallthrough bug in `ensureDefaultVolumeMounts` that prevents filtering duplicate volume mounts. Largely duplicated from `pkg/controller/sentinel/job.go`.
internal/controller/sentinelrunjob_secret.go	Secret reconciliation for sentinel run jobs. Contains a critical bug: `hasRunSecretData` uses `envConsoleURL` instead of `envRunID` to check the run ID, causing unnecessary secret updates on every reconcile.
pkg/controller/sentinel/reconciler.go	Modified sentinel reconciler that now creates SentinelRunJob CRs. Contains a bug where `reconcileSentinelRunJobCR` always tries to Create the CR even when it already exists, causing AlreadyExists errors on every re-reconcile.
cmd/agent/kubernetes.go	Registers the new SentinelRunJobReconciler controller with the manager. Follows the same pattern as other controller registrations in this file.
charts/deployment-operator/crds/deployments.plural.sh_sentinelrunjobs.yaml	Helm chart CRD for SentinelRunJob. Out of sync with Go types — includes `format` (required) and `jobSpec` fields not present in the Go struct.
config/crd/bases/deployments.plural.sh_sentinelrunjobs.yaml	Base CRD definition for SentinelRunJob. Same sync issue as the Helm chart CRD — generated from a different version of the type that had additional fields.

Sequence Diagram

sequenceDiagram
    participant Console as Console API
    participant SR as SentinelReconciler (webhook)
    participant K8s as Kubernetes API
    participant SRJC as SentinelRunJobReconciler
    participant Job as Kubernetes Job

    Console->>SR: Poll / WebSocket event (pending run)
    SR->>K8s: Create SentinelRunJob CR
    K8s-->>SRJC: Reconcile event
    SRJC->>Console: GetSentinelRunJob(runID)
    Console-->>SRJC: SentinelRunJobFragment
    SRJC->>K8s: Create/Get Secret (env vars)
    SRJC->>K8s: Create/Get Job
    SRJC->>K8s: Set OwnerRef (Job→Secret)
    SRJC->>K8s: Set ControllerRef (CR→Job)
    SRJC->>SRJC: Check Job health status
    SRJC->>Console: UpdateSentinelRunJobStatus
    SRJC->>K8s: Patch CR status (conditions)
    Note over SRJC,Job: On Job status change (Owns), re-reconcile
    Job-->>SRJC: Job status update event
    SRJC->>Console: UpdateSentinelRunJobStatus (failed if degraded)

_{Last reviewed commit: f603c04}

greptile-apps

_{9 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

internal/controller/sentinelrunjob_secret.go

pkg/controller/sentinel/reconciler.go

api/v1alpha1/zz_generated.deepcopy.go

internal/controller/sentinelrunjob_controller.go

internal/controller/sentinelrunjob_job.go

…r-runtime

michaeljguarino · 2026-02-20T18:29:52Z

Are you posting the right job reference back to the plural api? Seems like the ui isn't appropriately loading that for stacks at least

michaeljguarino · 2026-02-20T18:32:56Z

internal/controller/sentinelrunjob_controller.go

+	}
+
+	var status *console.SentinelRunJobStatus
+	if health != nil {


when the job completes, we should clean up the SentinelRunJob resource and that should cascade to deleting the k8s job as well

michaeljguarino · 2026-02-20T18:34:02Z

internal/controller/stackrunjob_controller.go


 // Reconcile StackRun's Job ensure that Console stays in sync with Kubernetes cluster.
-func (r *StackRunJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
+func (r *StackRunJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (_ reconcile.Result, retErr error) {


is there any logic to ensure the stack run job is cleaned up once complete here?

(we don't want these crs just living forever on the cluster, especially since they're reconciled again on each pod restart)

good point, I have to add this logic

zreigz requested a review from a team as a code owner February 16, 2026 15:32

github-actions bot added the size/XXL label Feb 16, 2026

zreigz added the enhancement New feature or request label Feb 16, 2026

greptile-apps bot reviewed Feb 16, 2026

View reviewed changes

zreigz changed the title ~~wip: Lukasz/prod 4436 stack run sentinel harness controllers should each use an~~ feat: Lukasz/prod 4436 stack run sentinel harness controllers should each use an Feb 18, 2026

michaeljguarino force-pushed the lukasz/prod-4436-stack-run-sentinel-harness-controllers-should-each-use-an branch 2 times, most recently from 1e2b71e to a9b14bd Compare February 19, 2026 04:50

zreigz and others added 14 commits February 20, 2026 12:48

introduce SentinelRunJob CRD to reconcile Job lifecycle via controlle…

bc6fe76

…r-runtime

register controller

d49e948

revert .dockerignore

426c26f

update manifests

787b722

fix build

20b01b7

code review

277846f

refactor

5ce7f00

add SentinelRunJob CRD

adbc19c

add stackrunjob CRD

e5f9442

linter

d3adb98

fix unit tests

eb5b12a

init controller

8365c56

linter

46c5dde

Enable read-only root filesystem for ansible stacks

36d883a

michaeljguarino force-pushed the lukasz/prod-4436-stack-run-sentinel-harness-controllers-should-each-use-an branch from 20845e1 to 36d883a Compare February 20, 2026 17:48

michaeljguarino reviewed Feb 20, 2026

View reviewed changes

zreigz added 3 commits February 23, 2026 08:05

Merge remote-tracking branch 'origin/main' into ddd

4f686c4

delete CRDs on success

52654d1

linter

bba220b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Lukasz/prod 4436 stack run sentinel harness controllers should each use an#661

feat: Lukasz/prod 4436 stack run sentinel harness controllers should each use an#661
zreigz wants to merge 17 commits intomainfrom
lukasz/prod-4436-stack-run-sentinel-harness-controllers-should-each-use-an

zreigz commented Feb 16, 2026 •

edited by floreks

Loading

Uh oh!

linear bot commented Feb 16, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Feb 16, 2026

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaeljguarino commented Feb 20, 2026

Uh oh!

michaeljguarino Feb 20, 2026

Uh oh!

michaeljguarino Feb 20, 2026

Uh oh!

michaeljguarino Feb 20, 2026

Uh oh!

zreigz Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

zreigz commented Feb 16, 2026 • edited by floreks Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Plan

Checklist

Uh oh!

linear bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Feb 16, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaeljguarino commented Feb 20, 2026

Uh oh!

michaeljguarino Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

michaeljguarino Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

michaeljguarino Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

zreigz Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zreigz commented Feb 16, 2026 •

edited by floreks

Loading

linear bot commented Feb 16, 2026 •

edited

Loading