Skip to content

Comments

feat: Lukasz/prod 4436 stack run sentinel harness controllers should each use an#661

Open
zreigz wants to merge 17 commits intomainfrom
lukasz/prod-4436-stack-run-sentinel-harness-controllers-should-each-use-an
Open

feat: Lukasz/prod 4436 stack run sentinel harness controllers should each use an#661
zreigz wants to merge 17 commits intomainfrom
lukasz/prod-4436-stack-run-sentinel-harness-controllers-should-each-use-an

Conversation

@zreigz
Copy link
Member

@zreigz zreigz commented Feb 16, 2026

Introduce a SentinelRunJob and StackRunJob CRD and controllers to manage job lifecycle via controller-runtime, improving reconciliation reliability, ownership handling, and status syncing back to Console.

(from michael: Also fixes PROD-4454 (maybe, should e2e test to confirm))

Test Plan

Test environment: https://console.your-env.onplural.sh/

Checklist

  • I have added a meaningful title and summary to convey the impact of this PR to a user.
  • I have deployed the agent to a test environment and verified that it works as expected.
    • Agent starts successfully.
    • Service creation works without any issues when using raw manifests and Helm templates.
    • Service creation works when resources contain both CRD and CRD instances.
    • Service templating works correctly.
    • Service errors are reported properly and visible in the UI.
    • Service updates are reflected properly in the cluster.
    • Service resync triggers immediately and works as expected.
    • Sync waves annotations are respected.
    • Sync phases annotations are respected. Phases are executed in the correct order.
    • Sync hook delete policies are respected. Resources are not recreated once they reach the desired state.
    • Service deletion works and cleanups resources properly.
    • Services can be recreated after deletion.
    • Service detachment works and keeps resources unaffected.
    • Services can be recreated after detachment.
    • Service component trees are working as expected.
    • Cluster health statuses are being updated.
    • Agent logs do not contain any errors (after running for at least 30 minutes).
    • There are no visible anomalies in Datadog (after running for at least 30 minutes).
  • I have added tests to cover my changes.
  • If required, I have updated the Plural documentation accordingly.

@zreigz zreigz requested a review from a team as a code owner February 16, 2026 15:32
@linear
Copy link

linear bot commented Feb 16, 2026

@zreigz zreigz added the enhancement New feature or request label Feb 16, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 16, 2026

Greptile Summary

This PR introduces a SentinelRunJob CRD and a new controller-runtime based SentinelRunJobReconciler to manage Sentinel run Job lifecycle. The existing webhook-based SentinelReconciler is modified to create SentinelRunJob CRs instead of directly managing Jobs, delegating Job creation and status syncing to the new controller.

  • Critical bug in hasRunSecretData: envConsoleURL is used instead of envRunID to check the run ID (line 32 of sentinelrunjob_secret.go), causing the secret to be unnecessarily updated on every reconcile. This same bug exists in the old code at pkg/controller/sentinel/secret.go:32.
  • CR always recreated in reconcileSentinelRunJobCR: The Get result is ignored and Create is always called, which will produce AlreadyExists errors on subsequent reconciles of the same run.
  • CRD/deepcopy out of sync with Go types: The SentinelRunJobSpec Go struct only has RunID, but the generated CRD includes format (marked required) and jobSpec fields, and the deepcopy references a JobSpec pointer field. These need to be regenerated.
  • Finalizer never performs cleanup: addOrRemoveFinalizer doesn't return early on deletion, so the reconciler continues to reconcile Jobs/secrets for a deleting CR.
  • Switch fallthrough bug in volume mount filtering: defaultJobVolumeName case has an empty body that doesn't filter, only defaultJobTmpVolumeName returns false — this is a pre-existing bug copied from the old sentinel code.

Confidence Score: 2/5

  • This PR has multiple logic bugs that will cause runtime errors and should not be merged as-is.
  • The PR introduces a well-structured CRD and controller pattern, but contains several bugs: (1) a copy-paste bug using the wrong constant in secret validation, (2) missing guard against duplicate CR creation, (3) CRD/deepcopy out of sync with Go types, (4) finalizer that doesn't short-circuit on deletion, and (5) a pre-existing switch fallthrough bug that was carried forward. These issues collectively mean the controller will produce errors in normal operation.
  • Pay close attention to internal/controller/sentinelrunjob_secret.go, pkg/controller/sentinel/reconciler.go, and api/v1alpha1/zz_generated.deepcopy.go.

Important Files Changed

Filename Overview
api/v1alpha1/sentinelrunjob_types.go New CRD type definition for SentinelRunJob. The struct is clean but is out of sync with the generated CRD YAML and deepcopy (which include format and jobSpec fields not present in the Go type).
api/v1alpha1/zz_generated.deepcopy.go Generated deepcopy functions for SentinelRunJob types. Out of sync with the Go types — references a JobSpec field that doesn't exist in the current SentinelRunJobSpec struct.
internal/controller/sentinelrunjob_controller.go Core reconciler for SentinelRunJob CRs. Has issues with finalizer logic (no cleanup on delete, no early return), and the overall flow continues even during deletion.
internal/controller/sentinelrunjob_job.go Job generation logic for sentinel runs. Contains a switch fallthrough bug in ensureDefaultVolumeMounts that prevents filtering duplicate volume mounts. Largely duplicated from pkg/controller/sentinel/job.go.
internal/controller/sentinelrunjob_secret.go Secret reconciliation for sentinel run jobs. Contains a critical bug: hasRunSecretData uses envConsoleURL instead of envRunID to check the run ID, causing unnecessary secret updates on every reconcile.
pkg/controller/sentinel/reconciler.go Modified sentinel reconciler that now creates SentinelRunJob CRs. Contains a bug where reconcileSentinelRunJobCR always tries to Create the CR even when it already exists, causing AlreadyExists errors on every re-reconcile.
cmd/agent/kubernetes.go Registers the new SentinelRunJobReconciler controller with the manager. Follows the same pattern as other controller registrations in this file.
charts/deployment-operator/crds/deployments.plural.sh_sentinelrunjobs.yaml Helm chart CRD for SentinelRunJob. Out of sync with Go types — includes format (required) and jobSpec fields not present in the Go struct.
config/crd/bases/deployments.plural.sh_sentinelrunjobs.yaml Base CRD definition for SentinelRunJob. Same sync issue as the Helm chart CRD — generated from a different version of the type that had additional fields.

Sequence Diagram

sequenceDiagram
    participant Console as Console API
    participant SR as SentinelReconciler (webhook)
    participant K8s as Kubernetes API
    participant SRJC as SentinelRunJobReconciler
    participant Job as Kubernetes Job

    Console->>SR: Poll / WebSocket event (pending run)
    SR->>K8s: Create SentinelRunJob CR
    K8s-->>SRJC: Reconcile event
    SRJC->>Console: GetSentinelRunJob(runID)
    Console-->>SRJC: SentinelRunJobFragment
    SRJC->>K8s: Create/Get Secret (env vars)
    SRJC->>K8s: Create/Get Job
    SRJC->>K8s: Set OwnerRef (Job→Secret)
    SRJC->>K8s: Set ControllerRef (CR→Job)
    SRJC->>SRJC: Check Job health status
    SRJC->>Console: UpdateSentinelRunJobStatus
    SRJC->>K8s: Patch CR status (conditions)
    Note over SRJC,Job: On Job status change (Owns), re-reconcile
    Job-->>SRJC: Job status update event
    SRJC->>Console: UpdateSentinelRunJobStatus (failed if degraded)
Loading

Last reviewed commit: f603c04

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

@zreigz zreigz changed the title wip: Lukasz/prod 4436 stack run sentinel harness controllers should each use an feat: Lukasz/prod 4436 stack run sentinel harness controllers should each use an Feb 18, 2026
@michaeljguarino michaeljguarino force-pushed the lukasz/prod-4436-stack-run-sentinel-harness-controllers-should-each-use-an branch 2 times, most recently from 1e2b71e to a9b14bd Compare February 19, 2026 04:50
@michaeljguarino michaeljguarino force-pushed the lukasz/prod-4436-stack-run-sentinel-harness-controllers-should-each-use-an branch from 20845e1 to 36d883a Compare February 20, 2026 17:48
@michaeljguarino
Copy link
Member

Are you posting the right job reference back to the plural api? Seems like the ui isn't appropriately loading that for stacks at least

}

var status *console.SentinelRunJobStatus
if health != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the job completes, we should clean up the SentinelRunJob resource and that should cascade to deleting the k8s job as well


// Reconcile StackRun's Job ensure that Console stays in sync with Kubernetes cluster.
func (r *StackRunJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
func (r *StackRunJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (_ reconcile.Result, retErr error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any logic to ensure the stack run job is cleaned up once complete here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we don't want these crs just living forever on the cluster, especially since they're reconciled again on each pod restart)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I have to add this logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants