[SPARK-54328] Add configurable startupProbe and enhance liveness/readiness probes in Helm chart by jiangzho · Pull Request #417 · apache/spark-kubernetes-operator

jiangzho · 2025-11-13T01:48:15Z

What changes were proposed in this pull request?

This PR adds a configurable startupProbe and enhances the existing livenessProbe and readinessProbe with additional configurable parameters in the Helm chart for the spark-kubernetes-operator.

Why are the changes needed?

Previously, these values were either using Kubernetes defaults or not configured at all. This change makes them explicitly configurable via Helm values, giving operators more control over pod lifecycle management in different cluster environments (small clusters vs large production clusters).

Does this PR introduce any user-facing change?

Yes. Users can now configure these probe settings in their values.yaml.

How was this patch tested?

E2E coverage for default value, and local dry-run for value overrides.

Was this patch authored or co-authored using generative AI tooling?

No.

…iness probes in Helm chart ### What changes were proposed in this pull request? This PR adds a configurable `startupProbe` and enhances the existing `livenessProbe` and `readinessProbe` with additional configurable parameters in the Helm chart for the spark-kubernetes-operator. ### Why are the changes needed? Previously, these values were either using Kubernetes defaults or not configured at all. This change makes them explicitly configurable via Helm values, giving operators more control over pod lifecycle management in different cluster environments (small clusters vs large production clusters). ### Does this PR introduce _any_ user-facing change? Yes. Users can now configure these probe settings in their `values.yaml` ### How was this patch tested? E2E coverage for default value, and local dry-run for value overrides ### Was this patch authored or co-authored using generative AI tooling? No

jiangzho · 2025-11-13T01:49:17Z

cc @peter-toth for review - thanks a lot in advance!

peter-toth

Thank you @jiangzho , it is always better to have more control over these values.

Could you please check the PR description though? The link seems to be broken.

jiangzho · 2025-11-14T00:43:35Z

Thanks for the catch! Seems GitHub interprates that as a JIRA automatically. Updated the description.

peter-toth · 2025-11-14T12:38:17Z

Thank you @jiangzho.

Merged to main (0.7.0).

dongjoon-hyun

Hi, @jiangzho and @peter-toth .

Unfortunately, this seems to introduce a severe race condition which causes the Apache Spark restart frequently. Technically, WebSocket is supposed to restart regularly and the reconnect time is scheduled after 1000ms. HealthProbe seems to check this time and kills the operator because this PR makes failureThreshold=1 which means no failure is allowed.. Here is the full log.

26/02/08 01:59:42 DEBUG   o.a.s.k.o.p.HealthProbe Checking informer health
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.WatcherWebSocketListener WebSocket close received. code: 1000, reason: null
26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 01:59:52 DEBUG   o.a.s.k.o.p.HealthProbe Checking informer health
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 ERROR   o.a.s.k.o.p.HealthProbe Controller: sparkclusterreconciler, Event Source: ControllerResourceEventSource, Informer: UNHEALTHY is in default, not a healthy state

dongjoon-hyun · 2026-02-08T02:13:51Z

build-tools/helm/spark-kubernetes-operator/values.yaml

          failureThreshold: 30
          periodSeconds: 10
+        readinessProbe:
+          failureThreshold: 1


Also, this PR introduces an inconsistency for readinessProbe.failureThreahold. Here, this PR uses 1 while the default is introduced by 30.

{{- default 30 .Values.operatorDeployment.operatorPod.operatorContainer.probes.readinessProbe.failureThreshold }}

dongjoon-hyun · 2026-02-08T02:18:57Z

I made a PR to fix this stability issue.

[SPARK-55421] Increase livenessProbe.failureThreshold to 3 #491

dongjoon-hyun · 2026-02-08T02:47:57Z

I also made a PR for #417 (comment)

[SPARK-55422] Fix the default value of readinessProbe.failureThreshold to 1 #492

### What changes were proposed in this pull request? This PR aims to increase `livenessProbe.failureThreshold` to 3. ### Why are the changes needed? SPARK-54328 introduce a severe race condition which causes the `Apache Spark Operator` restarts too frequently. Technically, `WebSocket` is supposed to restart regularly and the reconnect time is scheduled after 1000ms. `HealthProbe` checks during this reconnection and kills the operator because SPARK-54328 used `failureThreshold=1` which means no failure is allowed. Here is the full log. - #417 **Spark Operator Restart Log** ``` 26/02/08 01:59:42 DEBUG o.a.s.k.o.p.HealthProbe Checking informer health 26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:51 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the current watch 26/02/08 01:59:51 DEBUG i.f.k.c.d.i.WatcherWebSocketListener WebSocket close received. code: 1000, reason: null 26/02/08 01:59:51 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms 26/02/08 01:59:52 DEBUG o.a.s.k.o.p.HealthProbe Checking informer health 26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false] 26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false] 26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false] 26/02/08 01:59:52 ERROR o.a.s.k.o.p.HealthProbe Controller: sparkclusterreconciler, Event Source: ControllerResourceEventSource, Informer: UNHEALTHY is in default, not a healthy state ``` **AbstractWatchManager Behavior** ``` $ k logs -f spark-kubernetes-operator-68c55d48d9-548mz| grep 'AbstractWatchManager' 26/02/08 01:59:55 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkapplications?allowWatchBookmarks=true&resourceVersion=276933&timeoutSeconds=600&watch=true... 26/02/08 01:59:55 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkclusters?allowWatchBookmarks=true&resourceVersion=276931&timeoutSeconds=600&watch=true... 26/02/08 01:59:55 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=276933&timeoutSeconds=600&watch=true... 26/02/08 01:59:55 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=276933&timeoutSeconds=600&watch=true... 26/02/08 02:06:23 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the current watch 26/02/08 02:06:23 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms 26/02/08 02:06:24 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkclusters?allowWatchBookmarks=true&resourceVersion=277053&timeoutSeconds=600&watch=true... 26/02/08 02:07:14 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the current watch 26/02/08 02:07:14 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms 26/02/08 02:07:15 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=277072&timeoutSeconds=600&watch=true... 26/02/08 02:07:33 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the current watch 26/02/08 02:07:33 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms 26/02/08 02:07:34 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=277072&timeoutSeconds=600&watch=true... 26/02/08 02:08:41 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the current watch 26/02/08 02:08:41 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms 26/02/08 02:08:42 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkapplications?allowWatchBookmarks=true&resourceVersion=277089&timeoutSeconds=600&watch=true... ``` ### Does this PR introduce _any_ user-facing change? This will fix the regression at v0.7.0. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: `Opus 4.5` on `Claude Code` Closes #491 from dongjoon-hyun/SPARK-55421. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ld` to 1 ### What changes were proposed in this pull request? This PR aims to fix the default value of `readinessProbe.failureThreshold` to 1. ### Why are the changes needed? When SPARK-54328 changed `readinessProbe.failureThreshold` to 1, it didn't update the default value of `_helpers.tpl` file. We had better make it consistent. - #417 https://github.com/apache/spark-kubernetes-operator/blob/5836f15c91a85dd5f39a2e60bf46630ad1f3dc42/build-tools/helm/spark-kubernetes-operator/values.yaml#L69-L70 https://github.com/apache/spark-kubernetes-operator/blob/5836f15c91a85dd5f39a2e60bf46630ad1f3dc42/build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl#L131 ### Does this PR introduce _any_ user-facing change? No behavior changes because the behavior change happens in v0.7.0 already via SPARK-54328. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: `Opus 4.5` on `Claude Code` Closes #492 from dongjoon-hyun/SPARK-55422. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions bot added the BUILD label Nov 13, 2025

peter-toth approved these changes Nov 13, 2025

View reviewed changes

peter-toth closed this in 697229b Nov 14, 2025

dongjoon-hyun reviewed Feb 8, 2026

View reviewed changes

This was referenced Feb 8, 2026

[SPARK-55421] Increase livenessProbe.failureThreshold to 3 #491

Closed

[SPARK-55422] Fix the default value of readinessProbe.failureThreshold to 1 #492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-54328] Add configurable startupProbe and enhance liveness/readiness probes in Helm chart#417

[SPARK-54328] Add configurable startupProbe and enhance liveness/readiness probes in Helm chart#417
jiangzho wants to merge 1 commit intoapache:mainfrom
jiangzho:probe_override

jiangzho commented Nov 13, 2025 •

edited

Loading

Uh oh!

jiangzho commented Nov 13, 2025

Uh oh!

peter-toth left a comment

Uh oh!

jiangzho commented Nov 14, 2025

Uh oh!

peter-toth commented Nov 14, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun Feb 8, 2026

Uh oh!

dongjoon-hyun commented Feb 8, 2026

Uh oh!

dongjoon-hyun commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jiangzho commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

jiangzho commented Nov 13, 2025

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

jiangzho commented Nov 14, 2025

Uh oh!

peter-toth commented Nov 14, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 8, 2026

Uh oh!

dongjoon-hyun commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiangzho commented Nov 13, 2025 •

edited

Loading