[SPARK-54328] Add configurable startupProbe and enhance liveness/readiness probes in Helm chart#417
[SPARK-54328] Add configurable startupProbe and enhance liveness/readiness probes in Helm chart#417jiangzho wants to merge 1 commit intoapache:mainfrom
Conversation
…iness probes in Helm chart ### What changes were proposed in this pull request? This PR adds a configurable `startupProbe` and enhances the existing `livenessProbe` and `readinessProbe` with additional configurable parameters in the Helm chart for the spark-kubernetes-operator. ### Why are the changes needed? Previously, these values were either using Kubernetes defaults or not configured at all. This change makes them explicitly configurable via Helm values, giving operators more control over pod lifecycle management in different cluster environments (small clusters vs large production clusters). ### Does this PR introduce _any_ user-facing change? Yes. Users can now configure these probe settings in their `values.yaml` ### How was this patch tested? E2E coverage for default value, and local dry-run for value overrides ### Was this patch authored or co-authored using generative AI tooling? No
|
cc @peter-toth for review - thanks a lot in advance! |
peter-toth
left a comment
There was a problem hiding this comment.
Thank you @jiangzho , it is always better to have more control over these values.
Could you please check the PR description though? The link seems to be broken.
|
Thanks for the catch! Seems GitHub interprates that as a JIRA automatically. Updated the description. |
|
Thank you @jiangzho. Merged to |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Hi, @jiangzho and @peter-toth .
Unfortunately, this seems to introduce a severe race condition which causes the Apache Spark restart frequently. Technically, WebSocket is supposed to restart regularly and the reconnect time is scheduled after 1000ms. HealthProbe seems to check this time and kills the operator because this PR makes failureThreshold=1 which means no failure is allowed.. Here is the full log.
26/02/08 01:59:42 DEBUG o.a.s.k.o.p.HealthProbe Checking informer health
26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:51 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 01:59:51 DEBUG i.f.k.c.d.i.WatcherWebSocketListener WebSocket close received. code: 1000, reason: null
26/02/08 01:59:51 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 01:59:52 DEBUG o.a.s.k.o.p.HealthProbe Checking informer health
26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 ERROR o.a.s.k.o.p.HealthProbe Controller: sparkclusterreconciler, Event Source: ControllerResourceEventSource, Informer: UNHEALTHY is in default, not a healthy state
| failureThreshold: 30 | ||
| periodSeconds: 10 | ||
| readinessProbe: | ||
| failureThreshold: 1 |
There was a problem hiding this comment.
Also, this PR introduces an inconsistency for readinessProbe.failureThreahold. Here, this PR uses 1 while the default is introduced by 30.
{{- default 30 .Values.operatorDeployment.operatorPod.operatorContainer.probes.readinessProbe.failureThreshold }}
|
I made a PR to fix this stability issue. |
### What changes were proposed in this pull request? This PR aims to increase `livenessProbe.failureThreshold` to 3. ### Why are the changes needed? SPARK-54328 introduce a severe race condition which causes the `Apache Spark Operator` restarts too frequently. Technically, `WebSocket` is supposed to restart regularly and the reconnect time is scheduled after 1000ms. `HealthProbe` checks during this reconnection and kills the operator because SPARK-54328 used `failureThreshold=1` which means no failure is allowed. Here is the full log. - #417 **Spark Operator Restart Log** ``` 26/02/08 01:59:42 DEBUG o.a.s.k.o.p.HealthProbe Checking informer health 26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:51 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the current watch 26/02/08 01:59:51 DEBUG i.f.k.c.d.i.WatcherWebSocketListener WebSocket close received. code: 1000, reason: null 26/02/08 01:59:51 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms 26/02/08 01:59:52 DEBUG o.a.s.k.o.p.HealthProbe Checking informer health 26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false] 26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true] 26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false] 26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false] 26/02/08 01:59:52 ERROR o.a.s.k.o.p.HealthProbe Controller: sparkclusterreconciler, Event Source: ControllerResourceEventSource, Informer: UNHEALTHY is in default, not a healthy state ``` **AbstractWatchManager Behavior** ``` $ k logs -f spark-kubernetes-operator-68c55d48d9-548mz| grep 'AbstractWatchManager' 26/02/08 01:59:55 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkapplications?allowWatchBookmarks=true&resourceVersion=276933&timeoutSeconds=600&watch=true... 26/02/08 01:59:55 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkclusters?allowWatchBookmarks=true&resourceVersion=276931&timeoutSeconds=600&watch=true... 26/02/08 01:59:55 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=276933&timeoutSeconds=600&watch=true... 26/02/08 01:59:55 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=276933&timeoutSeconds=600&watch=true... 26/02/08 02:06:23 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the current watch 26/02/08 02:06:23 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms 26/02/08 02:06:24 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkclusters?allowWatchBookmarks=true&resourceVersion=277053&timeoutSeconds=600&watch=true... 26/02/08 02:07:14 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the current watch 26/02/08 02:07:14 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms 26/02/08 02:07:15 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=277072&timeoutSeconds=600&watch=true... 26/02/08 02:07:33 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the current watch 26/02/08 02:07:33 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms 26/02/08 02:07:34 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=277072&timeoutSeconds=600&watch=true... 26/02/08 02:08:41 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the current watch 26/02/08 02:08:41 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms 26/02/08 02:08:42 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkapplications?allowWatchBookmarks=true&resourceVersion=277089&timeoutSeconds=600&watch=true... ``` ### Does this PR introduce _any_ user-facing change? This will fix the regression at v0.7.0. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: `Opus 4.5` on `Claude Code` Closes #491 from dongjoon-hyun/SPARK-55421. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ld` to 1 ### What changes were proposed in this pull request? This PR aims to fix the default value of `readinessProbe.failureThreshold` to 1. ### Why are the changes needed? When SPARK-54328 changed `readinessProbe.failureThreshold` to 1, it didn't update the default value of `_helpers.tpl` file. We had better make it consistent. - #417 https://github.com/apache/spark-kubernetes-operator/blob/5836f15c91a85dd5f39a2e60bf46630ad1f3dc42/build-tools/helm/spark-kubernetes-operator/values.yaml#L69-L70 https://github.com/apache/spark-kubernetes-operator/blob/5836f15c91a85dd5f39a2e60bf46630ad1f3dc42/build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl#L131 ### Does this PR introduce _any_ user-facing change? No behavior changes because the behavior change happens in v0.7.0 already via SPARK-54328. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: `Opus 4.5` on `Claude Code` Closes #492 from dongjoon-hyun/SPARK-55422. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This PR adds a configurable
startupProbeand enhances the existinglivenessProbeandreadinessProbewith additional configurable parameters in the Helm chart for thespark-kubernetes-operator.Why are the changes needed?
Previously, these values were either using Kubernetes defaults or not configured at all. This change makes them explicitly configurable via Helm values, giving operators more control over pod lifecycle management in different cluster environments (small clusters vs large production clusters).
Does this PR introduce any user-facing change?
Yes. Users can now configure these probe settings in their
values.yaml.How was this patch tested?
E2E coverage for default value, and local dry-run for value overrides.
Was this patch authored or co-authored using generative AI tooling?
No.