Skip to content

[SPARK-54328] Add configurable startupProbe and enhance liveness/readiness probes in Helm chart#417

Closed
jiangzho wants to merge 1 commit intoapache:mainfrom
jiangzho:probe_override
Closed

[SPARK-54328] Add configurable startupProbe and enhance liveness/readiness probes in Helm chart#417
jiangzho wants to merge 1 commit intoapache:mainfrom
jiangzho:probe_override

Conversation

@jiangzho
Copy link
Contributor

@jiangzho jiangzho commented Nov 13, 2025

What changes were proposed in this pull request?

This PR adds a configurable startupProbe and enhances the existing livenessProbe and readinessProbe with additional configurable parameters in the Helm chart for the spark-kubernetes-operator.

Why are the changes needed?

Previously, these values were either using Kubernetes defaults or not configured at all. This change makes them explicitly configurable via Helm values, giving operators more control over pod lifecycle management in different cluster environments (small clusters vs large production clusters).

Does this PR introduce any user-facing change?

Yes. Users can now configure these probe settings in their values.yaml.

How was this patch tested?

E2E coverage for default value, and local dry-run for value overrides.

Was this patch authored or co-authored using generative AI tooling?

No.

…iness probes in Helm chart

### What changes were proposed in this pull request?

This PR adds a configurable `startupProbe` and enhances the existing `livenessProbe` and `readinessProbe`
with additional configurable parameters in the Helm chart for the spark-kubernetes-operator.

### Why are the changes needed?

Previously, these values were either using Kubernetes defaults or not configured at all. This change makes
them explicitly configurable via Helm values, giving operators more control over pod lifecycle management
in different cluster environments (small clusters vs large production clusters).

### Does this PR introduce _any_ user-facing change?

Yes. Users can now configure these probe settings in their `values.yaml`

### How was this patch tested?

E2E coverage for default value, and local dry-run for value overrides

### Was this patch authored or co-authored using generative AI tooling?

No
@github-actions github-actions bot added the BUILD label Nov 13, 2025
@jiangzho
Copy link
Contributor Author

cc @peter-toth for review - thanks a lot in advance!

Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jiangzho , it is always better to have more control over these values.

Could you please check the PR description though? The link seems to be broken.

@jiangzho
Copy link
Contributor Author

Thanks for the catch! Seems GitHub interprates that as a JIRA automatically. Updated the description.

@peter-toth
Copy link
Contributor

Thank you @jiangzho.

Merged to main (0.7.0).

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jiangzho and @peter-toth .

Unfortunately, this seems to introduce a severe race condition which causes the Apache Spark restart frequently. Technically, WebSocket is supposed to restart regularly and the reconnect time is scheduled after 1000ms. HealthProbe seems to check this time and kills the operator because this PR makes failureThreshold=1 which means no failure is allowed.. Here is the full log.

26/02/08 01:59:42 DEBUG   o.a.s.k.o.p.HealthProbe Checking informer health
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.WatcherWebSocketListener WebSocket close received. code: 1000, reason: null
26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 01:59:52 DEBUG   o.a.s.k.o.p.HealthProbe Checking informer health
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 ERROR   o.a.s.k.o.p.HealthProbe Controller: sparkclusterreconciler, Event Source: ControllerResourceEventSource, Informer: UNHEALTHY is in default, not a healthy state

failureThreshold: 30
periodSeconds: 10
readinessProbe:
failureThreshold: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this PR introduces an inconsistency for readinessProbe.failureThreahold. Here, this PR uses 1 while the default is introduced by 30.

{{- default 30 .Values.operatorDeployment.operatorPod.operatorContainer.probes.readinessProbe.failureThreshold }}

@dongjoon-hyun
Copy link
Member

I made a PR to fix this stability issue.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun added a commit that referenced this pull request Feb 8, 2026
### What changes were proposed in this pull request?

This PR aims to increase `livenessProbe.failureThreshold` to 3.

### Why are the changes needed?

SPARK-54328 introduce a severe race condition which causes the `Apache Spark Operator` restarts too frequently. Technically, `WebSocket` is supposed to restart regularly and the reconnect time is scheduled after 1000ms. `HealthProbe` checks during this reconnection and kills the operator because SPARK-54328 used  `failureThreshold=1` which means no failure is allowed. Here is the full log.

- #417

**Spark Operator Restart Log**
```
26/02/08 01:59:42 DEBUG   o.a.s.k.o.p.HealthProbe Checking informer health
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.WatcherWebSocketListener WebSocket close received. code: 1000, reason: null
26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 01:59:52 DEBUG   o.a.s.k.o.p.HealthProbe Checking informer health
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 ERROR   o.a.s.k.o.p.HealthProbe Controller: sparkclusterreconciler, Event Source: ControllerResourceEventSource, Informer: UNHEALTHY is in default, not a healthy state
```

**AbstractWatchManager Behavior**
```
$ k logs -f spark-kubernetes-operator-68c55d48d9-548mz| grep 'AbstractWatchManager'
26/02/08 01:59:55 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkapplications?allowWatchBookmarks=true&resourceVersion=276933&timeoutSeconds=600&watch=true...
26/02/08 01:59:55 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkclusters?allowWatchBookmarks=true&resourceVersion=276931&timeoutSeconds=600&watch=true...
26/02/08 01:59:55 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=276933&timeoutSeconds=600&watch=true...
26/02/08 01:59:55 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=276933&timeoutSeconds=600&watch=true...
26/02/08 02:06:23 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 02:06:23 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 02:06:24 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkclusters?allowWatchBookmarks=true&resourceVersion=277053&timeoutSeconds=600&watch=true...
26/02/08 02:07:14 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 02:07:14 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 02:07:15 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=277072&timeoutSeconds=600&watch=true...
26/02/08 02:07:33 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 02:07:33 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 02:07:34 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=277072&timeoutSeconds=600&watch=true...
26/02/08 02:08:41 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 02:08:41 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 02:08:42 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkapplications?allowWatchBookmarks=true&resourceVersion=277089&timeoutSeconds=600&watch=true...
```

### Does this PR introduce _any_ user-facing change?

This will fix the regression at v0.7.0.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: `Opus 4.5` on `Claude Code`

Closes #491 from dongjoon-hyun/SPARK-55421.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun added a commit that referenced this pull request Feb 8, 2026
…ld` to 1

### What changes were proposed in this pull request?

This PR aims to fix the default value of `readinessProbe.failureThreshold` to 1.

### Why are the changes needed?

When SPARK-54328 changed `readinessProbe.failureThreshold` to 1, it didn't update the default value of `_helpers.tpl` file. We had better make it consistent.
- #417

https://github.com/apache/spark-kubernetes-operator/blob/5836f15c91a85dd5f39a2e60bf46630ad1f3dc42/build-tools/helm/spark-kubernetes-operator/values.yaml#L69-L70

https://github.com/apache/spark-kubernetes-operator/blob/5836f15c91a85dd5f39a2e60bf46630ad1f3dc42/build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl#L131

### Does this PR introduce _any_ user-facing change?

No behavior changes because the behavior change happens in v0.7.0 already via SPARK-54328.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: `Opus 4.5` on `Claude Code`

Closes #492 from dongjoon-hyun/SPARK-55422.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants