Skip to content

[SPARK-55421] Increase livenessProbe.failureThreshold to 3#491

Closed
dongjoon-hyun wants to merge 1 commit intoapache:mainfrom
dongjoon-hyun:SPARK-55421
Closed

[SPARK-55421] Increase livenessProbe.failureThreshold to 3#491
dongjoon-hyun wants to merge 1 commit intoapache:mainfrom
dongjoon-hyun:SPARK-55421

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 8, 2026

What changes were proposed in this pull request?

This PR aims to increase livenessProbe.failureThreshold to 3.

Why are the changes needed?

SPARK-54328 introduce a severe race condition which causes the Apache Spark Operator restarts too frequently. Technically, WebSocket is supposed to restart regularly and the reconnect time is scheduled after 1000ms. HealthProbe checks during this reconnection and kills the operator because SPARK-54328 used failureThreshold=1 which means no failure is allowed. Here is the full log.

Spark Operator Restart Log

26/02/08 01:59:42 DEBUG   o.a.s.k.o.p.HealthProbe Checking informer health
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.WatcherWebSocketListener WebSocket close received. code: 1000, reason: null
26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 01:59:52 DEBUG   o.a.s.k.o.p.HealthProbe Checking informer health
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: SparkApplication, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: HEALTHY for type: Pod, namespace: default, details [is running: true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: UNHEALTHY for type: SparkCluster, namespace: default, details [is running: true, has synced: true, is watching: false]
26/02/08 01:59:52 ERROR   o.a.s.k.o.p.HealthProbe Controller: sparkclusterreconciler, Event Source: ControllerResourceEventSource, Informer: UNHEALTHY is in default, not a healthy state

AbstractWatchManager Behavior

$ k logs -f spark-kubernetes-operator-68c55d48d9-548mz| grep 'AbstractWatchManager'
26/02/08 01:59:55 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkapplications?allowWatchBookmarks=true&resourceVersion=276933&timeoutSeconds=600&watch=true...
26/02/08 01:59:55 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkclusters?allowWatchBookmarks=true&resourceVersion=276931&timeoutSeconds=600&watch=true...
26/02/08 01:59:55 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=276933&timeoutSeconds=600&watch=true...
26/02/08 01:59:55 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=276933&timeoutSeconds=600&watch=true...
26/02/08 02:06:23 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 02:06:23 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 02:06:24 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkclusters?allowWatchBookmarks=true&resourceVersion=277053&timeoutSeconds=600&watch=true...
26/02/08 02:07:14 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 02:07:14 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 02:07:15 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=277072&timeoutSeconds=600&watch=true...
26/02/08 02:07:33 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 02:07:33 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 02:07:34 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=277072&timeoutSeconds=600&watch=true...
26/02/08 02:08:41 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the current watch
26/02/08 02:08:41 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling reconnect task in 1000 ms
26/02/08 02:08:42 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkapplications?allowWatchBookmarks=true&resourceVersion=277089&timeoutSeconds=600&watch=true...

Does this PR introduce any user-facing change?

This will fix the regression at v0.7.0.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Opus 4.5 on Claude Code

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @viirya ! Merged to main.

@dongjoon-hyun dongjoon-hyun added this to the 0.8.0 milestone Feb 8, 2026
@peter-toth
Copy link
Contributor

Late LGTM, thanks @dongjoon-hyun for the fix.

@dongjoon-hyun
Copy link
Member Author

Thank you, @peter-toth !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-55421 branch February 8, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants