Skip to content

Conversation

@greenwich
Copy link

@greenwich greenwich commented Dec 23, 2025

What changes were proposed in this pull request?

This PR makes maxFailovers and failoverCount in S3g apply on a per-request basis.

Rationale:

  1. in hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java line 93: we have private int failoverCount = 0; - All threads share this counter; it never resets.
  2. Also, in GrpcOmTransport.shouldRetry(258) we run action = retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true); which is also shared between requests.
  3. Next in OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction, we still use that global failoverCount checking if (failovers < maxFailovers)(258), which always returns return RetryAction.FAIL;(263) once we reached the maxFailovers
  4. maxFailovers from above is defined by OZONE_CLIENT_FAILOVER_MAX_ATTEMPTS_KEY

I propose to change the value of failoverCount per request, rather than making it a global flag. So basically, it makes OZONE_CLIENT_FAILOVER_MAX_ATTEMPTS_KEY configuration to serve on a per-request basis.

Detailed explanation and discussion are here: #9477

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14212

How was this patch tested?

Branch pipeline: https://github.com/greenwich/ozone/actions/runs/20430980506

  • Added a unittest to check that failoverCount is set for each request
  • Before fixing the bug, I created a new integration test that mimics concurrent user requests during the S3G failover from om0 to om2. It reproduces the issue, then I used it to test the fix.

Concurrent test results before the fix - demonstrating the bug -> failover didn't happen to om2

--- Failed Requests (Failover Attempts) ---
om0: 500 failures (10.0 %)
om1: 4510 failures (90.0 %)
om2: 0 failures (0.0 %) NEVER TRIED!

--- Successful Requests ---
om0: 5 successes (100.0 %) LEADER
om1: 0 successes (0.0 %)
om2: 0 successes (0.0 %)

Concurrent test results after the fix - demonstrating the correct behaviour -> failover to om2

--- Failed Requests (Failover Attempts) ---
om0: 500 failures (97.1 %)
om1: 15 failures (2.9 %)
om2: 0 failures (0.0 %) NEVER TRIED!
--- Successful Requests ---
om0: 5 successes (0.1 %) LEADER
om1: 0 successes (0.0 %)
om2: 5000 successes (99.9 %) LEADER

@greenwich greenwich changed the title HDDS-14212 Make failoverCount in GrpcOmTransport to be per request HDDS-14212 Make maxFailovers in S3g apply on a per-request basis. Dec 23, 2025
@adoroszlai adoroszlai requested a review from ChenSammi December 23, 2025 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant