Core: Make sequence number conflicts retryable in concurrent commits #15126

agnes-xinyi-lu · 2026-01-23T18:52:37Z

When multiple processes concurrently commit to different branches of the
same table through the REST catalog, sequence number validation failures
in TableMetadata.addSnapshot() were throwing non-retryable ValidationException
instead of retryable CommitFailedException.
This fix catches the sequence number validation error in CatalogHandlers.commit()
and wraps it in ValidationFailureException(CommitFailedException) to:
- Skip server-side retry (which won't help since sequence number is in the request)
- Return CommitFailedException to the client so it can retry with refreshed metadata

Issue #15001

core/src/test/java/org/apache/iceberg/rest/TestRestCatalogConcurrentWrites.java

singhpk234 · 2026-01-26T05:29:42Z

core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java

+                  request.updates().forEach(update -> update.applyTo(metadataBuilder));
+                } catch (ValidationException e) {
+                  // Sequence number conflicts from concurrent commits are retryable by the client,
+                  // but server-side retry won't help since the sequence number is in the request.


This is an interesting point ! since the snapshot obj is created in the client and sent to the server the sequence number is locked in and server can't do much fail fast seems reasonable.

I wonder if we can refactor / introduce some other mechanism rather than relying on exception message text.

Thanks for the review @singhpk234 !
Checking exception message is not an uncommon pattern within iceberg repo , it helps target particular scenarios that were thrown in a more generic exception type. Refactoring the exception itself will require TableMetadata change which increases risks.
I'm trying to minimize the change to get this issue fixed as per my understanding of the comment on the issue. As my original idea was to add an UpdateRequirement to the spec for this assertion.
Any thoughts?

I agree with @singhpk234 here. It's a bit brittle to rely on the message in the exception. A lot of the cases where that's done in the code base is at the integration points with other libraries/systems like JDBC/hive where there isn't a better exception provided. Here it's all the Iceberg core code base, where there can be other use cases where this behavior is desirable and we have an opportunity to do it cleanly.

In that case, I think what @rdblue suggested in the issue makes a lot of sense; we could define a RetryableValidationException, throw it instead of the validation exception in addSnapshot when performing the sequence number comparison. Here in catalog handlers, then we'd then throw the ValidationFailure wrapping a CommitFailedException to break out and kick the CommitFailedException back to the client to retry.

I'm not too sure it's an additional "risk" to define a new exception and throw it in a very specific case.

Thanks @amogh-jahagirdar , I don't have a strong opinion on how to implement the fix. My goal is to get a solution in ASAP because it's a hard blocker for our rest catalog migration.
If we all agree adding a wrapping exception and throwing it from addSnapshot, happy to implement that as well.

@singhpk234 @amogh-jahagirdar I've updated the PR with the new solution, please take a look, thanks!

amogh-jahagirdar · 2026-01-28T05:07:47Z

Thanks @agnes-xinyi-lu will take a look with fresh eyes tomorrow.

amogh-jahagirdar · 2026-01-29T21:43:06Z

core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java

+    // Verify the fix: with AssertLastSequenceNumber, there should be NO validation failures
+    // All concurrent conflicts should be caught as CommitFailedException (retryable)
+    assertThat(validationFailureCount.get())
+        .as(
+            "With the fix, sequence number conflicts should be caught by AssertLastSequenceNumber "
+                + "and throw CommitFailedException (retryable), not ValidationException")
+        .isEqualTo(0);
+
+    // At least some should succeed (commits that don't conflict or succeed after retry)
+    assertThat(successCount.get()).as("At least some appends should succeed").isGreaterThan(0);


I think we should aim to make this test have harder assertions. I think we could use an AtomicInteger barrier and essentially synchronize different rounds of commits and deterministically cause conflicts. At the end, I think we'd be able to have a deterministic number of failures (i'd probably organize it so the barrier causes 1 conflict per branch per round?). Checkout https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/jdbc/TestJdbcTableConcurrency.java#L130 for another example of this pattern

Understand the ask, I updated the test to verify total number of commits and make sure there are failures/conflicts, for the total number of conflicts it's hard to get a deterministic number (pls let me know if you have an easy way, happy to implement that). Main problem is table.appendFiles.commit is not an atomic operation, it refreshes TableMetadata in SnapshotProducer. To create a deterministic conflict, we would need to put a barrier there or in TableOperations to make sure every thread gets exactly the same base TableMetadata.
I think the purpose of this test is to verify there could be conflicts during commit when there are multiple threads committing to non-current branch at the same time. And the fix should guarantee it's a retryable exception.

When multiple processes concurrently commit to different branches of the same table through the REST catalog, sequence number validation failures in TableMetadata.addSnapshot() were throwing non-retryable ValidationException instead of retryable CommitFailedException. This fix catches the sequence number validation error in CatalogHandlers.commit() and wraps it in ValidationFailureException(CommitFailedException) to: - Skip server-side retry (which won't help since sequence number is in the request) - Return CommitFailedException to the client so it can retry with refreshed metadata

core/src/main/java/org/apache/iceberg/RetryableValidationException.java

github-actions bot added the core label Jan 23, 2026

agnes-xinyi-lu mentioned this pull request Jan 23, 2026

[iceberg-core]Rest Catalog Concurrent Commit Race Condition #15001

Open

3 tasks

agnes-xinyi-lu force-pushed the xinyilu/fix2 branch from 38e36ca to 59a899f Compare January 23, 2026 20:46

singhpk234 reviewed Jan 26, 2026

View reviewed changes

singhpk234 requested review from amogh-jahagirdar and nastra January 26, 2026 05:30

agnes-xinyi-lu force-pushed the xinyilu/fix2 branch from 5f76559 to b245667 Compare January 26, 2026 22:27

amogh-jahagirdar reviewed Jan 29, 2026

View reviewed changes

agnes-xinyi-lu added 3 commits January 29, 2026 21:38

move test to TestRestCatalog class

b14cbfe

add RetryableValidationException

5ace567

agnes-xinyi-lu force-pushed the xinyilu/fix2 branch from b245667 to 5ace567 Compare January 30, 2026 17:07

github-actions bot added the API label Jan 30, 2026

fix checkstyle error

8732caf

amogh-jahagirdar reviewed Jan 30, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/RetryableValidationException.java Show resolved Hide resolved

move validation class to core

0f65a65

github-actions bot removed the API label Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Make sequence number conflicts retryable in concurrent commits #15126

Core: Make sequence number conflicts retryable in concurrent commits #15126

agnes-xinyi-lu commented Jan 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

singhpk234 Jan 26, 2026 •

edited

Loading

Uh oh!

agnes-xinyi-lu Jan 26, 2026

Uh oh!

amogh-jahagirdar Jan 29, 2026 •

edited

Loading

Uh oh!

agnes-xinyi-lu Jan 29, 2026 •

edited

Loading

Uh oh!

agnes-xinyi-lu Jan 30, 2026

Uh oh!

amogh-jahagirdar commented Jan 28, 2026

Uh oh!

amogh-jahagirdar Jan 29, 2026 •

edited

Loading

Uh oh!

agnes-xinyi-lu Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Core: Make sequence number conflicts retryable in concurrent commits #15126

Are you sure you want to change the base?

Core: Make sequence number conflicts retryable in concurrent commits #15126

Conversation

agnes-xinyi-lu commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

singhpk234 Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agnes-xinyi-lu Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agnes-xinyi-lu Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agnes-xinyi-lu Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar commented Jan 28, 2026

Uh oh!

amogh-jahagirdar Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agnes-xinyi-lu Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

agnes-xinyi-lu commented Jan 23, 2026 •

edited

Loading

singhpk234 Jan 26, 2026 •

edited

Loading

amogh-jahagirdar Jan 29, 2026 •

edited

Loading

agnes-xinyi-lu Jan 29, 2026 •

edited

Loading

amogh-jahagirdar Jan 29, 2026 •

edited

Loading