fix(JDBC): Add retry condition for acquisition timeout error by fivetran-ashokborra · Pull Request #3566 · apache/polaris

fivetran-ashokborra · 2026-01-27T09:01:08Z

Description

When the Agroal connection pool is exhausted, requests fail with "Acquisition timeout while waiting for new connection". This error is often transient during traffic spikes and should be retried.

Added "acquisition timeout" to the list of retryable error conditions in DatasourceOperations.isRetryable() method.

Stack trace:

Caused by: java.sql.SQLException: Failed due to 'Acquisition timeout while waiting for new connection' (error code 0, sql-state 'null'), after 1 attempts and 5000 milliseconds at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.withRetries(DatasourceOperations.java:300) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.executeSelectOverStream(DatasourceOperations.java:153) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.executeSelect(DatasourceOperations.java:134) at org.apache.polaris.persistence.relational.jdbc.JdbcBasePersistenceImpl.lookupEntities(JdbcBasePersistenceImpl.java:399) ... 68 more Caused by: java.sql.SQLException: Acquisition timeout while waiting for new connection at io.agroal.pool.ConnectionPool.handlerFromSharedCache(ConnectionPool.java:367) at io.agroal.pool.ConnectionPool.getConnection(ConnectionPool.java:293) at io.agroal.pool.DataSource.getConnection(DataSource.java:86) at io.agroal.api.AgroalDataSource_sqqLi56D50iCdXmOjyjPSAxbLu0_Synthetic_ClientProxy.getConnection(Unknown Source) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.borrowConnection(DatasourceOperations.java:339) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.lambda$executeSelectOverStream$2(DatasourceOperations.java:156) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.withRetries(DatasourceOperations.java:271) ... 71 more Caused by: java.util.concurrent.TimeoutException at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204) at io.agroal.pool.ConnectionPool.handlerFromSharedCache(ConnectionPool.java:344) ... 77 more

Checklist

🛡️ Don't disclose security issues! (contact security@apache.org)
🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
🧪 Added/updated tests with good coverage, or manually tested (and explained how)
💡 Added comments for complex logic
🧾 Updated CHANGELOG.md (if needed)
📚 Updated documentation in site/content/in-dev/unreleased (if needed)

singhpk234

LGTM thanks @fivetran-ashokborra !

dimas-b · 2026-01-27T17:31:27Z

...-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/DatasourceOperations.java

    return e.getMessage().toLowerCase(Locale.ROOT).contains("connection refused")
-        || e.getMessage().toLowerCase(Locale.ROOT).contains("connection reset");
+        || e.getMessage().toLowerCase(Locale.ROOT).contains("connection reset")
+        || e.getMessage().toLowerCase(Locale.ROOT).contains("acquisition timeout");


Could it be preferable to increase the acquisition timeout?

Re-trying with a short timeout is conceptually the same as waiting longer on the original attempt, but the code is simpler 🤔

We can definitely increase the acquisition timeout. But, under load, retry gives multiple opportunities (max retries) to grab connections as they become available, rather than a long timeout.

Sorry, I'm not sure I understand... I assume the "acquisition timeout" exception comes from the connection pool, right? So with a longer timeout the pool should be able to acquire an return a connection as soon as it becomes available without going through an exception and retry... WDYT?

While I'm aligned with increasing the timeout to a higher value, I think that would make sense for deterministic scenarios - like waiting for a specific connection. In the case of connection pool, we are not waiting for a specific connection, but for any of the available connections. Having a retry would make it resilient.

@fivetran-ashokborra : could you post the full stack trace of the "acquisition timeout" exception involved in this case?

So, if the timeout were longer, io.agroal.pool.ConnectionPool.getConnection() should immediately return the next available connection when it becomes available... even without retries.

I'm trying to understand what can be more efficient with retries in this situation 🤔 WDYT?

I tend to agree with @dimas-b here.

Increasing the acquisition timeout seems better, because it simply means each thread will stay longer on the waiting line during spikes, but they will eventually get a connection. This is efficient and preserves queue fairness.

Instead, if you keep the timeout short and implement a retry, threads will get kicked out of the waiting line during spikes. They will throw an exception (which is expensive), and then try to re-enter the queue. This effectively turns your queuing system into a polling system. Threads lose their original position on the line, which hurts queue fairness.

That's a fair point @adutra. Retry seems like an anti-pattern from a queue fairness perspective. I made the changes from a resiliency perspective; having a retry might ensure a connection rather than a short timeout failure.

@dimas-b, Yes, retry would add some overhead. If the connection pool is saturated for a prolonged period, a longer timeout would result in more threads piling up in memory.

Maybe we are trying to solve the wrong problem with these changes rather than adjusting the max pool size. Nevertheless, this situation can happen.

Agree, that retrying after on an 'acquisition timeout' leads to unfair request execution, which may or may not be desirable.

I wonder whether it is a good strategy to further increase the acquisition timeout.

In such situations the connection pool is already exhausted or, in other retry cases, the database is in a "bad shape".

Retrying or waiting longer lets requests pile up in the Polaris service, leading to additional resource usage by waiting threads holding resources.

User/client requests might have already been aborted, but the threads in the service sill occupy resources, meaning that newer client requests are stalled, because earlier requests haven't finished.
This can easily lead to a situation where the Polaris service unnecessarily appears unresponsive for clients.

The situation gets worse with pieces that add significant additional load against the database, like Polaris events or the idempotency-keys proposal.

Instead of increasing the acquisition timeout, operators should consider adjusting the connection pool size to the load. Or shed load by limiting the number of concurrent requests via the ingress.

Polaris' RateLimiterFilter does not help, it actually makes the situation worse, because rate-limited events are persisted via JDBC, requiring a usable JDBC connection. This leads to blocked servlet request filter instances, which I am not sure is a good idea.

+1 to adjusting the connection pool size

fix(JDBC): Add retry condition for acquisition timeout error

e992eb2

github-project-automation bot added this to Basic Kanban Board Jan 27, 2026

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board Jan 27, 2026

singhpk234 approved these changes Jan 27, 2026

View reviewed changes

github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Jan 27, 2026

singhpk234 requested a review from dimas-b January 27, 2026 16:15

dimas-b reviewed Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(JDBC): Add retry condition for acquisition timeout error#3566

fix(JDBC): Add retry condition for acquisition timeout error#3566
fivetran-ashokborra wants to merge 1 commit intoapache:mainfrom
fivetran-ashokborra:jdbc-retry-connection-pool-error

fivetran-ashokborra commented Jan 27, 2026 •

edited

Loading

Uh oh!

singhpk234 left a comment

Uh oh!

dimas-b Jan 27, 2026

Uh oh!

fivetran-ashokborra Jan 28, 2026

Uh oh!

dimas-b Jan 28, 2026

Uh oh!

fivetran-ashokborra Jan 28, 2026 •

edited

Loading

Uh oh!

dimas-b Jan 28, 2026

Uh oh!

dimas-b Jan 28, 2026

Uh oh!

adutra Jan 29, 2026

Uh oh!

fivetran-ashokborra Jan 30, 2026

Uh oh!

snazy Feb 2, 2026

Uh oh!

dimas-b Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

fivetran-ashokborra commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Stack trace:

Checklist

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fivetran-ashokborra Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fivetran-ashokborra commented Jan 27, 2026 •

edited

Loading

fivetran-ashokborra Jan 28, 2026 •

edited

Loading