Skip to content

fix(JDBC): Add retry condition for acquisition timeout error#3566

Open
fivetran-ashokborra wants to merge 1 commit intoapache:mainfrom
fivetran-ashokborra:jdbc-retry-connection-pool-error
Open

fix(JDBC): Add retry condition for acquisition timeout error#3566
fivetran-ashokborra wants to merge 1 commit intoapache:mainfrom
fivetran-ashokborra:jdbc-retry-connection-pool-error

Conversation

@fivetran-ashokborra
Copy link
Contributor

@fivetran-ashokborra fivetran-ashokborra commented Jan 27, 2026

Description

When the Agroal connection pool is exhausted, requests fail with "Acquisition timeout while waiting for new connection". This error is often transient during traffic spikes and should be retried.

Added "acquisition timeout" to the list of retryable error conditions in DatasourceOperations.isRetryable() method.

Stack trace:

Caused by: java.sql.SQLException: Failed due to 'Acquisition timeout while waiting for new connection' (error code 0, sql-state 'null'), after 1 attempts and 5000 milliseconds at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.withRetries(DatasourceOperations.java:300) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.executeSelectOverStream(DatasourceOperations.java:153) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.executeSelect(DatasourceOperations.java:134) at org.apache.polaris.persistence.relational.jdbc.JdbcBasePersistenceImpl.lookupEntities(JdbcBasePersistenceImpl.java:399) ... 68 more Caused by: java.sql.SQLException: Acquisition timeout while waiting for new connection at io.agroal.pool.ConnectionPool.handlerFromSharedCache(ConnectionPool.java:367) at io.agroal.pool.ConnectionPool.getConnection(ConnectionPool.java:293) at io.agroal.pool.DataSource.getConnection(DataSource.java:86) at io.agroal.api.AgroalDataSource_sqqLi56D50iCdXmOjyjPSAxbLu0_Synthetic_ClientProxy.getConnection(Unknown Source) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.borrowConnection(DatasourceOperations.java:339) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.lambda$executeSelectOverStream$2(DatasourceOperations.java:156) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.withRetries(DatasourceOperations.java:271) ... 71 more Caused by: java.util.concurrent.TimeoutException at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204) at io.agroal.pool.ConnectionPool.handlerFromSharedCache(ConnectionPool.java:344) ... 77 more

Checklist

  • 🛡️ Don't disclose security issues! (contact security@apache.org)
  • 🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
  • 🧪 Added/updated tests with good coverage, or manually tested (and explained how)
  • 💡 Added comments for complex logic
  • 🧾 Updated CHANGELOG.md (if needed)
  • 📚 Updated documentation in site/content/in-dev/unreleased (if needed)

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @fivetran-ashokborra !

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Jan 27, 2026
@singhpk234 singhpk234 requested a review from dimas-b January 27, 2026 16:15
return e.getMessage().toLowerCase(Locale.ROOT).contains("connection refused")
|| e.getMessage().toLowerCase(Locale.ROOT).contains("connection reset");
|| e.getMessage().toLowerCase(Locale.ROOT).contains("connection reset")
|| e.getMessage().toLowerCase(Locale.ROOT).contains("acquisition timeout");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be preferable to increase the acquisition timeout?

Re-trying with a short timeout is conceptually the same as waiting longer on the original attempt, but the code is simpler 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can definitely increase the acquisition timeout. But, under load, retry gives multiple opportunities (max retries) to grab connections as they become available, rather than a long timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I understand... I assume the "acquisition timeout" exception comes from the connection pool, right? So with a longer timeout the pool should be able to acquire an return a connection as soon as it becomes available without going through an exception and retry... WDYT?

Copy link
Contributor Author

@fivetran-ashokborra fivetran-ashokborra Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm aligned with increasing the timeout to a higher value, I think that would make sense for deterministic scenarios - like waiting for a specific connection. In the case of connection pool, we are not waiting for a specific connection, but for any of the available connections. Having a retry would make it resilient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-ashokborra : could you post the full stack trace of the "acquisition timeout" exception involved in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if the timeout were longer, io.agroal.pool.ConnectionPool.getConnection() should immediately return the next available connection when it becomes available... even without retries.

I'm trying to understand what can be more efficient with retries in this situation 🤔 WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @dimas-b here.

Increasing the acquisition timeout seems better, because it simply means each thread will stay longer on the waiting line during spikes, but they will eventually get a connection. This is efficient and preserves queue fairness.

Instead, if you keep the timeout short and implement a retry, threads will get kicked out of the waiting line during spikes. They will throw an exception (which is expensive), and then try to re-enter the queue. This effectively turns your queuing system into a polling system. Threads lose their original position on the line, which hurts queue fairness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point @adutra. Retry seems like an anti-pattern from a queue fairness perspective. I made the changes from a resiliency perspective; having a retry might ensure a connection rather than a short timeout failure.

@dimas-b, Yes, retry would add some overhead. If the connection pool is saturated for a prolonged period, a longer timeout would result in more threads piling up in memory.

Maybe we are trying to solve the wrong problem with these changes rather than adjusting the max pool size. Nevertheless, this situation can happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, that retrying after on an 'acquisition timeout' leads to unfair request execution, which may or may not be desirable.

I wonder whether it is a good strategy to further increase the acquisition timeout.

In such situations the connection pool is already exhausted or, in other retry cases, the database is in a "bad shape".

Retrying or waiting longer lets requests pile up in the Polaris service, leading to additional resource usage by waiting threads holding resources.

User/client requests might have already been aborted, but the threads in the service sill occupy resources, meaning that newer client requests are stalled, because earlier requests haven't finished.
This can easily lead to a situation where the Polaris service unnecessarily appears unresponsive for clients.

The situation gets worse with pieces that add significant additional load against the database, like Polaris events or the idempotency-keys proposal.

Instead of increasing the acquisition timeout, operators should consider adjusting the connection pool size to the load. Or shed load by limiting the number of concurrent requests via the ingress.

Polaris' RateLimiterFilter does not help, it actually makes the situation worse, because rate-limited events are persisted via JDBC, requiring a usable JDBC connection. This leads to blocked servlet request filter instances, which I am not sure is a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to adjusting the connection pool size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants