fix(JDBC): Add retry condition for acquisition timeout error#3566
fix(JDBC): Add retry condition for acquisition timeout error#3566fivetran-ashokborra wants to merge 1 commit intoapache:mainfrom
Conversation
singhpk234
left a comment
There was a problem hiding this comment.
LGTM thanks @fivetran-ashokborra !
| return e.getMessage().toLowerCase(Locale.ROOT).contains("connection refused") | ||
| || e.getMessage().toLowerCase(Locale.ROOT).contains("connection reset"); | ||
| || e.getMessage().toLowerCase(Locale.ROOT).contains("connection reset") | ||
| || e.getMessage().toLowerCase(Locale.ROOT).contains("acquisition timeout"); |
There was a problem hiding this comment.
Could it be preferable to increase the acquisition timeout?
Re-trying with a short timeout is conceptually the same as waiting longer on the original attempt, but the code is simpler 🤔
There was a problem hiding this comment.
We can definitely increase the acquisition timeout. But, under load, retry gives multiple opportunities (max retries) to grab connections as they become available, rather than a long timeout.
There was a problem hiding this comment.
Sorry, I'm not sure I understand... I assume the "acquisition timeout" exception comes from the connection pool, right? So with a longer timeout the pool should be able to acquire an return a connection as soon as it becomes available without going through an exception and retry... WDYT?
There was a problem hiding this comment.
While I'm aligned with increasing the timeout to a higher value, I think that would make sense for deterministic scenarios - like waiting for a specific connection. In the case of connection pool, we are not waiting for a specific connection, but for any of the available connections. Having a retry would make it resilient.
There was a problem hiding this comment.
@fivetran-ashokborra : could you post the full stack trace of the "acquisition timeout" exception involved in this case?
There was a problem hiding this comment.
So, if the timeout were longer, io.agroal.pool.ConnectionPool.getConnection() should immediately return the next available connection when it becomes available... even without retries.
I'm trying to understand what can be more efficient with retries in this situation 🤔 WDYT?
There was a problem hiding this comment.
I tend to agree with @dimas-b here.
Increasing the acquisition timeout seems better, because it simply means each thread will stay longer on the waiting line during spikes, but they will eventually get a connection. This is efficient and preserves queue fairness.
Instead, if you keep the timeout short and implement a retry, threads will get kicked out of the waiting line during spikes. They will throw an exception (which is expensive), and then try to re-enter the queue. This effectively turns your queuing system into a polling system. Threads lose their original position on the line, which hurts queue fairness.
There was a problem hiding this comment.
That's a fair point @adutra. Retry seems like an anti-pattern from a queue fairness perspective. I made the changes from a resiliency perspective; having a retry might ensure a connection rather than a short timeout failure.
@dimas-b, Yes, retry would add some overhead. If the connection pool is saturated for a prolonged period, a longer timeout would result in more threads piling up in memory.
Maybe we are trying to solve the wrong problem with these changes rather than adjusting the max pool size. Nevertheless, this situation can happen.
There was a problem hiding this comment.
Agree, that retrying after on an 'acquisition timeout' leads to unfair request execution, which may or may not be desirable.
I wonder whether it is a good strategy to further increase the acquisition timeout.
In such situations the connection pool is already exhausted or, in other retry cases, the database is in a "bad shape".
Retrying or waiting longer lets requests pile up in the Polaris service, leading to additional resource usage by waiting threads holding resources.
User/client requests might have already been aborted, but the threads in the service sill occupy resources, meaning that newer client requests are stalled, because earlier requests haven't finished.
This can easily lead to a situation where the Polaris service unnecessarily appears unresponsive for clients.
The situation gets worse with pieces that add significant additional load against the database, like Polaris events or the idempotency-keys proposal.
Instead of increasing the acquisition timeout, operators should consider adjusting the connection pool size to the load. Or shed load by limiting the number of concurrent requests via the ingress.
Polaris' RateLimiterFilter does not help, it actually makes the situation worse, because rate-limited events are persisted via JDBC, requiring a usable JDBC connection. This leads to blocked servlet request filter instances, which I am not sure is a good idea.
There was a problem hiding this comment.
+1 to adjusting the connection pool size
Description
When the Agroal connection pool is exhausted, requests fail with "Acquisition timeout while waiting for new connection". This error is often transient during traffic spikes and should be retried.
Added "acquisition timeout" to the list of retryable error conditions in DatasourceOperations.isRetryable() method.
Stack trace:
Caused by: java.sql.SQLException: Failed due to 'Acquisition timeout while waiting for new connection' (error code 0, sql-state 'null'), after 1 attempts and 5000 milliseconds at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.withRetries(DatasourceOperations.java:300) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.executeSelectOverStream(DatasourceOperations.java:153) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.executeSelect(DatasourceOperations.java:134) at org.apache.polaris.persistence.relational.jdbc.JdbcBasePersistenceImpl.lookupEntities(JdbcBasePersistenceImpl.java:399) ... 68 more Caused by: java.sql.SQLException: Acquisition timeout while waiting for new connection at io.agroal.pool.ConnectionPool.handlerFromSharedCache(ConnectionPool.java:367) at io.agroal.pool.ConnectionPool.getConnection(ConnectionPool.java:293) at io.agroal.pool.DataSource.getConnection(DataSource.java:86) at io.agroal.api.AgroalDataSource_sqqLi56D50iCdXmOjyjPSAxbLu0_Synthetic_ClientProxy.getConnection(Unknown Source) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.borrowConnection(DatasourceOperations.java:339) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.lambda$executeSelectOverStream$2(DatasourceOperations.java:156) at org.apache.polaris.persistence.relational.jdbc.DatasourceOperations.withRetries(DatasourceOperations.java:271) ... 71 more Caused by: java.util.concurrent.TimeoutException at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204) at io.agroal.pool.ConnectionPool.handlerFromSharedCache(ConnectionPool.java:344) ... 77 moreChecklist
CHANGELOG.md(if needed)site/content/in-dev/unreleased(if needed)