Skip to content

Conversation

@StFS
Copy link
Collaborator

@StFS StFS commented Jan 18, 2025

This is a DRAFT PR to add a Redis backed command-router.

It consists of the following changes:

  • Adds a client-device-connection-redis module that implements a device-connection-info repository using the Quarkus Redis Client extension.
  • Adds a services/command-router-redis module that creates a command-router-redis Docker image.
  • Adds the necessary configuration to the integration test project to start up a Redis server beneath the new Redis backed Command Router service.

There are some recently discovered problems though, namely that the CommandAndControlAmqpIT and CommandAndControlMqttIT tests fail when run against the redis backed command router service. There also seems to be a problem with the redis configuration. The config interceptor that is supposed to "redirect" hono.commandRouter.cache.redis.* config options to the quarkus.redis.* namespace does not seem to be working so it is necessary to explicitly set the quarkus.redis.hosts property.

StFS and others added 2 commits January 17, 2025 23:57
…redis module and a services/command-router-redis module. Also add appropriate configurations to integration tests to run them against a Redis backed command router with a Redis server
@StFS
Copy link
Collaborator Author

StFS commented Jan 18, 2025

Addresses eclipse-hono#3532

@StFS
Copy link
Collaborator Author

StFS commented Jan 18, 2025

This comment is a reminder to myself.

To run the integration tests against the redis command router:

mvn clean verify -Prun-tests,redis_cache

Currently I'm trying to run only the tests that fail... but they only fail when I run the whole integration test suite. If I limit the run to the tests that fail the tests don't fail. Here's the command I'm using to run only the failing tests:

mvn clean verify -Prun-tests,redis_cache -Dit.test=CommandAndControlAmqpIT -Dit.test=CommandAndControlMqttIT

So what I'm going to try now is to run the tests once using the -Ddocker.keepRunning option and then repeatedly run the tests that fail against the containers that were started until a run fails (if that happens). Here are the commands I'm using for this:

# Run once to start all the containers
mvn clean verify -Prun-tests,redis_cache -Dit.test=CommandAndControlAmqpIT -Dit.test=CommandAndControlMqttIT -Ddocker.keepRunning

# Run the tests repeatedly until a run fails
while mvn verify -Prun-tests,redis_cache,useRunningContainers -Dit.test=CommandAndControlAmqpIT -Dit.test=CommandAndControlMqttIT ; do :; done

Finally, you shut down the running containers with:

mvn verify -PstopContainers

@StFS
Copy link
Collaborator Author

StFS commented Jan 18, 2025

I have verified that when I run the sequence described above, in the while loop that repeatedly runs the tests, eventually the tests will fail with a timeout error (on the test run side). In the command router service logs there are log lines that display some warnings related to the Kafka client there. These log lines and warnings also seem to appear in the infinispan command router when I run the tests against that, but that does not result in the same timeout error. However, in the redis command router, when the tests error out, I did find the following error in the command router logs which does not appear in the logs for the infinispan based command router:

2025-01-18 03:10:32,855 WARN  [org.apa.kaf.cli.NetworkClient] (kafka-producer-network-thread | event-bde4a74d-3f49-4e1c-b1e8-f1585409253e) [Producer clientId=event-bde4a74d-3f49-4e1c-b1e8-f1585409253e] Error while fetching metadata with correlation id 193 : {hono.event.a530c0a3-2ca0-4055-b360-02c8b892f387=UNKNOWN_TOPIC_OR_PARTITION, hono.event.1cd1fb42-bb19-45e8-8e55-41e335b36deb=UNKNOWN_TOPIC_OR_PARTITION, hono.event.a44a8fb0-3e9b-4359-a4a4-502ec57bf3b4=UNKNOWN_TOPIC_OR_PARTITION, hono.event.ef07c6f2-2fb7-4ff3-93af-9f8b27906b49=UNKNOWN_TOPIC_OR_PARTITION, hono.event.942dea39-14e4-47bf-9cdf-d4cd83cd48b5=UNKNOWN_TOPIC_OR_PARTITION, hono.event.febe7482-629e-487f-8b79-3625db931d9e=UNKNOWN_TOPIC_OR_PARTITION, hono.event.9721c210-c90a-40a0-80c9-8b5f5ef10589=UNKNOWN_TOPIC_OR_PARTITION, hono.event.b81bf5d2-b9e5-40a9-9df4-3cbc5e24c5b5=UNKNOWN_TOPIC_OR_PARTITION, hono.event.14ac4593-fa4f-4d5d-a1d5-0d7c272d6033=UNKNOWN_TOPIC_OR_PARTITION, hono.event.d12ff377-565a-404a-a1b1-40aece6a089e=UNKNOWN_TOPIC_OR_PARTITION, hono.event.44cf0462-e3f4-4256-8124-b3cde64999ac=UNKNOWN_TOPIC_OR_PARTITION, hono.event.5b4e4805-e8f7-4180-9b8b-fb9b513a43f0=UNKNOWN_TOPIC_OR_PARTITION, hono.event.2b51dfc8-6b00-4f64-b241-aa15a0e58e5e=UNKNOWN_TOPIC_OR_PARTITION, hono.event.0a14698e-a3be-4ac2-99e9-4616e946de0d=UNKNOWN_TOPIC_OR_PARTITION, hono.event.8de73437-63f4-4723-b647-82a3ad0d58b9=UNKNOWN_TOPIC_OR_PARTITION, hono.event.21ad0229-a364-4d9e-9e85-3883e65034a3=UNKNOWN_TOPIC_OR_PARTITION, hono.event.8803b596-dc36-4a64-b4f0-ba01aa9903c2=UNKNOWN_TOPIC_OR_PARTITION, hono.event.9ee42165-e83e-4369-91e2-7978a6b07800=UNKNOWN_TOPIC_OR_PARTITION, hono.event.99ab674f-63e0-4521-890e-2d6fced95d36=UNKNOWN_TOPIC_OR_PARTITION, hono.event.0661fc86-2b08-41f6-9fbf-dd831c6ad5fd=UNKNOWN_TOPIC_OR_PARTITION, hono.event.60160670-f222-4e5f-99fe-7928063d0bba=UNKNOWN_TOPIC_OR_PARTITION, hono.event.2633c7ec-152d-448e-9dd0-a5617b2f40ed=UNKNOWN_TOPIC_OR_PARTITION, hono.event.b9f90aab-dc4c-4b7b-b03e-fb0c0a624839=UNKNOWN_TOPIC_OR_PARTITION, hono.event.0660782c-7317-43b1-a71c-af1c0b39131f=UNKNOWN_TOPIC_OR_PARTITION, hono.event.a838f1dc-0009-4354-aff5-261ebd23eb01=UNKNOWN_TOPIC_OR_PARTITION, hono.event.b1e56267-da24-4fa4-a996-6a5a3aa4ac9b=UNKNOWN_TOPIC_OR_PARTITION, hono.event.d37ca8b0-af8d-4393-8c32-61be503da711=UNKNOWN_TOPIC_OR_PARTITION, hono.event.d8ee36b0-b49c-4849-8510-f922a414c2ea=UNKNOWN_TOPIC_OR_PARTITION, hono.event.4889160b-cf0e-4bed-b987-9f07983affc4=UNKNOWN_TOPIC_OR_PARTITION, hono.event.ff972a2d-f300-41ba-bb1a-c48fbde40163=UNKNOWN_TOPIC_OR_PARTITION, hono.event.1a8afec8-8354-40ea-aabe-367e13452fc0=UNKNOWN_TOPIC_OR_PARTITION, hono.event.51800a66-553b-4b5d-8760-b4b3d2547159=UNKNOWN_TOPIC_OR_PARTITION, hono.event.17c4c8ae-5150-4e23-99ae-4faeee9e0489=UNKNOWN_TOPIC_OR_PARTITION, hono.event.502e0622-bba2-49d3-b5ec-8a4fcca14241=UNKNOWN_TOPIC_OR_PARTITION, hono.event.6b283196-bca9-43a4-a799-ac3fe1c789b3=UNKNOWN_TOPIC_OR_PARTITION, hono.event.0dfaf89d-33e6-4138-b3c6-cd4e6474bf58=UNKNOWN_TOPIC_OR_PARTITION, hono.event.7046706f-127b-461e-b35a-09cfd091ff22=UNKNOWN_TOPIC_OR_PARTITION, hono.event.0ee960c1-b9cf-4dd8-95c6-01d13ac07b39=UNKNOWN_TOPIC_OR_PARTITION, hono.event.bbcd2940-6804-4b6e-bff0-13181104c116=UNKNOWN_TOPIC_OR_PARTITION, hono.event.87b6d19c-89c7-46ca-ae69-8b0ce470c58e=UNKNOWN_TOPIC_OR_PARTITION, hono.event.44966256-0c97-47e8-830d-92f76997b3a0=UNKNOWN_TOPIC_OR_PARTITION, hono.event.a98a395f-2202-4f1d-b84a-749e7e268e95=UNKNOWN_TOPIC_OR_PARTITION, hono.event.48cf99f4-df5b-435a-b252-4e05728b2898=UNKNOWN_TOPIC_OR_PARTITION, hono.event.e5e8049a-6a4e-4455-9df1-c685ae3c6427=UNKNOWN_TOPIC_OR_PARTITION, hono.event.136eaf85-4a55-4c6a-94b7-de6d6ec95da3=UNKNOWN_TOPIC_OR_PARTITION, hono.event.917664e5-edac-4f03-be88-98182d198ddc=UNKNOWN_TOPIC_OR_PARTITION, hono.event.b439f69f-eb8e-47dd-a010-ad5faba0fa0e=UNKNOWN_TOPIC_OR_PARTITION, hono.event.fb59448a-6238-4dd8-a517-c9b4f35fbaa0=UNKNOWN_TOPIC_OR_PARTITION, hono.event.5067b2cf-3ddd-4907-bb39-38cc8868ce9e=UNKNOWN_TOPIC_OR_PARTITION}

This might be a red herring though... I have not confirmed that a similar error appears every time the tests error out.

@StFS
Copy link
Collaborator Author

StFS commented Jan 18, 2025

correction. The error mentioned above does seem to appear every now and then in the infinispan based command router, but to be clear, the test runs do not error out. So the error found in the command router logs does not seem to be related to the test errors.

For reference, here is a similar error log line from the infinispan command router that happened while tests were running successfully:

2025-01-18 03:24:34,549 WARN  [org.apa.kaf.cli.NetworkClient] (kafka-producer-network-thread | event-2dbb1cf3-4f7a-46a3-97bf-53a37b95fd9e) [Producer clientId=event-2dbb1cf3-4f7a-46a3-97bf-53a37b95fd9e] Error while fetching metadata with correlation id 400 : {hono.event.33f6673d-fd5f-4f40-a163-79e475be9639=UNKNOWN_TOPIC_OR_PARTITION, hono.event.1c34ca87-2f55-4fe5-9529-b894d828f8ee=UNKNOWN_TOPIC_OR_PARTITION, hono.event.b9bc2de8-35b8-42d3-aa36-fa830184dd00=UNKNOWN_TOPIC_OR_PARTITION, hono.event.1b830943-a8d6-42f5-b5db-6f63df5915f8=UNKNOWN_TOPIC_OR_PARTITION, hono.event.a7b059d5-149f-4edf-964d-7bc27f00dae2=UNKNOWN_TOPIC_OR_PARTITION, hono.event.dbd5e165-5150-4f1c-aac5-cbed25fc8b79=UNKNOWN_TOPIC_OR_PARTITION, hono.event.8066ccfc-f915-4e4c-a5db-2bdc5327ffdf=UNKNOWN_TOPIC_OR_PARTITION, hono.event.6e620301-fbe7-4815-895d-1bfb8a497d93=UNKNOWN_TOPIC_OR_PARTITION, hono.event.02cf5756-1168-4fe6-940e-ca7b2b249ce1=UNKNOWN_TOPIC_OR_PARTITION, hono.event.59ca1b4f-4610-4f68-a278-e25e300d03ea=UNKNOWN_TOPIC_OR_PARTITION, hono.event.b98fdbb0-dad1-481f-a7f6-03cad74dba34=UNKNOWN_TOPIC_OR_PARTITION, hono.event.c16d2fc0-9159-46db-9eb6-011889e787c8=UNKNOWN_TOPIC_OR_PARTITION, hono.event.8b139e23-7f0f-4636-a6c7-22cf545c7db3=UNKNOWN_TOPIC_OR_PARTITION, hono.event.bd2b0553-4c59-4914-a750-07de11076ee2=UNKNOWN_TOPIC_OR_PARTITION, hono.event.8f5a34bc-2c72-4871-919a-958cc41467f0=UNKNOWN_TOPIC_OR_PARTITION, hono.event.7ab04826-f118-41f6-b3e2-bb2ed6d0ac33=UNKNOWN_TOPIC_OR_PARTITION, hono.event.a8c1efde-44b4-446a-8e4f-ff1293f2afec=UNKNOWN_TOPIC_OR_PARTITION, hono.event.6f0d1596-beba-4edc-97e3-5ef73dfb3b63=UNKNOWN_TOPIC_OR_PARTITION, hono.event.53211385-c399-4599-be0d-be0efe995373=UNKNOWN_TOPIC_OR_PARTITION, hono.event.5744d151-0ab5-4ee5-8f86-47a1f32dedbb=UNKNOWN_TOPIC_OR_PARTITION, hono.event.ba649b07-8f1d-4818-a966-8e094d9ec191=UNKNOWN_TOPIC_OR_PARTITION, hono.event.741bb2ca-6a74-4f30-a7bb-d4953f16302f=UNKNOWN_TOPIC_OR_PARTITION, hono.event.442b0067-e19d-4e9a-ba22-f0e21a908060=UNKNOWN_TOPIC_OR_PARTITION, hono.event.3eb43744-cd05-4f05-8200-b71d3ee48e5f=UNKNOWN_TOPIC_OR_PARTITION, hono.event.8e1c8c92-c797-429e-940b-58070441b3a7=UNKNOWN_TOPIC_OR_PARTITION, hono.event.a68682c1-f6de-47cb-abaa-22017c1ce13d=UNKNOWN_TOPIC_OR_PARTITION, hono.event.591e9c7a-6eec-45e1-9a31-1e06b330f66c=UNKNOWN_TOPIC_OR_PARTITION, hono.event.71398b9f-1407-4e9d-b5b2-b3a75cd4eb80=UNKNOWN_TOPIC_OR_PARTITION, hono.event.0f49b5e6-6196-4a34-a1ed-6e4555c3b447=UNKNOWN_TOPIC_OR_PARTITION, hono.event.6166a5c3-4d98-41da-9ed2-d1f5a595951c=UNKNOWN_TOPIC_OR_PARTITION, hono.event.7caae0e8-0b13-46cc-926c-98f955b8d7de=UNKNOWN_TOPIC_OR_PARTITION, hono.event.d86c9c63-0822-42ea-8f7d-a99d6e18644d=UNKNOWN_TOPIC_OR_PARTITION, hono.event.ff52b378-e18e-40bd-86ee-f657b45a7ed2=UNKNOWN_TOPIC_OR_PARTITION, hono.event.093ef358-c89a-4391-98bd-65a6d9209e52=UNKNOWN_TOPIC_OR_PARTITION, hono.event.f51bbed1-6d96-4084-a065-66aec2cdbc9c=UNKNOWN_TOPIC_OR_PARTITION, hono.event.6d7008b0-3811-48d8-b396-789973bb22db=UNKNOWN_TOPIC_OR_PARTITION, hono.event.2044a15c-0ed7-4afa-928b-7cfde63fd485=UNKNOWN_TOPIC_OR_PARTITION, hono.event.e78eba0d-fa11-47aa-b63e-c9fcaec9537e=UNKNOWN_TOPIC_OR_PARTITION, hono.event.321c29aa-044f-496b-ba69-24c45a7bd827=UNKNOWN_TOPIC_OR_PARTITION, hono.event.268d340f-98f2-4bb7-8289-644a0a180ce6=UNKNOWN_TOPIC_OR_PARTITION, hono.event.8b928a44-ec9c-4555-b56f-6b327cc719ef=UNKNOWN_TOPIC_OR_PARTITION, hono.event.44d72121-d8e5-4864-ae1f-9e802c988aa9=UNKNOWN_TOPIC_OR_PARTITION, hono.event.cb65cbae-d686-435a-9bf5-23b139672714=UNKNOWN_TOPIC_OR_PARTITION, hono.event.ce591207-6c3a-45ae-8410-64661625cda1=UNKNOWN_TOPIC_OR_PARTITION, hono.event.311f670d-18c4-42f9-9938-6ae6b32ddfab=UNKNOWN_TOPIC_OR_PARTITION, hono.event.9361b53d-a8e9-42d9-8e97-889ef56e3bd8=UNKNOWN_TOPIC_OR_PARTITION, hono.event.2e8374ed-8692-4567-aaab-5db4ac5173d5=UNKNOWN_TOPIC_OR_PARTITION, hono.event.942f0a7e-61c1-4632-9a34-f3e531e3485c=UNKNOWN_TOPIC_OR_PARTITION, hono.event.dd9e682e-8ee6-475f-bdd4-ab68704763fa=UNKNOWN_TOPIC_OR_PARTITION, hono.event.0ad4f733-47f1-41b0-997f-f7f21fe9e511=UNKNOWN_TOPIC_OR_PARTITION, hono.event.3e36a0f9-367f-42d1-81c7-8d31d8adecbf=UNKNOWN_TOPIC_OR_PARTITION, hono.event.4e0bdd48-ff19-4fe1-8487-4aa51b102ac9=UNKNOWN_TOPIC_OR_PARTITION}

@StFS
Copy link
Collaborator Author

StFS commented Jan 18, 2025

Sometimes the tests run successfully and sometimes they fail. There seems to be quite a variance in when they fail... sometimes they will fail on the first run but sometimes there will be 5-6 successful runs and then a run will fail. If a test run is repeated immediately after a failed test run (without rebooting the containers) the tests sometimes seem to fail consistently, indicating that the command router service may get into a corrupted state. However, this does not always happen, sometimes it is possible to get successful runs after a failed run.

Attached is a log of a failed test run.
hono.test.error.log

Also, when the containers are stopped I'm seeing this error appear in the command router, however, it appears whether or not a test run has failed previously (so I always get this when shutting down the redis command router container in the integration tests):

2025-01-18 03:54:55,132 ERROR [io.qua.run.StartupContext] (Shutdown thread) Running a shutdown task failed [Error Occurred After Shutdown]: java.lang.IllegalStateException
        at io.vertx.core.net.impl.pool.Endpoint.close(Endpoint.java:105)
        at io.vertx.core.net.impl.pool.ConnectionManager.close(ConnectionManager.java:59)
        at io.vertx.redis.client.impl.RedisConnectionManager.close(RedisConnectionManager.java:411)
        at io.vertx.redis.client.impl.BaseRedisClient.close(BaseRedisClient.java:31)
        at io.quarkus.redis.runtime.client.ObservableRedis.close(ObservableRedis.java:77)
        at io.vertx.mutiny.redis.client.Redis.close(Redis.java:139)
        at io.quarkus.redis.runtime.client.RedisClientRecorder$10.run(RedisClientRecorder.java:216)
        at io.quarkus.runtime.StartupContext.runAllInReverseOrder(StartupContext.java:84)
        at io.quarkus.runtime.StartupContext.close(StartupContext.java:73)
        at io.quarkus.runner.ApplicationImpl.doStop(Unknown Source)
        at io.quarkus.runtime.Application.stop(Application.java:208)
        at io.quarkus.runtime.Application.stop(Application.java:155)
        at io.quarkus.runtime.ApplicationLifecycleManager$ShutdownHookThread.run(ApplicationLifecycleManager.java:437)

Exception in thread "BatchSpanProcessor_WorkerThread-1" java.util.concurrent.RejectedExecutionException: event executor terminated
        at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:934)
        at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:351)
        at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:344)
        at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:836)
        at io.netty.util.concurrent.SingleThreadEventExecutor.execute0(SingleThreadEventExecutor.java:827)
        at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:817)
        at io.vertx.core.impl.EventLoopExecutor.execute(EventLoopExecutor.java:35)
        at io.vertx.core.impl.ContextImpl.execute(ContextImpl.java:300)
        at io.vertx.core.impl.ContextImpl.execute(ContextImpl.java:288)
        at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:57)
        at io.vertx.core.impl.future.FutureImpl.addListener(FutureImpl.java:231)
        at io.vertx.core.impl.future.FutureImpl.onSuccess(FutureImpl.java:87)
        at io.quarkus.opentelemetry.runtime.exporter.otlp.VertxGrpcExporter.shutdown(VertxGrpcExporter.java:164)
        at io.opentelemetry.sdk.trace.export.BatchSpanProcessor$Worker.lambda$shutdown$3(BatchSpanProcessor.java:294)
        at io.opentelemetry.sdk.common.CompletableResultCode.succeed(CompletableResultCode.java:85)
        at io.opentelemetry.sdk.trace.export.BatchSpanProcessor$Worker.flush(BatchSpanProcessor.java:278)
        at io.opentelemetry.sdk.trace.export.BatchSpanProcessor$Worker.run(BatchSpanProcessor.java:239)
        at java.base/java.lang.Thread.run(Unknown Source)
2025-01-18 03:54:55,644 INFO  [io.quarkus] (Shutdown thread) hono-service-command-router-redis stopped in 0.733s

@StFS
Copy link
Collaborator Author

StFS commented Jan 18, 2025

It seems that the failure happens eventually even if you only run that single test (although I've seen test from the Mqtt test fail as well. But this is sufficient to get the failure to appear eventually:

# Run once to start all the containers
mvn clean verify -Prun-tests,redis_cache -Dit.test=CommandAndControlMqttIT#testSendOneWayCommandSucceeds -Ddocker.keepRunning

# Run the tests repeatedly until a run fails
while mvn verify -Prun-tests,redis_cache,useRunningContainers -Dit.test=CommandAndControlMqttIT#testSendOneWayCommandSucceeds ; do :; done

This will eventually result in a failed test run and often (but not always) subsequent test runs will then fail consistently. Sometimes however, subsequent test runs pass successfully.

Copy link

@sophokles73 sophokles73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty solid to me :-)
At first glance, I cannot see, why the tests should fail. I haven't run the tests yet but will first take another look at the integration test implementation. We sometimes have odd bugs regarding async behavior in them ...

tests/pom.xml Outdated
</profile>
</profiles>
</project>
</project> No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing EOL

"io.quarkus.vertx.core.runtime":
level: DEBUG
vertx:
max-event-loop-execute-time: ${max.event-loop.execute-time} No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing EOL

}
futures.add(api.set(params));
});
return Future.all(Collections.unmodifiableList(futures));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO there is no value in creating this list of futures because you are not doing anything with it anyway, right?

})
.compose(ignored -> api.exec())
// null reply means transaction aborted
.map(Objects::nonNull)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arey you sure that this handles the response properly? Does api.exec() return a failed future if the transaction has aborted? Or does it return a succeeded future that contains an empty array? In the latter case, this code will need to be adapted in order to create a failed future ...

Objects.requireNonNull(key);

return api.get(key)
.compose(value -> Future.succeededFuture(String.valueOf(value)));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

.map(value -> value.toString(UTF8));

// key does not exist
return Future.succeededFuture(false);
}
if (String.valueOf(response).equals(value)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FMPOV this should better be

Suggested change
if (String.valueOf(response).equals(value)) {
if (value.equals(response.toString(UTF8))) {

/**
* TODO.
*/
public class RedisConfigInterceptor extends RelocateConfigSourceInterceptor {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FMPOV we should simply use the default Quarkus mechanism for configuring the connection to the Redis cache ...

#
# SPDX-License-Identifier: EPL-2.0

Args = -H:+RunReachabilityHandlersConcurrently

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this strictly necessary?

<removeVolumes>true</removeVolumes>
<removeNamePattern>**/hono-*-test:*</removeNamePattern>
<stopNamePattern>hono-*-test-*,all-in-one-*,cp-*,postgres-*</stopNamePattern>
<stopNamePattern>hono-*-test-*,all-in-one-*,cp-*,postgres-*,redis-*</stopNamePattern>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the Redis container be covered by the hono-*-test-* pattern?

@sophokles73
Copy link

@StFS I believe I have found the culprit. When the integration test sends the one-way commands at a high rate (as it does), the Command Router sometimes fails to query the Redis cache for the command handling adapter instances, which it needs to do for each command to be forwarded to a protocol adapter instance. The Jaeger trace includes the following stack trace:

io.vertx.core.http.ConnectionPoolTooBusyException: Connection pool reached max wait queue size of 40
	at io.vertx.core.net.impl.pool.SimpleConnectionPool$Acquire$6.run(SimpleConnectionPool.java:630)
	at io.vertx.core.net.impl.pool.Task.runNextTasks(Task.java:43)
	at io.vertx.core.net.impl.pool.CombinerExecutor.submit(CombinerExecutor.java:91)
	at io.vertx.core.net.impl.pool.SimpleConnectionPool.execute(SimpleConnectionPool.java:244)
	at io.vertx.core.net.impl.pool.SimpleConnectionPool.acquire(SimpleConnectionPool.java:639)
	at io.vertx.core.net.impl.pool.SimpleConnectionPool.acquire(SimpleConnectionPool.java:643)
	at io.vertx.redis.client.impl.RedisConnectionManager$RedisEndpoint.requestConnection(RedisConnectionManager.java:430)
	at io.vertx.core.net.impl.pool.Endpoint.getConnection(Endpoint.java:41)
	at io.vertx.core.net.impl.pool.ConnectionManager.getConnection(ConnectionManager.java:51)
	at io.vertx.core.net.impl.pool.ConnectionManager.getConnection(ConnectionManager.java:40)
	at io.vertx.redis.client.impl.RedisConnectionManager.getConnection(RedisConnectionManager.java:394)
	at io.vertx.redis.client.impl.RedisClient.connect(RedisClient.java:36)
	at io.vertx.redis.client.impl.BaseRedisClient.send(BaseRedisClient.java:42)
	at io.quarkus.redis.runtime.client.ObservableRedis.send(ObservableRedis.java:83)
	at io.vertx.redis.client.impl.RedisAPIImpl.send(RedisAPIImpl.java:56)
	at io.vertx.redis.client.RedisAPI.mget(RedisAPI.java:3791)
	at io.vertx.redis.client.RedisAPI_DVNloIyh16wAC4m7ssePIaeqbbc_Synthetic_ClientProxy.mget(Unknown Source)
	at org.eclipse.hono.deviceconnection.redis.client.RedisCache.getAll(RedisCache.java:183)
	at org.eclipse.hono.deviceconnection.common.CacheBasedDeviceConnectionInfo.getInstancesQueryingAllGatewaysFirst(CacheBasedDeviceConnectionInfo.java:320)
	at org.eclipse.hono.deviceconnection.common.CacheBasedDeviceConnectionInfo.getCommandHandlingAdapterInstances(CacheBasedDeviceConnectionInfo.java:296)
	at org.eclipse.hono.commandrouter.impl.CommandTargetMapperImpl.lambda$getTargetGatewayAndAdapterInstance$2(CommandTargetMapperImpl.java:96)
	at io.vertx.core.impl.future.Composition.onSuccess(Composition.java:38)
	at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:66)
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:259)
	at io.vertx.core.impl.future.Composition$1.onSuccess(Composition.java:62)
	at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:66)
	at io.vertx.core.impl.future.SucceededFuture.addListener(SucceededFuture.java:88)
	at io.vertx.core.impl.future.Composition.onSuccess(Composition.java:43)
	at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:66)
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:259)
	at io.vertx.core.impl.future.Mapping.onSuccess(Mapping.java:40)
	at io.vertx.core.impl.future.FutureImpl$ListenerArray.onSuccess(FutureImpl.java:310)
	at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:66)
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:259)
	at io.vertx.core.impl.future.Mapping.onSuccess(Mapping.java:40)
	at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:66)
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:259)
	at io.vertx.core.impl.future.Composition$1.onSuccess(Composition.java:62)
	at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:66)
	at io.vertx.core.impl.future.SucceededFuture.addListener(SucceededFuture.java:88)
	at io.vertx.core.impl.future.Composition.onSuccess(Composition.java:43)
	at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:66)
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:259)
	at io.vertx.core.impl.future.Composition$1.onSuccess(Composition.java:62)
	at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:66)
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:259)
	at io.vertx.core.impl.future.Mapping.onSuccess(Mapping.java:40)
	at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:66)
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:259)
	at io.vertx.core.impl.future.Composition$1.onSuccess(Composition.java:62)
	at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:66)
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:259)
	at io.vertx.core.impl.future.PromiseImpl.onSuccess(PromiseImpl.java:49)
	at io.vertx.core.impl.future.PromiseImpl.handle(PromiseImpl.java:41)
	at org.eclipse.hono.client.amqp.RequestResponseClient.handleResponse(RequestResponseClient.java:368)
	at org.eclipse.hono.client.amqp.connection.impl.HonoConnectionImpl.lambda$createReceiver$13(HonoConnectionImpl.java:688)
	at io.vertx.proton.impl.ProtonReceiverImpl.onDelivery(ProtonReceiverImpl.java:236)
	at io.vertx.proton.impl.ProtonTransport.handleSocketBuffer(ProtonTransport.java:168)
	at io.vertx.core.net.impl.NetSocketImpl.lambda$new$1(NetSocketImpl.java:104)
	at io.vertx.core.streams.impl.InboundBuffer.handleEvent(InboundBuffer.java:255)
	at io.vertx.core.streams.impl.InboundBuffer.write(InboundBuffer.java:134)
	at io.vertx.core.net.impl.NetSocketImpl$DataMessageHandler.handle(NetSocketImpl.java:412)
	at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:328)
	at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:321)
	at io.vertx.core.net.impl.NetSocketImpl.handleMessage(NetSocketImpl.java:388)
	at io.vertx.core.net.impl.ConnectionBase.read(ConnectionBase.java:159)
	at io.vertx.core.net.impl.VertxHandler.channelRead(VertxHandler.java:153)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:801)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:509)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

I have increased the connection pool:

  commandRouter:
    amqp:
      insecurePortEnabled: true
      insecurePortBindAddress: "0.0.0.0"
    cache:
      redis:
        hosts: "redis://redis:6379"
        max-pool-size: 8
        max-pool-waiting: 40

but this did not (yet) fix the issue. I can also hardly believe that just running a few queries already brings Redis to its limits. Maybe you want to investigate a little further ...

StFS and others added 4 commits July 30, 2025 23:57
Use dedicated connections for Redis operations that require
WATCH/MULTI/EXEC transactions, since the pooled RedisAPI does
not support stateful transactional commands. Also remove the
api.close() call from stop() as the RedisAPI lifecycle is
managed by the Quarkus CDI container.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants