Fix prepare timeout problem which will exit the caller and some other changes #13
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ekaf will store the prepare request to dict and would response it when got worker-up messages, in my production case with many Kafka partitions, it needs to wait so long for the worker-up messages that will reach the timeout and exit the caller process. And also, it have a little chance to miss the worker-up messages since ekaf_server state change logic is separated with process of worker-up message.
First of all, I've changed the prepare process to an instant manner as there is a pick operation when producing sync messages on non-prepared topic.
Then I've added three trivial features, one for operation friendliness which can purge messages in case too many messages buffered in memory, one for fast recovery on kafka cluster restart or network problem which will timeout on connection, one bug fix on restart worker which will lead to twofold reconnection on each connection failure.
We've run this version in production environment for about one month, and I guess it's time to send them back. HTH.