feat(rebalance): error if a topic migration is ongoing#26
Closed
feat(rebalance): error if a topic migration is ongoing#26
Conversation
fpacifici
reviewed
Feb 18, 2025
fpacifici
left a comment
There was a problem hiding this comment.
Some high level comments:
I think we would have to revisit our Kafka workflow in a world where long operations can happen and interact with the behavior of topicctl, whether or not the long running task is managed by topicctl.
Specifically: imagine a scenario where we have a large shared cluster and we want to scale it or rebalance the partitions. This process could be concurrent with others and it should not prevent them:
- Creation of a new topic on the cluster. We should not prevent topics creaiton for hours.
- Emergency changes during an incident (like increasing topic partitions). The rebalancing cannot prevent incident management
So I would start by outlining which short operations can be safely run concurrently with a rebalance, then we can discuss whether this is acceptable.
Member
Author
|
Closing this as I won't have time to produce a better solution |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, if:
topicctl applyjob runs to rebalance partitions across the new brokers (which takes >1 hour)topicctl applyjob runs while the 1st job is still rebalancingThe 2nd job outputs this in the logs, then errors:
The
leader.replication.throttled.replicasandfollower.replication.throttled.replicassettings mentioned above are throttles automatically created by the 1st job doing the rebalancing on the partition. Currently, the 2nd job will delete these settings before erroring out while the 1st job is still replicating (as we run topicctl with the--destructiveflag).This PR changes this behaviour so that topicctl errors out immediately if it detects a topic migration is ongoing.