Skip to content

Comments

feat(rebalance): error if a topic migration is ongoing#26

Closed
bmckerry wants to merge 2 commits intomasterfrom
ben/error-on-ongoing-migration
Closed

feat(rebalance): error if a topic migration is ongoing#26
bmckerry wants to merge 2 commits intomasterfrom
ben/error-on-ongoing-migration

Conversation

@bmckerry
Copy link
Member

@bmckerry bmckerry commented Feb 18, 2025

Currently, if:

  • 3 new brokers are added to a cluster
  • a topicctl apply job runs to rebalance partitions across the new brokers (which takes >1 hour)
  • another topicctl apply job runs while the 1st job is still rebalancing

The 2nd job outputs this in the logs, then errors:

time="2025-02-18 19:45:14" level=warning msg="One or more replicas are not in-sync; there may be an ongoing migration:
-----+--------+-------------+---------+----------+----------------------------------------------------------+--------------
  ID | LEADER |  REPLICAS   |   ISR   | DISTINCT |                          RACKS                           |   STATUS
     |        |             |         |  RACKS   |                                                          |
-----+--------+-------------+---------+----------+----------------------------------------------------------+--------------
   1 |      1 | [1 6 8 0 2] | [1 0 2] |        3 | [us-east1-c us-east1-b us-east1-d us-east1-b us-east1-d] | Out-of-sync
-----+--------+-------------+---------+----------+----------------------------------------------------------+--------------"
time="2025-02-18 19:45:14" level=info msg="Checking topic config settings..."
time="2025-02-18 19:45:14" level=info msg="Found 2 key(s) set in cluster but missing from config to be deleted:
------------------------------------------+----------------
                    KEY                   | CLUSTER VALUE
------------------------------------------+----------------
  leader.replication.throttled.replicas   | 1:0,1:1,1:2
  follower.replication.throttled.replicas | 1:6,1:8
------------------------------------------+----------------"

The leader.replication.throttled.replicas and follower.replication.throttled.replicas settings mentioned above are throttles automatically created by the 1st job doing the rebalancing on the partition. Currently, the 2nd job will delete these settings before erroring out while the 1st job is still replicating (as we run topicctl with the --destructive flag).

This PR changes this behaviour so that topicctl errors out immediately if it detects a topic migration is ongoing.

Copy link

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high level comments:

I think we would have to revisit our Kafka workflow in a world where long operations can happen and interact with the behavior of topicctl, whether or not the long running task is managed by topicctl.

Specifically: imagine a scenario where we have a large shared cluster and we want to scale it or rebalance the partitions. This process could be concurrent with others and it should not prevent them:

  • Creation of a new topic on the cluster. We should not prevent topics creaiton for hours.
  • Emergency changes during an incident (like increasing topic partitions). The rebalancing cannot prevent incident management

So I would start by outlining which short operations can be safely run concurrently with a rebalance, then we can discuss whether this is acceptable.

@bmckerry
Copy link
Member Author

Closing this as I won't have time to produce a better solution

@bmckerry bmckerry closed this Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants