feat(rebalance): error if a topic migration is ongoing by bmckerry · Pull Request #26 · getsentry/topicctl

bmckerry · 2025-02-18T21:01:08Z

Currently, if:

3 new brokers are added to a cluster
a topicctl apply job runs to rebalance partitions across the new brokers (which takes >1 hour)
another topicctl apply job runs while the 1st job is still rebalancing

The 2nd job outputs this in the logs, then errors:

time="2025-02-18 19:45:14" level=warning msg="One or more replicas are not in-sync; there may be an ongoing migration:
-----+--------+-------------+---------+----------+----------------------------------------------------------+--------------
  ID | LEADER |  REPLICAS   |   ISR   | DISTINCT |                          RACKS                           |   STATUS
     |        |             |         |  RACKS   |                                                          |
-----+--------+-------------+---------+----------+----------------------------------------------------------+--------------
   1 |      1 | [1 6 8 0 2] | [1 0 2] |        3 | [us-east1-c us-east1-b us-east1-d us-east1-b us-east1-d] | Out-of-sync
-----+--------+-------------+---------+----------+----------------------------------------------------------+--------------"
time="2025-02-18 19:45:14" level=info msg="Checking topic config settings..."
time="2025-02-18 19:45:14" level=info msg="Found 2 key(s) set in cluster but missing from config to be deleted:
------------------------------------------+----------------
                    KEY                   | CLUSTER VALUE
------------------------------------------+----------------
  leader.replication.throttled.replicas   | 1:0,1:1,1:2
  follower.replication.throttled.replicas | 1:6,1:8
------------------------------------------+----------------"

The leader.replication.throttled.replicas and follower.replication.throttled.replicas settings mentioned above are throttles automatically created by the 1st job doing the rebalancing on the partition. Currently, the 2nd job will delete these settings before erroring out while the 1st job is still replicating (as we run topicctl with the --destructive flag).

This PR changes this behaviour so that topicctl errors out immediately if it detects a topic migration is ongoing.

fpacifici

Some high level comments:

I think we would have to revisit our Kafka workflow in a world where long operations can happen and interact with the behavior of topicctl, whether or not the long running task is managed by topicctl.

Specifically: imagine a scenario where we have a large shared cluster and we want to scale it or rebalance the partitions. This process could be concurrent with others and it should not prevent them:

Creation of a new topic on the cluster. We should not prevent topics creaiton for hours.
Emergency changes during an incident (like increasing topic partitions). The rebalancing cannot prevent incident management

So I would start by outlining which short operations can be safely run concurrently with a rebalance, then we can discuss whether this is acceptable.

bmckerry · 2025-02-19T19:48:37Z

Closing this as I won't have time to produce a better solution

bmckerry added 2 commits February 18, 2025 15:46

feat(rebalance): error if a topic migration is ongoing

b963143

better message for slack

7b07bb7

fpacifici reviewed Feb 18, 2025

View reviewed changes

bmckerry closed this Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat(rebalance): error if a topic migration is ongoing#26

feat(rebalance): error if a topic migration is ongoing#26
bmckerry wants to merge 2 commits intomasterfrom
ben/error-on-ongoing-migration

bmckerry commented Feb 18, 2025 •

edited

Loading

Uh oh!

fpacifici left a comment

Uh oh!

bmckerry commented Feb 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

bmckerry commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fpacifici left a comment

Choose a reason for hiding this comment

Uh oh!

bmckerry commented Feb 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bmckerry commented Feb 18, 2025 •

edited

Loading