KAFKA-13392 Resolve Timeout Exception triggering reassign partitions with --bootstrap-server option#21388
Conversation
…trap-server option
| brokerListWithoutThrottleOpt = parser.accepts("broker-list-without-throttle", "Optional. Comma-separated broker ID list (e.g. 1,2) that " + | ||
| "should be excluded from broker-level throttle config updates during partition reassignment execution. " + | ||
| "When --execute and --throttle are used, it normally applies throttle configs on all brokers involved in the reassignment. " + | ||
| "If any of those brokers are known to be down or unreachable, adding them to --broker-list-without-throttle makes it " + | ||
| "skip the throttle-setting step for those brokers, avoiding retries/timeouts, while still throttling the remaining reachable brokers.") | ||
| .withRequiredArg() | ||
| .describedAs("broker list without throttle") | ||
| .ofType(String.class); |
There was a problem hiding this comment.
this is adding a new param to a command line tool (public API), so it would need a KIP to discuss and approve with the community. You can find info here and take it from there https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#KafkaImprovementProposals-Process
Thanks for looking into this!
There was a problem hiding this comment.
@lianetm Thank you very much for the information you provided! I have learned that adding new parameters to the command-line tool (public API) requires going through the KIP process, and it needs to be discussed and approved by the community. I will carefully read the link you provided to understand the specific steps of the KIP process and start preparing the proposal. Thank you again for your guidance!
|
I close this PR as I have another better solution, refer to #21654 |
This PR resolved the timeout exception triggering reassign partitions with --bootstrap-server option. More can be found https://issues.apache.org/jira/browse/KAFKA-13392.
Root cause
When we run a reassignment using a plan file (e.g. xxx.json), the plan may still include replicas on the down broker. During the execution, we try to apply throttling by calling
adminClient.incrementalAlterConfigs(configs). The issue is: this API needs to connect to the target broker to set the broker-level throttle configs.If the broker is down, it’s obviously unreachable, so the client keeps retrying and eventually times out → TimeoutException.
My proposed solution
Add a new parameter:
--broker-list-without-throttleDescription: Optional. Comma-separated broker ID list (e.g. 1,2) that should be excluded from broker-level throttle config updates during partition reassignment execution. When --execute and --throttle are used, it normally applies throttle configs on all brokers involved in the reassignment. If any of those brokers are known to be down or unreachable, adding them to --broker-list-without-throttle makes it skip the throttle-setting step for those brokers, avoiding retries/timeouts, while still throttling the remaining reachable brokers.
Value: a list of broker IDs, comma-separated
Example: 1001 or 1001,1002
If broker 1001 is known to be down, and the reassignment plan includes it, then we exclude 1001 from throttle config changes.
Why this is needed