Skip to content

manage/k8s: document decommission timing settings (--decommission-wait-interval, RequeueAfter)#1761

Merged
micheleRP merged 4 commits into
mainfrom
dyu/decommission-wait-interval-docs
Jun 24, 2026
Merged

manage/k8s: document decommission timing settings (--decommission-wait-interval, RequeueAfter)#1761
micheleRP merged 4 commits into
mainfrom
dyu/decommission-wait-interval-docs

Conversation

@david-yu

@david-yu david-yu commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What

Adds a Tune automatic decommission timing section to manage/kubernetes/k-decommission-brokers.adoc documenting the re-check / requeue interval settings for the automatic decommissioner, which were previously undocumented for the Operator deployment mode.

Covers, with defaults and a worked example:

Setting Default Mode
--decommission-wait-interval (via operator additionalCmdFlags) 8s Operator
decommissionRequeueTimeout 10s Helm sidecar
decommissionAfter 60s Helm sidecar

The section explains:

  • How to pass --decommission-wait-interval through the operator chart's additionalCmdFlags.
  • That this flag sets the Decommission controller's RequeueAfter — i.e. the next run in <interval> value visible in the operator logs.
  • Guidance for adjusting the values: re-check cadence vs. the decommissionAfter debounce window, and that these intervals do not affect partition reallocation throughput (that's raft_learner_recovery_rate / partition_autobalancing_concurrent_moves).

Also adds a TIP cross-link from the existing Operator enablement step.

Why

Customer question (Arctic Wolf) on the Decommission controller's reconcile cadence and the 8s default; the flag and RequeueAfter behavior were not documented. Tracked in DOC-2270.

Verification (EKS)

Validated on a fresh EKS cluster:

  • Operator 26.1.6 with --decommission-wait-interval=300s: the DecommissionReconciler watching the V2 (Redpanda CRD) StatefulSet logs successful reconciliation finished in 1m0s, next run in 5m0s — confirming the flag sets the requeue cadence (default would be 8s). The ~1-minute reconcile duration is the old controller's inline cluster-health stability wait.
  • Operator 26.2.1-beta.2: identical behavior (still the old reconciler).
  • An intentional scale-down is decommissioned promptly (~2s) by the operator's core ClusterReconciler; this interval governs the secondary re-check cadence, not scale-down speed.

Notes for reviewers

  • Source of truth: operator cmd/run/run.go (--decommission-wait-interval, default 8s) → internal/controller/olddecommission/redpanda_decommission_controller.go (uses it as the RequeueAfter fallback over the 30s constant).
  • Version sensitivity: in all currently installable operators (25.3.x, 26.1.x, and 26.2.1-beta.2) --additional-controllers=decommission selects the controller that consumes this flag for V2 clusters (verified on EKS). A refactor on main (commit e9e70de, 2026-06-22) switches decommission to the new NodePool-aware StatefulSetDecommissioner (which ignores this flag) and renames the old one to legacy-decommission. That change is not in any release or pre-release yet (latest pre-release is beta.2 from 2026-05-29); it will ship in a later 26.2 build. This page documents current behavior; it should get a version note when that build ships.

🤖 Generated with Claude Code

Preview pages

…t-interval, RequeueAfter)

Add a "Tune automatic decommission timing" section to the Kubernetes
decommission guide explaining the re-check/requeue interval settings for
both deployment modes:

- Operator: --decommission-wait-interval (default 8s), passed via the
  operator chart's additionalCmdFlags, which sets the Decommission
  controller's RequeueAfter (surfaced as the "next run in" log line).
- Helm sidecar: decommissionRequeueTimeout (10s) and decommissionAfter (60s).

Includes defaults, a worked helm example, how to read the interval from
operator logs, and guidance for adjusting the values (recheck vs debounce,
reallocation throughput is separate).

Ref: DOC-2270

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@david-yu david-yu requested a review from a team as a code owner June 23, 2026 20:54
@netlify

netlify Bot commented Jun 23, 2026

Copy link
Copy Markdown

Deploy Preview for redpanda-docs-preview ready!

Name Link
🔨 Latest commit 8052d9c
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-docs-preview/deploys/6a3b66da4af52a000801fad0
😎 Deploy Preview https://deploy-preview-1761--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2353a3f7-5cd5-40b6-b6ee-92bf2822ec73

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The pull request updates modules/manage/pages/kubernetes/k-decommission-brokers.adoc with two additions. A tip is inserted in the BrokerDecommissioner setup steps directing users to pass --decommission-wait-interval via additionalCmdFlags and linking to a new section. That new section, "Tune automatic decommission timing," documents the polling interval and debounce parameters for both the Operator's Decommission controller (--decommission-wait-interval) and the Helm sidecar deployment (decommissionRequeueTimeout, decommissionAfter), along with defaults, example commands, sample log output, and a clarification that these settings affect re-check timing only, not partition reallocation throughput.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • redpanda-data/docs#1717: Directly related — both PRs modify the same file's BrokerDecommissioner instructions, specifically around --decommission-wait-interval via additionalCmdFlags and decommission timing behavior.

Suggested reviewers

  • kbatuigas
  • joe-redpanda
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request provides a detailed description with 'What', 'Why', 'Verification', and 'Notes for reviewers' sections, but is missing required template sections including a JIRA ticket link, review deadline, page previews, and checkbox selections. Add the missing template sections: link to the JIRA ticket (DOC-2270), set review deadline, include Netlify page preview URL, and mark appropriate checkboxes (likely 'Content gap' based on the undocumented settings being addressed).
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: documenting decommission timing settings for Kubernetes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dyu/decommission-wait-interval-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

…cale-in gate

Per EKS end-to-end testing: a user-initiated scale-in (reducing
statefulset.replicas) is detected from a StatefulSet watch event and acted
on promptly (~seconds) regardless of --decommission-wait-interval. The
interval governs the periodic re-check cadence for conditions that arise
without a triggering event (for example, a broker that becomes
unreachable), so raising it does not delay routine scale-ins.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@micheleRP

Copy link
Copy Markdown
Contributor

Docs review

Overall: Solid, technically accurate addition. No critical issues — all xrefs/anchors resolve and the behavior was EKS-verified. A few minor consistency suggestions below.

Critical issues

None. Both property xrefs are valid (reference:cluster-properties.adoc and reference:tunable-properties.adoc both alias to properties/cluster-properties.adoc, and both anchors exist in the included partial).

Suggestions

  1. Capitalization: the heading "Set the interval for the Operator" capitalizes the bare noun. The docs convention is lowercase "operator" in prose (~7:1 across the repo), reserving "Redpanda Operator" for the product name — and this same file already uses lowercase ("the operator detects the change"). Recommend: "Set the interval for the operator".
  2. Helm command quoting: the new operator example uses --set "additionalCmdFlags={...=decommission,--decommission-wait-interval=30s}" (whole arg quoted) while the existing example on the page uses --set additionalCmdFlags={--additional-controllers="decommission"}. The new form is more shell-correct — worth aligning the two for consistency.
  3. Intro precision: the intro says the decommissioner "polls the cluster on a regular interval," but the later clarification (and your EKS testing) notes operator scale-ins are event-driven via a StatefulSet watch. Softening "polls… to detect" → "re-checks" would align the intro with the later bullet.

Impact on other files

None. Single existing page; no nav or What's New entry needed. Related PR #1717 (same file) is already merged — no conflict.

What works well

  • EKS-verified behavior; clean Operator-vs-Helm separation; good internal cross-linking; correct AsciiDoc throughout (table, [.no-copy] log block, {latest-operator-version} attribute, anchors).
  • Good catch on the version-sensitivity note (the main refactor to legacy-decommission) — worth a version note when that build ships, as you flagged.

These are all minor; the PR is in good shape to merge.

@micheleRP micheleRP left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lovely, thanks David! Claude has some minor suggestions you can consider

…ator, re-check wording

- Align both additionalCmdFlags examples to the shell-correct form
  (--set "additionalCmdFlags={...}"): outer-quoted to protect {}/comma from
  brace expansion, no pointless inner quotes. Verified the rendered list with
  `helm template`: ["--additional-controllers=decommission","--decommission-wait-interval=30s"].
- Lowercase bare-noun "operator" (heading + table label) per docs convention.
- Intro: "polls ... to detect" -> "re-checks ... for" to match the event-driven note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@david-yu

Copy link
Copy Markdown
Contributor Author

Thanks addressed the feedback from the review. We should keep "Operator" instead "operator" because the capitalized Operator is typically used in K8s to describe a Kubernetes Operator instead of a human operator.

…re noun

Per maintainer preference, capitalize bare-noun "Operator" page-wide (heading,
table label, prose) — reverts the earlier lowercasing. Chart path
`redpanda/operator` and the `{latest-operator-version}` attribute stay lowercase.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@david-yu

Copy link
Copy Markdown
Contributor Author

Thanks @micheleRP! Addressed:

  1. Helm --set quoting — aligned both additionalCmdFlags examples to the shell-correct form --set "additionalCmdFlags={...}" (outer-quoted, no inner quotes). I verified the rendered list with helm template:

    • --set "additionalCmdFlags={--additional-controllers=decommission}"["--additional-controllers=decommission"]
    • --set "additionalCmdFlags={--additional-controllers=decommission,--decommission-wait-interval=30s}"["--additional-controllers=decommission","--decommission-wait-interval=30s"]
      The outer quotes matter for the multi-flag form specifically: without them the shell brace-expands {a,b} and mangles the list; the previous inner-quoted form (={...="decommission"}) only worked because an interactive shell strips the quotes.
  2. Intro precision — "polls … to detect" → "re-checks … for", matching the event-driven note below.

  3. Capitalization — went the other way here, per maintainer preference: using capital "Operator" (the bare-noun product reference) since it's the more accepted form in Kubernetes docs. Applied page-wide for consistency; the redpanda/operator chart path and {latest-operator-version} attribute stay lowercase. Flagging so it's clear that's a deliberate choice, not an unaddressed comment.

@micheleRP micheleRP merged commit 9e3a661 into main Jun 24, 2026
7 checks passed
@micheleRP micheleRP deleted the dyu/decommission-wait-interval-docs branch June 24, 2026 13:40
@david-yu

Copy link
Copy Markdown
Contributor Author

Will backport to 25.3.x and 25.2.x

@david-yu

Copy link
Copy Markdown
Contributor Author

Backports opened:

Each adds the "Tune automatic decommission timing" section + TIP cross-ref + shell-correct --set quoting. The main-only Operator-example rewrite (the "do not add brokerDecommissioner" paragraph/callout) was intentionally omitted on both, since the version branches' Operator examples still use the brokerDecommissioner sidecar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants