Fix prometheus pods not scheduling to infra nodes after rebalance#79335
Conversation
WalkthroughThe script expands the Prometheus migration in rebalanceInfra: it logs current state, applies a cluster-monitoring-config, waits for operator reconciliation, restarts the prometheus-k8s StatefulSet, verifies pods land on infra nodes with retries, and updates the HCP flow to call checkInfra for prometheus-k8s. ChangesPrometheus HyperShift Migration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes-onperfsector |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`:
- Around line 54-118: The current heredoc creates/replaces the entire ConfigMap
"cluster-monitoring-config" (data.config.yaml) which wipes unrelated settings;
instead, modify this step to merge only the node-placement stanza into the
existing ConfigMap: fetch the existing "cluster-monitoring-config" (namespace
openshift-monitoring), parse data.config.yaml, inject/merge the nodeSelector and
tolerations into each component (alertmanagerMain, prometheusK8s,
prometheusOperator, k8sPrometheusAdapter, kubeStateMetrics, telemeterClient,
openshiftStateMetrics, thanosQuerier) and then update the ConfigMap (e.g., via
oc get -> merge YAML -> oc apply/oc patch) rather than replacing
data.config.yaml via the heredoc used with "cat << 'EOF' | oc apply -f -".
- Around line 120-121: The current sleep 30 after "Wait for
cluster-monitoring-operator to reconcile the configuration" is insufficient;
replace the fixed sleep with a polling loop that queries the prometheus-k8s
StatefulSet spec.template (using kubectl -n openshift-monitoring get statefulset
prometheus-k8s -o jsonpath=... or equivalent) and waits until the infra
nodeSelector/tolerations (the infra placement) are present in
spec.template.spec.template.spec.nodeSelector and/or
spec.template.spec.template.spec.tolerations, then proceed to perform the
rollout restart of prometheus-k8s; ensure the loop has a timeout and sleeps
between polls to avoid tight looping.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: e4b84612-9189-4f54-aed5-aeffcd5bc337
📒 Files selected for processing (1)
ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh
|
/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes-onperfsector |
|
@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh (1)
154-180: 💤 Low valueOuter
checkInfraretry is now a no-op for prometheus-k8s.
rebalanceInfranow performs its own 12-retry verification andexit 1on failure (lines 176-180). Combined withset -o errexit, that means the wrappingcheckInfraloop on line 259 cannot ever retry for prometheus-k8s, and its post-call verification at lines 202-210 just duplicates whatrebalanceInfraalready proved. If "fail fast on placement failure" is the intent (per the PR description), this is fine — but consider either dropping the redundant outer pass for prometheus-k8s, or returning a non-zero status fromrebalanceInfrasocheckInfra'sTRYloop can actually exercise the retries it advertises. Right now the script presents two retry layers but only the inner one ever runs.Also applies to: 259-259
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh` around lines 154 - 180, The inner rebalanceInfra verification for prometheus-k8s currently calls exit 1 on failure (and sets VERIFY_SUCCESS), which with set -o errexit makes the outer checkInfra retry loop (RETRY/TRY/MAX_RETRIES) a no-op; either remove the redundant prometheus-k8s verification from checkInfra or make rebalanceInfra return a non-zero status instead of exiting so the outer loop can actually retry: change the exit 1 in rebalanceInfra to return 1 (and ensure VERIFY_SUCCESS is set appropriately), and update the caller (checkInfra) to test the rebalanceInfra return code and continue its RETRY loop (or delete the outer prometheus-k8s branch if you prefer fail-fast behavior).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`:
- Around line 86-92: The cluster-monitoring-config uses the removed
k8sPrometheusAdapter key; update the manifest to use metricsServer instead:
replace the top-level k8sPrometheusAdapter mapping with metricsServer and keep
the nested nodeSelector and tolerations (the node-role.kubernetes.io/infra
selector and the NoSchedule toleration with key node-role.kubernetes.io/infra
and operator Exists) so the Cluster Monitoring Operator on OCP 4.22 will accept
and apply the configuration.
---
Nitpick comments:
In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`:
- Around line 154-180: The inner rebalanceInfra verification for prometheus-k8s
currently calls exit 1 on failure (and sets VERIFY_SUCCESS), which with set -o
errexit makes the outer checkInfra retry loop (RETRY/TRY/MAX_RETRIES) a no-op;
either remove the redundant prometheus-k8s verification from checkInfra or make
rebalanceInfra return a non-zero status instead of exiting so the outer loop can
actually retry: change the exit 1 in rebalanceInfra to return 1 (and ensure
VERIFY_SUCCESS is set appropriately), and update the caller (checkInfra) to test
the rebalanceInfra return code and continue its RETRY loop (or delete the outer
prometheus-k8s branch if you prefer fail-fast behavior).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 4d255d7c-6c5f-42e2-b51d-6172c5690787
📒 Files selected for processing (1)
ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh
de86cf0 to
1b7df99
Compare
The rebalanceInfra function was restarting prometheus-k8s statefulset without configuring node placement, causing pods to randomly land on worker nodes instead of infra nodes. This led to OOM issues on workers as prometheus workload is resource-intensive. Root cause: Missing nodeSelector and tolerations configuration for prometheus before pod restart. Previously, topologySpreadConstraints helped ensure at least one prometheus pod landed on infra nodes (as described in RFE-5107), but topologySpreadConstraints is no longer present in the current prometheus-k8s StatefulSet. Without explicit nodeSelector and tolerations, prometheus pods schedule to workers. Changes: - Apply cluster-monitoring-config ConfigMap with nodeSelector and tolerations for prometheusK8s to explicitly target infra nodes (other monitoring components consume minimal resources and remain on workers) - Wait for cluster-monitoring-operator to reconcile the StatefulSet template spec before restarting pods (poll up to 5 minutes using jq to verify nodeSelector and tolerations are present in the spec) - Add inline verification after rollout to ensure prometheus pods actually land on infra nodes (12 retries over 2 minutes) - Fail fast with explicit error if StatefulSet reconciliation times out or pods don't schedule to infra nodes, preventing silent OOM failures on workers Related: https://redhat.atlassian.net/browse/RFE-5107 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes-onperfsector |
|
@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[REHEARSALNOTIFIER]
A total of 48 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
Looking good |
|
/lgtm |
|
/pj-rehearse ack |
|
@mcornea: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mcornea, Sandeepyadav93 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@Sandeepyadav93: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
…enshift#79335) The rebalanceInfra function was restarting prometheus-k8s statefulset without configuring node placement, causing pods to randomly land on worker nodes instead of infra nodes. This led to OOM issues on workers as prometheus workload is resource-intensive. Root cause: Missing nodeSelector and tolerations configuration for prometheus before pod restart. Previously, topologySpreadConstraints helped ensure at least one prometheus pod landed on infra nodes (as described in RFE-5107), but topologySpreadConstraints is no longer present in the current prometheus-k8s StatefulSet. Without explicit nodeSelector and tolerations, prometheus pods schedule to workers. Changes: - Apply cluster-monitoring-config ConfigMap with nodeSelector and tolerations for prometheusK8s to explicitly target infra nodes (other monitoring components consume minimal resources and remain on workers) - Wait for cluster-monitoring-operator to reconcile the StatefulSet template spec before restarting pods (poll up to 5 minutes using jq to verify nodeSelector and tolerations are present in the spec) - Add inline verification after rollout to ensure prometheus pods actually land on infra nodes (12 retries over 2 minutes) - Fail fast with explicit error if StatefulSet reconciliation times out or pods don't schedule to infra nodes, preventing silent OOM failures on workers Related: https://redhat.atlassian.net/browse/RFE-5107 Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Fix prometheus pods not scheduling to infra nodes after rebalance
The rebalanceInfra function was restarting prometheus-k8s statefulset
without configuring node placement, causing pods to randomly land on
worker nodes instead of infra nodes. This led to OOM issues on workers
as prometheus workload is resource-intensive.
Root cause: Missing nodeSelector and tolerations configuration for
prometheus before pod restart. Previously, topologySpreadConstraints
helped ensure at least one prometheus pod landed on infra nodes (as
described in RFE-5107), but topologySpreadConstraints is no longer
present in the current prometheus-k8s StatefulSet. Without explicit
nodeSelector and tolerations, prometheus pods schedule to workers.
Changes:
tolerations for prometheusK8s to explicitly target infra nodes
(other monitoring components consume minimal resources and remain
on workers)
template spec before restarting pods (poll up to 5 minutes using jq
to verify nodeSelector and tolerations are present in the spec)
actually land on infra nodes (12 retries over 2 minutes)
or pods don't schedule to infra nodes, preventing silent OOM failures
on workers
Related: https://redhat.atlassian.net/browse/RFE-510