Fix prometheus pods not scheduling to infra nodes after rebalance by Sandeepyadav93 · Pull Request #79335 · openshift/release

Sandeepyadav93 · 2026-05-15T07:35:47Z

Fix prometheus pods not scheduling to infra nodes after rebalance

The rebalanceInfra function was restarting prometheus-k8s statefulset
without configuring node placement, causing pods to randomly land on
worker nodes instead of infra nodes. This led to OOM issues on workers
as prometheus workload is resource-intensive.

Root cause: Missing nodeSelector and tolerations configuration for
prometheus before pod restart. Previously, topologySpreadConstraints
helped ensure at least one prometheus pod landed on infra nodes (as
described in RFE-5107), but topologySpreadConstraints is no longer
present in the current prometheus-k8s StatefulSet. Without explicit
nodeSelector and tolerations, prometheus pods schedule to workers.

Changes:

Apply cluster-monitoring-config ConfigMap with nodeSelector and
tolerations for prometheusK8s to explicitly target infra nodes
(other monitoring components consume minimal resources and remain
on workers)
Wait for cluster-monitoring-operator to reconcile the StatefulSet
template spec before restarting pods (poll up to 5 minutes using jq
to verify nodeSelector and tolerations are present in the spec)
Add inline verification after rollout to ensure prometheus pods
actually land on infra nodes (12 retries over 2 minutes)
Fail fast with explicit error if StatefulSet reconciliation times out
or pods don't schedule to infra nodes, preventing silent OOM failures
on workers

Related: https://redhat.atlassian.net/browse/RFE-510

coderabbitai · 2026-05-15T07:36:12Z

Walkthrough

The script expands the Prometheus migration in rebalanceInfra: it logs current state, applies a cluster-monitoring-config, waits for operator reconciliation, restarts the prometheus-k8s StatefulSet, verifies pods land on infra nodes with retries, and updates the HCP flow to call checkInfra for prometheus-k8s.

Changes

Prometheus HyperShift Migration

Layer / File(s)	Summary
Pre-migration logging and state `ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`	Log migration start and print current `prometheus-k8s` pods and StatefulSet prior to changes.
Apply cluster-monitoring-config and wait for reconciliation `ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`	Apply a `cluster-monitoring-config` ConfigMap in `openshift-monitoring` to set `prometheusK8s` nodeSelector/tolerations and poll the `prometheus-k8s` StatefulSet JSON until the template reflects the expected nodeSelector/tolerations or timeout with failure logging.
Restart StatefulSet and wait for rollout `ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`	Restart the `prometheus-k8s` StatefulSet and block until `oc rollout status` completes.
Verify pods scheduled on infra nodes `ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`	Run a bounded retry loop verifying all running `prometheus-k8s-*` pods are scheduled on nodes labeled as infra; log warnings for mismatches and fail (listing pods) if not achieved within retries.
HCP cluster flow call site change `ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`	In the HCP hypershift flow, the post-“Re-balance infra components” step now calls `checkInfra "prometheus-k8s" "openshift-monitoring"` instead of `rebalanceInfra "prometheus-k8s"`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and specifically describes the main change: fixing prometheus pods scheduling to infra nodes after rebalance, which is the core issue addressed in the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The check is not applicable to this PR. The modified file is a shell script, not a Ginkgo test file. No Go test files with Ginkgo test definitions were modified.
Test Structure And Quality	✅ Passed	The custom check reviews Ginkgo test code quality (Go tests). This PR modifies only bash shell scripts and contains no Ginkgo tests. Not applicable.
Microshift Test Compatibility	✅ Passed	This PR does not add any Ginkgo e2e tests. The modified file is a bash shell script for CI infrastructure setup, not a Go test file. The custom check is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR does not add any new Ginkgo e2e tests. It only modifies a bash script (openshift-qe-hypershift-infra-commands.sh) for CI infrastructure operations. The custom check is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	CI test script for HyperShift. Checks topology at runtime and exits if not HyperShift. Scheduling constraints are HyperShift-specific.
Ote Binary Stdout Contract	✅ Passed	Check not applicable. PR modifies bash scripts, config files, and documentation—not OTE binary source code. OTE stdout contract check applies only to OTE binaries with process-level stdout writes.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e tests are added. The PR modifies only a bash script in the CI step registry, not test code. This check applies only to new Ginkgo e2e tests.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Sandeepyadav93 · 2026-05-15T07:38:53Z

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes-onperfsector

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`:
- Around line 54-118: The current heredoc creates/replaces the entire ConfigMap
"cluster-monitoring-config" (data.config.yaml) which wipes unrelated settings;
instead, modify this step to merge only the node-placement stanza into the
existing ConfigMap: fetch the existing "cluster-monitoring-config" (namespace
openshift-monitoring), parse data.config.yaml, inject/merge the nodeSelector and
tolerations into each component (alertmanagerMain, prometheusK8s,
prometheusOperator, k8sPrometheusAdapter, kubeStateMetrics, telemeterClient,
openshiftStateMetrics, thanosQuerier) and then update the ConfigMap (e.g., via
oc get -> merge YAML -> oc apply/oc patch) rather than replacing
data.config.yaml via the heredoc used with "cat << 'EOF' | oc apply -f -".
- Around line 120-121: The current sleep 30 after "Wait for
cluster-monitoring-operator to reconcile the configuration" is insufficient;
replace the fixed sleep with a polling loop that queries the prometheus-k8s
StatefulSet spec.template (using kubectl -n openshift-monitoring get statefulset
prometheus-k8s -o jsonpath=... or equivalent) and waits until the infra
nodeSelector/tolerations (the infra placement) are present in
spec.template.spec.template.spec.nodeSelector and/or
spec.template.spec.template.spec.tolerations, then proceed to perform the
rollout restart of prometheus-k8s; ensure the loop has a timeout and sleeps
between polls to avoid tight looping.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e4b84612-9189-4f54-aed5-aeffcd5bc337

📥 Commits

Reviewing files that changed from the base of the PR and between 26eb294 and c4e854e.

📒 Files selected for processing (1)

ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh

Sandeepyadav93 · 2026-05-15T10:16:01Z

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes-onperfsector

openshift-merge-bot · 2026-05-15T10:16:04Z

@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh (1)
154-180: 💤 Low value

Outer checkInfra retry is now a no-op for prometheus-k8s.

rebalanceInfra now performs its own 12-retry verification and exit 1 on failure (lines 176-180). Combined with set -o errexit, that means the wrapping checkInfra loop on line 259 cannot ever retry for prometheus-k8s, and its post-call verification at lines 202-210 just duplicates what rebalanceInfra already proved. If "fail fast on placement failure" is the intent (per the PR description), this is fine — but consider either dropping the redundant outer pass for prometheus-k8s, or returning a non-zero status from rebalanceInfra so checkInfra's TRY loop can actually exercise the retries it advertises. Right now the script presents two retry layers but only the inner one ever runs.

Also applies to: 259-259
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`
around lines 154 - 180, The inner rebalanceInfra verification for prometheus-k8s
currently calls exit 1 on failure (and sets VERIFY_SUCCESS), which with set -o
errexit makes the outer checkInfra retry loop (RETRY/TRY/MAX_RETRIES) a no-op;
either remove the redundant prometheus-k8s verification from checkInfra or make
rebalanceInfra return a non-zero status instead of exiting so the outer loop can
actually retry: change the exit 1 in rebalanceInfra to return 1 (and ensure
VERIFY_SUCCESS is set appropriately), and update the caller (checkInfra) to test
the rebalanceInfra return code and continue its RETRY loop (or delete the outer
prometheus-k8s branch if you prefer fail-fast behavior).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`:
- Around line 86-92: The cluster-monitoring-config uses the removed
k8sPrometheusAdapter key; update the manifest to use metricsServer instead:
replace the top-level k8sPrometheusAdapter mapping with metricsServer and keep
the nested nodeSelector and tolerations (the node-role.kubernetes.io/infra
selector and the NoSchedule toleration with key node-role.kubernetes.io/infra
and operator Exists) so the Cluster Monitoring Operator on OCP 4.22 will accept
and apply the configuration.

---

Nitpick comments:
In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`:
- Around line 154-180: The inner rebalanceInfra verification for prometheus-k8s
currently calls exit 1 on failure (and sets VERIFY_SUCCESS), which with set -o
errexit makes the outer checkInfra retry loop (RETRY/TRY/MAX_RETRIES) a no-op;
either remove the redundant prometheus-k8s verification from checkInfra or make
rebalanceInfra return a non-zero status instead of exiting so the outer loop can
actually retry: change the exit 1 in rebalanceInfra to return 1 (and ensure
VERIFY_SUCCESS is set appropriately), and update the caller (checkInfra) to test
the rebalanceInfra return code and continue its RETRY loop (or delete the outer
prometheus-k8s branch if you prefer fail-fast behavior).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4d255d7c-6c5f-42e2-b51d-6172c5690787

📥 Commits

Reviewing files that changed from the base of the PR and between c4e854e and 61f4ed6.

📒 Files selected for processing (1)

ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh

The rebalanceInfra function was restarting prometheus-k8s statefulset without configuring node placement, causing pods to randomly land on worker nodes instead of infra nodes. This led to OOM issues on workers as prometheus workload is resource-intensive. Root cause: Missing nodeSelector and tolerations configuration for prometheus before pod restart. Previously, topologySpreadConstraints helped ensure at least one prometheus pod landed on infra nodes (as described in RFE-5107), but topologySpreadConstraints is no longer present in the current prometheus-k8s StatefulSet. Without explicit nodeSelector and tolerations, prometheus pods schedule to workers. Changes: - Apply cluster-monitoring-config ConfigMap with nodeSelector and tolerations for prometheusK8s to explicitly target infra nodes (other monitoring components consume minimal resources and remain on workers) - Wait for cluster-monitoring-operator to reconcile the StatefulSet template spec before restarting pods (poll up to 5 minutes using jq to verify nodeSelector and tolerations are present in the spec) - Add inline verification after rollout to ensure prometheus pods actually land on infra nodes (12 retries over 2 minutes) - Fail fast with explicit error if StatefulSet reconciliation times out or pods don't schedule to infra nodes, preventing silent OOM failures on workers Related: https://redhat.atlassian.net/browse/RFE-5107 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Sandeepyadav93 · 2026-05-15T14:03:44Z

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes-onperfsector

openshift-merge-bot · 2026-05-15T14:03:47Z

@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-merge-bot · 2026-05-15T14:04:46Z

[REHEARSALNOTIFIER]
@Sandeepyadav93: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-24nodes	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-120nodes	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-249nodes	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-498nodes	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-498nodes-onperfsector	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-24nodes-onperfsector	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-24nodes	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-node-density-heavy-24nodes	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-node-density-24nodes	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-24nodes-onperfsector	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-120nodes-onperfsector	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-249nodes-onperfsector-nd	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-249nodes-onperfsector-nd-cni	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-249nodes-onperfsector-cdv2	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-249nodes-onperfsector-crd	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-498nodes-onperfsector-nd	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-498nodes-onperfsector-nd-cni	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-498nodes-onperfsector-cdv2	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-498nodes-onperfsector-crd	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-data-path-9nodes	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-node-density-heavy-24nodes	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-node-density-24nodes	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-120nodes-onperfsector	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-249nodes-onperfsector-nd	openshift-eng/ocp-qe-perfscale-ci	presubmit	Registry content changed

A total of 48 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

Sandeepyadav93 · 2026-05-15T14:57:38Z

/assign @mukrishn @mcornea

Sandeepyadav93 · 2026-05-15T15:06:38Z

Looking good

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/79335/rehearse-79335-periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-24nodes-onperfsector/2055288859511492608/artifacts/control-plane-24nodes-onperfsector/openshift-qe-hypershift-infra/build-log.txt

[1m15-05-2026T14:45:20  Fri May 15 14:45:20 UTC 2026 - Initiate migration of prometheus to infra nodepools[0m
prometheus-k8s-0                                         6/6     Running   0          12m     10.131.10.11   ip-10-0-110-117.us-east-2.compute.internal   <none>           <none>
prometheus-k8s-1                                         6/6     Running   0          12m     10.128.6.11    ip-10-0-113-94.us-east-2.compute.internal    <none>           <none>
NAME             READY   AGE
prometheus-k8s   2/2     12m
[1m15-05-2026T14:45:20  Fri May 15 14:45:20 UTC 2026 - Apply cluster-monitoring-config to move prometheus to infra nodes[0m
configmap/cluster-monitoring-config created
[1m15-05-2026T14:45:21  Fri May 15 14:45:21 UTC 2026 - Wait for cluster-monitoring-operator to reconcile the configuration[0m
[1m15-05-2026T14:45:31  Fri May 15 14:45:31 UTC 2026 - StatefulSet reconciled with infra nodeSelector and tolerations[0m
[1m15-05-2026T14:45:31  Fri May 15 14:45:31 UTC 2026 - Restart stateful set pods[0m
rollout restart -n openshift-monitoring statefulset/prometheus-k8s
statefulset.apps/prometheus-k8s restarted
[1m15-05-2026T14:45:31  Fri May 15 14:45:31 UTC 2026 - Wait till they are completely restarted[0m
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
waiting for statefulset rolling update to complete 1 pods at revision prometheus-k8s-7d5c6ccc66...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
statefulset rolling update complete 2 pods at revision prometheus-k8s-7d5c6ccc66...
[1m15-05-2026T14:47:35  Fri May 15 14:47:35 UTC 2026 - Verify prometheus pods are running on infra nodes[0m
[1m15-05-2026T14:47:36  Fri May 15 14:47:36 UTC 2026 - prometheus pod on ip-10-0-64-208.us-east-2.compute.internal (infra node) ✓[0m
[1m15-05-2026T14:47:36  Fri May 15 14:47:36 UTC 2026 - prometheus pod on ip-10-0-119-27.us-east-2.compute.internal (infra node) ✓[0m
[1m15-05-2026T14:47:36  Fri May 15 14:47:36 UTC 2026 - All prometheus-k8s pods are on infra nodes ✓[0m

mcornea · 2026-05-18T07:13:59Z

/lgtm

mcornea · 2026-05-18T07:14:12Z

/pj-rehearse ack

openshift-merge-bot · 2026-05-18T07:14:15Z

@mcornea: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci · 2026-05-18T07:14:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mcornea, Sandeepyadav93

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/openshift-qe/OWNERS~~ [Sandeepyadav93,mcornea]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-05-18T07:29:31Z

@Sandeepyadav93: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

…enshift#79335) The rebalanceInfra function was restarting prometheus-k8s statefulset without configuring node placement, causing pods to randomly land on worker nodes instead of infra nodes. This led to OOM issues on workers as prometheus workload is resource-intensive. Root cause: Missing nodeSelector and tolerations configuration for prometheus before pod restart. Previously, topologySpreadConstraints helped ensure at least one prometheus pod landed on infra nodes (as described in RFE-5107), but topologySpreadConstraints is no longer present in the current prometheus-k8s StatefulSet. Without explicit nodeSelector and tolerations, prometheus pods schedule to workers. Changes: - Apply cluster-monitoring-config ConfigMap with nodeSelector and tolerations for prometheusK8s to explicitly target infra nodes (other monitoring components consume minimal resources and remain on workers) - Wait for cluster-monitoring-operator to reconcile the StatefulSet template spec before restarting pods (poll up to 5 minutes using jq to verify nodeSelector and tolerations are present in the spec) - Add inline verification after rollout to ensure prometheus pods actually land on infra nodes (12 retries over 2 minutes) - Fail fast with explicit error if StatefulSet reconciliation times out or pods don't schedule to infra nodes, preventing silent OOM failures on workers Related: https://redhat.atlassian.net/browse/RFE-5107 Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread ...erator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh

Comment thread ...erator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh Outdated

Sandeepyadav93 force-pushed the hcp_fix branch from c4e854e to 61f4ed6 Compare May 15, 2026 10:10

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2026

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread ...erator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh Outdated

Sandeepyadav93 force-pushed the hcp_fix branch 2 times, most recently from de86cf0 to 1b7df99 Compare May 15, 2026 13:47

Sandeepyadav93 force-pushed the hcp_fix branch from 1b7df99 to e298e1a Compare May 15, 2026 14:01

openshift-ci Bot assigned mcornea and mukrishn May 15, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 18, 2026

openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label May 18, 2026

openshift-merge-bot Bot merged commit 12c766e into openshift:main May 18, 2026
11 checks passed

Sandeepyadav93 deleted the hcp_fix branch May 18, 2026 13:08

Conversation

Sandeepyadav93 commented May 15, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Sandeepyadav93 commented May 15, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Sandeepyadav93 commented May 15, 2026

Uh oh!

openshift-merge-bot Bot commented May 15, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sandeepyadav93 commented May 15, 2026

Uh oh!

openshift-merge-bot Bot commented May 15, 2026

Uh oh!

openshift-merge-bot Bot commented May 15, 2026

Uh oh!

Sandeepyadav93 commented May 15, 2026

Uh oh!

Sandeepyadav93 commented May 15, 2026

Uh oh!

mcornea commented May 18, 2026

Uh oh!

mcornea commented May 18, 2026

Uh oh!

openshift-merge-bot Bot commented May 18, 2026

Uh oh!

openshift-ci Bot commented May 18, 2026

Uh oh!

openshift-ci Bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sandeepyadav93 commented May 15, 2026 •

edited by atlassian Bot

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading