Skip to content

HYPERFLEET-1124 - docs: spike for dedicated GKE node pool for maestro#165

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift-hyperfleet:mainfrom
kuudori:HYPERFLEET-1124-dedicated-maestro-node-pool-spike
Jun 19, 2026
Merged

HYPERFLEET-1124 - docs: spike for dedicated GKE node pool for maestro#165
openshift-merge-bot[bot] merged 1 commit into
openshift-hyperfleet:mainfrom
kuudori:HYPERFLEET-1124-dedicated-maestro-node-pool-spike

Conversation

@kuudori

@kuudori kuudori commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Spike evaluating dedicated GKE node pool for maestro on hyperfleet-dev-prow cluster
  • All pricing, sizing, and feature claims verified against primary sources (GCP docs, upstream chart, live cluster data)
  • Includes measured E2E load data (tier0), DaemonSet memory breakdown, and single-node SPOF analysis
  • Recommendation: adopt, blocked on upstream chart PR adding nodeSelector/tolerations to openshift-online/maestro

Ticket

HYPERFLEET-1124

@openshift-ci openshift-ci Bot requested review from ciaranRoche and ldornele June 18, 2026 18:09
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: f8b308fb-9311-4742-9271-883493b08432

📥 Commits

Reviewing files that changed from the base of the PR and between 0958e79 and 97affed.

📒 Files selected for processing (1)
  • hyperfleet/docs/spike-dedicated-maestro-node-pool.md
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)

📝 Walkthrough

Summary by CodeRabbit

  • Documentation
    • Added infrastructure design documentation evaluating system reliability improvements for Maestro, including analysis, recommendations, implementation strategies, and decision tracking for planned cluster optimization work.

Walkthrough

A new spike document (hyperfleet/docs/spike-dedicated-maestro-node-pool.md) is added to evaluate running Maestro on a dedicated tainted node pool within the hyperfleet-dev-prow GKE cluster. It records a June 2026 incident, current cluster resource measurements, a proposed fixed-count node pool configuration with taints and labels, Terraform and Helm change sketches, Terraform operational caveats, a migration and rollback procedure, GKE maintenance exclusion strategy, interaction with existing disruption protections, a recommendation with explicit upstream chart blockers, and a pending team decision record template.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes


Security observations (no praise, direct):

  • Taint/toleration strategy (CWE-284): The proposed taint dedicated=maestro:NoSchedule with a matching toleration in Helm values isolates Maestro workloads. If any Helm subchart does not receive the toleration, E2E workloads that happen to carry a broad toleration could still schedule onto the dedicated pool. The document acknowledges this as a blocker but does not enumerate which subcharts lack upstream support — that list must be exhaustive before merging the implementation.
  • Terraform state drift (CWE-16): The document explicitly notes that changing taints forces node pool recreation. If terraform apply is interrupted mid-recreation, the cluster can enter a state with no Maestro-eligible nodes and no record of the drift. The migration plan should reference a state backup step (terraform state pull) before any taint modification.
  • No network policy scoping: Isolating by node pool does not enforce network-layer separation. Maestro API, Broker, and Adapter components on the dedicated pool remain reachable from E2E pods on the shared pool unless a NetworkPolicy restricts ingress. This is not addressed in the document.
  • Single-node failure acceptance: The document explicitly accepts the single-node failure tradeoff for the dedicated pool. This is a known availability gap (no redundancy without autoscaling) and should appear in the decision record with an explicit sign-off.
🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title directly references the spike evaluation of a dedicated GKE node pool for maestro, which is the primary focus of the changeset.
Description check ✅ Passed The description accurately summarizes the spike content: dedicated node pool evaluation, verified data sources, measured load data, and blocking upstream dependency.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Sec-02: Secrets In Log Output ✅ Passed PR adds only Markdown documentation. No code files, no log statements (slog/log/logr/zap/fmt.Print*), no secrets/tokens/passwords/credentials exposed in log output.
No Hardcoded Secrets ✅ Passed Documentation-only PR with no hardcoded secrets. Terraform/Kubernetes examples use standard configuration (oauth_scopes, taints, labels) not credentials. No API keys, tokens, passwords, base64 secr...
No Weak Cryptography ✅ Passed Documentation-only PR containing no code; spike document discusses GKE node pool infrastructure with no cryptographic primitives, custom crypto implementations, or timing attacks on secrets.
No Injection Vectors ✅ Passed Documentation-only spike (single .md file). No executable code added. Reference code blocks (HCL, YAML) contain no injection patterns: zero SQL queries, exec.Command, template.HTML, or unsafe deser...
No Privileged Containers ✅ Passed PR adds only documentation spike (markdown file). No Kubernetes manifests, Helm templates, or Dockerfiles with privileged container settings detected. Example code blocks show only node pool taints...
No Pii Or Sensitive Data In Logs ✅ Passed Documentation-only PR (spike analysis) contains zero logging statements and no PII, credentials, session IDs, or sensitive customer data patterns.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Comment @coderabbitai help to get the list of available commands and usage tips.

Evaluate whether maestro should run on a dedicated tainted node pool
instead of sharing with E2E workloads. Includes verified pricing,
machine sizing with measured DaemonSet memory, Terraform/Helm diffs,
migration plan, single-node SPOF analysis, and E2E load measurements.
@kuudori kuudori force-pushed the HYPERFLEET-1124-dedicated-maestro-node-pool-spike branch from 03cd500 to 97affed Compare June 18, 2026 18:11
**Forum:** sprint planning / architecture sync / Slack thread
**Participants:** ...
**Notes:** ...
--> No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

nit — non-blocking suggestion

Category: Pattern

File is missing a trailing newline — POSIX convention and keeps future diffs clean. Just add a blank line at the end.

Suggested change
-->
-->

@rafabene

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci

openshift-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rafabene

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit 29d5a67 into openshift-hyperfleet:main Jun 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants