OCPEDGE-2286: Add instance availability check for AWS hypervisor deployment by fonta-rh · Pull Request #46 · openshift-eng/two-node-toolbox

fonta-rh · 2026-01-23T13:50:36Z

Summary

Add pre-deployment EC2 capacity validation before CloudFormation stack creation
Auto-detect first available AZ with capacity in the configured region
Create targeted capacity reservation to guarantee instance provisioning
Provide actionable error messages when no capacity is available
Extract CloudFormation template to separate file for maintainability
NEW: Add capacity error detection when starting stopped instances

Bug Fixes (follow-up commit)

Fix platform mismatch: RHEL AMIs require Red Hat Enterprise Linux platform for capacity reservations
Fix CloudFormation error: Use LaunchTemplate resource for CapacityReservationSpecification (not valid directly on EC2::Instance)

Capacity Reservation Lifecycle

The capacity reservation guarantees instance availability during creation and is then released:

Creation: Reservation is created before CloudFormation stack, guaranteeing capacity in the chosen AZ
Instance Launch: CloudFormation uses the targeted reservation to launch the instance
Release: Once the instance is successfully running, the reservation is released:
- Instance is modified to use "open" capacity preference (on-demand)
- Reservation is cancelled (no longer needed)
- This allows the instance to start/stop freely without reservation dependency

Safety net: Reservations have a 30-minute time limit in case release fails, preventing orphaned reservations.

Instance Start Error Handling

When starting a stopped instance fails due to capacity issues:

Detects InsufficientInstanceCapacity errors from AWS
Explains that EC2 instances are permanently bound to their AZ (cannot be moved)
Provides clear resolution: make destroy && make create (will find AZ with capacity)
Warns about data loss (clusters, images, etc. on the hypervisor)

Why not auto-migrate? EC2 instances cannot be moved between AZs. "Migration" would require snapshotting volumes, creating a new instance in another AZ, and restoring - essentially a destroy/recreate with extra steps.

Testing

Tested successfully in us-east-1, eu-west-1, and eu-north-1 regions.

JIRA

OCPEDGE-2286

🤖 Generated with Claude Code

…oyment Add pre-deployment capacity validation before CloudFormation stack creation to prevent failures due to EC2 capacity constraints. Changes: - Add create_capacity_reservation() function that auto-detects available AZs and creates a targeted capacity reservation - Add cancel_capacity_reservation() for cleanup on destroy or failure - Update CloudFormation template with CapacityReservationId and AvailabilityZone parameters with conditional usage - Add ENABLE_CAPACITY_RESERVATION config variable (default: true) - Add error handling flags to destroy.sh for consistency The capacity check runs before stack creation, trying each AZ in the region until one with available capacity is found. On failure, users get actionable error messages suggesting alternative regions or instance types. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

openshift-ci-robot · 2026-01-23T13:50:41Z

@fonta-rh: This pull request references OCPEDGE-2286 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Add pre-deployment EC2 capacity validation before CloudFormation stack creation

Auto-detect first available AZ with capacity in the configured region

Create targeted capacity reservation to guarantee instance provisioning

Provide actionable error messages when no capacity is available

JIRA

OCPEDGE-2286

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-01-23T13:50:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fonta-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [fonta-rh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…n template Bug fixes: - Fix platform mismatch: RHEL requires "Red Hat Enterprise Linux" platform, not "Linux/UNIX" for capacity reservations - Fix CloudFormation error: CapacityReservationSpecification is a LaunchTemplate property, not EC2::Instance property. Added conditional RHELLaunchTemplate resource that's created when capacity reservation is used Refactoring: - Extract CloudFormation template from heredoc in create.sh to separate file at templates/rhel-instance.yaml for better maintainability and validation - Clean up YAML formatting and comments Tested in us-east-1, eu-west-1, and eu-north-1 regions successfully. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

openshift-ci-robot · 2026-01-27T15:18:44Z

@fonta-rh: This pull request references OCPEDGE-2286 which is a valid jira issue.

Details

In response to this:

Summary

Add pre-deployment EC2 capacity validation before CloudFormation stack creation

Auto-detect first available AZ with capacity in the configured region

Create targeted capacity reservation to guarantee instance provisioning

Provide actionable error messages when no capacity is available

Extract CloudFormation template to separate file for maintainability

Bug Fixes (follow-up commit)

Fix platform mismatch: RHEL AMIs require Red Hat Enterprise Linux platform for capacity reservations

Fix CloudFormation error: Use LaunchTemplate resource for CapacityReservationSpecification (not valid directly on EC2::Instance)

Testing

Tested successfully in us-east-1, eu-west-1, and eu-north-1 regions.

JIRA

OCPEDGE-2286

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

…aned reservations Capacity reservations now auto-expire after 60 minutes (configurable via CAPACITY_RESERVATION_DURATION_MINUTES). This prevents orphaned reservations that could accumulate costs if the deployment script crashes or is terminated. The reservation only needs to exist until the EC2 instance launches - once running, the instance is independent of the reservation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…deployment When an EC2 instance is created with a targeted capacity reservation, the instance remains bound to that reservation. If the reservation expires while the instance is stopped, the instance cannot be restarted because AWS requires the targeted reservation to be active. This change releases the capacity reservation immediately after the instance is successfully deployed: - Modifies the instance to use "open" capacity preference (on-demand) - Cancels the capacity reservation (no longer needed) - Cleans up local tracking files The reservation's purpose is to guarantee capacity during creation - once the instance exists, it's no longer needed and releasing it allows the instance to start/stop freely. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

openshift-ci-robot · 2026-02-03T09:13:38Z

@fonta-rh: This pull request references OCPEDGE-2286 which is a valid jira issue.

Details

In response to this:

Summary

Add pre-deployment EC2 capacity validation before CloudFormation stack creation

Auto-detect first available AZ with capacity in the configured region

Create targeted capacity reservation to guarantee instance provisioning

Provide actionable error messages when no capacity is available

Extract CloudFormation template to separate file for maintainability

Bug Fixes (follow-up commit)

Fix platform mismatch: RHEL AMIs require Red Hat Enterprise Linux platform for capacity reservations

Fix CloudFormation error: Use LaunchTemplate resource for CapacityReservationSpecification (not valid directly on EC2::Instance)

Capacity Reservation Lifecycle

The capacity reservation guarantees instance availability during creation and is then released:

Creation: Reservation is created before CloudFormation stack, guaranteeing capacity in the chosen AZ

Instance Launch: CloudFormation uses the targeted reservation to launch the instance

Release: Once the instance is successfully running, the reservation is released:

Instance is modified to use "open" capacity preference (on-demand)

Reservation is cancelled (no longer needed)

This allows the instance to start/stop freely without reservation dependency

Safety net: Reservations have a 30-minute time limit in case release fails, preventing orphaned reservations.

Testing

Tested successfully in us-east-1, eu-west-1, and eu-north-1 regions.

JIRA

OCPEDGE-2286

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Add enhanced error handling in start.sh to detect InsufficientInstanceCapacity errors when starting a stopped instance. When capacity issues occur, provide clear, actionable error messages explaining: - EC2 instances are permanently bound to their AZ and cannot be moved - The resolution path: make destroy && make create - The trade-off: data loss warning for hypervisor contents This complements the pre-deployment capacity reservation system by handling the edge case where an instance was created successfully but later encounters capacity issues when restarting after being stopped. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

openshift-ci-robot · 2026-02-03T09:30:17Z

@fonta-rh: This pull request references OCPEDGE-2286 which is a valid jira issue.

Details

In response to this:

Summary

Add pre-deployment EC2 capacity validation before CloudFormation stack creation

Auto-detect first available AZ with capacity in the configured region

Create targeted capacity reservation to guarantee instance provisioning

Provide actionable error messages when no capacity is available

Extract CloudFormation template to separate file for maintainability

NEW: Add capacity error detection when starting stopped instances

Bug Fixes (follow-up commit)

Fix platform mismatch: RHEL AMIs require Red Hat Enterprise Linux platform for capacity reservations

Fix CloudFormation error: Use LaunchTemplate resource for CapacityReservationSpecification (not valid directly on EC2::Instance)

Capacity Reservation Lifecycle

The capacity reservation guarantees instance availability during creation and is then released:

Creation: Reservation is created before CloudFormation stack, guaranteeing capacity in the chosen AZ

Instance Launch: CloudFormation uses the targeted reservation to launch the instance

Release: Once the instance is successfully running, the reservation is released:

Instance is modified to use "open" capacity preference (on-demand)

Reservation is cancelled (no longer needed)

This allows the instance to start/stop freely without reservation dependency

Safety net: Reservations have a 30-minute time limit in case release fails, preventing orphaned reservations.

Instance Start Error Handling

When starting a stopped instance fails due to capacity issues:

Detects InsufficientInstanceCapacity errors from AWS

Explains that EC2 instances are permanently bound to their AZ (cannot be moved)

Provides clear resolution: make destroy && make create (will find AZ with capacity)

Warns about data loss (clusters, images, etc. on the hypervisor)

Why not auto-migrate? EC2 instances cannot be moved between AZs. "Migration" would require snapshotting volumes, creating a new instance in another AZ, and restoring - essentially a destroy/recreate with extra steps.

Testing

Tested successfully in us-east-1, eu-west-1, and eu-north-1 regions.

JIRA

OCPEDGE-2286

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Add targeted SC2153 disable for the line where shellcheck incorrectly flags REGION as a possible misspelling of the local 'region' variable. REGION is sourced from instance.env via common.sh. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 23, 2026

openshift-ci bot requested review from jaypoulz and qJkee January 23, 2026 13:50

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2026

fonta-rh force-pushed the OCPEDGE-2286-aws-capacity-reservation-check branch from 517dc0e to 4571b79 Compare January 27, 2026 15:15

fonta-rh and others added 2 commits January 30, 2026 14:09

Conversation

fonta-rh commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug Fixes (follow-up commit)

Capacity Reservation Lifecycle

Instance Start Error Handling

Testing

JIRA

Uh oh!

openshift-ci-robot commented Jan 23, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

JIRA

Uh oh!

openshift-ci bot commented Jan 23, 2026

Uh oh!

openshift-ci-robot commented Jan 27, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug Fixes (follow-up commit)

Testing

JIRA

Uh oh!

openshift-ci-robot commented Feb 3, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug Fixes (follow-up commit)

Capacity Reservation Lifecycle

Testing

JIRA

Uh oh!

openshift-ci-robot commented Feb 3, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug Fixes (follow-up commit)

Capacity Reservation Lifecycle

Instance Start Error Handling

Testing

JIRA

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fonta-rh commented Jan 23, 2026 •

edited

Loading

openshift-ci-robot commented Jan 23, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 27, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 3, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 3, 2026 •

edited by openshift-ci bot

Loading