Skip to content

OCPEDGE-2286: Add instance availability check for AWS hypervisor deployment#46

Open
fonta-rh wants to merge 6 commits intoopenshift-eng:mainfrom
fonta-rh:OCPEDGE-2286-aws-capacity-reservation-check
Open

OCPEDGE-2286: Add instance availability check for AWS hypervisor deployment#46
fonta-rh wants to merge 6 commits intoopenshift-eng:mainfrom
fonta-rh:OCPEDGE-2286-aws-capacity-reservation-check

Conversation

@fonta-rh
Copy link
Contributor

@fonta-rh fonta-rh commented Jan 23, 2026

Summary

  • Add pre-deployment EC2 capacity validation before CloudFormation stack creation
  • Auto-detect first available AZ with capacity in the configured region
  • Create targeted capacity reservation to guarantee instance provisioning
  • Provide actionable error messages when no capacity is available
  • Extract CloudFormation template to separate file for maintainability
  • NEW: Add capacity error detection when starting stopped instances

Bug Fixes (follow-up commit)

  • Fix platform mismatch: RHEL AMIs require Red Hat Enterprise Linux platform for capacity reservations
  • Fix CloudFormation error: Use LaunchTemplate resource for CapacityReservationSpecification (not valid directly on EC2::Instance)

Capacity Reservation Lifecycle

The capacity reservation guarantees instance availability during creation and is then released:

  1. Creation: Reservation is created before CloudFormation stack, guaranteeing capacity in the chosen AZ
  2. Instance Launch: CloudFormation uses the targeted reservation to launch the instance
  3. Release: Once the instance is successfully running, the reservation is released:
    • Instance is modified to use "open" capacity preference (on-demand)
    • Reservation is cancelled (no longer needed)
    • This allows the instance to start/stop freely without reservation dependency

Safety net: Reservations have a 30-minute time limit in case release fails, preventing orphaned reservations.

Instance Start Error Handling

When starting a stopped instance fails due to capacity issues:

  • Detects InsufficientInstanceCapacity errors from AWS
  • Explains that EC2 instances are permanently bound to their AZ (cannot be moved)
  • Provides clear resolution: make destroy && make create (will find AZ with capacity)
  • Warns about data loss (clusters, images, etc. on the hypervisor)

Why not auto-migrate? EC2 instances cannot be moved between AZs. "Migration" would require snapshotting volumes, creating a new instance in another AZ, and restoring - essentially a destroy/recreate with extra steps.

Testing

Tested successfully in us-east-1, eu-west-1, and eu-north-1 regions.

JIRA

OCPEDGE-2286

🤖 Generated with Claude Code

…oyment

Add pre-deployment capacity validation before CloudFormation stack creation
to prevent failures due to EC2 capacity constraints.

Changes:
- Add create_capacity_reservation() function that auto-detects available AZs
  and creates a targeted capacity reservation
- Add cancel_capacity_reservation() for cleanup on destroy or failure
- Update CloudFormation template with CapacityReservationId and
  AvailabilityZone parameters with conditional usage
- Add ENABLE_CAPACITY_RESERVATION config variable (default: true)
- Add error handling flags to destroy.sh for consistency

The capacity check runs before stack creation, trying each AZ in the region
until one with available capacity is found. On failure, users get actionable
error messages suggesting alternative regions or instance types.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 23, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 23, 2026

@fonta-rh: This pull request references OCPEDGE-2286 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add pre-deployment EC2 capacity validation before CloudFormation stack creation
  • Auto-detect first available AZ with capacity in the configured region
  • Create targeted capacity reservation to guarantee instance provisioning
  • Provide actionable error messages when no capacity is available

JIRA

OCPEDGE-2286

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from jaypoulz and qJkee January 23, 2026 13:50
@openshift-ci
Copy link

openshift-ci bot commented Jan 23, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fonta-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2026
…n template

Bug fixes:
- Fix platform mismatch: RHEL requires "Red Hat Enterprise Linux" platform,
  not "Linux/UNIX" for capacity reservations
- Fix CloudFormation error: CapacityReservationSpecification is a
  LaunchTemplate property, not EC2::Instance property. Added conditional
  RHELLaunchTemplate resource that's created when capacity reservation is used

Refactoring:
- Extract CloudFormation template from heredoc in create.sh to separate file
  at templates/rhel-instance.yaml for better maintainability and validation
- Clean up YAML formatting and comments

Tested in us-east-1, eu-west-1, and eu-north-1 regions successfully.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@fonta-rh fonta-rh force-pushed the OCPEDGE-2286-aws-capacity-reservation-check branch from 517dc0e to 4571b79 Compare January 27, 2026 15:15
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 27, 2026

@fonta-rh: This pull request references OCPEDGE-2286 which is a valid jira issue.

Details

In response to this:

Summary

  • Add pre-deployment EC2 capacity validation before CloudFormation stack creation
  • Auto-detect first available AZ with capacity in the configured region
  • Create targeted capacity reservation to guarantee instance provisioning
  • Provide actionable error messages when no capacity is available
  • Extract CloudFormation template to separate file for maintainability

Bug Fixes (follow-up commit)

  • Fix platform mismatch: RHEL AMIs require Red Hat Enterprise Linux platform for capacity reservations
  • Fix CloudFormation error: Use LaunchTemplate resource for CapacityReservationSpecification (not valid directly on EC2::Instance)

Testing

Tested successfully in us-east-1, eu-west-1, and eu-north-1 regions.

JIRA

OCPEDGE-2286

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

fonta-rh and others added 2 commits January 30, 2026 14:09
…aned reservations

Capacity reservations now auto-expire after 60 minutes (configurable via
CAPACITY_RESERVATION_DURATION_MINUTES). This prevents orphaned reservations
that could accumulate costs if the deployment script crashes or is terminated.

The reservation only needs to exist until the EC2 instance launches - once
running, the instance is independent of the reservation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…deployment

When an EC2 instance is created with a targeted capacity reservation, the
instance remains bound to that reservation. If the reservation expires while
the instance is stopped, the instance cannot be restarted because AWS requires
the targeted reservation to be active.

This change releases the capacity reservation immediately after the instance
is successfully deployed:
- Modifies the instance to use "open" capacity preference (on-demand)
- Cancels the capacity reservation (no longer needed)
- Cleans up local tracking files

The reservation's purpose is to guarantee capacity during creation - once the
instance exists, it's no longer needed and releasing it allows the instance to
start/stop freely.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 3, 2026

@fonta-rh: This pull request references OCPEDGE-2286 which is a valid jira issue.

Details

In response to this:

Summary

  • Add pre-deployment EC2 capacity validation before CloudFormation stack creation
  • Auto-detect first available AZ with capacity in the configured region
  • Create targeted capacity reservation to guarantee instance provisioning
  • Provide actionable error messages when no capacity is available
  • Extract CloudFormation template to separate file for maintainability

Bug Fixes (follow-up commit)

  • Fix platform mismatch: RHEL AMIs require Red Hat Enterprise Linux platform for capacity reservations
  • Fix CloudFormation error: Use LaunchTemplate resource for CapacityReservationSpecification (not valid directly on EC2::Instance)

Capacity Reservation Lifecycle

The capacity reservation guarantees instance availability during creation and is then released:

  1. Creation: Reservation is created before CloudFormation stack, guaranteeing capacity in the chosen AZ
  2. Instance Launch: CloudFormation uses the targeted reservation to launch the instance
  3. Release: Once the instance is successfully running, the reservation is released:
  • Instance is modified to use "open" capacity preference (on-demand)
  • Reservation is cancelled (no longer needed)
  • This allows the instance to start/stop freely without reservation dependency

Safety net: Reservations have a 30-minute time limit in case release fails, preventing orphaned reservations.

Testing

Tested successfully in us-east-1, eu-west-1, and eu-north-1 regions.

JIRA

OCPEDGE-2286

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Add enhanced error handling in start.sh to detect InsufficientInstanceCapacity
errors when starting a stopped instance. When capacity issues occur, provide
clear, actionable error messages explaining:

- EC2 instances are permanently bound to their AZ and cannot be moved
- The resolution path: make destroy && make create
- The trade-off: data loss warning for hypervisor contents

This complements the pre-deployment capacity reservation system by handling
the edge case where an instance was created successfully but later encounters
capacity issues when restarting after being stopped.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 3, 2026

@fonta-rh: This pull request references OCPEDGE-2286 which is a valid jira issue.

Details

In response to this:

Summary

  • Add pre-deployment EC2 capacity validation before CloudFormation stack creation
  • Auto-detect first available AZ with capacity in the configured region
  • Create targeted capacity reservation to guarantee instance provisioning
  • Provide actionable error messages when no capacity is available
  • Extract CloudFormation template to separate file for maintainability
  • NEW: Add capacity error detection when starting stopped instances

Bug Fixes (follow-up commit)

  • Fix platform mismatch: RHEL AMIs require Red Hat Enterprise Linux platform for capacity reservations
  • Fix CloudFormation error: Use LaunchTemplate resource for CapacityReservationSpecification (not valid directly on EC2::Instance)

Capacity Reservation Lifecycle

The capacity reservation guarantees instance availability during creation and is then released:

  1. Creation: Reservation is created before CloudFormation stack, guaranteeing capacity in the chosen AZ
  2. Instance Launch: CloudFormation uses the targeted reservation to launch the instance
  3. Release: Once the instance is successfully running, the reservation is released:
  • Instance is modified to use "open" capacity preference (on-demand)
  • Reservation is cancelled (no longer needed)
  • This allows the instance to start/stop freely without reservation dependency

Safety net: Reservations have a 30-minute time limit in case release fails, preventing orphaned reservations.

Instance Start Error Handling

When starting a stopped instance fails due to capacity issues:

  • Detects InsufficientInstanceCapacity errors from AWS
  • Explains that EC2 instances are permanently bound to their AZ (cannot be moved)
  • Provides clear resolution: make destroy && make create (will find AZ with capacity)
  • Warns about data loss (clusters, images, etc. on the hypervisor)

Why not auto-migrate? EC2 instances cannot be moved between AZs. "Migration" would require snapshotting volumes, creating a new instance in another AZ, and restoring - essentially a destroy/recreate with extra steps.

Testing

Tested successfully in us-east-1, eu-west-1, and eu-north-1 regions.

JIRA

OCPEDGE-2286

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Add targeted SC2153 disable for the line where shellcheck incorrectly
flags REGION as a possible misspelling of the local 'region' variable.
REGION is sourced from instance.env via common.sh.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants