OCPEDGE-2286: Add instance availability check for AWS hypervisor deployment#46
Conversation
…oyment Add pre-deployment capacity validation before CloudFormation stack creation to prevent failures due to EC2 capacity constraints. Changes: - Add create_capacity_reservation() function that auto-detects available AZs and creates a targeted capacity reservation - Add cancel_capacity_reservation() for cleanup on destroy or failure - Update CloudFormation template with CapacityReservationId and AvailabilityZone parameters with conditional usage - Add ENABLE_CAPACITY_RESERVATION config variable (default: true) - Add error handling flags to destroy.sh for consistency The capacity check runs before stack creation, trying each AZ in the region until one with available capacity is found. On failure, users get actionable error messages suggesting alternative regions or instance types. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@fonta-rh: This pull request references OCPEDGE-2286 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fonta-rh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…n template Bug fixes: - Fix platform mismatch: RHEL requires "Red Hat Enterprise Linux" platform, not "Linux/UNIX" for capacity reservations - Fix CloudFormation error: CapacityReservationSpecification is a LaunchTemplate property, not EC2::Instance property. Added conditional RHELLaunchTemplate resource that's created when capacity reservation is used Refactoring: - Extract CloudFormation template from heredoc in create.sh to separate file at templates/rhel-instance.yaml for better maintainability and validation - Clean up YAML formatting and comments Tested in us-east-1, eu-west-1, and eu-north-1 regions successfully. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
517dc0e to
4571b79
Compare
|
@fonta-rh: This pull request references OCPEDGE-2286 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
…aned reservations Capacity reservations now auto-expire after 60 minutes (configurable via CAPACITY_RESERVATION_DURATION_MINUTES). This prevents orphaned reservations that could accumulate costs if the deployment script crashes or is terminated. The reservation only needs to exist until the EC2 instance launches - once running, the instance is independent of the reservation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…deployment When an EC2 instance is created with a targeted capacity reservation, the instance remains bound to that reservation. If the reservation expires while the instance is stopped, the instance cannot be restarted because AWS requires the targeted reservation to be active. This change releases the capacity reservation immediately after the instance is successfully deployed: - Modifies the instance to use "open" capacity preference (on-demand) - Cancels the capacity reservation (no longer needed) - Cleans up local tracking files The reservation's purpose is to guarantee capacity during creation - once the instance exists, it's no longer needed and releasing it allows the instance to start/stop freely. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@fonta-rh: This pull request references OCPEDGE-2286 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Add enhanced error handling in start.sh to detect InsufficientInstanceCapacity errors when starting a stopped instance. When capacity issues occur, provide clear, actionable error messages explaining: - EC2 instances are permanently bound to their AZ and cannot be moved - The resolution path: make destroy && make create - The trade-off: data loss warning for hypervisor contents This complements the pre-deployment capacity reservation system by handling the edge case where an instance was created successfully but later encounters capacity issues when restarting after being stopped. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@fonta-rh: This pull request references OCPEDGE-2286 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Add targeted SC2153 disable for the line where shellcheck incorrectly flags REGION as a possible misspelling of the local 'region' variable. REGION is sourced from instance.env via common.sh. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Bug Fixes (follow-up commit)
Red Hat Enterprise Linuxplatform for capacity reservationsLaunchTemplateresource forCapacityReservationSpecification(not valid directly on EC2::Instance)Capacity Reservation Lifecycle
The capacity reservation guarantees instance availability during creation and is then released:
Safety net: Reservations have a 30-minute time limit in case release fails, preventing orphaned reservations.
Instance Start Error Handling
When starting a stopped instance fails due to capacity issues:
InsufficientInstanceCapacityerrors from AWSmake destroy && make create(will find AZ with capacity)Why not auto-migrate? EC2 instances cannot be moved between AZs. "Migration" would require snapshotting volumes, creating a new instance in another AZ, and restoring - essentially a destroy/recreate with extra steps.
Testing
Tested successfully in us-east-1, eu-west-1, and eu-north-1 regions.
JIRA
OCPEDGE-2286
🤖 Generated with Claude Code