Maintenance: Fix flaky MetricsE2ET E2E test caused by CloudWatch metrics propagation delays

### Summary

The `MetricsE2ET` E2E test ([`MetricsE2ET.java`](https://github.com/aws-powertools/powertools-lambda-java/blob/main/powertools-e2e-tests/src/test/java/software/amazon/lambda/powertools/MetricsE2ET.java)) fails intermittently due to CloudWatch metrics propagation delays. The test deploys a Lambda function, invokes it twice, then polls CloudWatch for the emitted metrics using [`MetricsFetcher`](https://github.com/aws-powertools/powertools-lambda-java/blob/main/powertools-e2e-tests/src/test/java/software/amazon/lambda/powertools/testutils/metrics/MetricsFetcher.java). The retry configuration in [`RetryUtils`](https://github.com/aws-powertools/powertools-lambda-java/blob/main/powertools-e2e-tests/src/test/java/software/amazon/lambda/powertools/testutils/RetryUtils.java) allows 60 attempts at 5-second intervals (300 seconds total). When CloudWatch takes longer than 300 seconds to make metrics queryable, the test throws `MetricDataNotFoundException` or `DataNotReadyException` and fails.

3 out of the last 4 commits on `main` (2026-03-27) triggered this failure, each on a different Java version and metric name:

| Commit | Java | Error | CI Run |
|--------|------|-------|--------|
| `5753d70c` | 17 | `MetricDataNotFoundException: No data found for metric ColdStart` | [Run 23648995071](https://github.com/aws-powertools/powertools-lambda-java/actions/runs/23648995071) |
| `becec931` | 25 | `DataNotReadyException: Expected 2.0 orders but got 1.0` | [Run 23648967704](https://github.com/aws-powertools/powertools-lambda-java/actions/runs/23648967704) |
| `4bef85e9` | 11 | `MetricDataNotFoundException: No data found for metric products` | [Run 23648952341](https://github.com/aws-powertools/powertools-lambda-java/actions/runs/23648952341) |

The failures affect different metrics (`ColdStart`, `orders`, `products`) across different Java versions (11, 17, 25), which confirms this is a timing issue and not a code regression. The problem is amplified when multiple commits merge in quick succession, since each triggers a parallel E2E run that competes for CloudWatch API throughput.

### Why is this needed?

Flaky E2E tests erode trust in CI signals. Maintainers cannot distinguish a real metrics regression from a CloudWatch propagation delay without manually inspecting logs. This costs maintainer time on every failure and creates a habit of ignoring red builds, which increases the risk of missing an actual regression.

The current retry budget of 300 seconds is insufficient for CloudWatch's eventual consistency model. The CloudWatch `GetMetricData` API can take upwards of 5-10 minutes to return data for recently published metrics, especially under concurrent load.

### Which area does this relate to?

Metrics, Tests

### Solution

Relevant source files:
- [`MetricsE2ET.java`](https://github.com/aws-powertools/powertools-lambda-java/blob/main/powertools-e2e-tests/src/test/java/software/amazon/lambda/powertools/MetricsE2ET.java) - test class
- [`MetricsFetcher.java`](https://github.com/aws-powertools/powertools-lambda-java/blob/main/powertools-e2e-tests/src/test/java/software/amazon/lambda/powertools/testutils/metrics/MetricsFetcher.java) - CloudWatch polling logic
- [`RetryUtils.java`](https://github.com/aws-powertools/powertools-lambda-java/blob/main/powertools-e2e-tests/src/test/java/software/amazon/lambda/powertools/testutils/RetryUtils.java) - retry configuration (60 attempts, 5s interval)
- [`LambdaInvoker.java`](https://github.com/aws-powertools/powertools-lambda-java/blob/main/powertools-e2e-tests/src/test/java/software/amazon/lambda/powertools/testutils/lambda/LambdaInvoker.java) - invocation timestamp and query window logic

Possible approaches (not exhaustive):

1. **Increase retry budget.** Raise `MAX_ATTEMPTS` from 60 to 120 (600 seconds total) in `RetryUtils`, or use exponential backoff to cover a longer window without doubling the attempt count.
2. **Widen the CloudWatch query time window.** `LambdaInvoker` currently sets the query window to a 1-minute range (invocation minute to invocation minute + 1). Widening this window reduces the chance of missing metrics that land on a minute boundary.
3. **Add a configurable per-test retry config.** `MetricsE2ET` already uses the default retry config. Allow the metrics test to pass a custom, longer retry config to `MetricsFetcher.fetchMetrics()` without affecting other E2E tests.

### Acknowledgment

- [x] This request meets [Powertools for AWS Lambda (Java) Tenets](https://docs.powertools.aws.dev/lambda-java/#tenets)
- [ ] Should this be considered in other Powertools for AWS Lambda languages? i.e. [Python](https://github.com/aws-powertools/powertools-lambda-python/), [TypeScript](https://github.com/aws-powertools/powertools-lambda-typescript/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintenance: Fix flaky MetricsE2ET E2E test caused by CloudWatch metrics propagation delays #2440

Summary

Why is this needed?

Which area does this relate to?

Solution

Acknowledgment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Commit	Java	Error	CI Run
`5753d70c`	17	`MetricDataNotFoundException: No data found for metric ColdStart`	Run 23648995071
`becec931`	25	`DataNotReadyException: Expected 2.0 orders but got 1.0`	Run 23648967704
`4bef85e9`	11	`MetricDataNotFoundException: No data found for metric products`	Run 23648952341

Maintenance: Fix flaky MetricsE2ET E2E test caused by CloudWatch metrics propagation delays #2440

Description

Summary

Why is this needed?

Which area does this relate to?

Solution

Acknowledgment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions