-
Notifications
You must be signed in to change notification settings - Fork 100
Maintenance: Fix flaky MetricsE2ET E2E test caused by CloudWatch metrics propagation delays #2440
Description
Summary
The MetricsE2ET E2E test (MetricsE2ET.java) fails intermittently due to CloudWatch metrics propagation delays. The test deploys a Lambda function, invokes it twice, then polls CloudWatch for the emitted metrics using MetricsFetcher. The retry configuration in RetryUtils allows 60 attempts at 5-second intervals (300 seconds total). When CloudWatch takes longer than 300 seconds to make metrics queryable, the test throws MetricDataNotFoundException or DataNotReadyException and fails.
3 out of the last 4 commits on main (2026-03-27) triggered this failure, each on a different Java version and metric name:
| Commit | Java | Error | CI Run |
|---|---|---|---|
5753d70c |
17 | MetricDataNotFoundException: No data found for metric ColdStart |
Run 23648995071 |
becec931 |
25 | DataNotReadyException: Expected 2.0 orders but got 1.0 |
Run 23648967704 |
4bef85e9 |
11 | MetricDataNotFoundException: No data found for metric products |
Run 23648952341 |
The failures affect different metrics (ColdStart, orders, products) across different Java versions (11, 17, 25), which confirms this is a timing issue and not a code regression. The problem is amplified when multiple commits merge in quick succession, since each triggers a parallel E2E run that competes for CloudWatch API throughput.
Why is this needed?
Flaky E2E tests erode trust in CI signals. Maintainers cannot distinguish a real metrics regression from a CloudWatch propagation delay without manually inspecting logs. This costs maintainer time on every failure and creates a habit of ignoring red builds, which increases the risk of missing an actual regression.
The current retry budget of 300 seconds is insufficient for CloudWatch's eventual consistency model. The CloudWatch GetMetricData API can take upwards of 5-10 minutes to return data for recently published metrics, especially under concurrent load.
Which area does this relate to?
Metrics, Tests
Solution
Relevant source files:
MetricsE2ET.java- test classMetricsFetcher.java- CloudWatch polling logicRetryUtils.java- retry configuration (60 attempts, 5s interval)LambdaInvoker.java- invocation timestamp and query window logic
Possible approaches (not exhaustive):
- Increase retry budget. Raise
MAX_ATTEMPTSfrom 60 to 120 (600 seconds total) inRetryUtils, or use exponential backoff to cover a longer window without doubling the attempt count. - Widen the CloudWatch query time window.
LambdaInvokercurrently sets the query window to a 1-minute range (invocation minute to invocation minute + 1). Widening this window reduces the chance of missing metrics that land on a minute boundary. - Add a configurable per-test retry config.
MetricsE2ETalready uses the default retry config. Allow the metrics test to pass a custom, longer retry config toMetricsFetcher.fetchMetrics()without affecting other E2E tests.
Acknowledgment
- This request meets Powertools for AWS Lambda (Java) Tenets
- Should this be considered in other Powertools for AWS Lambda languages? i.e. Python, TypeScript
Metadata
Metadata
Assignees
Labels
Type
Projects
Status