Skip to content

Maintenance: Fix flaky MetricsE2ET E2E test caused by CloudWatch metrics propagation delays #2440

@phipag

Description

@phipag

Summary

The MetricsE2ET E2E test (MetricsE2ET.java) fails intermittently due to CloudWatch metrics propagation delays. The test deploys a Lambda function, invokes it twice, then polls CloudWatch for the emitted metrics using MetricsFetcher. The retry configuration in RetryUtils allows 60 attempts at 5-second intervals (300 seconds total). When CloudWatch takes longer than 300 seconds to make metrics queryable, the test throws MetricDataNotFoundException or DataNotReadyException and fails.

3 out of the last 4 commits on main (2026-03-27) triggered this failure, each on a different Java version and metric name:

Commit Java Error CI Run
5753d70c 17 MetricDataNotFoundException: No data found for metric ColdStart Run 23648995071
becec931 25 DataNotReadyException: Expected 2.0 orders but got 1.0 Run 23648967704
4bef85e9 11 MetricDataNotFoundException: No data found for metric products Run 23648952341

The failures affect different metrics (ColdStart, orders, products) across different Java versions (11, 17, 25), which confirms this is a timing issue and not a code regression. The problem is amplified when multiple commits merge in quick succession, since each triggers a parallel E2E run that competes for CloudWatch API throughput.

Why is this needed?

Flaky E2E tests erode trust in CI signals. Maintainers cannot distinguish a real metrics regression from a CloudWatch propagation delay without manually inspecting logs. This costs maintainer time on every failure and creates a habit of ignoring red builds, which increases the risk of missing an actual regression.

The current retry budget of 300 seconds is insufficient for CloudWatch's eventual consistency model. The CloudWatch GetMetricData API can take upwards of 5-10 minutes to return data for recently published metrics, especially under concurrent load.

Which area does this relate to?

Metrics, Tests

Solution

Relevant source files:

Possible approaches (not exhaustive):

  1. Increase retry budget. Raise MAX_ATTEMPTS from 60 to 120 (600 seconds total) in RetryUtils, or use exponential backoff to cover a longer window without doubling the attempt count.
  2. Widen the CloudWatch query time window. LambdaInvoker currently sets the query window to a 1-minute range (invocation minute to invocation minute + 1). Widening this window reduces the chance of missing metrics that land on a minute boundary.
  3. Add a configurable per-test retry config. MetricsE2ET already uses the default retry config. Allow the metrics test to pass a custom, longer retry config to MetricsFetcher.fetchMetrics() without affecting other E2E tests.

Acknowledgment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions