Skip to content

Cloud Topics: Deflake L0 GC integration tests and also consolidate them significantly#29978

Merged
oleiman merged 5 commits intodevfrom
ci/core-15765/repeater-timeout-cdt-consolidate
Mar 30, 2026
Merged

Cloud Topics: Deflake L0 GC integration tests and also consolidate them significantly#29978
oleiman merged 5 commits intodevfrom
ci/core-15765/repeater-timeout-cdt-consolidate

Conversation

@oleiman
Copy link
Copy Markdown
Member

@oleiman oleiman commented Mar 27, 2026

Test suite got a little hairy over time and was flaky on CDT.

Remove 7 tests redundant with C++ unit tests or low integration value:

  • test_l0_gc: subsumed by every other test that produces + waits for GC
  • test_concurrent_pause_start: state machine concurrency covered by
    level_zero_gc_mt_test.cc concurrent_reset_start_pause
  • test_reset_while_paused: covered by level_zero_gc_tests.cc
    ResetWhilePaused (exact same scenario)
  • test_reset_while_running: covered by level_zero_gc_tests.cc
    ResetWhileRunning (exact same scenario)
  • test_grace_period_enforcement: grace period filtering covered by
    level_zero_gc_tests.cc NoDeletesForYoungObjects
  • test_epoch_lag_and_catchup: tests internal epoch_lag metric response
    to hot-reconfigure of gc_interval; no cluster invariant at stake
  • test_leadership_transfer_during_gc: L0 GC scans the object store by
    prefix range, not by partition ownership; leadership is irrelevant

Combine 5 admin API tests (get_status, basic_pause_unpause,
single_node_pause_unpause, not_found, partial_failure) into one
test_admin_api that exercises all endpoints after a single produce.

Combine test_node_failure_mid_gc and test_topic_deletion_during_gc into
test_cluster_disruption_resilience.

Fold test_gc_metrics batching assertion into test_gc_stress.

Fixes

JIRA filter

  • fixes CORE-15919
  • fixes CORE-15863
  • fixes CORE-15845
  • fixes CORE-15809
  • fixes CORE-15783
  • fixes CORE-15773
  • fixes CORE-15772
  • fixes CORE-15765
  • fixes CORE-15674
  • fixes CORE-15590

Proof (CDT)

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

@oleiman oleiman self-assigned this Mar 27, 2026
@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented Mar 27, 2026

/ci-repeat 1
skip-units
dt-repeat=10
tests/rptest/tests/cloud_topics/l0_gc_test.py

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Mar 27, 2026

Retry command for Build#82404

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cloud_topics/l0_gc_test.py::CloudTopicsL0GCResilienceTest.test_cluster_disruption_resilience@{"cloud_storage_type":1}
tests/rptest/tests/cloud_topics/l0_gc_test.py::CloudTopicsL0GCResilienceTest.test_cluster_disruption_resilience@{"cloud_storage_type":2}

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Mar 27, 2026

CI test results

test results on build#82404
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
CloudTopicsL0GCResilienceTest test_cluster_disruption_resilience {"cloud_storage_type": 2} integration https://buildkite.com/redpanda/redpanda/builds/82404#019d2db8-d20f-4b61-9387-42021f2fcdec FLAKY 18/20 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCResilienceTest&test_method=test_cluster_disruption_resilience
CloudTopicsL0GCResilienceTest test_cluster_disruption_resilience {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/82404#019d2db5-0d4b-46ab-91d7-95011f813f65 FLAKY 19/20 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCResilienceTest&test_method=test_cluster_disruption_resilience
test results on build#82436
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/82436#019d3128-8a76-4894-914c-9899de9b3bcb FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0908, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.2485, p1=0.0575, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#82455
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ControllerLogLimitMirrorMakerTests test_mirror_maker_with_limits null integration https://buildkite.com/redpanda/redpanda/builds/82455#019d38c8-0b88-48f5-a993-e26295556d7e FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0273, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControllerLogLimitMirrorMakerTests&test_method=test_mirror_maker_with_limits

@oleiman oleiman force-pushed the ci/core-15765/repeater-timeout-cdt-consolidate branch from 1c4dfc1 to 2221481 Compare March 27, 2026 06:05
@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented Mar 27, 2026

/cdt
dt-repeat=20
tests/rptest/tests/cloud_topics/l0_gc_test.py

@oleiman oleiman force-pushed the ci/core-15765/repeater-timeout-cdt-consolidate branch from 2221481 to 3db4658 Compare March 27, 2026 15:32
@oleiman oleiman marked this pull request as ready for review March 27, 2026 15:32
Copilot AI review requested due to automatic review settings March 27, 2026 15:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Cloud Topics L0 GC integration tests to reduce flakiness and significantly consolidate coverage (as described: 21 tests → 8), while adjusting test workloads and timeouts to better align with epoch-based GC behavior.

Changes:

  • Reworks produce_some to run for a minimum duration (based on the configured epoch increment interval) to ensure L0 objects span multiple epochs for GC eligibility.
  • Consolidates multiple admin API / pause-unpause / partial-failure checks into a single combined test flow.
  • Adds/adjusts longer timeouts and introduces a resilience test covering node failure + topic deletion, plus updates data-integrity/stress tests to be more tolerant to transient produce failures.

Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py Outdated
@oleiman oleiman changed the title WIP: Cloud Topics: Deflake L0 GC integration tests and also consolidate them significantly Cloud Topics: Deflake L0 GC integration tests and also consolidate them significantly Mar 27, 2026
@oleiman oleiman marked this pull request as draft March 27, 2026 18:02
@oleiman oleiman force-pushed the ci/core-15765/repeater-timeout-cdt-consolidate branch from 3db4658 to a2fdbb1 Compare March 27, 2026 18:07
@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented Mar 27, 2026

/cdt
dt-repeat=50
tests/rptest/tests/cloud_topics/l0_gc_test.py::CloudTopicsL0GCDataIntegrityTest

@oleiman oleiman requested a review from Copilot March 27, 2026 18:53
@oleiman oleiman added the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label Mar 27, 2026
@oleiman oleiman marked this pull request as ready for review March 27, 2026 18:53
Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py
Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py Outdated
Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py
Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py Outdated
@claude
Copy link
Copy Markdown

claude bot commented Mar 27, 2026

PR Review: Cloud Topics L0 GC test consolidation and deflaking

Overall: Good consolidation — 21 tests down to 8 is a meaningful reduction in CI cost and flake surface. The structural changes (time-based produce_some, wait_for_status helper, try/finally for producer/consumer cleanup, tolerate_failed_produce) are all solid deflaking patterns. The code reads well.

What looks good

  • produce_some rework: Switching from message-count-based progress to time-based "run across N epoch intervals" directly addresses the root cause of flakes — GC needs data spread across epochs, not just a count of messages. The wait_until_with_progress_check with progress_sec is a good safeguard.
  • wait_for_status helper: Cleanly handles the transient SAFETY_BLOCKED states that were likely causing flakes in direct check_statuses calls.
  • try/finally cleanup: Both CloudTopicsL0GCDataIntegrityTest and CloudTopicsL0GCStressTest now properly stop() and free() producer/consumer in finally blocks, preventing resource leaks on assertion failures.
  • Timeout increases (30s → 60s): Reasonable given these are integration tests running against real cloud storage.
  • Batching assertion moved to stress test: Makes sense — it needs enough data volume to be meaningful.

Concerns

  1. Coverage loss in test_admin_api Phase 4 (single-node pause): The old test_single_node_pause_unpause verified the behavioral effect of pausing — that the paused node stopped deleting, and that unpausing caused it to start again. The consolidated test only checks API status, not actual deletion behavior. See inline comment.

  2. Coverage loss in Phase 5 (partial failure recovery): After restarting the killed node, the test pauses and starts it but doesn't verify the node actually resumes GC work. The old test_partial_failure also had a gc_pause → restart → gc_pause(recovered_node) flow, but this new version doesn't confirm end-to-end recovery.

  3. Removed tests with no replacement:

    • test_concurrent_pause_start (rapid pause/start toggling with concurrent resets) — this was a concurrency stress test. Is this coverage still needed, or is it considered low-value?
    • test_reset_while_paused and test_reset_while_runninggc_reset() is no longer exercised anywhere in the file.
    • test_grace_period_enforcement — verified the minimum-object-age grace period. Is this covered elsewhere?
    • test_epoch_lag_and_catchup — verified hot-reconfiguration of GC intervals. Is this covered elsewhere?
    • test_leadership_transfer_during_gc — verified GC survives leadership transfers.
    • test_gc_metrics (dedicated metrics test) — some metrics checks moved to stress test, but epoch_lag, max_deleted_epoch, and objects_listed metrics are no longer tested.

    It would be helpful to note in the PR description which of these removals are intentional coverage reductions (low-value/flaky) vs. coverage that's been folded into the remaining tests.

  4. Minor: The test_l0_gc basic test was removed entirely — the simplest "produce, wait for deletes" smoke test is gone. The closest replacement is test_idle_housekeeping, but that has a different purpose (idle GC after repeater stops).

Nits

  • See inline comments for minor style/consistency observations.

No blocking issues — the deflaking improvements are valuable and the clean CDT run is encouraging. The main question is whether the removed concurrency/reset/grace-period/leadership-transfer coverage is intentionally dropped or should be tracked for follow-up.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py Outdated
Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py Outdated
@oleiman oleiman marked this pull request as draft March 27, 2026 18:59
@oleiman oleiman marked this pull request as draft March 27, 2026 18:59
@oleiman oleiman removed the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label Mar 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py
@oleiman oleiman marked this pull request as ready for review March 27, 2026 21:08
@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented Mar 28, 2026

/cdt
rp_version=build
provider=azure
region=centralus
cdt_instance_type=Standard_L8as_v3
dt-repeat=20
tests/rptest/tests/cloud_topics/l0_gc_test.py

@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented Mar 28, 2026

/cdt
provider=gcp
region=us-west2
cdt_instance_type=n2d-standard-4
dt-repeat=10
tests/rptest/tests/cloud_topics/l0_gc_test.py

@oleiman oleiman added the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label Mar 29, 2026
Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py
Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py
Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py
Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py
Comment thread tests/rptest/tests/cloud_topics/l0_gc_test.py
@claude
Copy link
Copy Markdown

claude bot commented Mar 29, 2026

PR Review: Cloud Topics L0 GC Test Consolidation

Overall this is a solid cleanup that significantly reduces test suite size (-893/+342 lines) while preserving meaningful integration coverage. The consolidation rationale in the PR description is well-justified — each removed test is mapped to its C++ unit test counterpart. The CDT proof across AWS/Azure/GCP with multiple runs is thorough.

What looks good

  • produce_some refactor: The time-based approach with wait_until_with_progress_check is much more robust than the old message-count approach. Ensuring production spans multiple epoch intervals directly addresses the root cause of flakiness (GC needs objects spread across epochs).
  • wait_for_status helper: Clean polling abstraction that replaces repeated inline check_statuses + wait_until patterns.
  • Resource cleanup: Adding try/finally with producer.free() / consumer.free() in both test_no_data_loss_under_gc and test_gc_stress fixes resource leaks on assertion failures — the old code didn't call free() at all.
  • Timeout bumps (30s → 60s): Sensible for CI environments; the old 30s timeouts were likely a flake source.
  • test_admin_api consolidation: Combining 5 small admin API tests into phased execution after a single produce cycle is a good trade-off — the phases are well-labeled and logically ordered.
  • test_cluster_disruption_resilience: Merging node failure + topic deletion into one test with a longer produce phase makes sense since both test GC's resilience to cluster disruptions.

Suggestions

  • Weak assertion in stress test (line 1024): assert acked > 0 is very permissive for a test that's supposed to push ~3 GiB. If the producer mostly fails, downstream GC/L1 assertions would fail with confusing errors. Consider a minimum threshold (e.g., MSG_COUNT // 2) like the data integrity test uses.
  • Edge case in produce_some: If _epoch_increment_interval_ms < 1000, integer division would yield epoch_interval_s=0 and min_runtime_s=0, making the time-gate trivially satisfied. Not a concern with current defaults (2000ms) but worth noting.

No concerns with

  • Removing tests covered by C++ unit tests (concurrent_pause_start, reset_while_paused/running, grace_period, epoch_lag, leadership_transfer) — the justification is clear and the unit test coverage is cited.
  • Adding tolerate_failed_produce=True — appropriate for integration tests where transient produce failures shouldn't cause flakes, especially with the 75% threshold guard in the data integrity test.

@oleiman oleiman marked this pull request as draft March 29, 2026 07:28
@oleiman oleiman force-pushed the ci/core-15765/repeater-timeout-cdt-consolidate branch from 067e6d8 to 84cfda2 Compare March 29, 2026 08:39
@oleiman oleiman marked this pull request as ready for review March 29, 2026 08:39
oleiman added 3 commits March 29, 2026 16:48
produce_some used await_progress which waits for both produce AND
consume progress. In CDT the consume path through cloud topics
reconciliation is slow, and the produce path can also stall because
the L0 batcher's scheduling cycle occasionally exceeds the default
10s kafka produce timeout, causing request_timed_out errors.

Realistically all we care about is that GC has some work to do, so instead
of waiting on a specific amount of repeater progress, just let it run for
a configurable period of time, by default 6x the epoch increment interval.
As long as some L0 data goes in and the cluster epoch moves forward, GC
should have some work to do.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
wait_for_status waits for all shards on one node or the whole cluster to
converge on some expected status.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Remove 7 tests redundant with C++ unit tests or low integration value:
- test_l0_gc: subsumed by every other test that produces + waits for GC
- test_concurrent_pause_start: state machine concurrency covered by
  level_zero_gc_mt_test.cc concurrent_reset_start_pause
- test_reset_while_paused: covered by level_zero_gc_tests.cc
  ResetWhilePaused (exact same scenario)
- test_reset_while_running: covered by level_zero_gc_tests.cc
  ResetWhileRunning (exact same scenario)
- test_grace_period_enforcement: grace period filtering covered by
  level_zero_gc_tests.cc NoDeletesForYoungObjects
- test_epoch_lag_and_catchup: tests internal epoch_lag metric response
  to hot-reconfigure of gc_interval; no cluster invariant at stake
- test_leadership_transfer_during_gc: L0 GC scans the object store by
  prefix range, not by partition ownership; leadership is irrelevant
- test_gc_metrics: low value, batching assertion will fold into stress test

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ci/core-15765/repeater-timeout-cdt-consolidate branch from 84cfda2 to ff9c575 Compare March 29, 2026 23:49
oleiman added 2 commits March 29, 2026 17:07
Combine 5 admin API tests (get_status, basic_pause_unpause,
single_node_pause_unpause, not_found, partial_failure) into one
test_admin_api that exercises all endpoints after a single produce.

Combine test_node_failure_mid_gc and test_topic_deletion_during_gc into
test_cluster_disruption_resilience.

Fold test_gc_metrics batching assertion into test_gc_stress.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ci/core-15765/repeater-timeout-cdt-consolidate branch from ff9c575 to 460058e Compare March 30, 2026 00:10
@oleiman oleiman requested a review from a team March 30, 2026 03:42
Copy link
Copy Markdown
Contributor

@WillemKauf WillemKauf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deflakes LGTM, haven't spend a lot of time ensuring that we aren't losing test coverage in removal of the ducktape tests, but I'm willing to take the commit & cover letter at face value knowing that these things have been author & bot validated

@oleiman oleiman merged commit 4e00c12 into dev Mar 30, 2026
19 checks passed
@oleiman oleiman deleted the ci/core-15765/repeater-timeout-cdt-consolidate branch March 30, 2026 20:15
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v26.1.x

@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented Mar 30, 2026

deflakes LGTM, haven't spend a lot of time ensuring that we aren't losing test coverage in removal of the ducktape tests, but I'm willing to take the commit & cover letter at face value knowing that these things have been author & bot validated

Thanks. Can always reintroduce less bad versions of these tests if needed.

Copy link
Copy Markdown
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-review Adding this label to a PR will trigger a workflow to review the code using claude.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants