Conversation
|
/ci-repeat 1 |
Retry command for Build#82404please wait until all jobs are finished before running the slash command |
CI test resultstest results on build#82404
test results on build#82436
test results on build#82455
|
1c4dfc1 to
2221481
Compare
|
/cdt |
2221481 to
3db4658
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors the Cloud Topics L0 GC integration tests to reduce flakiness and significantly consolidate coverage (as described: 21 tests → 8), while adjusting test workloads and timeouts to better align with epoch-based GC behavior.
Changes:
- Reworks
produce_someto run for a minimum duration (based on the configured epoch increment interval) to ensure L0 objects span multiple epochs for GC eligibility. - Consolidates multiple admin API / pause-unpause / partial-failure checks into a single combined test flow.
- Adds/adjusts longer timeouts and introduces a resilience test covering node failure + topic deletion, plus updates data-integrity/stress tests to be more tolerant to transient produce failures.
3db4658 to
a2fdbb1
Compare
|
/cdt |
PR Review: Cloud Topics L0 GC test consolidation and deflakingOverall: Good consolidation — 21 tests down to 8 is a meaningful reduction in CI cost and flake surface. The structural changes (time-based What looks good
Concerns
Nits
No blocking issues — the deflaking improvements are valuable and the clean CDT run is encouraging. The main question is whether the removed concurrency/reset/grace-period/leadership-transfer coverage is intentionally dropped or should be tracked for follow-up. |
|
/cdt |
|
/cdt |
PR Review: Cloud Topics L0 GC Test ConsolidationOverall this is a solid cleanup that significantly reduces test suite size (-893/+342 lines) while preserving meaningful integration coverage. The consolidation rationale in the PR description is well-justified — each removed test is mapped to its C++ unit test counterpart. The CDT proof across AWS/Azure/GCP with multiple runs is thorough. What looks good
Suggestions
No concerns with
|
067e6d8 to
84cfda2
Compare
produce_some used await_progress which waits for both produce AND consume progress. In CDT the consume path through cloud topics reconciliation is slow, and the produce path can also stall because the L0 batcher's scheduling cycle occasionally exceeds the default 10s kafka produce timeout, causing request_timed_out errors. Realistically all we care about is that GC has some work to do, so instead of waiting on a specific amount of repeater progress, just let it run for a configurable period of time, by default 6x the epoch increment interval. As long as some L0 data goes in and the cluster epoch moves forward, GC should have some work to do. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
wait_for_status waits for all shards on one node or the whole cluster to converge on some expected status. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Remove 7 tests redundant with C++ unit tests or low integration value: - test_l0_gc: subsumed by every other test that produces + waits for GC - test_concurrent_pause_start: state machine concurrency covered by level_zero_gc_mt_test.cc concurrent_reset_start_pause - test_reset_while_paused: covered by level_zero_gc_tests.cc ResetWhilePaused (exact same scenario) - test_reset_while_running: covered by level_zero_gc_tests.cc ResetWhileRunning (exact same scenario) - test_grace_period_enforcement: grace period filtering covered by level_zero_gc_tests.cc NoDeletesForYoungObjects - test_epoch_lag_and_catchup: tests internal epoch_lag metric response to hot-reconfigure of gc_interval; no cluster invariant at stake - test_leadership_transfer_during_gc: L0 GC scans the object store by prefix range, not by partition ownership; leadership is irrelevant - test_gc_metrics: low value, batching assertion will fold into stress test Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
84cfda2 to
ff9c575
Compare
Combine 5 admin API tests (get_status, basic_pause_unpause, single_node_pause_unpause, not_found, partial_failure) into one test_admin_api that exercises all endpoints after a single produce. Combine test_node_failure_mid_gc and test_topic_deletion_during_gc into test_cluster_disruption_resilience. Fold test_gc_metrics batching assertion into test_gc_stress. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
ff9c575 to
460058e
Compare
WillemKauf
left a comment
There was a problem hiding this comment.
deflakes LGTM, haven't spend a lot of time ensuring that we aren't losing test coverage in removal of the ducktape tests, but I'm willing to take the commit & cover letter at face value knowing that these things have been author & bot validated
|
/backport v26.1.x |
Thanks. Can always reintroduce less bad versions of these tests if needed. |
Test suite got a little hairy over time and was flaky on CDT.
Remove 7 tests redundant with C++ unit tests or low integration value:
level_zero_gc_mt_test.cc concurrent_reset_start_pause
ResetWhilePaused (exact same scenario)
ResetWhileRunning (exact same scenario)
level_zero_gc_tests.cc NoDeletesForYoungObjects
to hot-reconfigure of gc_interval; no cluster invariant at stake
prefix range, not by partition ownership; leadership is irrelevant
Combine 5 admin API tests (get_status, basic_pause_unpause,
single_node_pause_unpause, not_found, partial_failure) into one
test_admin_api that exercises all endpoints after a single produce.
Combine test_node_failure_mid_gc and test_topic_deletion_during_gc into
test_cluster_disruption_resilience.
Fold test_gc_metrics batching assertion into test_gc_stress.
Fixes
JIRA filter
Proof (CDT)
Backports Required
Release Notes