[chore](compaction) remove single replica compaction#63771
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
61db3dd to
d97a724
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 31818 ms |
TPC-DS: Total hot run time: 172532 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
5a02820 to
65422ce
Compare
|
run buildall |
TPC-H: Total hot run time: 31501 ms |
TPC-DS: Total hot run time: 171768 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
skip check_coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
Write down the reason for deletion, and delete the corresponding doc as well. |
Remove the single replica compaction (SRC) feature end-to-end across BE, FE, protocol definitions, and regression tests. SRC let one replica run compaction and others fetch the result; this drops the code path along with its tablet property, thread pool, metrics, HTTP `remote=true` knob, and FE plumbing. - BE: delete `SingleReplicaCompaction` class and unit tests; drop the SRC thread pool, replica-info refresher, tablet hooks, schema field, metrics, and config (`max_single_replica_compaction_threads`, `update_replica_infos_interval_seconds`). - FE: drop `enable_single_replica_compaction` property handling (PropertyAnalyzer, TableProperty, OlapTable, Env, InternalCatalog, SchemaChangeHandler, ModifyTablePropertiesOp) and the corresponding CreateReplicaTask / UpdateTabletMetaInfoTask fields. - Proto/Thrift: mark the removed fields as `reserved` / `// deprecated` to preserve wire compatibility. - Tests: delete SRC-only suites and strip the property from ~130 regression-test groovy/.out files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
65422ce to
dfc334d
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29656 ms |
TPC-DS: Total hot run time: 170047 ms |
|
/review |
There was a problem hiding this comment.
I found one behavior regression that should be fixed before merging.
Critical checkpoint conclusions:
- Goal/test: The PR removes single-replica compaction and associated table/schema/protocol plumbing. The removal is broadly implemented and many obsolete regression cases/properties are updated, but the manual HTTP compaction endpoint still accepts the old remote request semantics silently, and there is no replacement negative test for that.
- Scope/focus: Most changes are direct removals and are focused on the feature deletion.
- Concurrency: No new compaction/thread lifecycle path was added; removed background thread and thread pool cleanup are consistent. No new lock-order issue found.
- Lifecycle/static initialization: No new static/global lifecycle dependency found.
- Configuration: Removed BE configs are no longer consumed; Doris BE config loading ignores unknown file keys, so I did not find a startup blocker from stale config entries.
- Compatibility/storage/protocol: Thrift/protobuf field numbers are left deprecated/reserved and old values are ignored, which is consistent with removing the feature.
- Parallel paths: FE create/alter/create-replica/update-meta paths and BE tablet schema/task paths were checked; no other active single-replica-compaction path found.
- Tests: Obsolete positive tests were removed and expected outputs updated, but the endpoint compatibility/negative behavior for remote=true is missing.
- Observability/performance: No new logging/metric/performance issue found beyond the endpoint behavior below.
- Transaction/persistence/data correctness: No transaction or visible-version correctness issue found in the reviewed removal. Existing persisted table/proto metadata for the removed flag is ignored.
- Other issues: See inline comment.
User focus points: No additional user-provided review focus was present.
Remove the single replica compaction (SRC) feature end-to-end across BE, FE and regression tests. Doc: apache/doris-website#3870 ### Why remove it The main reason is **correctness risk in peer selection**. A follower replica had to pick a peer holding a "proper" version (`_find_rowset_to_fetch`) and fetch its compacted result, based on replica info that was only refreshed periodically. Because replicas progress through versions independently and this "leader" selection ran against a stale, time-sensitive view of the cluster, the hoice of which peer to fetch from — and which version — was racy and could select a peer whose state no longer matched, leading to subtle inconsistencies.
Remove the single replica compaction (SRC) feature end-to-end across BE, FE and regression tests. Doc: apache/doris-website#3870 ### Why remove it The main reason is **correctness risk in peer selection**. A follower replica had to pick a peer holding a "proper" version (`_find_rowset_to_fetch`) and fetch its compacted result, based on replica info that was only refreshed periodically. Because replicas progress through versions independently and this "leader" selection ran against a stale, time-sensitive view of the cluster, the hoice of which peer to fetch from — and which version — was racy and could select a peer whose state no longer matched, leading to subtle inconsistencies.
Remove the single replica compaction (SRC) feature end-to-end across BE, FE and regression tests.
Doc: apache/doris-website#3870
Why remove it
The main reason is correctness risk in peer selection. A follower replica had to pick a peer holding a "proper" version (
_find_rowset_to_fetch) and fetch its compacted result, based on replica info that was only refreshedperiodically. Because replicas progress through versions independently and this "leader" selection ran against a stale, time-sensitive view of the cluster, the hoice of which peer to fetch from — and which version — was racy and could select a peer whose state no longer matched, leading to subtle inconsistencies.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)