Proposed Change
Problem
In Flink CDC workloads, the same row keys are often deleted multiple times within a single checkpoint:
- Each UPDATE generates DELETE + INSERT in CDC stream
- Flink processes records one-by-one without batch deduplication opportunity
- Results in 5-10x more delete records than necessary
- Creates excessive small delete files, causing NameNode RPC pressure
Proposal document
Add checkpoint-scoped deduplication caches to skip redundant delete operations:
pendingDeleteKeys: Deduplicates deleteKey() calls (key-only deletes)
pendingDeleteRows: Deduplicates delete() calls (full-row deletes)
- Caches are cleared when writer closes (memory bounded by checkpoint size)
- Check-before-copy optimization minimizes object allocations
No response
Specifications
Proposed Change
Problem
In Flink CDC workloads, the same row keys are often deleted multiple times within a single checkpoint:
Proposal document
Add checkpoint-scoped deduplication caches to skip redundant delete operations:
pendingDeleteKeys: DeduplicatesdeleteKey()calls (key-only deletes)pendingDeleteRows: Deduplicatesdelete()calls (full-row deletes)No response
Specifications