Skip to content

Add delete deduplication optimization for equality deletes in CDC scenarios #15336

@wangyum

Description

@wangyum

Proposed Change

Problem

In Flink CDC workloads, the same row keys are often deleted multiple times within a single checkpoint:

  • Each UPDATE generates DELETE + INSERT in CDC stream
  • Flink processes records one-by-one without batch deduplication opportunity
  • Results in 5-10x more delete records than necessary
  • Creates excessive small delete files, causing NameNode RPC pressure

Proposal document

Add checkpoint-scoped deduplication caches to skip redundant delete operations:

  • pendingDeleteKeys: Deduplicates deleteKey() calls (key-only deletes)
  • pendingDeleteRows: Deduplicates delete() calls (full-row deletes)
  • Caches are cleared when writer closes (memory bounded by checkpoint size)
  • Check-before-copy optimization minimizes object allocations

No response

Specifications

  • Table
  • View
  • REST
  • Puffin
  • Encryption
  • Other

Metadata

Metadata

Assignees

No one assigned

    Labels

    proposalIceberg Improvement Proposal (spec/major changes/etc)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions