Skip to content

Estimate aggregate output rows using existing NDV statistics #20926

Open
buraksenn wants to merge 2 commits intoapache:mainfrom
buraksenn:top-k-aggregate-estimation
Open

Estimate aggregate output rows using existing NDV statistics #20926
buraksenn wants to merge 2 commits intoapache:mainfrom
buraksenn:top-k-aggregate-estimation

Conversation

@buraksenn
Copy link
Contributor

Which issue does this PR close?

Part of #20766

Rationale for this change

Grouped aggregations currently estimate output rows as input_rows, ignoring available NDV statistics. Spark's AggregateEstimation and Trino's AggregationStatsRule both use NDV products to tighten this estimate. This PR is highly referenced by both.

What changes are included in this PR?

  • Estimate aggregate output rows as min(input_rows, product(NDV_i + null_adj_i) * grouping_sets)
  • Cap by Top K limit when active since output row cannot be higher than K
  • Propagate distinct_count from child stats to group-by output columns

Are these changes tested?

Yes existing and new tests that cover different scenarios and edge cases

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Mar 13, 2026
@github-actions github-actions bot added the core Core DataFusion crate label Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant