feat: estimate cardinality for semi and anti-joins using distinct counts by buraksenn · Pull Request #20904 · apache/datafusion

buraksenn · 2026-03-12T12:46:35Z

Which issue does this PR close?

Does not close but part of #20766

Rationale for this change

Details are in #20766. But main idea is to use existing distinct count information to optimize joins similar to how Spark/Trino does

What changes are included in this PR?

This PR extends cardinality estimation for semi/anti joins using distinct counts

Are these changes tested?

I've added cases but not sure if I should've added benchmarks on this.

Are there any user-facing changes?

No

asolimando

LGTM, a couple of minor points and a few tests to be added. The only change I'd like to see is bailing out when either side has no stats for a column pair.

asolimando · 2026-03-12T13:54:12Z

datafusion/physical-plan/src/joins/utils.rs

    None
 }

+/// Estimates the number of outer rows that have at least one matching


The math looks sound to me, and coherent with that of #20846.

I was wondering if you did check other notable systems using CBO like Trino or Spark.

If so, consider adding a note, this will help reviewers trust the change, as already battle-tested elsewhere.

This builds up on same assumption in the inner join in the same file estimate_inner_join_cardinality. I saw similar thing in postgres https://github.com/postgres/postgres/blob/02976b0a1718037f73fded250411b013e81fdafa/src/backend/utils/adt/selfuncs.c#L2718. I may need to check Spark and Trino again. In the epic it said about them but not sure about this.

If you have any reservations about I can close or maybe try to be more conservative on this

Yes I think we can look into what trino did, I think they had something for this, but the postgres approach makes sense

No reservations, but I think it's valuable to point to the original implementation for multiple reasons: documentation, revisit the implementation if the original source improves over it, and recognition for the original author/project.

As I said, this looks reasonable and correct to me.

Thanks @asolimando I understand and it totally makes sense. I could not find relevant estimation logic for this in Trino maybe I've missed this. Given Postgres approach and reference in comments I think it makes sense to go forward with this.

datafusion/physical-plan/src/joins/utils.rs

asolimando · 2026-03-13T11:08:48Z

datafusion/physical-plan/src/joins/utils.rs

+        if let (Some(&o), Some(&i)) = (outer_ndv.get_value(), inner_ndv.get_value())
+            && o > 0
+        {
+            selectivity *= (o.min(i) as f64) / (o as f64);


I took a look at the Postgres' implementation (the branch without the "most common values", which we don't track), I noticed a potential improvement over what we have now.

Postgres uses selec = min(1, inner_ndv/outer_ndv) * (1 - nullfrac) as core formula (code is here: I have "condensed" the if/else into a single formula and renamed the variables ndv1 -> outer_ndv and ndv2 -> inner_ndv, so it's clear, but it's equivalent to the code).

We use selec = min(inner_ndv, outer_ndv) / outer_ndv (renamed again for readability here), which is equivalent to Postgres' min(1, inner_ndv/outer_ndv).

We don't account for NULL values, we are missing the * (1 - nullfrac) part of their formula.

Postgres removes them, as NULL values cannot ever match, and we could do that too, when ColumnStatistics::null_count is available.

We could do it this way:

Suggested change

selectivity *= (o.min(i) as f64) / (o as f64);

let null_frac = outer_stat.null_count

.get_value()

.map(|&nc| nc as f64 / outer_rows as f64)

.unwrap_or(0.0);

selectivity *= (o.min(i) as f64) / (o as f64) * (1.0 - null_frac);

WDYT?

If you accept the change, we would need a few additional test cases to exercise this part of the code, I can think of these cases:

single column with nulls on outer side

anti-join with nulls

all outer rows are null

multi-column with nulls on one column

Thanks @asolimando I've added your part and will add test cases now

I've adjusted comment and added test cases

Co-authored-by: Alessandro Solimando <alessandro.solimando@gmail.com>

asolimando

LGTM, thanks for addressing all my comments fully!

feat: use ndv to estimate cardinality

6b5a8d2

github-actions bot added the physical-plan Changes to the physical-plan crate label Mar 12, 2026

fall back if both sides dont have stats

6c989d1

asolimando reviewed Mar 12, 2026

View reviewed changes

jonathanc-n mentioned this pull request Mar 12, 2026

EPIC: Making use of NDVs (number of distinct values) in DataFusion #20766

Open

buraksenn force-pushed the use-ndv-for-semi-and-anti-join branch from a82d83e to 79dcc2b Compare March 12, 2026 20:40

address review comments

ee530c3

buraksenn force-pushed the use-ndv-for-semi-and-anti-join branch from 79dcc2b to ee530c3 Compare March 12, 2026 20:44

asolimando reviewed Mar 13, 2026

View reviewed changes

buraksenn and others added 2 commits March 13, 2026 14:32

Update datafusion/physical-plan/src/joins/utils.rs

5090435

Co-authored-by: Alessandro Solimando <alessandro.solimando@gmail.com>

changes after considering null

f36f442

asolimando approved these changes Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: estimate cardinality for semi and anti-joins using distinct counts#20904

feat: estimate cardinality for semi and anti-joins using distinct counts#20904
buraksenn wants to merge 5 commits intoapache:mainfrom
buraksenn:use-ndv-for-semi-and-anti-join

buraksenn commented Mar 12, 2026

Uh oh!

asolimando left a comment

Uh oh!

asolimando Mar 12, 2026

Uh oh!

buraksenn Mar 12, 2026

Uh oh!

jonathanc-n Mar 12, 2026

Uh oh!

asolimando Mar 13, 2026

Uh oh!

buraksenn Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asolimando Mar 13, 2026 •

edited

Loading

Uh oh!

buraksenn Mar 13, 2026

Uh oh!

buraksenn Mar 13, 2026

Uh oh!

asolimando left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            selectivity *= (o.min(i) as f64) / (o as f64);
+            let null_frac = outer_stat.null_count
+            .get_value()
+            .map(|&nc| nc as f64 / outer_rows as f64)
+            .unwrap_or(0.0);
+            selectivity *= (o.min(i) as f64) / (o as f64) * (1.0 - null_frac);

Conversation

buraksenn commented Mar 12, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

asolimando left a comment

Choose a reason for hiding this comment

Uh oh!

asolimando Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

jonathanc-n Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asolimando Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

buraksenn Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asolimando Mar 13, 2026 •

edited

Loading