feat: estimate cardinality for semi and anti-joins using distinct counts#20904
feat: estimate cardinality for semi and anti-joins using distinct counts#20904buraksenn wants to merge 5 commits intoapache:mainfrom
Conversation
asolimando
left a comment
There was a problem hiding this comment.
LGTM, a couple of minor points and a few tests to be added. The only change I'd like to see is bailing out when either side has no stats for a column pair.
| None | ||
| } | ||
|
|
||
| /// Estimates the number of outer rows that have at least one matching |
There was a problem hiding this comment.
The math looks sound to me, and coherent with that of #20846.
I was wondering if you did check other notable systems using CBO like Trino or Spark.
If so, consider adding a note, this will help reviewers trust the change, as already battle-tested elsewhere.
There was a problem hiding this comment.
This builds up on same assumption in the inner join in the same file estimate_inner_join_cardinality. I saw similar thing in postgres https://github.com/postgres/postgres/blob/02976b0a1718037f73fded250411b013e81fdafa/src/backend/utils/adt/selfuncs.c#L2718. I may need to check Spark and Trino again. In the epic it said about them but not sure about this.
If you have any reservations about I can close or maybe try to be more conservative on this
There was a problem hiding this comment.
Yes I think we can look into what trino did, I think they had something for this, but the postgres approach makes sense
There was a problem hiding this comment.
No reservations, but I think it's valuable to point to the original implementation for multiple reasons: documentation, revisit the implementation if the original source improves over it, and recognition for the original author/project.
As I said, this looks reasonable and correct to me.
There was a problem hiding this comment.
Thanks @asolimando I understand and it totally makes sense. I could not find relevant estimation logic for this in Trino maybe I've missed this. Given Postgres approach and reference in comments I think it makes sense to go forward with this.
a82d83e to
79dcc2b
Compare
79dcc2b to
ee530c3
Compare
| if let (Some(&o), Some(&i)) = (outer_ndv.get_value(), inner_ndv.get_value()) | ||
| && o > 0 | ||
| { | ||
| selectivity *= (o.min(i) as f64) / (o as f64); |
There was a problem hiding this comment.
I took a look at the Postgres' implementation (the branch without the "most common values", which we don't track), I noticed a potential improvement over what we have now.
Postgres uses selec = min(1, inner_ndv/outer_ndv) * (1 - nullfrac) as core formula (code is here: I have "condensed" the if/else into a single formula and renamed the variables ndv1 -> outer_ndv and ndv2 -> inner_ndv, so it's clear, but it's equivalent to the code).
We use selec = min(inner_ndv, outer_ndv) / outer_ndv (renamed again for readability here), which is equivalent to Postgres' min(1, inner_ndv/outer_ndv).
We don't account for NULL values, we are missing the * (1 - nullfrac) part of their formula.
Postgres removes them, as NULL values cannot ever match, and we could do that too, when ColumnStatistics::null_count is available.
We could do it this way:
| selectivity *= (o.min(i) as f64) / (o as f64); | |
| let null_frac = outer_stat.null_count | |
| .get_value() | |
| .map(|&nc| nc as f64 / outer_rows as f64) | |
| .unwrap_or(0.0); | |
| selectivity *= (o.min(i) as f64) / (o as f64) * (1.0 - null_frac); |
WDYT?
If you accept the change, we would need a few additional test cases to exercise this part of the code, I can think of these cases:
- single column with nulls on outer side
- anti-join with nulls
- all outer rows are null
- multi-column with nulls on one column
There was a problem hiding this comment.
Thanks @asolimando I've added your part and will add test cases now
There was a problem hiding this comment.
I've adjusted comment and added test cases
Co-authored-by: Alessandro Solimando <alessandro.solimando@gmail.com>
asolimando
left a comment
There was a problem hiding this comment.
LGTM, thanks for addressing all my comments fully!
Which issue does this PR close?
Does not close but part of #20766
Rationale for this change
Details are in #20766. But main idea is to use existing distinct count information to optimize joins similar to how Spark/Trino does
What changes are included in this PR?
This PR extends cardinality estimation for semi/anti joins using distinct counts
Are these changes tested?
I've added cases but not sure if I should've added benchmarks on this.
Are there any user-facing changes?
No