-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Optimize the evaluation of date_part(<col>) == <constant> when pushed down #19733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 28 commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
d94889a
Add udf_preimage logic
sdf-jkl 4aa7f4e
Cargo fmt
sdf-jkl 2329c12
Fix err in rewrite_with_preimage
sdf-jkl 7ac8325
Rewrite the preimage_in_comparison
sdf-jkl 7a3e8b3
cargo fmt
sdf-jkl fbd5dcc
Fix ci
sdf-jkl d920735
Fix GtEq, Lt logic
sdf-jkl d3318ff
Add datepart preimage + tests
sdf-jkl 9fb245b
Fix asf header
sdf-jkl 5ffb704
Merge branch 'main' into smaller-preimage-pr-1
sdf-jkl 372f704
Merge branch 'main' into smaller-preimage-pr-2
sdf-jkl 2fdc14c
Merge branch 'main' of https://github.com/apache/datafusion into smal…
sdf-jkl c2b0cd3
Replace BinaryExpression with binary_expr() fn
sdf-jkl a0b6564
Add unit tests + add doc part about upper bound
sdf-jkl 0a24d60
Fix docs
sdf-jkl 86b7627
Add datepart preimage + tests
sdf-jkl 0158662
Fix asf header
sdf-jkl b491d4f
Merge branch 'smaller-preimage-pr-2' of https://github.com/sdf-jkl/da…
sdf-jkl 59235de
clippy
alamb 9ae434e
Merge remote-tracking branch 'apache/main' into smaller-preimage-pr-1
alamb 9f845e7
Make test field nullable
sdf-jkl 08ef1f1
Add datepart preimage + tests
sdf-jkl 57f6c4c
Fix asf header
sdf-jkl e4dc727
Merge branch 'smaller-preimage-pr-2' of https://github.com/sdf-jkl/da…
sdf-jkl f36257e
Merge branch 'main' of https://github.com/apache/datafusion into smal…
sdf-jkl 6bceb43
Fix date_part.rs
sdf-jkl 13f1164
Fix udf_preimage.slt
sdf-jkl d902f65
Small fix
sdf-jkl 6992c8f
fix typo
sdf-jkl b456a22
Add proper error handling
sdf-jkl 3669fb9
Add tz slt tests
sdf-jkl bb6625f
Add tz aware timestamp logic
sdf-jkl aeafe1a
Move tests to date_part.slt
sdf-jkl 62b0841
Use Arrow Type methods for date32/64
sdf-jkl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -29,6 +29,10 @@ use arrow::datatypes::TimeUnit::{Microsecond, Millisecond, Nanosecond, Second}; | |
| use arrow::datatypes::{ | ||
| DataType, Field, FieldRef, IntervalUnit as ArrowIntervalUnit, TimeUnit, | ||
| }; | ||
| use arrow::temporal_conversions::{ | ||
| MICROSECONDS_IN_DAY, MILLISECONDS_IN_DAY, NANOSECONDS_IN_DAY, SECONDS_IN_DAY, | ||
| }; | ||
| use chrono::{Datelike, NaiveDate}; | ||
| use datafusion_common::types::{NativeType, logical_date}; | ||
|
|
||
| use datafusion_common::{ | ||
|
|
@@ -44,9 +48,11 @@ use datafusion_common::{ | |
| types::logical_string, | ||
| utils::take_function_args, | ||
| }; | ||
| use datafusion_expr::preimage::PreimageResult; | ||
| use datafusion_expr::simplify::SimplifyContext; | ||
| use datafusion_expr::{ | ||
| ColumnarValue, Documentation, ReturnFieldArgs, ScalarUDFImpl, Signature, | ||
| TypeSignature, Volatility, | ||
| ColumnarValue, Documentation, Expr, ReturnFieldArgs, ScalarUDFImpl, Signature, | ||
| TypeSignature, Volatility, interval_arithmetic, | ||
| }; | ||
| use datafusion_expr_common::signature::{Coercion, TypeSignatureClass}; | ||
| use datafusion_macros::user_doc; | ||
|
|
@@ -237,6 +243,68 @@ impl ScalarUDFImpl for DatePartFunc { | |
| }) | ||
| } | ||
|
|
||
| // Only casting the year is supported since pruning other IntervalUnit is not possible | ||
| // date_part(col, YEAR) = 2024 => col >= '2024-01-01' and col < '2025-01-01' | ||
| // But for anything less than YEAR simplifying is not possible without specifying the bigger interval | ||
| // date_part(col, MONTH) = 1 => col = '2023-01-01' or col = '2024-01-01' or ... or col = '3000-01-01' | ||
| fn preimage( | ||
| &self, | ||
| args: &[Expr], | ||
| lit_expr: &Expr, | ||
| info: &SimplifyContext, | ||
| ) -> Result<PreimageResult> { | ||
| let [part, col_expr] = take_function_args(self.name(), args)?; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should be able to avoid writing all this code |
||
|
|
||
| // Get the interval unit from the part argument | ||
| let interval_unit = part | ||
| .as_literal() | ||
| .and_then(|sv| sv.try_as_str().flatten()) | ||
| .map(part_normalization) | ||
| .and_then(|s| IntervalUnit::from_str(s).ok()); | ||
|
|
||
| // only support extracting year | ||
| match interval_unit { | ||
| Some(IntervalUnit::Year) => (), | ||
| _ => return Ok(PreimageResult::None), | ||
| } | ||
|
|
||
| // Check if the argument is a literal (e.g. date_part(YEAR, col) = 2024) | ||
| let Some(argument_literal) = lit_expr.as_literal() else { | ||
| return Ok(PreimageResult::None); | ||
| }; | ||
|
|
||
| // Extract i32 year from Scalar value | ||
| let year = match argument_literal { | ||
| ScalarValue::Int32(Some(y)) => *y, | ||
| _ => return Ok(PreimageResult::None), | ||
| }; | ||
|
|
||
| // Can only extract year from Date32/64 and Timestamp column | ||
| let target_type = match info.get_data_type(col_expr)? { | ||
| Date32 | Date64 | Timestamp(_, _) => &info.get_data_type(col_expr)?, | ||
| _ => return Ok(PreimageResult::None), | ||
| }; | ||
|
|
||
| // Compute the Interval bounds | ||
| let start_time = | ||
| NaiveDate::from_ymd_opt(year, 1, 1).expect("Expect computed start time"); | ||
| let end_time = start_time | ||
| .with_year(year + 1) | ||
| .expect("Expect computed end time"); | ||
|
sdf-jkl marked this conversation as resolved.
Outdated
|
||
|
|
||
| // Convert to ScalarValues | ||
| let lower = date_to_scalar(start_time, target_type) | ||
| .expect("Expect preimage interval lower bound"); | ||
| let upper = date_to_scalar(end_time, target_type) | ||
| .expect("Expect preimage interval upper bound"); | ||
| let interval = Box::new(interval_arithmetic::Interval::try_new(lower, upper)?); | ||
|
|
||
| Ok(PreimageResult::Range { | ||
| expr: col_expr.clone(), | ||
| interval, | ||
| }) | ||
| } | ||
|
|
||
| fn aliases(&self) -> &[String] { | ||
| &self.aliases | ||
| } | ||
|
|
@@ -251,6 +319,35 @@ fn is_epoch(part: &str) -> bool { | |
| matches!(part.to_lowercase().as_str(), "epoch") | ||
| } | ||
|
|
||
| fn date_to_scalar(date: NaiveDate, target_type: &DataType) -> Option<ScalarValue> { | ||
| let days = date | ||
| .signed_duration_since(NaiveDate::from_epoch_days(0)?) | ||
| .num_days(); | ||
|
|
||
| Some(match target_type { | ||
| Date32 => ScalarValue::Date32(Some(days as i32)), | ||
| Date64 => ScalarValue::Date64(Some(days * MILLISECONDS_IN_DAY)), | ||
| Timestamp(unit, tz) => match unit { | ||
| Second => { | ||
| ScalarValue::TimestampSecond(Some(days * SECONDS_IN_DAY), tz.clone()) | ||
| } | ||
| Millisecond => ScalarValue::TimestampMillisecond( | ||
| Some(days * MILLISECONDS_IN_DAY), | ||
| tz.clone(), | ||
| ), | ||
| Microsecond => ScalarValue::TimestampMicrosecond( | ||
| Some(days * MICROSECONDS_IN_DAY), | ||
| tz.clone(), | ||
| ), | ||
| Nanosecond => ScalarValue::TimestampNanosecond( | ||
| Some(days * NANOSECONDS_IN_DAY), | ||
| tz.clone(), | ||
| ), | ||
| }, | ||
| _ => return None, | ||
| }) | ||
| } | ||
|
|
||
| // Try to remove quote if exist, if the quote is invalid, return original string and let the downstream function handle the error | ||
| fn part_normalization(part: &str) -> &str { | ||
| part.strip_prefix(|c| c == '\'' || c == '\"') | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a section to
docs/source/library-user-guide/functions/adding-udfs.mdexplaining:date_part)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a PR for
preimagedoc improvement here #20008.I am however, not sure that this doc needs to explain
preimage. I think the doc's goal is to be a very minimal guide on adding and registering a function. There is also no mention ofsimplifytoo.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the API docs is probably adequate. We could potentially add a note to
adding-udfs.mdthat says something generic like "The ScalarUDFImpl has additional methods that support specialized optimizations such aspreimage-- see the API documentation for additional details"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate PR should do.