-
Notifications
You must be signed in to change notification settings - Fork 2.1k
fix: avoid extraneous casts for equivalent nested types #20945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
b05d5e9
4af3b3e
9bec12c
dc8bfe6
1c5a339
538e515
8b2c60e
215a74e
efb9914
00d249e
e918fbe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -130,7 +130,7 @@ pub fn fields_with_udf<F: UDFCoercionExt>( | |
| let valid_types = get_valid_types_with_udf(type_signature, ¤t_types, func)?; | ||
| if valid_types | ||
| .iter() | ||
| .any(|data_type| data_type == ¤t_types) | ||
| .any(|data_type| data_types_match(data_type, ¤t_types)) | ||
| { | ||
| return Ok(current_fields.to_vec()); | ||
| } | ||
|
|
@@ -236,7 +236,7 @@ pub fn data_types( | |
| get_valid_types(function_name.as_ref(), type_signature, current_types)?; | ||
| if valid_types | ||
| .iter() | ||
| .any(|data_type| data_type == current_types) | ||
| .any(|data_type| data_types_match(data_type, current_types)) | ||
| { | ||
| return Ok(current_types.to_vec()); | ||
| } | ||
|
|
@@ -307,6 +307,14 @@ fn try_coerce_types( | |
| ) | ||
| } | ||
|
|
||
| fn data_types_match(valid_types: &[DataType], current_types: &[DataType]) -> bool { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like we aren't handling Map, Struct, or ListView -- is there a reason for that? In fact, the original bug report uses Map. I wonder if we can simplify this to use
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for mention. The original reproducer does go through a I did try a broader
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, got it. Just to help me understand, can you point at an SLT test (e.g., involving structs) that would regress if we used
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I locally verified that a broader In particular:
With the broader matching, both end up failing in array construction ( I agree it would be useful to make that boundary explicit, so I can also add a focused sanity-check regression test in this PR.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you! That makes sense: the key point is that some Arrow kernels depend on struct field ordering, but the "field name" of a list has no influence on the representation of the data. Can we add a brief comment to It seems like |
||
| valid_types.len() == current_types.len() | ||
| && valid_types | ||
| .iter() | ||
| .zip(current_types) | ||
| .all(|(valid_type, current_type)| valid_type.equals_datatype(current_type)) | ||
| } | ||
|
|
||
| fn get_valid_types_with_udf<F: UDFCoercionExt>( | ||
| signature: &TypeSignature, | ||
| current_types: &[DataType], | ||
|
|
@@ -757,7 +765,7 @@ fn maybe_data_types( | |
| for (i, valid_type) in valid_types.iter().enumerate() { | ||
| let current_type = ¤t_types[i]; | ||
|
|
||
| if current_type == valid_type { | ||
| if current_type.equals_datatype(valid_type) { | ||
| new_type.push(current_type.clone()) | ||
| } else { | ||
| // attempt to coerce. | ||
|
|
@@ -789,7 +797,7 @@ fn maybe_data_types_without_coercion( | |
| for (i, valid_type) in valid_types.iter().enumerate() { | ||
| let current_type = ¤t_types[i]; | ||
|
|
||
| if current_type == valid_type { | ||
| if current_type.equals_datatype(valid_type) { | ||
| new_type.push(current_type.clone()) | ||
| } else if can_cast_types(current_type, valid_type) { | ||
| // validate the valid type is castable from the current type | ||
|
|
@@ -1223,6 +1231,36 @@ mod tests { | |
| Ok(()) | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_fields_with_udf_preserves_equivalent_nested_types() -> Result<()> { | ||
| let struct_fields = vec![ | ||
| Field::new("id", DataType::Utf8, true), | ||
| Field::new("prim", DataType::Boolean, true), | ||
| ]; | ||
| let current_type = DataType::List(Arc::new(Field::new( | ||
| "item", | ||
| DataType::Struct(struct_fields.clone().into()), | ||
| true, | ||
| ))); | ||
| let signature_type = DataType::List(Arc::new(Field::new( | ||
| "element", | ||
| DataType::Struct(struct_fields.into()), | ||
| true, | ||
| ))); | ||
|
|
||
| assert!(current_type.equals_datatype(&signature_type)); | ||
|
|
||
| let current_fields = vec![Arc::new(Field::new("field", current_type, true))]; | ||
| let coerced_fields = fields_with_udf( | ||
| ¤t_fields, | ||
| &MockUdf(Signature::exact(vec![signature_type], Volatility::Stable)), | ||
| )?; | ||
|
|
||
| assert_eq!(coerced_fields, current_fields); | ||
|
|
||
| Ok(()) | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_nested_wildcard_fixed_size_lists() -> Result<()> { | ||
| let type_into = DataType::FixedSizeList( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to change this function, it's deprecated and not used anywhere