fix: avoid extraneous casts for equivalent nested types#20945
fix: avoid extraneous casts for equivalent nested types#20945feichai0017 wants to merge 5 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates DataFusion’s function argument coercion to treat structurally equivalent nested DataTypes as matching (ignoring nested field names / metadata), preventing unnecessary CASTs from being inserted during analysis for UDF calls.
Changes:
- Use
DataType::equals_datatype(via a helper) when checking whether current argument types already satisfy a function/UDF signature. - Update coercion paths to avoid rewriting arguments when nested types are equivalent.
- Add regression tests in both
datafusion-expranddatafusion-optimizerto ensure equivalent nested types don’t trigger casts.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
datafusion/expr/src/type_coercion/functions.rs |
Switches type matching from strict equality to equals_datatype for function/UDF argument matching and related coercion logic; adds a regression test. |
datafusion/optimizer/src/analyzer/type_coercion.rs |
Adds an analyzer regression test ensuring scalar UDF argument coercion doesn’t insert casts for equivalent nested list/struct types. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| ) | ||
| } | ||
|
|
||
| fn data_types_match(valid_types: &[DataType], current_types: &[DataType]) -> bool { |
There was a problem hiding this comment.
Seems like we aren't handling Map, Struct, or ListView -- is there a reason for that? In fact, the original bug report uses Map.
I wonder if we can simplify this to use equals_datatype from Arrow, as suggested by the bug reporter?
There was a problem hiding this comment.
Thanks for mention.
The original reproducer does go through a Map, but the actual mismatch at the coercion point is on the extracted map value (List<Struct<...>>), not on the Map type itself. That’s why I kept the fast-path relaxation narrow.
I did try a broader equals_datatype approach first, but it was too permissive in this path and regressed existing cases where runtime kernels still require exact type identity, especially around Struct. I agree ListView / LargeListView should be handled consistently with List / LargeList, and I’ve updated the matcher for that.
There was a problem hiding this comment.
Ah, got it.
Just to help me understand, can you point at an SLT test (e.g., involving structs) that would regress if we used equals_datatype? Or if such an SLT test doesn't already exist, it would probably be a good idea to add one as a sanity check.
There was a problem hiding this comment.
Yes, I locally verified that a broader equals_datatype-style matcher regresses existing SLTs.
In particular:
/datafusion/datafusion/sqllogictest/test_files/struct.slt
select [{a: 1, b: 2}, {b: 3, a: 4}];/datafusion/datafusion/sqllogictest/test_files/spark/array/array.slt
SELECT array(arrow_cast(array(1,2), 'LargeList(Int64)'), array(3));
With the broader matching, both end up failing in array construction (MutableArrayData) because those paths still require exact runtime type identity. That was the main reason I kept this matcher narrower than equals_datatype, especially around Struct.
I agree it would be useful to make that boundary explicit, so I can also add a focused sanity-check regression test in this PR.
There was a problem hiding this comment.
Thank you! That makes sense: the key point is that some Arrow kernels depend on struct field ordering, but the "field name" of a list has no influence on the representation of the data. Can we add a brief comment to data_type_matches to explain the rationale for the kinda-structural-equality we are implementing?
It seems like Map has the same behavior as the List variants: the "field name" does not impact the representation of the data. Should we handle that as well, for completeness?
| ) | ||
| } | ||
|
|
||
| fn data_types_match(valid_types: &[DataType], current_types: &[DataType]) -> bool { |
There was a problem hiding this comment.
Thank you! That makes sense: the key point is that some Arrow kernels depend on struct field ordering, but the "field name" of a list has no influence on the representation of the data. Can we add a brief comment to data_type_matches to explain the rationale for the kinda-structural-equality we are implementing?
It seems like Map has the same behavior as the List variants: the "field name" does not impact the representation of the data. Should we handle that as well, for completeness?
Summary
This PR avoids inserting extraneous casts during function argument coercion when two nested types are structurally equivalent but differ only in nested field names or metadata.
Specifically, it:
DataTypes as matching during UDF argument coercionCASTsdatafusion-expranddatafusion-optimizerCloses #19943.
Tests
cargo test -p datafusion-optimizer./dev/rust_lint.shcargo test -p datafusion-exprcurrently fails on an existing snapshot mismatch inlogical_plan::plan::tests::test_display_pg_jsonon the currentmainbaseline, unrelated to this change