Skip to content

fix: avoid extraneous casts for equivalent nested types#20945

Open
feichai0017 wants to merge 5 commits intoapache:mainfrom
feichai0017:fix/extraneous-casts-nested-types
Open

fix: avoid extraneous casts for equivalent nested types#20945
feichai0017 wants to merge 5 commits intoapache:mainfrom
feichai0017:fix/extraneous-casts-nested-types

Conversation

@feichai0017
Copy link

Summary

This PR avoids inserting extraneous casts during function argument coercion when two nested types are structurally equivalent but differ only in nested field names or metadata.

Specifically, it:

  • treats equivalent nested DataTypes as matching during UDF argument coercion
  • avoids rewriting such arguments with unnecessary CASTs
  • adds regression coverage in both datafusion-expr and datafusion-optimizer

Closes #19943.

Tests

  • cargo test -p datafusion-optimizer
  • ./dev/rust_lint.sh
  • cargo test -p datafusion-expr currently fails on an existing snapshot mismatch in logical_plan::plan::tests::test_display_pg_json on the current main baseline, unrelated to this change

Copilot AI review requested due to automatic review settings March 14, 2026 14:38
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Mar 14, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates DataFusion’s function argument coercion to treat structurally equivalent nested DataTypes as matching (ignoring nested field names / metadata), preventing unnecessary CASTs from being inserted during analysis for UDF calls.

Changes:

  • Use DataType::equals_datatype (via a helper) when checking whether current argument types already satisfy a function/UDF signature.
  • Update coercion paths to avoid rewriting arguments when nested types are equivalent.
  • Add regression tests in both datafusion-expr and datafusion-optimizer to ensure equivalent nested types don’t trigger casts.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
datafusion/expr/src/type_coercion/functions.rs Switches type matching from strict equality to equals_datatype for function/UDF argument matching and related coercion logic; adds a regression test.
datafusion/optimizer/src/analyzer/type_coercion.rs Adds an analyzer regression test ensuring scalar UDF argument coercion doesn’t insert casts for equivalent nested list/struct types.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

)
}

fn data_types_match(valid_types: &[DataType], current_types: &[DataType]) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we aren't handling Map, Struct, or ListView -- is there a reason for that? In fact, the original bug report uses Map.

I wonder if we can simplify this to use equals_datatype from Arrow, as suggested by the bug reporter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for mention.

The original reproducer does go through a Map, but the actual mismatch at the coercion point is on the extracted map value (List<Struct<...>>), not on the Map type itself. That’s why I kept the fast-path relaxation narrow.

I did try a broader equals_datatype approach first, but it was too permissive in this path and regressed existing cases where runtime kernels still require exact type identity, especially around Struct. I agree ListView / LargeListView should be handled consistently with List / LargeList, and I’ve updated the matcher for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it.

Just to help me understand, can you point at an SLT test (e.g., involving structs) that would regress if we used equals_datatype? Or if such an SLT test doesn't already exist, it would probably be a good idea to add one as a sanity check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I locally verified that a broader equals_datatype-style matcher regresses existing SLTs.

In particular:

  • /datafusion/datafusion/sqllogictest/test_files/struct.slt
    select [{a: 1, b: 2}, {b: 3, a: 4}];
  • /datafusion/datafusion/sqllogictest/test_files/spark/array/array.slt
    SELECT array(arrow_cast(array(1,2), 'LargeList(Int64)'), array(3));

With the broader matching, both end up failing in array construction (MutableArrayData) because those paths still require exact runtime type identity. That was the main reason I kept this matcher narrower than equals_datatype, especially around Struct.

I agree it would be useful to make that boundary explicit, so I can also add a focused sanity-check regression test in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! That makes sense: the key point is that some Arrow kernels depend on struct field ordering, but the "field name" of a list has no influence on the representation of the data. Can we add a brief comment to data_type_matches to explain the rationale for the kinda-structural-equality we are implementing?

It seems like Map has the same behavior as the List variants: the "field name" does not impact the representation of the data. Should we handle that as well, for completeness?

Copy link
Contributor

@neilconway neilconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Struct(a: List(x: Int64)) vs Struct(a: List(y: Int64)) -- should we try to elide the cast in this case as well?

It might be helpful to add an SLT test to verify the end-user behavior that the extraneous casts are indeed omitted.

)
}

fn data_types_match(valid_types: &[DataType], current_types: &[DataType]) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! That makes sense: the key point is that some Arrow kernels depend on struct field ordering, but the "field name" of a list has no influence on the representation of the data. Can we add a brief comment to data_type_matches to explain the rationale for the kinda-structural-equality we are implementing?

It seems like Map has the same behavior as the List variants: the "field name" does not impact the representation of the data. Should we handle that as well, for completeness?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions optimizer Optimizer rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extraneous casts added due to overly strict type comparison

3 participants