DataFusion 52 migration by comphead · Pull Request #3052 · apache/datafusion-comet

comphead · 2026-01-07T00:18:51Z

Which issue does this PR close?

Closes #3046 .

Rationale for this change

What changes are included in this PR?

How are these changes tested?

codecov-commenter · 2026-01-07T00:44:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 20.79%. Comparing base (f09f8af) to head (168b35d).
⚠️ Report is 918 commits behind head on main.

Additional details and impacted files

@@              Coverage Diff              @@
##               main    #3052       +/-   ##
=============================================
- Coverage     56.12%   20.79%   -35.33%     
+ Complexity      976      493      -483     
=============================================
  Files           119      175       +56     
  Lines         11743    16165     +4422     
  Branches       2251     2681      +430     
=============================================
- Hits           6591     3362     -3229     
- Misses         4012    12273     +8261     
+ Partials       1140      530      -610

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

comphead · 2026-01-07T16:24:45Z

native/core/src/parquet/parquet_exec.rs

-                file_source,
-            )
-            .with_projection_indices(Some(projection_vector))
-            .with_table_partition_cols(partition_fields)


with_partition_cols now derived automatically
https://datafusion.apache.org/library-user-guide/upgrading.html#refactoring-of-filesource-constructors-and-filescanconfigbuilder-to-accept-schemas-upfront

comphead · 2026-01-07T16:25:16Z

native/spark-expr/src/math_funcs/round.rs

                make_decimal_scalar(a, precision, scale, &f)
            }
-            ScalarValue::Float32(_) | ScalarValue::Float64(_) => Ok(ColumnarValue::Scalar(
-                ScalarValue::try_from_array(&round(&[a.to_array()?, args[1].to_array(1)?])?, 0)?,


the direct function is private now, creating UDF instead

native/core/src/execution/planner.rs

adriangb · 2026-01-14T05:15:24Z

native/core/src/parquet/parquet_exec.rs

+    // Determine the schema to use for ParquetSource
+    let table_schema = if let Some(ref data_schema) = data_schema {


There is also TableSchema::with_table_partition_cols which might make this easier

adriangb · 2026-01-16T00:39:34Z

native/core/src/parquet/schema_adapter.rs

+        // Check for CastColumnExpr and replace with spark_expr::Cast
+        // CastColumnExpr is in datafusion_physical_expr::expressions
+        if let Some(cast) = expr
+            .as_any()
+            .downcast_ref::<datafusion::physical_expr::expressions::CastColumnExpr>()
+        {


Note from call: we are trading a CastColumnExpr -> Cast. The latter doesn't have the ability to handle struct casts. So we should check if we are casting from struct -> struct and if so not replace with the spark compatible cast.

I think this is another example of why we need to unify CastColumnExpr and Cast: ideally you'd be able to cast from struct<c1: int, c2: int>[] -> struct<c1: text>[] while applying your Spark casting rules to the c1: int -> c2: text cast but having DataFusion handle the list->struct->literal part.

comphead · 2026-01-31T19:52:44Z

I'm seeing issues like

Arrow error: Invalid argument error: Invalid date arithmetic operation: Date32 - Int8

Checking if this can be addressed in Comet or in DataFusion

Created a native test reproduce

#[test]
    fn test_date_sub_with_int8_cast_error() {
        use arrow::array::Date32Array;

        let session_ctx = SessionContext::new();
        let task_ctx = session_ctx.task_ctx();
        let planner = PhysicalPlanner::new(Arc::from(session_ctx), 0);

        // Create a scan operator with Date32 (DATE) and Int8 (TINYINT) columns
        // This simulates the schema from the Scala test where _20 is DATE and _2 is TINYINT
        let op_scan = Operator {
            plan_id: 0,
            children: vec![],
            op_struct: Some(OpStruct::Scan(spark_operator::Scan {
                fields: vec![
                    spark_expression::DataType {
                        type_id: 12, // DATE (Date32)
                        type_info: None,
                    },
                    spark_expression::DataType {
                        type_id: 1, // INT8 (TINYINT)
                        type_info: None,
                    },
                ],
                source: "test".to_string(),
                arrow_ffi_safe: false,
            })),
        };

        // Create bound reference for the DATE column (index 0)
        let date_col = spark_expression::Expr {
            expr_struct: Some(Bound(spark_expression::BoundReference {
                index: 0,
                datatype: Some(spark_expression::DataType {
                    type_id: 12, // DATE
                    type_info: None,
                }),
            })),
        };

        // Create bound reference for the INT8 column (index 1)
        let int8_col = spark_expression::Expr {
            expr_struct: Some(Bound(spark_expression::BoundReference {
                index: 1,
                datatype: Some(spark_expression::DataType {
                    type_id: 1, // INT8
                    type_info: None,
                }),
            })),
        };

        // Create a Subtract expression: date_col - int8_col
        // This is equivalent to the SQL: SELECT _20 - _2 FROM tbl (date_sub operation)
        // In the protobuf, subtract uses MathExpr type
        let subtract_expr = spark_expression::Expr {
            expr_struct: Some(ExprStruct::Subtract(Box::new(spark_expression::MathExpr {
                left: Some(Box::new(date_col)),
                right: Some(Box::new(int8_col)),
                return_type: Some(spark_expression::DataType {
                    type_id: 12, // DATE - result should be DATE
                    type_info: None,
                }),
                eval_mode: 0, // Legacy mode
            }))),
        };

        // Create a projection operator with the subtract expression
        let projection = Operator {
            children: vec![op_scan],
            plan_id: 1,
            op_struct: Some(OpStruct::Projection(spark_operator::Projection {
                project_list: vec![subtract_expr],
            })),
        };

        // Create the physical plan
        let (mut scans, datafusion_plan) =
            planner.create_plan(&projection, &mut vec![], 1).unwrap();

        // Execute the plan with test data
        let mut stream = datafusion_plan.native_plan.execute(0, task_ctx).unwrap();

        let runtime = tokio::runtime::Runtime::new().unwrap();
        let (tx, mut rx) = mpsc::channel(1);

        // Send test data: Date32 values and Int8 values
        runtime.spawn(async move {
            // Create Date32 array (days since epoch)
            // 19000 days = approximately 2022-01-01
            let date_array = Date32Array::from(vec![Some(19000), Some(19001), Some(19002)]);
            // Create Int8 array
            let int8_array = Int8Array::from(vec![Some(1i8), Some(2i8), Some(3i8)]);

            let input_batch1 =
                InputBatch::Batch(vec![Arc::new(date_array), Arc::new(int8_array)], 3);
            let input_batch2 = InputBatch::EOF;

            let batches = vec![input_batch1, input_batch2];

            for batch in batches.into_iter() {
                tx.send(batch).await.unwrap();
            }
        });

        // Execute and expect the cast error
        runtime.block_on(async move {
            loop {
                let batch = rx.recv().await.unwrap();
                scans[0].set_input_batch(batch);
                match poll!(stream.next()) {
                    Poll::Ready(Some(result)) => {
                        // We expect an error here related to date arithmetic with Int8
                        // The error can be either:
                        // - "Cast error: Casting from Int8 to Date32 not supported" (from Spark/Comet cast)
                        // - "Invalid date arithmetic operation: Date32 - Int8" (from Arrow)
                        assert!(
                            result.is_err(),
                            "Expected error for date - int8 operation but got success: {:?}",
                            result
                        );
                        let err = result.unwrap_err();
                        let err_msg = err.to_string();
                        println!("{}", &err_msg);
                        assert!(
                            err_msg.contains("Cast error")
                                || err_msg.contains("Casting from Int8 to Date32")
                                || err_msg.contains("Invalid date arithmetic operation")
                                || err_msg.contains("Date32 - Int8"),
                            "Expected date arithmetic error message but got: {}",
                            err_msg
                        );
                        break;
                    }
                    Poll::Ready(None) => {
                        panic!("Stream ended without producing expected error");
                    }
                    _ => {}
                }
            }
        });
    }

comphead · 2026-02-02T17:28:24Z

hm, somehow with DF52 the data array comes to Scan is not the same as in DF51 even without Schema Adapter disabled

UPD: ParquetSource now returns all columns, however in DF51 the same code returned only required columns.

This builds on the DF52 migration work in PR apache#3052 with fixes for failing Rust tests. Pushing as draft PR for CI validation. Fixes: - Date32 +/- Int8/Int16/Int32 arithmetic: Use SparkDateAdd/SparkDateSub UDFs since DF52's arrow-arith only supports Date32 +/- Interval types - Schema adapter nested types: Replace equals_datatype with PartialEq (==) so struct field name differences are detected and spark_parquet_convert is invoked for field-name-based selection - Schema adapter complex nested casts: Add fallback path (wrap_all_type_mismatches) when default adapter fails for complex nested type casts (List<Struct>, Map) - Schema adapter CastColumnExpr replacement: Route Struct/List/Map casts through CometCastColumnExpr with spark_parquet_convert, simple scalars through Spark Cast - Dictionary unpack tests: Restructure polling to handle DF52's FilterExec batch coalescer which accumulates rows before returning Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

comphead · 2026-02-10T16:49:27Z

Replaced by #3470

comphead commented Jan 7, 2026

View reviewed changes

comphead mentioned this pull request Jan 14, 2026

chore: migrate SchemaAdapter to PhysicalExprAdapter for DataFusion 52 compatibility #3047

Draft

4 tasks

adriangb reviewed Jan 14, 2026

View reviewed changes

adriangb reviewed Jan 16, 2026

View reviewed changes

comphead mentioned this pull request Jan 16, 2026

Fix struct casts to align fields by name (prevent positional mis-casts) apache/datafusion#19674

Merged

comphead force-pushed the df52 branch 2 times, most recently from 5dabf9f to e639805 Compare January 31, 2026 16:38

comphead force-pushed the df52 branch from f52823f to 2a94b15 Compare February 6, 2026 00:02

comphead added 18 commits February 9, 2026 09:21

DataFusion 52 migration

ad7577b

DataFusion 52 migration

9c4a9d6

DataFusion 52 migration

23e0eed

DataFusion 52 migration

4b8780f

DataFusion 52 migration

68814fd

DataFusion 52 migration

9b9e1fe

DataFusion 52 migration

4a5b760

DataFusion 52 migration

67a4783

DataFusion 52 migration

be99469

DataFusion 52 migration

bf84adf

DataFusion 52 migration

4429477

DataFusion 52 migration

33647d3

DataFusion 52 migration

48032aa

Df52 migration

f7aad61

Df52 migration

627cb3d

DataFusion 52 migration

d793291

DataFusion 52 migration

921a7d0

DataFusion 52 migration

d9f16cc

comphead added 8 commits February 9, 2026 09:22

DataFusion 52 migration

81ab13a

DataFusion 52 migration

b267574

Df52 migration

926ffe3

Df52 migration

678ce64

DataFusion 52 migration

7979db1

DataFusion 52 migration

0565d60

Df52 migration

5cbb0cb

DataFusion 52 migration

854dae9

comphead force-pushed the df52 branch from 329a565 to 854dae9 Compare February 9, 2026 17:25

andygrove mentioned this pull request Feb 9, 2026

fix: [WIP] Fix Rust-level test failures for DataFusion 52 migration #3464

Closed

4 tasks

comphead added 2 commits February 9, 2026 19:06

Df52 migration

a10a3ed

Df52 migration

b3e27ac

comphead mentioned this pull request Feb 10, 2026

chore: DataFusion 52 migration #3470

Open

comphead closed this Feb 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFusion 52 migration#3052

DataFusion 52 migration#3052
comphead wants to merge 28 commits intoapache:mainfrom
comphead:df52

comphead commented Jan 7, 2026

Uh oh!

codecov-commenter commented Jan 7, 2026 •

edited

Loading

Uh oh!

comphead Jan 7, 2026

Uh oh!

comphead Jan 7, 2026

Uh oh!

Uh oh!

Uh oh!

adriangb Jan 14, 2026

Uh oh!

adriangb Jan 16, 2026

Uh oh!

comphead commented Jan 31, 2026

Uh oh!

comphead commented Feb 2, 2026 •

edited

Loading

Uh oh!

comphead commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// Determine the schema to use for ParquetSource
		let table_schema = if let Some(ref data_schema) = data_schema {

Conversation

comphead commented Jan 7, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

comphead Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

comphead Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adriangb Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

comphead commented Jan 31, 2026

Uh oh!

comphead commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comphead commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jan 7, 2026 •

edited

Loading

comphead commented Feb 2, 2026 •

edited

Loading