From 8f3d2dea161cc7e104dacaa1d4009a8eed15cdc9 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 6 Feb 2026 07:08:41 -0700 Subject: [PATCH 01/15] docs: remove all mentions of native_comet scan --- docs/source/contributor-guide/ffi.md | 6 +-- .../source/contributor-guide/parquet_scans.md | 47 ++++--------------- docs/source/contributor-guide/roadmap.md | 17 +------ 3 files changed, 14 insertions(+), 56 deletions(-) diff --git a/docs/source/contributor-guide/ffi.md b/docs/source/contributor-guide/ffi.md index b1a51ecb2a..30f441c082 100644 --- a/docs/source/contributor-guide/ffi.md +++ b/docs/source/contributor-guide/ffi.md @@ -177,9 +177,9 @@ message Scan { #### When ownership is NOT transferred to native: -If the data originates from `native_comet` scan (deprecated, will be removed in a future release) or from -`native_iceberg_compat` in some cases, then ownership is not transferred to native and the JVM may re-use the -underlying buffers in the future. +If the data originates from a scan that uses mutable buffers (such as `native_iceberg_compat` when reading partition +columns or adding missing columns) in some cases, then ownership is not transferred to native and the JVM may +re-use the underlying buffers in the future. It is critical that the native code performs a deep copy of the arrays if the arrays are to be buffered by operators such as `SortExec` or `ShuffleWriterExec`, otherwise data corruption is likely to occur. diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index bbacff4d93..71a6e36862 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -19,30 +19,17 @@ under the License. # Comet Parquet Scan Implementations -Comet currently has three distinct implementations of the Parquet scan operator. The configuration property +Comet currently has two distinct implementations of the Parquet scan operator. The configuration property `spark.comet.scan.impl` is used to select an implementation. The default setting is `spark.comet.scan.impl=auto`, and Comet will choose the most appropriate implementation based on the Parquet schema and other Comet configuration settings. Most users should not need to change this setting. However, it is possible to force Comet to try and use a particular implementation for all scan operations by setting this configuration property to one of the following implementations. -| Implementation | Description | -| ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `native_comet` | **Deprecated.** This implementation provides strong compatibility with Spark but does not support complex types. This is the original scan implementation in Comet and will be removed in a future release. | -| `native_iceberg_compat` | This implementation delegates to DataFusion's `DataSourceExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. | -| `native_datafusion` | This experimental implementation delegates to DataFusion's `DataSourceExec` for full native execution. There are known compatibility issues when using this scan. | - -The `native_datafusion` and `native_iceberg_compat` scans provide the following benefits over the `native_comet` -implementation: - -- Leverages the DataFusion community's ongoing improvements to `DataSourceExec` -- Provides support for reading complex types (structs, arrays, and maps) -- Delegates Parquet decoding to native Rust code rather than JVM-side decoding -- Improves performance - -> **Note on mutable buffers:** Both `native_comet` and `native_iceberg_compat` use reusable mutable buffers -> when transferring data from JVM to native code via Arrow FFI. The `native_iceberg_compat` implementation uses DataFusion's native Parquet reader for data columns, bypassing Comet's mutable buffer infrastructure entirely. However, partition columns still use `ConstantColumnReader`, which relies on Comet's mutable buffers that are reused across batches. This means native operators that buffer data (such as `SortExec` or `ShuffleWriterExec`) must perform deep copies to avoid data corruption. -> See the [FFI documentation](ffi.md) for details on the `arrow_ffi_safe` flag and ownership semantics. +The two implementations are `native_datafusion` and `native_iceberg_compat`. They both delegate to DataFusion's +`DataSourceExec`. The main difference between these implementations is that `native_datafusion` runs fully natively, and +`native_iceberg_compat` is a hybrid JVM/Rust implementation that can provide support some Spark features that +`native_datafusion` can not, but has some performance overhead due to crossing the JVM/Rust boundary. The `native_datafusion` and `native_iceberg_compat` scans share the following limitations: @@ -56,35 +43,21 @@ The `native_datafusion` and `native_iceberg_compat` scans share the following li - No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. - No support for datetime rebasing detection or the `spark.comet.exceptionOnDatetimeRebase` configuration. When reading Parquet files containing dates or timestamps written before Spark 3.0 (which used a hybrid Julian/Gregorian calendar), - the `native_comet` implementation can detect these legacy values and either throw an exception or read them without - rebasing. The DataFusion-based implementations do not have this detection capability and will read all dates/timestamps - as if they were written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before - October 15, 1582. + dates/timestamps will be read as if they were written using the Proleptic Gregorian calendar. This may produce + incorrect results for dates before October 15, 1582. - No support for Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not include `parquet`, Spark uses the V2 API for Parquet scans. The DataFusion-based implementations only support the V1 API, so Comet - will fall back to `native_comet` when V2 is enabled. + will fall back to Spark when V2 is enabled. The `native_datafusion` scan has some additional limitations: - No support for row indexes -- `PARQUET_FIELD_ID_READ_ENABLED` is not respected [#1758] -- There are failures in the Spark SQL test suite [#1545] +- No support for reading Parquet field IDs - Setting Spark configs `ignoreMissingFiles` or `ignoreCorruptFiles` to `true` is not compatible with Spark +- There are failures in the Spark SQL test suite [#1545] ## S3 Support -There are some differences in S3 support between the scan implementations. - -### `native_comet` (Deprecated) - -> **Note:** The `native_comet` scan implementation is deprecated and will be removed in a future release. - -The `native_comet` Parquet scan implementation reads data from S3 using the [Hadoop-AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html), which -is identical to the approach commonly used with vanilla Spark. AWS credential configuration and other Hadoop S3A -configurations works the same way as in vanilla Spark. - -### `native_datafusion` and `native_iceberg_compat` - The `native_datafusion` and `native_iceberg_compat` Parquet scan implementations completely offload data loading to native code. They use the [`object_store` crate](https://crates.io/crates/object_store) to read data from S3 and support configuring S3 access using standard [Hadoop S3A configurations](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#General_S3A_Client_configuration) by translating them to diff --git a/docs/source/contributor-guide/roadmap.md b/docs/source/contributor-guide/roadmap.md index 3abe6f3f13..0e1a07beed 100644 --- a/docs/source/contributor-guide/roadmap.md +++ b/docs/source/contributor-guide/roadmap.md @@ -27,8 +27,7 @@ helpful to have a roadmap for some of the major items that require coordination ### Iceberg Integration Iceberg integration is still a work-in-progress ([#2060]), with major improvements expected in the next few -releases. The default `auto` scan mode now uses `native_iceberg_compat` instead of `native_comet`, enabling -support for complex types. +releases. [#2060]: https://github.com/apache/datafusion-comet/issues/2060 @@ -40,20 +39,6 @@ more Spark SQL tests and fully implementing ANSI support ([#313]) for all suppor [#313]: https://github.com/apache/datafusion-comet/issues/313 [#1637]: https://github.com/apache/datafusion-comet/issues/1637 -### Removing the native_comet scan implementation - -The `native_comet` scan implementation is now deprecated and will be removed in a future release ([#2186], [#2177]). -This is the original scan implementation that uses mutable buffers (which is incompatible with best practices around -Arrow FFI) and does not support complex types. - -Now that the default `auto` scan mode uses `native_iceberg_compat` (which is based on DataFusion's `DataSourceExec`), -we can proceed with removing the `native_comet` scan implementation, and then improve the efficiency of our use of -Arrow FFI ([#2171]). - -[#2186]: https://github.com/apache/datafusion-comet/issues/2186 -[#2171]: https://github.com/apache/datafusion-comet/issues/2171 -[#2177]: https://github.com/apache/datafusion-comet/issues/2177 - ## Ongoing Improvements In addition to the major initiatives above, we have the following ongoing areas of work: From 87cd7944c4314c5278771f6a13412d56daee3e5a Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 6 Feb 2026 07:14:19 -0700 Subject: [PATCH 02/15] update --- docs/source/contributor-guide/parquet_scans.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index 71a6e36862..4e60623b90 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -54,7 +54,6 @@ The `native_datafusion` scan has some additional limitations: - No support for row indexes - No support for reading Parquet field IDs - Setting Spark configs `ignoreMissingFiles` or `ignoreCorruptFiles` to `true` is not compatible with Spark -- There are failures in the Spark SQL test suite [#1545] ## S3 Support From e394f2a1b64f01f53722e204b8824f3022531d57 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 6 Feb 2026 07:19:51 -0700 Subject: [PATCH 03/15] prettier --- docs/source/contributor-guide/ffi.md | 4 ++-- docs/source/contributor-guide/parquet_scans.md | 10 +++++----- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/source/contributor-guide/ffi.md b/docs/source/contributor-guide/ffi.md index 30f441c082..e2ffad386b 100644 --- a/docs/source/contributor-guide/ffi.md +++ b/docs/source/contributor-guide/ffi.md @@ -177,8 +177,8 @@ message Scan { #### When ownership is NOT transferred to native: -If the data originates from a scan that uses mutable buffers (such as `native_iceberg_compat` when reading partition -columns or adding missing columns) in some cases, then ownership is not transferred to native and the JVM may +If the data originates from a scan that uses mutable buffers (such as `native_iceberg_compat` when reading partition +columns or adding missing columns) in some cases, then ownership is not transferred to native and the JVM may re-use the underlying buffers in the future. It is critical that the native code performs a deep copy of the arrays if the arrays are to be buffered by diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index 4e60623b90..acf2028c0e 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -26,10 +26,10 @@ settings. Most users should not need to change this setting. However, it is poss a particular implementation for all scan operations by setting this configuration property to one of the following implementations. -The two implementations are `native_datafusion` and `native_iceberg_compat`. They both delegate to DataFusion's -`DataSourceExec`. The main difference between these implementations is that `native_datafusion` runs fully natively, and -`native_iceberg_compat` is a hybrid JVM/Rust implementation that can provide support some Spark features that -`native_datafusion` can not, but has some performance overhead due to crossing the JVM/Rust boundary. +The two implementations are `native_datafusion` and `native_iceberg_compat`. They both delegate to DataFusion's +`DataSourceExec`. The main difference between these implementations is that `native_datafusion` runs fully natively, and +`native_iceberg_compat` is a hybrid JVM/Rust implementation that can provide support some Spark features that +`native_datafusion` can not, but has some performance overhead due to crossing the JVM/Rust boundary. The `native_datafusion` and `native_iceberg_compat` scans share the following limitations: @@ -43,7 +43,7 @@ The `native_datafusion` and `native_iceberg_compat` scans share the following li - No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. - No support for datetime rebasing detection or the `spark.comet.exceptionOnDatetimeRebase` configuration. When reading Parquet files containing dates or timestamps written before Spark 3.0 (which used a hybrid Julian/Gregorian calendar), - dates/timestamps will be read as if they were written using the Proleptic Gregorian calendar. This may produce + dates/timestamps will be read as if they were written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before October 15, 1582. - No support for Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not include `parquet`, Spark uses the V2 API for Parquet scans. The DataFusion-based implementations only support the V1 API, so Comet From a3014ce4709826cc1f2644d402d1bccb7b9fb033 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Mon, 9 Feb 2026 08:04:32 -0700 Subject: [PATCH 04/15] docs: improve parquet_scans.md accuracy and completeness Fix grammar, add encryption fallback and native_iceberg_compat hard-coded config limitations, clarify S3 section applies to both scan implementations, and remove orphaned link references. Co-Authored-By: Claude Opus 4.6 --- .../source/contributor-guide/parquet_scans.md | 22 ++++++++++++++----- 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index acf2028c0e..63401a2fe9 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -28,7 +28,7 @@ implementations. The two implementations are `native_datafusion` and `native_iceberg_compat`. They both delegate to DataFusion's `DataSourceExec`. The main difference between these implementations is that `native_datafusion` runs fully natively, and -`native_iceberg_compat` is a hybrid JVM/Rust implementation that can provide support some Spark features that +`native_iceberg_compat` is a hybrid JVM/Rust implementation that can support some Spark features that `native_datafusion` can not, but has some performance overhead due to crossing the JVM/Rust boundary. The `native_datafusion` and `native_iceberg_compat` scans share the following limitations: @@ -48,6 +48,8 @@ The `native_datafusion` and `native_iceberg_compat` scans share the following li - No support for Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not include `parquet`, Spark uses the V2 API for Parquet scans. The DataFusion-based implementations only support the V1 API, so Comet will fall back to Spark when V2 is enabled. +- No support for Parquet encryption. When Parquet encryption is enabled (i.e., `parquet.crypto.factory.class` is + set), Comet will fall back to Spark. The `native_datafusion` scan has some additional limitations: @@ -55,6 +57,14 @@ The `native_datafusion` scan has some additional limitations: - No support for reading Parquet field IDs - Setting Spark configs `ignoreMissingFiles` or `ignoreCorruptFiles` to `true` is not compatible with Spark +The `native_iceberg_compat` scan has some additional limitations: + +- Some Spark configuration values are hard-coded to their defaults rather than respecting user-specified values. + The affected configurations are `spark.sql.parquet.binaryAsString`, `spark.sql.parquet.int96AsTimestamp`, + `spark.sql.caseSensitive`, `spark.sql.parquet.inferTimestampNTZ.enabled`, and + `spark.sql.legacy.parquet.nanosAsLong`. See [issue #1816](https://github.com/apache/datafusion-comet/issues/1816) + for more details. + ## S3 Support The `native_datafusion` and `native_iceberg_compat` Parquet scan implementations completely offload data loading @@ -67,7 +77,8 @@ continue to work as long as the configurations are supported and can be translat #### Additional S3 Configuration Options -Beyond credential providers, the `native_datafusion` implementation supports additional S3 configuration options: +Beyond credential providers, the `native_datafusion` and `native_iceberg_compat` implementations support additional +S3 configuration options: | Option | Description | | ------------------------------- | -------------------------------------------------------------------------------------------------- | @@ -80,7 +91,8 @@ All configuration options support bucket-specific overrides using the pattern `f #### Examples -The following examples demonstrate how to configure S3 access with the `native_datafusion` Parquet scan implementation using different authentication methods. +The following examples demonstrate how to configure S3 access with the `native_datafusion` and `native_iceberg_compat` +Parquet scan implementations using different authentication methods. **Example 1: Simple Credentials** @@ -112,11 +124,9 @@ $SPARK_HOME/bin/spark-shell \ #### Limitations -The S3 support of `native_datafusion` has the following limitations: +The S3 support of `native_datafusion` and `native_iceberg_compat` has the following limitations: 1. **Partial Hadoop S3A configuration support**: Not all Hadoop S3A configurations are currently supported. Only the configurations listed in the tables above are translated and applied to the underlying `object_store` crate. 2. **Custom credential providers**: Custom implementations of AWS credential providers are not supported. The implementation only supports the standard credential providers listed in the table above. We are planning to add support for custom credential providers through a JNI-based adapter that will allow calling Java credential providers from native code. See [issue #1829](https://github.com/apache/datafusion-comet/issues/1829) for more details. -[#1545]: https://github.com/apache/datafusion-comet/issues/1545 -[#1758]: https://github.com/apache/datafusion-comet/issues/1758 From fbe2f3392eaef9c399ed4560269109bd290ae4b1 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Mon, 9 Feb 2026 08:04:58 -0700 Subject: [PATCH 05/15] update config docs --- .../main/scala/org/apache/comet/CometConf.scala | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/common/src/main/scala/org/apache/comet/CometConf.scala b/common/src/main/scala/org/apache/comet/CometConf.scala index 522ccbc94c..e837263844 100644 --- a/common/src/main/scala/org/apache/comet/CometConf.scala +++ b/common/src/main/scala/org/apache/comet/CometConf.scala @@ -122,16 +122,14 @@ object CometConf extends ShimCometConf { val SCAN_AUTO = "auto" val COMET_NATIVE_SCAN_IMPL: ConfigEntry[String] = conf("spark.comet.scan.impl") - .category(CATEGORY_SCAN) + .category(CATEGORY_PARQUET) .doc( - "The implementation of Comet Native Scan to use. Available modes are " + + "The implementation of Comet's Parquet scan to use. Available scans are " + s"`$SCAN_NATIVE_DATAFUSION`, and `$SCAN_NATIVE_ICEBERG_COMPAT`. " + - s"`$SCAN_NATIVE_DATAFUSION` is a fully native implementation of scan based on " + - "DataFusion. " + - s"`$SCAN_NATIVE_ICEBERG_COMPAT` is the recommended native implementation that " + - "exposes apis to read parquet columns natively and supports complex types. " + - s"`$SCAN_AUTO` (default) chooses the best scan.") - .internal() + s"`$SCAN_NATIVE_DATAFUSION` is a fully native implementation, and " + + s"`$SCAN_NATIVE_ICEBERG_COMPAT` is a hybrid implementation that supports some " + + "additional features, such as row indexes and field ids. " + + s"`$SCAN_AUTO` (default) chooses the best available scan based on the scan schema.") .stringConf .transform(_.toLowerCase(Locale.ROOT)) .checkValues(Set(SCAN_NATIVE_DATAFUSION, SCAN_NATIVE_ICEBERG_COMPAT, SCAN_AUTO)) From c25a7cd421d8de4c4306ee2edbfdc7b0312cd7f7 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Mon, 9 Feb 2026 08:07:59 -0700 Subject: [PATCH 06/15] prettier --- docs/source/contributor-guide/parquet_scans.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index 63401a2fe9..583786ee6a 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -129,4 +129,3 @@ The S3 support of `native_datafusion` and `native_iceberg_compat` has the follow 1. **Partial Hadoop S3A configuration support**: Not all Hadoop S3A configurations are currently supported. Only the configurations listed in the tables above are translated and applied to the underlying `object_store` crate. 2. **Custom credential providers**: Custom implementations of AWS credential providers are not supported. The implementation only supports the standard credential providers listed in the table above. We are planning to add support for custom credential providers through a JNI-based adapter that will allow calling Java credential providers from native code. See [issue #1829](https://github.com/apache/datafusion-comet/issues/1829) for more details. - From 1baabfeb7e8af7079796a05896533341eaaa8551 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 13 Feb 2026 13:02:09 -0700 Subject: [PATCH 07/15] docs: clarify parquet scan limitations and fallback behavior Clarify which limitations fall back to Spark vs which may produce incorrect results. Add missing documented limitations for native_datafusion (DPP, input_file_name, metadata columns). Fix misleading wording for ignoreCorruptFiles/ignoreMissingFiles. Note that auto mode currently always selects native_iceberg_compat. Co-Authored-By: Claude Opus 4.6 --- .../source/contributor-guide/parquet_scans.md | 46 +++++++++++-------- 1 file changed, 28 insertions(+), 18 deletions(-) diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index 583786ee6a..ab4b1b9b23 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -20,18 +20,18 @@ under the License. # Comet Parquet Scan Implementations Comet currently has two distinct implementations of the Parquet scan operator. The configuration property -`spark.comet.scan.impl` is used to select an implementation. The default setting is `spark.comet.scan.impl=auto`, and -Comet will choose the most appropriate implementation based on the Parquet schema and other Comet configuration -settings. Most users should not need to change this setting. However, it is possible to force Comet to try and use -a particular implementation for all scan operations by setting this configuration property to one of the following -implementations. +`spark.comet.scan.impl` is used to select an implementation. The default setting is `spark.comet.scan.impl=auto`, which +currently always uses the `native_iceberg_compat` implementation. Most users should not need to change this setting. +However, it is possible to force Comet to try and use a particular implementation for all scan operations by setting +this configuration property to one of the following implementations. The two implementations are `native_datafusion` and `native_iceberg_compat`. They both delegate to DataFusion's `DataSourceExec`. The main difference between these implementations is that `native_datafusion` runs fully natively, and `native_iceberg_compat` is a hybrid JVM/Rust implementation that can support some Spark features that `native_datafusion` can not, but has some performance overhead due to crossing the JVM/Rust boundary. -The `native_datafusion` and `native_iceberg_compat` scans share the following limitations: +The `native_datafusion` and `native_iceberg_compat` scans share the following limitations. Unless otherwise noted, +unsupported features cause Comet to fall back to Spark, so queries will still return correct results. - When reading Parquet files written by systems other than Spark that contain columns with the logical type `UINT_8` (unsigned 8-bit integers), Comet may produce different results than Spark. Spark maps `UINT_8` to `ShortType`, but @@ -40,30 +40,40 @@ The `native_datafusion` and `native_iceberg_compat` scans share the following li Spark when scanning Parquet files containing `ShortType` columns. This behavior can be disabled by setting `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType` columns are always safe because they can only come from signed `INT8`, where truncation preserves the signed value. -- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. -- No support for datetime rebasing detection or the `spark.comet.exceptionOnDatetimeRebase` configuration. When reading - Parquet files containing dates or timestamps written before Spark 3.0 (which used a hybrid Julian/Gregorian calendar), - dates/timestamps will be read as if they were written using the Proleptic Gregorian calendar. This may produce - incorrect results for dates before October 15, 1582. +- No support for default values that are nested types (e.g., maps, arrays, structs). Comet falls back to Spark. + Literal default values are supported. +- **Potential incorrect results:** No support for datetime rebasing detection or the + `spark.comet.exceptionOnDatetimeRebase` configuration. When reading Parquet files containing dates or timestamps + written before Spark 3.0 (which used a hybrid Julian/Gregorian calendar), dates/timestamps will be read as if they + were written using the Proleptic Gregorian calendar. This does not fall back to Spark and may produce incorrect + results for dates before October 15, 1582. - No support for Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not include `parquet`, Spark uses the V2 API for Parquet scans. The DataFusion-based implementations only support the V1 API, so Comet will fall back to Spark when V2 is enabled. - No support for Parquet encryption. When Parquet encryption is enabled (i.e., `parquet.crypto.factory.class` is set), Comet will fall back to Spark. +- No support for Spark metadata columns (e.g., `_metadata.file_path`). When metadata columns are referenced in the + query, Comet falls back to Spark. -The `native_datafusion` scan has some additional limitations: +The `native_datafusion` scan has some additional limitations. All of these cause Comet to fall back to Spark. - No support for row indexes - No support for reading Parquet field IDs -- Setting Spark configs `ignoreMissingFiles` or `ignoreCorruptFiles` to `true` is not compatible with Spark +- No support for Dynamic Partition Pruning (DPP). When partition filters contain dynamic pruning subqueries, Comet + falls back to Spark. +- No support for `input_file_name()`, `input_file_block_start()`, or `input_file_block_length()` SQL functions. + The `native_datafusion` scan does not use Spark's `FileScanRDD`, so these functions cannot populate their values. + When these functions appear anywhere in the query plan, Comet falls back to Spark. +- When `ignoreMissingFiles` or `ignoreCorruptFiles` is set to `true`, Comet falls back to Spark. The `native_iceberg_compat` scan has some additional limitations: -- Some Spark configuration values are hard-coded to their defaults rather than respecting user-specified values. - The affected configurations are `spark.sql.parquet.binaryAsString`, `spark.sql.parquet.int96AsTimestamp`, - `spark.sql.caseSensitive`, `spark.sql.parquet.inferTimestampNTZ.enabled`, and - `spark.sql.legacy.parquet.nanosAsLong`. See [issue #1816](https://github.com/apache/datafusion-comet/issues/1816) - for more details. +- **Potential incorrect results:** Some Spark configuration values are hard-coded to their defaults rather than + respecting user-specified values. This does not fall back to Spark and may produce incorrect results when + non-default values are set. The affected configurations are `spark.sql.parquet.binaryAsString`, + `spark.sql.parquet.int96AsTimestamp`, `spark.sql.caseSensitive`, `spark.sql.parquet.inferTimestampNTZ.enabled`, + and `spark.sql.legacy.parquet.nanosAsLong`. See + [issue #1816](https://github.com/apache/datafusion-comet/issues/1816) for more details. ## S3 Support From 32334bdf90b1d554cd75e80c50a055bf12dea52e Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 13 Feb 2026 13:03:48 -0700 Subject: [PATCH 08/15] docs: remove redundant fallback language in native_datafusion section The section intro already states all limitations fall back to Spark, so individual bullet points don't need to repeat it. Co-Authored-By: Claude Opus 4.6 --- docs/source/contributor-guide/parquet_scans.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index ab4b1b9b23..07c56f1ff3 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -59,12 +59,10 @@ The `native_datafusion` scan has some additional limitations. All of these cause - No support for row indexes - No support for reading Parquet field IDs -- No support for Dynamic Partition Pruning (DPP). When partition filters contain dynamic pruning subqueries, Comet - falls back to Spark. +- No support for Dynamic Partition Pruning (DPP) - No support for `input_file_name()`, `input_file_block_start()`, or `input_file_block_length()` SQL functions. The `native_datafusion` scan does not use Spark's `FileScanRDD`, so these functions cannot populate their values. - When these functions appear anywhere in the query plan, Comet falls back to Spark. -- When `ignoreMissingFiles` or `ignoreCorruptFiles` is set to `true`, Comet falls back to Spark. +- No support for `ignoreMissingFiles` or `ignoreCorruptFiles` being set to `true` The `native_iceberg_compat` scan has some additional limitations: From 15c3049dd08abd54017b629ddf5e0debb24c9015 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 13 Feb 2026 13:05:22 -0700 Subject: [PATCH 09/15] docs: separate fallback limitations from incorrect-results limitations Restructure shared and per-scan limitation lists into two clear categories: features that fall back to Spark (safe) and issues that may produce incorrect results without falling back. Remove redundant "Comet falls back to Spark" from individual bullets where the section intro already states it. Co-Authored-By: Claude Opus 4.6 --- .../source/contributor-guide/parquet_scans.md | 57 +++++++++---------- 1 file changed, 27 insertions(+), 30 deletions(-) diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index 07c56f1ff3..195545fe4c 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -30,30 +30,27 @@ The two implementations are `native_datafusion` and `native_iceberg_compat`. The `native_iceberg_compat` is a hybrid JVM/Rust implementation that can support some Spark features that `native_datafusion` can not, but has some performance overhead due to crossing the JVM/Rust boundary. -The `native_datafusion` and `native_iceberg_compat` scans share the following limitations. Unless otherwise noted, -unsupported features cause Comet to fall back to Spark, so queries will still return correct results. - -- When reading Parquet files written by systems other than Spark that contain columns with the logical type `UINT_8` - (unsigned 8-bit integers), Comet may produce different results than Spark. Spark maps `UINT_8` to `ShortType`, but - Comet's Arrow-based readers respect the unsigned type and read the data as unsigned rather than signed. Since Comet - cannot distinguish `ShortType` columns that came from `UINT_8` versus signed `INT16`, by default Comet falls back to - Spark when scanning Parquet files containing `ShortType` columns. This behavior can be disabled by setting - `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType` columns are always safe because they can - only come from signed `INT8`, where truncation preserves the signed value. -- No support for default values that are nested types (e.g., maps, arrays, structs). Comet falls back to Spark. - Literal default values are supported. -- **Potential incorrect results:** No support for datetime rebasing detection or the - `spark.comet.exceptionOnDatetimeRebase` configuration. When reading Parquet files containing dates or timestamps - written before Spark 3.0 (which used a hybrid Julian/Gregorian calendar), dates/timestamps will be read as if they - were written using the Proleptic Gregorian calendar. This does not fall back to Spark and may produce incorrect - results for dates before October 15, 1582. -- No support for Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not include `parquet`, - Spark uses the V2 API for Parquet scans. The DataFusion-based implementations only support the V1 API, so Comet - will fall back to Spark when V2 is enabled. -- No support for Parquet encryption. When Parquet encryption is enabled (i.e., `parquet.crypto.factory.class` is - set), Comet will fall back to Spark. -- No support for Spark metadata columns (e.g., `_metadata.file_path`). When metadata columns are referenced in the - query, Comet falls back to Spark. +The following unsupported features are shared by both scans and cause Comet to fall back to Spark: + +- `ShortType` columns, by default. When reading Parquet files written by systems other than Spark that contain + columns with the logical type `UINT_8` (unsigned 8-bit integers), Comet may produce different results than Spark. + Spark maps `UINT_8` to `ShortType`, but Comet's Arrow-based readers respect the unsigned type and read the data as + unsigned rather than signed. Since Comet cannot distinguish `ShortType` columns that came from `UINT_8` versus + signed `INT16`, by default Comet falls back to Spark when scanning Parquet files containing `ShortType` columns. + This behavior can be disabled by setting `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType` + columns are always safe because they can only come from signed `INT8`, where truncation preserves the signed value. +- Default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. +- Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not include `parquet`, Spark uses the + V2 API for Parquet scans. The DataFusion-based implementations only support the V1 API. +- Parquet encryption (i.e., when `parquet.crypto.factory.class` is set) +- Spark metadata columns (e.g., `_metadata.file_path`) + +The following shared limitation may produce incorrect results without falling back to Spark: + +- No support for datetime rebasing detection or the `spark.comet.exceptionOnDatetimeRebase` configuration. When + reading Parquet files containing dates or timestamps written before Spark 3.0 (which used a hybrid + Julian/Gregorian calendar), dates/timestamps will be read as if they were written using the Proleptic Gregorian + calendar. This may produce incorrect results for dates before October 15, 1582. The `native_datafusion` scan has some additional limitations. All of these cause Comet to fall back to Spark. @@ -64,13 +61,13 @@ The `native_datafusion` scan has some additional limitations. All of these cause The `native_datafusion` scan does not use Spark's `FileScanRDD`, so these functions cannot populate their values. - No support for `ignoreMissingFiles` or `ignoreCorruptFiles` being set to `true` -The `native_iceberg_compat` scan has some additional limitations: +The `native_iceberg_compat` scan has the following additional limitation that may produce incorrect results +without falling back to Spark: -- **Potential incorrect results:** Some Spark configuration values are hard-coded to their defaults rather than - respecting user-specified values. This does not fall back to Spark and may produce incorrect results when - non-default values are set. The affected configurations are `spark.sql.parquet.binaryAsString`, - `spark.sql.parquet.int96AsTimestamp`, `spark.sql.caseSensitive`, `spark.sql.parquet.inferTimestampNTZ.enabled`, - and `spark.sql.legacy.parquet.nanosAsLong`. See +- Some Spark configuration values are hard-coded to their defaults rather than respecting user-specified values. + This may produce incorrect results when non-default values are set. The affected configurations are + `spark.sql.parquet.binaryAsString`, `spark.sql.parquet.int96AsTimestamp`, `spark.sql.caseSensitive`, + `spark.sql.parquet.inferTimestampNTZ.enabled`, and `spark.sql.legacy.parquet.nanosAsLong`. See [issue #1816](https://github.com/apache/datafusion-comet/issues/1816) for more details. ## S3 Support From 2789c36180ff42319c29e4bdd10df18a1995086b Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 13 Feb 2026 13:07:13 -0700 Subject: [PATCH 10/15] fix --- docs/source/contributor-guide/parquet_scans.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index 195545fe4c..71e682dfea 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -19,17 +19,19 @@ under the License. # Comet Parquet Scan Implementations -Comet currently has two distinct implementations of the Parquet scan operator. The configuration property -`spark.comet.scan.impl` is used to select an implementation. The default setting is `spark.comet.scan.impl=auto`, which -currently always uses the `native_iceberg_compat` implementation. Most users should not need to change this setting. -However, it is possible to force Comet to try and use a particular implementation for all scan operations by setting -this configuration property to one of the following implementations. +Comet currently has two distinct implementations of the Parquet scan operator. The two implementations are `native_datafusion` and `native_iceberg_compat`. They both delegate to DataFusion's `DataSourceExec`. The main difference between these implementations is that `native_datafusion` runs fully natively, and `native_iceberg_compat` is a hybrid JVM/Rust implementation that can support some Spark features that `native_datafusion` can not, but has some performance overhead due to crossing the JVM/Rust boundary. +The configuration property +`spark.comet.scan.impl` is used to select an implementation. The default setting is `spark.comet.scan.impl=auto`, which +currently always uses the `native_iceberg_compat` implementation. Most users should not need to change this setting. +However, it is possible to force Comet to try and use a particular implementation for all scan operations by setting +this configuration property to one of the following implementations. + The following unsupported features are shared by both scans and cause Comet to fall back to Spark: - `ShortType` columns, by default. When reading Parquet files written by systems other than Spark that contain From 0266613f37f6069f4d307302a46ea8c7b72c7449 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 13 Feb 2026 13:12:05 -0700 Subject: [PATCH 11/15] update --- docs/source/contributor-guide/ffi.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/source/contributor-guide/ffi.md b/docs/source/contributor-guide/ffi.md index e2ffad386b..5a4f572f81 100644 --- a/docs/source/contributor-guide/ffi.md +++ b/docs/source/contributor-guide/ffi.md @@ -177,9 +177,8 @@ message Scan { #### When ownership is NOT transferred to native: -If the data originates from a scan that uses mutable buffers (such as `native_iceberg_compat` when reading partition -columns or adding missing columns) in some cases, then ownership is not transferred to native and the JVM may -re-use the underlying buffers in the future. +If the data originates from a scan that uses mutable buffers (such Iceberg scans using the Iceberg Java integration path), +then ownership is not transferred to native and the JVM may re-use the underlying buffers in the future. It is critical that the native code performs a deep copy of the arrays if the arrays are to be buffered by operators such as `SortExec` or `ShuffleWriterExec`, otherwise data corruption is likely to occur. From ba192a10d126b3295d55a9dcff1d4c130db84a3d Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 13 Feb 2026 13:53:34 -0700 Subject: [PATCH 12/15] remove encryption from unsupported list, move DPP to common list --- docs/source/contributor-guide/parquet_scans.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index 71e682dfea..79187e5aa4 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -44,8 +44,8 @@ The following unsupported features are shared by both scans and cause Comet to f - Default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. - Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not include `parquet`, Spark uses the V2 API for Parquet scans. The DataFusion-based implementations only support the V1 API. -- Parquet encryption (i.e., when `parquet.crypto.factory.class` is set) - Spark metadata columns (e.g., `_metadata.file_path`) +- No support for Dynamic Partition Pruning (DPP) The following shared limitation may produce incorrect results without falling back to Spark: @@ -58,7 +58,6 @@ The `native_datafusion` scan has some additional limitations. All of these cause - No support for row indexes - No support for reading Parquet field IDs -- No support for Dynamic Partition Pruning (DPP) - No support for `input_file_name()`, `input_file_block_start()`, or `input_file_block_length()` SQL functions. The `native_datafusion` scan does not use Spark's `FileScanRDD`, so these functions cannot populate their values. - No support for `ignoreMissingFiles` or `ignoreCorruptFiles` being set to `true` From df37c226755243a3c74ad3ce3b3773af5ecdf41c Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Tue, 17 Feb 2026 11:21:27 -0700 Subject: [PATCH 13/15] Update docs/source/contributor-guide/parquet_scans.md Co-authored-by: Oleks V --- docs/source/contributor-guide/parquet_scans.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index 79187e5aa4..987ce7ea69 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -30,7 +30,7 @@ The configuration property `spark.comet.scan.impl` is used to select an implementation. The default setting is `spark.comet.scan.impl=auto`, which currently always uses the `native_iceberg_compat` implementation. Most users should not need to change this setting. However, it is possible to force Comet to try and use a particular implementation for all scan operations by setting -this configuration property to one of the following implementations. +this configuration property to one of the following implementations. For example: `--conf spark.comet.scan.impl=native_datafusion` The following unsupported features are shared by both scans and cause Comet to fall back to Spark: From 7702b24d62b5c62219df5a5d7e1cc93ca978f643 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Tue, 17 Feb 2026 14:42:17 -0700 Subject: [PATCH 14/15] address feedback --- docs/source/contributor-guide/ffi.md | 4 +++- docs/source/contributor-guide/parquet_scans.md | 17 +++++++++-------- 2 files changed, 12 insertions(+), 9 deletions(-) diff --git a/docs/source/contributor-guide/ffi.md b/docs/source/contributor-guide/ffi.md index 5a4f572f81..c40c189e99 100644 --- a/docs/source/contributor-guide/ffi.md +++ b/docs/source/contributor-guide/ffi.md @@ -177,9 +177,11 @@ message Scan { #### When ownership is NOT transferred to native: -If the data originates from a scan that uses mutable buffers (such Iceberg scans using the Iceberg Java integration path), +If the data originates from a scan that uses mutable buffers (such as Iceberg scans using the [hybrid Iceberg reader]), then ownership is not transferred to native and the JVM may re-use the underlying buffers in the future. +[hybrid Iceberg reader]: https://datafusion.apache.org/comet/user-guide/latest/iceberg.html#hybrid-reader + It is critical that the native code performs a deep copy of the arrays if the arrays are to be buffered by operators such as `SortExec` or `ShuffleWriterExec`, otherwise data corruption is likely to occur. diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index 987ce7ea69..e462be2410 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -21,18 +21,18 @@ under the License. Comet currently has two distinct implementations of the Parquet scan operator. -The two implementations are `native_datafusion` and `native_iceberg_compat`. They both delegate to DataFusion's -`DataSourceExec`. The main difference between these implementations is that `native_datafusion` runs fully natively, and -`native_iceberg_compat` is a hybrid JVM/Rust implementation that can support some Spark features that -`native_datafusion` can not, but has some performance overhead due to crossing the JVM/Rust boundary. +| Scan Implementation | Notes | +| ----------------------- | ----------------- | +| `native_datafusion` | Fully native scan | +| `native_iceberg_compat` | Hybrid JVM scan | The configuration property `spark.comet.scan.impl` is used to select an implementation. The default setting is `spark.comet.scan.impl=auto`, which currently always uses the `native_iceberg_compat` implementation. Most users should not need to change this setting. -However, it is possible to force Comet to try and use a particular implementation for all scan operations by setting -this configuration property to one of the following implementations. For example: `--conf spark.comet.scan.impl=native_datafusion` +However, it is possible to force Comet to use a particular implementation for all scan operations by setting +this configuration property to one of the following implementations. For example: `--conf spark.comet.scan.impl=native_datafusion`. -The following unsupported features are shared by both scans and cause Comet to fall back to Spark: +The following features are not supported by either scan implementation, and Comet will fall back to Spark in these scenarios: - `ShortType` columns, by default. When reading Parquet files written by systems other than Spark that contain columns with the logical type `UINT_8` (unsigned 8-bit integers), Comet may produce different results than Spark. @@ -54,7 +54,8 @@ The following shared limitation may produce incorrect results without falling ba Julian/Gregorian calendar), dates/timestamps will be read as if they were written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before October 15, 1582. -The `native_datafusion` scan has some additional limitations. All of these cause Comet to fall back to Spark. +The `native_datafusion` scan has some additional limitations, mostly related to Parquet metadata. All of these +cause Comet to fall back to Spark. - No support for row indexes - No support for reading Parquet field IDs From 955699b5055b4fdfc3a367eb5b9934ea8af2d4b9 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Tue, 17 Feb 2026 14:42:39 -0700 Subject: [PATCH 15/15] address feedback --- docs/source/contributor-guide/parquet_scans.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md index e462be2410..7df9394882 100644 --- a/docs/source/contributor-guide/parquet_scans.md +++ b/docs/source/contributor-guide/parquet_scans.md @@ -21,10 +21,10 @@ under the License. Comet currently has two distinct implementations of the Parquet scan operator. -| Scan Implementation | Notes | -| ----------------------- | ----------------- | -| `native_datafusion` | Fully native scan | -| `native_iceberg_compat` | Hybrid JVM scan | +| Scan Implementation | Notes | +| ----------------------- | ---------------------- | +| `native_datafusion` | Fully native scan | +| `native_iceberg_compat` | Hybrid JVM/native scan | The configuration property `spark.comet.scan.impl` is used to select an implementation. The default setting is `spark.comet.scan.impl=auto`, which