Skip to content

feat: add support for parquet content defined chunking options#21110

Open
kszucs wants to merge 7 commits intoapache:mainfrom
kszucs:cdc
Open

feat: add support for parquet content defined chunking options#21110
kszucs wants to merge 7 commits intoapache:mainfrom
kszucs:cdc

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Mar 23, 2026

Rationale for this change

Expose the new Content-Defined Chunking feature from parquet-rs apache/arrow-rs#9450

What changes are included in this PR?

New parquet writer options for enabling CDC.

Are these changes tested?

In-progress.

Are there any user-facing changes?

New config options.

Depends on the 58.1 arrow-rs release.

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt) labels Mar 23, 2026
@kszucs kszucs force-pushed the cdc branch 4 times, most recently from 47936bb to 9d757a2 Compare March 23, 2026 10:56
@github-actions github-actions bot removed the core Core DataFusion crate label Mar 23, 2026
@alamb
Copy link
Contributor

alamb commented Mar 24, 2026

Depends on the 58.1 arrow-rs release.

This is done in

So once that PR is merged this one should be good to go

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 25, 2026
… clean up proto handling

- Change `CdcOptions::norm_level` from `i64` to `i32` to match parquet's type,
  replacing `config_field!(i64)` with `config_field!(i32)` and removing all
  now-unnecessary `as i32`/`as i64` casts
- Add validation in `into_writer_properties_builder` for invalid CDC config:
  `min_chunk_size == 0` and `max_chunk_size <= min_chunk_size`, returning a
  proper `DataFusionError::Configuration` instead of panicking in the parquet layer
- Add explanatory comments on why proto zero-value handling is asymmetric
  between chunk sizes (0 is invalid) and `norm_level` (0 is valid default)
- Remove extra blank line in `ParquetOptions` config block
- Remove unused `parquet` feature from `datafusion-proto-common/Cargo.toml`
- Add three validation tests for the new error paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants