Spark, Parquet: honor Hadoop vectored IO read config by venkateshwaracholan · Pull Request #16614 · apache/iceberg

venkateshwaracholan · 2026-05-29T19:43:14Z

Summary

Iceberg currently hardcodes Hadoop vectored I/O on when building Parquet read options by calling withUseHadoopVectoredIo(true). This prevents users from disabling vectored I/O with Parquet’s existing parquet.hadoop.vectored.io.enabled=false setting.

This change:

Removes the hardcoded vectored I/O override.
Preserves Parquet’s default vectored I/O behavior.
Propagates parquet.hadoop.vectored.io.enabled through Spark read planning so executor-side Parquet readers can honor it.
Adds Parquet and Spark tests for the config behavior.

Why

For S3FileIO, Spark executor-side reads use Iceberg input files such as S3InputFile, not HadoopInputFile. That means Hadoop configuration is not automatically available when Parquet read options are built on executors.

By carrying parquet.hadoop.vectored.io.enabled from Spark read options, Spark Hadoop configuration, or table properties into the Parquet reader, users can disable vectored I/O for environments where it causes excessive on-heap allocation.

Testing

./gradlew :iceberg-parquet:test --tests org.apache.iceberg.parquet.TestParquet
./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:test --tests org.apache.iceberg.spark.TestSparkReadConf
./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:compileTestJava
./gradlew spotlessCheck

alexandrefimov · 2026-05-30T15:03:14Z

@@ -125,6 +148,10 @@ protected NameMapping nameMapping() {
    return nameMapping;
  }


Thanks for working on this. I was looking at whether the setting reaches all Parquet read paths and noticed one path that may still be uncovered.

BaseRowReader and BaseBatchReader now call setAll(parquetReadProperties()) for Parquet data files, but SparkDeleteFilter still creates BaseDeleteLoader/CachingDeleteLoader with only loadInputFile. BaseDeleteLoader.openDeletes then builds the reader without any read properties:

FormatModelRegistry.readBuilder(format, Record.class, inputFile)

For S3FileIO-backed equality/position delete files, the InputFile is not a HadoopInputFile, so Parquet.buildReadOptions may not see the Spark Hadoop configuration and may fall back to the default vectored IO setting. That would mean spark.hadoop.parquet.hadoop.vectored.io.enabled=false may still be ignored while applying Parquet delete files.

Could you check whether delete-file reads need the same propagated parquetReadProperties? If so, a small regression test with a Parquet position/equality delete file would make this easier to keep covered.

Small note on the anchor: GitHub only allowed me to attach this to the added parquetReadProperties() getter. The path I mean is a bit lower in this file: SparkDeleteFilter.newDeleteLoader() creates BaseDeleteLoader/CachingDeleteLoader without passing these properties through.

Thanks for catching this. The delete-file path wasn't getting parquetReadProperties().

I've propagated the same Parquet read properties through BaseDeleteLoader, including the cached delete loader path, and added regression tests covering Parquet position deletes with vectored I/O disabled.

Please take another look when you have a chance.

alexandrefimov

I had one small source-compatibility question below.

alexandrefimov · 2026-06-02T11:37:38Z

    this(loadInputFile, ThreadPools.getDeleteWorkerPool());
  }

+  public BaseDeleteLoader(


Small source-compatibility question: adding this overload makes existing source calls like new BaseDeleteLoader(loadInputFile, null) ambiguous with the existing (Function, ExecutorService) overload. This PR had to update the JMH callers for that reason.

Since BaseDeleteLoader is public in iceberg-data, would it be better to avoid this two-arg overload, for example by keeping only the three-arg variant for callers that need read properties? That would keep the read-properties propagation while avoiding a source-compat break for existing two-arg callers.

Good point. I removed the (Function<DeleteFile, InputFile>, Map<String, String>) overload and updated the new call sites to use the existing worker-pool constructor plus read properties. This keeps the read-properties propagation while avoiding ambiguity for existing callers using (Function<DeleteFile, InputFile>, ExecutorService) with null.

alexandrefimov

Thanks, this addresses my concern. Removing the two-arg read-properties overload avoids the source ambiguity while still allowing the delete-file read path to receive the propagated Parquet read properties.

I also re-ran the focused checks for the updated paths:

:iceberg-data:test --tests org.apache.iceberg.data.TestBaseDeleteLoader
:iceberg-parquet:test --tests org.apache.iceberg.parquet.TestParquet
:iceberg-spark:iceberg-spark-4.1_2.13:compileTestJava
:iceberg-spark:iceberg-spark-4.1_2.13:test --tests org.apache.iceberg.spark.source.TestSparkReaderDeletes.testParquetDeletesWithVectoredIoDisabled

github-actions Bot added spark parquet labels May 29, 2026

venkateshwaracholan force-pushed the parquet-vectored-io-config branch from f81d217 to 1fed7bd Compare May 29, 2026 19:46

Spark, Parquet: honor Hadoop vectored IO read config

b4f4091

venkateshwaracholan force-pushed the parquet-vectored-io-config branch from 1fed7bd to b4f4091 Compare May 29, 2026 19:52

alexandrefimov reviewed May 30, 2026

View reviewed changes

github-actions Bot added the data label May 31, 2026

venkateshwaracholan requested a review from alexandrefimov May 31, 2026 15:11

Spark: propagate Parquet read properties to delete readers

93ed1e6

venkateshwaracholan force-pushed the parquet-vectored-io-config branch from f97f994 to 93ed1e6 Compare May 31, 2026 15:15

venkateshwaracholan added 4 commits June 2, 2026 15:57

Test: clear TestTables state after delete loader tests

462985f

Fix benchmark compilation after delete loader changes

3697abc

Merge branch 'main' into parquet-vectored-io-config

bfe9d0d

Parquet: Fix import ordering in TestParquet

84eb144

alexandrefimov reviewed Jun 2, 2026

View reviewed changes

Data: avoid BaseDeleteLoader constructor ambiguity

d738d82

venkateshwaracholan requested a review from alexandrefimov June 2, 2026 12:27

alexandrefimov approved these changes Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark, Parquet: honor Hadoop vectored IO read config#16614

Spark, Parquet: honor Hadoop vectored IO read config#16614
venkateshwaracholan wants to merge 7 commits into
apache:mainfrom
venkateshwaracholan:parquet-vectored-io-config

venkateshwaracholan commented May 29, 2026 •

edited

Loading

Uh oh!

alexandrefimov May 30, 2026

Uh oh!

alexandrefimov May 30, 2026

Uh oh!

venkateshwaracholan May 31, 2026

Uh oh!

alexandrefimov left a comment

Uh oh!

alexandrefimov Jun 2, 2026

Uh oh!

venkateshwaracholan Jun 2, 2026

Uh oh!

alexandrefimov left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -125,6 +148,10 @@ protected NameMapping nameMapping() {
		return nameMapping;
		}

Conversation

venkateshwaracholan commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Testing

Uh oh!

alexandrefimov May 30, 2026

Choose a reason for hiding this comment

Uh oh!

alexandrefimov May 30, 2026

Choose a reason for hiding this comment

Uh oh!

venkateshwaracholan May 31, 2026

Choose a reason for hiding this comment

Uh oh!

alexandrefimov left a comment

Choose a reason for hiding this comment

Uh oh!

alexandrefimov Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

venkateshwaracholan Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

alexandrefimov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

venkateshwaracholan commented May 29, 2026 •

edited

Loading