Parquet: Keep FileSystem reachable during writes when Hadoop FS cache is disabled by wombatu-kun · Pull Request #16641 · apache/iceberg

wombatu-kun · 2026-06-01T04:45:37Z

Problem

When the Hadoop FileSystem cache is disabled (for example fs.abfs.impl.disable.cache=true), Parquet writes through Iceberg can fail mid-write. The reporter hit this on Spark writing to Azure ADLS Gen2 over abfs:// with a Hadoop catalog: the job fails with Could not submit task to executor ... ThreadPoolExecutor [Terminated], and the debug log shows AzureBlobFileSystem.finalize() running while the file is still being written.

Root cause

ParquetWriter#ensureWriterInitialized builds new ParquetFileWriter(ParquetIO.file(output, conf), ...) and does not keep the Parquet OutputFile it passes in. For a HadoopOutputFile, ParquetIO.file(...) returns a Parquet-native org.apache.parquet.hadoop.util.HadoopOutputFile that resolves its own FileSystem through path.getFileSystem(conf); with the cache disabled this is a fresh instance. Because ParquetFileWriter retains only the output stream and not the OutputFile, nothing keeps that FileSystem reachable once the writer has been constructed, so it can be garbage-collected while the write is still in progress.

On Azure this is fatal. AbfsOutputStream references the store's bounded thread pool (through a SemaphoredDelegatingExecutor) and the AbfsClient, but never the AzureBlobFileSystem wrapper. When the wrapper becomes unreachable, AzureBlobFileSystem.finalize() calls close(), which shuts down AzureBlobFileSystemStore's boundedThreadPool; the still-open stream's next asynchronous flush then submits to that terminated pool and fails, which is exactly the reported log sequence.

The read path is unaffected because Parquet's ParquetFileReader retains its InputFile, and therefore the FileSystem, for the reader's lifetime. This change restores the same symmetry on the write side.

Change

Keep the Parquet OutputFile on ParquetWriter so its FileSystem stays reachable for the writer's lifetime, including through close() and the footer flush. This keeps alive the exact FileSystem instance that backs the write until the write finishes.

Tests

Added TestParquetWriterFileSystemReachability. It writes through a ParquetWriter backed by a HadoopOutputFile with the Hadoop FileSystem cache disabled, using a local FileSystem whose output stream does not hold a back-reference to the FileSystem (mirroring AbfsOutputStream). It asserts that every FileSystem resolved for the write stays reachable while the writer is open and becomes collectible after the writer is dropped. The test fails without the production change and passes with it.

Closes #16640

… is disabled Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

steveloughran · 2026-06-01T18:57:09Z

This isn't needed

it is fixed on Hadoop 3.4.0 at the level it should be: in the abfs output stream
it's only going to surface on spark 3, which on 1.11.0 means 3.5
and then only when fs caching is disabled

Disabling abfs caching is a performance killer as you lose the thread pool, the http connection pool and the http prefetch buffer pool. Unless you have some fundamental reason, fix that.

...but it's not needed in iceberg main branch once spark 3.5 is cut; it's a very transient workaround for the situation "caching disabled". Once the spark-3.5 branch goes this PR just becomes superfluous and should really be rolled back to keep the codebase leaner.

Can't you just enable caching and speed all your work up at the same time?

or, if there is some abfs issue which requires caching to be turned off, why not discuss it here or creating a HADOOP- jira on fs/azure. Though as usual: test against 3.4.3/3.5.0 before reporting.

wombatu-kun · 2026-06-02T01:27:46Z

@steveloughran I opened this only to help the reporter of #16640, who hit it on the current Spark 3.5 runtime (Spark bundles Hadoop 3.3.x) with the abfs cache disabled - not as a feature I'm attached to. If you don't think it's worth carrying as a transitional workaround until the spark-3.5 tree is dropped, feel free to close it - no objection from me.

steveloughran · 2026-06-02T10:07:40Z

I'll ask why they can't turn caching on; this is such a transient issue. Always frustrating to see something fixed years ago still not being picked up ... we don't want to taint other projects just because of a bug which was fixed within a week of being discovered in 2023.

Parquet: Keep FileSystem reachable during writes when Hadoop FS cache…

f084ac9

… is disabled Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added the parquet label Jun 1, 2026

wombatu-kun mentioned this pull request Jun 1, 2026

Core: Keep FileSystem reachable during Avro writes when Hadoop FS cache is disabled #16642

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: Keep FileSystem reachable during writes when Hadoop FS cache is disabled#16641

Parquet: Keep FileSystem reachable during writes when Hadoop FS cache is disabled#16641
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16640-parquet-writer-fs-reachable

wombatu-kun commented Jun 1, 2026 •

edited

Loading

Uh oh!

steveloughran commented Jun 1, 2026

Uh oh!

wombatu-kun commented Jun 2, 2026

Uh oh!

steveloughran commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wombatu-kun commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Change

Tests

Uh oh!

steveloughran commented Jun 1, 2026

Uh oh!

wombatu-kun commented Jun 2, 2026

Uh oh!

steveloughran commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wombatu-kun commented Jun 1, 2026 •

edited

Loading