Skip to content

Parquet: Keep FileSystem reachable during writes when Hadoop FS cache is disabled#16641

Open
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16640-parquet-writer-fs-reachable
Open

Parquet: Keep FileSystem reachable during writes when Hadoop FS cache is disabled#16641
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16640-parquet-writer-fs-reachable

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

@wombatu-kun wombatu-kun commented Jun 1, 2026

Problem

When the Hadoop FileSystem cache is disabled (for example fs.abfs.impl.disable.cache=true), Parquet writes through Iceberg can fail mid-write. The reporter hit this on Spark writing to Azure ADLS Gen2 over abfs:// with a Hadoop catalog: the job fails with Could not submit task to executor ... ThreadPoolExecutor [Terminated], and the debug log shows AzureBlobFileSystem.finalize() running while the file is still being written.

Root cause

ParquetWriter#ensureWriterInitialized builds new ParquetFileWriter(ParquetIO.file(output, conf), ...) and does not keep the Parquet OutputFile it passes in. For a HadoopOutputFile, ParquetIO.file(...) returns a Parquet-native org.apache.parquet.hadoop.util.HadoopOutputFile that resolves its own FileSystem through path.getFileSystem(conf); with the cache disabled this is a fresh instance. Because ParquetFileWriter retains only the output stream and not the OutputFile, nothing keeps that FileSystem reachable once the writer has been constructed, so it can be garbage-collected while the write is still in progress.

On Azure this is fatal. AbfsOutputStream references the store's bounded thread pool (through a SemaphoredDelegatingExecutor) and the AbfsClient, but never the AzureBlobFileSystem wrapper. When the wrapper becomes unreachable, AzureBlobFileSystem.finalize() calls close(), which shuts down AzureBlobFileSystemStore's boundedThreadPool; the still-open stream's next asynchronous flush then submits to that terminated pool and fails, which is exactly the reported log sequence.

The read path is unaffected because Parquet's ParquetFileReader retains its InputFile, and therefore the FileSystem, for the reader's lifetime. This change restores the same symmetry on the write side.

Change

Keep the Parquet OutputFile on ParquetWriter so its FileSystem stays reachable for the writer's lifetime, including through close() and the footer flush. This keeps alive the exact FileSystem instance that backs the write until the write finishes.

Tests

Added TestParquetWriterFileSystemReachability. It writes through a ParquetWriter backed by a HadoopOutputFile with the Hadoop FileSystem cache disabled, using a local FileSystem whose output stream does not hold a back-reference to the FileSystem (mirroring AbfsOutputStream). It asserts that every FileSystem resolved for the write stays reachable while the writer is open and becomes collectible after the writer is dropped. The test fails without the production change and passes with it.

Closes #16640

… is disabled

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@steveloughran
Copy link
Copy Markdown
Contributor

This isn't needed

  1. it is fixed on Hadoop 3.4.0 at the level it should be: in the abfs output stream
  2. it's only going to surface on spark 3, which on 1.11.0 means 3.5
  3. and then only when fs caching is disabled

Disabling abfs caching is a performance killer as you lose the thread pool, the http connection pool and the http prefetch buffer pool. Unless you have some fundamental reason, fix that.

...but it's not needed in iceberg main branch once spark 3.5 is cut; it's a very transient workaround for the situation "caching disabled". Once the spark-3.5 branch goes this PR just becomes superfluous and should really be rolled back to keep the codebase leaner.

Can't you just enable caching and speed all your work up at the same time?

or, if there is some abfs issue which requires caching to be turned off, why not discuss it here or creating a HADOOP- jira on fs/azure. Though as usual: test against 3.4.3/3.5.0 before reporting.

@wombatu-kun
Copy link
Copy Markdown
Contributor Author

@steveloughran I opened this only to help the reporter of #16640, who hit it on the current Spark 3.5 runtime (Spark bundles Hadoop 3.3.x) with the abfs cache disabled - not as a feature I'm attached to. If you don't think it's worth carrying as a transitional workaround until the spark-3.5 tree is dropped, feel free to close it - no objection from me.

@steveloughran
Copy link
Copy Markdown
Contributor

I'll ask why they can't turn caching on; this is such a transient issue. Always frustrating to see something fixed years ago still not being picked up ... we don't want to taint other projects just because of a bug which was fixed within a week of being discovered in 2023.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Spark] parquet ingestion to azure gets stuck when hadoop fs cache is disabled

2 participants