Skip to content

Optimize repeated shouldIgnoreStatistics calls during footer reading in ParquetMetadataConverter #3601

@asifsmohammed

Description

@asifsmohammed

Describe the enhancement requested

CorruptStatistics.shouldIgnoreStatistics(String createdBy, PrimitiveTypeName columnType) performs VersionParser.parse() and SemanticVersion.parse() on every invocation. The createdBy string is constant per file (from FileMetaData.created_by), but this method is called once per column chunk per row group during ParquetMetadataConverter.fromParquetMetadata().

For a file with R row groups and C columns, the version string is parsed R×C times during file open — all yielding the same result.

Impact
High CPU during ParquetFileReader construction on files with many high groups/columns.

Proposed fix
Compute shouldIgnoreStatistics once per file before the row group loop in fromParquetMetadata, and pass the pre-computed boolean through buildColumnChunkMetaDatafromParquetStatisticsInternal.

Since buildColumnChunkMetaData and fromParquetStatisticsInternal are public/package-level API methods enforced by japicmp-maven-plugin, we will add overloaded methods that accept a boolean shouldIgnoreBinaryStats parameter rather than changing existing signatures. The existing String createdBy signatures remain for backward compatibility and delegate to the new overloads.

The page-level path in ParquetFileReader.Chunk.readAllPages() also calls shouldIgnoreStatistics via fromParquetStatisticsInternal, but is not in scope for this fix — at read time the cost is masked by I/O, decompression, and decoding. The footer path during file open is where R×C calls happen in a tight loop with no I/O to amortize the cost.

Profiling data
shouldIgnoreStatistics accounts for ~65% of fromParquetMetadata CPU time across multiple samples.
Image

Component(s)

Core

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions