parquet: Add tests for disabling statistics on multiple columns#16639
Open
algojogacor wants to merge 1 commit into
Open
parquet: Add tests for disabling statistics on multiple columns#16639algojogacor wants to merge 1 commit into
algojogacor wants to merge 1 commit into
Conversation
Add two new test cases to verify per-column statistics configuration works correctly when stats are disabled on more than one column: - testColumnStatisticsDisabledMultipleColumns: verifies that when stats are enabled on one column and disabled on two others, both disabled columns correctly omit statistics in the Parquet file. - testColumnStatisticsDisabledAllColumns: verifies that when stats are disabled on every column, all columns correctly omit statistics. These tests address the edge case reported in issue apache#15347, where disabling statistics across multiple columns could result in some columns still writing statistics.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is an independent contribution and is not associated with any hackathon or competition.
Summary
Adds two new test methods to
TestParquet.javacovering the edge case where per-column statistics are disabled on more than one column simultaneously.Root Cause
The existing test
testColumnStatisticsEnabledonly validates the single-column-disabled scenario. Issue #15347 reported that when statistics are disabled on multiple columns (e.g.,write.parquet.stats-enabled.column.foo=falseandwrite.parquet.stats-enabled.column.bar=false), the second column may still write statistics. These tests provide coverage for that scenario.Changes
testColumnStatisticsDisabledMultipleColumns: Schema with 3 columns (int, string, double); stats enabled on col 1, disabled on cols 2 and 3. Verifies both disabled columns correctly omit statistics.testColumnStatisticsDisabledAllColumns: Schema with 2 columns; stats disabled on both. Verifies every column omits statistics.Testing
Both new tests follow the same pattern as the existing
testColumnStatisticsEnabled: write Parquet data with the given properties, read back the footer, and assert that each ColumnChunkMetaData has the expected statistics state (empty or non-empty).