Skip to content

parquet: Add tests for disabling statistics on multiple columns#16639

Open
algojogacor wants to merge 1 commit into
apache:mainfrom
algojogacor:fix/multi-column-stats-disabled
Open

parquet: Add tests for disabling statistics on multiple columns#16639
algojogacor wants to merge 1 commit into
apache:mainfrom
algojogacor:fix/multi-column-stats-disabled

Conversation

@algojogacor
Copy link
Copy Markdown

This is an independent contribution and is not associated with any hackathon or competition.

Summary

Adds two new test methods to TestParquet.java covering the edge case where per-column statistics are disabled on more than one column simultaneously.

Root Cause

The existing test testColumnStatisticsEnabled only validates the single-column-disabled scenario. Issue #15347 reported that when statistics are disabled on multiple columns (e.g., write.parquet.stats-enabled.column.foo=false and write.parquet.stats-enabled.column.bar=false), the second column may still write statistics. These tests provide coverage for that scenario.

Changes

  • testColumnStatisticsDisabledMultipleColumns: Schema with 3 columns (int, string, double); stats enabled on col 1, disabled on cols 2 and 3. Verifies both disabled columns correctly omit statistics.
  • testColumnStatisticsDisabledAllColumns: Schema with 2 columns; stats disabled on both. Verifies every column omits statistics.

Testing

Both new tests follow the same pattern as the existing testColumnStatisticsEnabled: write Parquet data with the given properties, read back the footer, and assert that each ColumnChunkMetaData has the expected statistics state (empty or non-empty).

Add two new test cases to verify per-column statistics configuration
works correctly when stats are disabled on more than one column:

- testColumnStatisticsDisabledMultipleColumns: verifies that when stats
  are enabled on one column and disabled on two others, both disabled
  columns correctly omit statistics in the Parquet file.

- testColumnStatisticsDisabledAllColumns: verifies that when stats are
  disabled on every column, all columns correctly omit statistics.

These tests address the edge case reported in issue apache#15347, where
disabling statistics across multiple columns could result in some
columns still writing statistics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant