GH-3598: Expose `getRowRanges(int)` and add `getCompressedBytesForRowRanges` by peter-toth · Pull Request #3599 · apache/parquet-java

peter-toth · 2026-06-05T13:57:19Z

Rationale for this change

This PR is based on @mbutrovich's previous work.

Opening up APIs needed by a later materialization feature in Spark. External readers (e.g. a Spark-side scanner) need (a) the column-index-derived row ranges that may pass the configured filter for a row group, and (b) a metadata-only estimate of the on-disk compressed bytes those ranges correspond to for the currently requested columns, so they can plan I/O without reading column data.

What changes are included in this PR?

getRowRanges(int blockIndex): made public; returns row ranges that may pass the configured filter. With no filter, shortcuts to all rows of the row group.
getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges): metadata-only sum of compressed page sizes for the reader's currently requested columns whose pages overlap the given row ranges. Dictionary pages are not represented in OffsetIndex and are therefore excluded.

Are these changes tested?

Yes. TestParquetFileReaderRowRanges covers: no-filter row ranges cover all rows, empty ranges short-circuit to 0, full ranges equal the per-page OffsetIndex sum and are strictly less than the column-chunk total (proving dictionary-page exclusion), and partial ranges fall between 0 and the full total.

Are there any user-facing changes?

No.

Closes #3598

…RowRanges ### Rationale for this change Opening up APIs needed by a later materialization feature in Spark. External readers (e.g. a Spark-side scanner) need (a) the column-index-derived row ranges that may pass the configured filter for a row group, and (b) a metadata-only estimate of the on-disk compressed bytes those ranges correspond to for the currently requested columns, so they can plan I/O without reading column data. ### What changes are included in this PR? - `getRowRanges(int blockIndex)`: made public; returns row ranges that may pass the configured filter. With no filter, shortcuts to all rows of the row group. - `getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges)`: metadata-only sum of compressed page sizes for the reader's currently requested columns whose pages overlap the given row ranges. Dictionary pages are not represented in OffsetIndex and are therefore excluded. ### Are these changes tested? Yes. `TestParquetFileReaderRowRanges` covers: no-filter row ranges cover all rows, empty ranges short-circuit to 0, full ranges equal the per-page OffsetIndex sum and are strictly less than the column-chunk total (proving dictionary-page exclusion), and partial ranges fall between 0 and the full total. ### Are there any user-facing changes? No. Closes apache#3598 Co-authored-by: Matt Butrovich <mbutrovich@gmail.com>

peter-toth changed the title ~~GH-3598: Expose getRowRanges(int) and add getCompressedBytesForRowRanges~~ GH-3598: Expose getRowRanges(int) and add getCompressedBytesForRowRanges Jun 5, 2026

peter-toth force-pushed the GH-3598-parquet-reader-row-range-apis branch from dc0e426 to 6d3427b Compare June 5, 2026 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3598: Expose `getRowRanges(int)` and add `getCompressedBytesForRowRanges`#3599

GH-3598: Expose `getRowRanges(int)` and add `getCompressedBytesForRowRanges`#3599
peter-toth wants to merge 1 commit into
apache:masterfrom
peter-toth:GH-3598-parquet-reader-row-range-apis

peter-toth commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

peter-toth commented Jun 5, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant