Skip to content

GH-3598: Expose getRowRanges(int) and add getCompressedBytesForRowRanges#3599

Open
peter-toth wants to merge 1 commit into
apache:masterfrom
peter-toth:GH-3598-parquet-reader-row-range-apis
Open

GH-3598: Expose getRowRanges(int) and add getCompressedBytesForRowRanges#3599
peter-toth wants to merge 1 commit into
apache:masterfrom
peter-toth:GH-3598-parquet-reader-row-range-apis

Conversation

@peter-toth
Copy link
Copy Markdown

Rationale for this change

This PR is based on @mbutrovich's previous work.

Opening up APIs needed by a later materialization feature in Spark. External readers (e.g. a Spark-side scanner) need (a) the column-index-derived row ranges that may pass the configured filter for a row group, and (b) a metadata-only estimate of the on-disk compressed bytes those ranges correspond to for the currently requested columns, so they can plan I/O without reading column data.

What changes are included in this PR?

  • getRowRanges(int blockIndex): made public; returns row ranges that may pass the configured filter. With no filter, shortcuts to all rows of the row group.
  • getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges): metadata-only sum of compressed page sizes for the reader's currently requested columns whose pages overlap the given row ranges. Dictionary pages are not represented in OffsetIndex and are therefore excluded.

Are these changes tested?

Yes. TestParquetFileReaderRowRanges covers: no-filter row ranges cover all rows, empty ranges short-circuit to 0, full ranges equal the per-page OffsetIndex sum and are strictly less than the column-chunk total (proving dictionary-page exclusion), and partial ranges fall between 0 and the full total.

Are there any user-facing changes?

No.

Closes #3598

@peter-toth peter-toth changed the title GH-3598: Expose getRowRanges(int) and add getCompressedBytesForRowRanges GH-3598: Expose getRowRanges(int) and add getCompressedBytesForRowRanges Jun 5, 2026
…RowRanges

### Rationale for this change

Opening up APIs needed by a later materialization feature in Spark. External readers (e.g. a Spark-side scanner) need (a) the column-index-derived row ranges that may pass the configured filter for a row group, and (b) a metadata-only estimate of the on-disk compressed bytes those ranges correspond to for the currently requested columns, so they can plan I/O without reading column data.

### What changes are included in this PR?

- `getRowRanges(int blockIndex)`: made public; returns row ranges that may pass the configured filter. With no filter, shortcuts to all rows of the row group.
- `getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges)`: metadata-only sum of compressed page sizes for the reader's currently requested columns whose pages overlap the given row ranges. Dictionary pages are not represented in OffsetIndex and are therefore excluded.

### Are these changes tested?

Yes. `TestParquetFileReaderRowRanges` covers: no-filter row ranges cover all rows, empty ranges short-circuit to 0, full ranges equal the per-page OffsetIndex sum and are strictly less than the column-chunk total (proving dictionary-page exclusion), and partial ranges fall between 0 and the full total.

### Are there any user-facing changes?

No.

Closes apache#3598

Co-authored-by: Matt Butrovich <mbutrovich@gmail.com>
@peter-toth peter-toth force-pushed the GH-3598-parquet-reader-row-range-apis branch from dc0e426 to 6d3427b Compare June 5, 2026 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose row-range planning APIs on ParquetFileReader (getRowRanges, getCompressedBytesForRowRanges)

1 participant