Skip to content

Expose row-range planning APIs on ParquetFileReader (getRowRanges, getCompressedBytesForRowRanges) #3598

@peter-toth

Description

@peter-toth

Describe the enhancement requested

ParquetFileReader already computes, per row group, the RowRanges that may pass the configured filter — but only as a private step on the internal read path (getRowRanges(int) feeding readFilteredRowGroup). There is also no supported way to ask how many compressed bytes a given set of row ranges corresponds to without actually reading the column data.

This makes it hard for an external reader to plan I/O up front. A concrete motivating case is a materialization path (e.g. a Spark-side scanner) that wants to (a) obtain the column-index-derived row ranges that may pass the filter for a row group, and (b) get a metadata-only estimate of the on-disk compressed bytes those ranges correspond to for the currently requested columns — so it can size buffers and schedule reads without first touching column data.

Proposed change

Two additive APIs on ParquetFileReader, both metadata-only (they consult the column/offset indexes from the footer; no column data is read):

  • public RowRanges getRowRanges(int blockIndex) — promote the existing private method to public. It returns the row ranges within the row group that may pass the configured filter. When no filter is configured it short-circuits to a RowRanges covering all rows in the row group (the previous private version asserted a filter was present, since it was only ever reached on the filtering path).

  • public long getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges) — sum of compressed page sizes (OffsetIndex.getCompressedPageSize, which includes the page header) across the reader's currently requested columns, for pages whose row range intersects rowRanges. Returns 0 for empty ranges or when no columns are requested. Throws MissingOffsetIndexException if a requested column lacks an offset index.

    Note: dictionary pages are not represented in OffsetIndex, so they are excluded from the sum — the result is a lower bound on actual on-disk bytes for those columns/rows by exactly the dictionary-page contribution.

Scope

  • Additive, Core only. The only behavioral change is that getRowRanges now handles the no-filter case (returning all rows) instead of asserting; all existing callers already guard with a filter check, so they are unaffected.
  • No user-facing API removal.

This is the second of two related enhancements opening up RowRanges/reader APIs needed by the materialization feature described above. The first (#3596) added RowRanges.Builder for incremental construction from selected row indices.

Component(s):

Core

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions