Skip to content

feat: add dictionary_columns to to_arrow() / to_arrow_batch_reader() for memory-efficient reads#3461

Open
GayathriSrividya wants to merge 3 commits into
apache:mainfrom
GayathriSrividya:feat/issue-3170-dictionary-columns-scan
Open

feat: add dictionary_columns to to_arrow() / to_arrow_batch_reader() for memory-efficient reads#3461
GayathriSrividya wants to merge 3 commits into
apache:mainfrom
GayathriSrividya:feat/issue-3170-dictionary-columns-scan

Conversation

@GayathriSrividya
Copy link
Copy Markdown

Closes #3170

Rationale

Columns that contain large or frequently repeated string values (e.g. JSON blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them as plain string arrays. PyArrow's Parquet reader natively supports dictionary-encoded reads via its dictionary_columns kwarg, which deduplicates values and can dramatically reduce peak memory usage.

This was previously discussed in #3168 and a prior implementation (#3234) was closed as stale.

Changes

  • Added dictionary_columns: tuple[str, ...] = () to Table.scan(), TableScan.__init__, and StagedTable.scan().
  • Forwarded through DataScan.to_arrow() and to_arrow_batch_reader()ArrowScan.__init___task_to_record_batches_get_file_format().
  • Only applied when task.file.file_format == FileFormat.PARQUET; silently ignored for ORC (which does not support this kwarg).

Usage

# Read the "payload" column as dictionary-encoded to save memory
df = table.scan(dictionary_columns=("payload",)).to_arrow()

Verification

  • Added test_dictionary_columns_produces_dict_encoded_output — confirms the requested column is dict-encoded, non-requested columns are plain, and values are identical.
  • make lint
  • pytest tests/table/ tests/io/test_pyarrow.py

…icient reads

Columns that contain large or frequently repeated strings (e.g. JSON
blobs, low-cardinality categoricals) can exhaust memory when PyArrow
loads them as plain string arrays.  PyArrow's Parquet reader supports
reading such columns as dictionary-encoded arrays, which deduplicates
values and can dramatically reduce memory usage.

Add a dictionary_columns: tuple[str, ...] parameter to Table.scan()
(and the underlying TableScan / ArrowScan classes) that is forwarded
to _get_file_format() as PyArrow's dictionary_columns kwarg.  Only
applies to Parquet files; silently ignored for ORC.

Usage:
    table.scan(dictionary_columns=("payload",)).to_arrow()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a meta question, but I don't think this helps for large JSON blobs since it would directly exceed the dictionary limit and fall back to plain encoding. I do think this helps a lot for low-cardinality strings.

Do we know how Arrow decodes the data? For example, if the Parquet column is dictionary encoded, would Arrow do something smart with the buffers to not repeat this value many times?

Comment thread pyiceberg/table/__init__.py Outdated
An integer representing the number of rows to
return in the scan result. If None, fetches all
matching rows.
dictionary_columns:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to add Arrow specific things to the public API

Copy link
Copy Markdown
Author

@GayathriSrividya GayathriSrividya Jun 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @Fokko I have moved dictionary_columns off the public scan() API and onto the Arrow-specific output methods instead:

table.scan(...).to_arrow(dictionary_columns=("payload",))
table.scan(...).to_arrow_batch_reader(dictionary_columns=("payload",))

That way it doesn't pollute the general scan interface. ArrowScan still accepts it for lower-level use. Pushed in the latest commit. Let me know if you have any further suggestions.

…w_batch_reader()

Addresses reviewer feedback that dictionary_columns is an Arrow-specific
concern and should not be part of the general-purpose scan() public API.

The parameter is now accepted directly by the Arrow output methods:
  table.scan(...).to_arrow(dictionary_columns=("payload",))
  table.scan(...).to_arrow_batch_reader(dictionary_columns=("payload",))

ArrowScan still accepts dictionary_columns for lower-level use.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@GayathriSrividya GayathriSrividya changed the title feat: add dictionary_columns parameter to Table.scan() for memory-efficient reads feat: add dictionary_columns to to_arrow() / to_arrow_batch_reader() for memory-efficient reads Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature request: pass optional parameters to DataScan/pyarrow

2 participants