feat: add dictionary_columns to to_arrow() / to_arrow_batch_reader() for memory-efficient reads by GayathriSrividya · Pull Request #3461 · apache/iceberg-python

GayathriSrividya · 2026-06-05T07:11:48Z

Rationale

Columns that contain large or frequently repeated string values (e.g. JSON blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them as plain string arrays. PyArrow's Parquet reader natively supports dictionary-encoded reads via its dictionary_columns kwarg, which deduplicates values and can dramatically reduce peak memory usage.

This was previously discussed in #3168 and a prior implementation (#3234) was closed as stale.

Changes

Added dictionary_columns: tuple[str, ...] = () to Table.scan(), TableScan.__init__, and StagedTable.scan().
Forwarded through DataScan.to_arrow() and to_arrow_batch_reader() → ArrowScan.__init__ → _task_to_record_batches → _get_file_format().
Only applied when task.file.file_format == FileFormat.PARQUET; silently ignored for ORC (which does not support this kwarg).

Usage

# Read the "payload" column as dictionary-encoded to save memory
df = table.scan(dictionary_columns=("payload",)).to_arrow()

Verification

Added test_dictionary_columns_produces_dict_encoded_output — confirms the requested column is dict-encoded, non-requested columns are plain, and values are identical.
make lint ✓
pytest tests/table/ tests/io/test_pyarrow.py ✓

…icient reads Columns that contain large or frequently repeated strings (e.g. JSON blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them as plain string arrays. PyArrow's Parquet reader supports reading such columns as dictionary-encoded arrays, which deduplicates values and can dramatically reduce memory usage. Add a dictionary_columns: tuple[str, ...] parameter to Table.scan() (and the underlying TableScan / ArrowScan classes) that is forwarded to _get_file_format() as PyArrow's dictionary_columns kwarg. Only applies to Parquet files; silently ignored for ORC. Usage: table.scan(dictionary_columns=("payload",)).to_arrow() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fokko

More of a meta question, but I don't think this helps for large JSON blobs since it would directly exceed the dictionary limit and fall back to plain encoding. I do think this helps a lot for low-cardinality strings.

Do we know how Arrow decodes the data? For example, if the Parquet column is dictionary encoded, would Arrow do something smart with the buffers to not repeat this value many times?

Fokko · 2026-06-06T07:18:43Z

                An integer representing the number of rows to
                return in the scan result. If None, fetches all
                matching rows.
+            dictionary_columns:


I'm hesitant to add Arrow specific things to the public API

Good point @Fokko I have moved dictionary_columns off the public scan() API and onto the Arrow-specific output methods instead:

table.scan(...).to_arrow(dictionary_columns=("payload",)) table.scan(...).to_arrow_batch_reader(dictionary_columns=("payload",))

That way it doesn't pollute the general scan interface. ArrowScan still accepts it for lower-level use. Pushed in the latest commit. Let me know if you have any further suggestions.

…w_batch_reader() Addresses reviewer feedback that dictionary_columns is an Arrow-specific concern and should not be part of the general-purpose scan() public API. The parameter is now accepted directly by the Arrow output methods: table.scan(...).to_arrow(dictionary_columns=("payload",)) table.scan(...).to_arrow_batch_reader(dictionary_columns=("payload",)) ArrowScan still accepts dictionary_columns for lower-level use. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fokko reviewed Jun 6, 2026

View reviewed changes

GayathriSrividya changed the title ~~feat: add dictionary_columns parameter to Table.scan() for memory-efficient reads~~ feat: add dictionary_columns to to_arrow() / to_arrow_batch_reader() for memory-efficient reads Jun 6, 2026

ci: re-trigger CI after transient runner cache failure

e429e48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add dictionary_columns to to_arrow() / to_arrow_batch_reader() for memory-efficient reads#3461

feat: add dictionary_columns to to_arrow() / to_arrow_batch_reader() for memory-efficient reads#3461
GayathriSrividya wants to merge 3 commits into
apache:mainfrom
GayathriSrividya:feat/issue-3170-dictionary-columns-scan

GayathriSrividya commented Jun 5, 2026

Uh oh!

Fokko left a comment

Uh oh!

Fokko Jun 6, 2026

Uh oh!

GayathriSrividya Jun 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GayathriSrividya commented Jun 5, 2026

Rationale

Changes

Usage

Verification

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

GayathriSrividya Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GayathriSrividya Jun 6, 2026 •

edited

Loading