Skip to content

feat: Add support for RollingManifestWriter#3009

Open
geruh wants to merge 1 commit intoapache:mainfrom
geruh:rollman
Open

feat: Add support for RollingManifestWriter#3009
geruh wants to merge 1 commit intoapache:mainfrom
geruh:rollman

Conversation

@geruh
Copy link
Member

@geruh geruh commented Feb 8, 2026

Rationale for this change

Adds support for the RollingManifestWriter to split large commits into multiple manifest files instead of one giant file. This PR wraps the ManifestWriter and follows the Java implemenation of checking the size every 250 entries. The bulk of this work was done in #650.

Next step is wiring this into fast_append, but it's also useful for manifest repair operations, like deduplicating entries and rewriting manifests without blowing up the output sizes

Are these changes tested?

Yes

Are there any user-facing changes?

No

Copy link
Contributor

@jayceslesar jayceslesar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matches the java it looks like to me

class RollingManifestWriter:
"""As opposed to ManifestWriter, a rolling writer could produce multiple manifest files."""

_ROWS_DIVISOR = 250
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Java we target at the size, controlled by commit.manifest.target-size-bytes. Any chance we can do something similar here? Or at least have some rationale behind this number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a direct port of the java implementations RollingmanifestWriter.ROW_DIVISOR.

this added divisor is basically a interval where we check the file size against target-size-bytes every 250 entries rather than on every write. But after some digging it looks like in PyIceberg tell() on both the PyArrow and fsspec OutputStream implementations is essentially free and use their own counters so maybe we can simplify this check and ignore the ROW_DIVISOR, but still doesn't hurt to have wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants