[python] Add StreamReadBuilder and AsyncStreamingTableScan#7350
Closed
tub wants to merge 5 commits intoapache:masterfrom
Closed
[python] Add StreamReadBuilder and AsyncStreamingTableScan#7350tub wants to merge 5 commits intoapache:masterfrom
tub wants to merge 5 commits intoapache:masterfrom
Conversation
This was referenced Mar 5, 2026
tub
added a commit
to tub/paimon
that referenced
this pull request
Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and SnapshotManager — they have near-zero hit rates in practice (batch reads create new manager instances; streaming reads see unique manifest names per snapshot). Caching will be re-added in PR apache#7350 where streaming actually benefits. Remove find_next_scannable and get_snapshots_batch from SnapshotManager as they have zero callers on this branch. They will be added where needed in downstream PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub
added a commit
to tub/paimon
that referenced
this pull request
Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and SnapshotManager — they have near-zero hit rates in practice (batch reads create new manager instances; streaming reads see unique manifest names per snapshot). Caching will be re-added in PR apache#7350 where streaming actually benefits. Remove find_next_scannable and get_snapshots_batch from SnapshotManager as they have zero callers on this branch. They will be added where needed in downstream PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub
added a commit
to tub/paimon
that referenced
this pull request
Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and SnapshotManager — they have near-zero hit rates in practice (batch reads create new manager instances; streaming reads see unique manifest names per snapshot). Caching will be re-added in PR apache#7350 where streaming actually benefits. Remove find_next_scannable and get_snapshots_batch from SnapshotManager as they have zero callers on this branch. They will be added where needed in downstream PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub
added a commit
to tub/paimon
that referenced
this pull request
Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and SnapshotManager — they have near-zero hit rates in practice (batch reads create new manager instances; streaming reads see unique manifest names per snapshot). Caching will be re-added in PR apache#7350 where streaming actually benefits. Remove find_next_scannable and get_snapshots_batch from SnapshotManager as they have zero callers on this branch. They will be added where needed in downstream PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub
added a commit
to tub/paimon
that referenced
this pull request
Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and SnapshotManager — they have near-zero hit rates in practice (batch reads create new manager instances; streaming reads see unique manifest names per snapshot). Caching will be re-added in PR apache#7350 where streaming actually benefits. Remove find_next_scannable and get_snapshots_batch from SnapshotManager as they have zero callers on this branch. They will be added where needed in downstream PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub
added a commit
to tub/paimon
that referenced
this pull request
Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and SnapshotManager — they have near-zero hit rates in practice (batch reads create new manager instances; streaming reads see unique manifest names per snapshot). Caching will be re-added in PR apache#7350 where streaming actually benefits. Remove find_next_scannable and get_snapshots_batch from SnapshotManager as they have zero callers on this branch. They will be added where needed in downstream PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
a1d9372 to
dfbdb01
Compare
tub
added a commit
to tub/paimon
that referenced
this pull request
Mar 10, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and SnapshotManager — they have near-zero hit rates in practice (batch reads create new manager instances; streaming reads see unique manifest names per snapshot). Caching will be re-added in PR apache#7350 where streaming actually benefits. Remove find_next_scannable and get_snapshots_batch from SnapshotManager as they have zero callers on this branch. They will be added where needed in downstream PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3 tasks
dfbdb01 to
d987c80
Compare
tub
added a commit
to tub/paimon
that referenced
this pull request
Mar 11, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and SnapshotManager — they have near-zero hit rates in practice (batch reads create new manager instances; streaming reads see unique manifest names per snapshot). Caching will be re-added in PR apache#7350 where streaming actually benefits. Remove find_next_scannable and get_snapshots_batch from SnapshotManager as they have zero callers on this branch. They will be added where needed in downstream PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub
added a commit
to tub/paimon
that referenced
this pull request
Mar 11, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and SnapshotManager — they have near-zero hit rates in practice (batch reads create new manager instances; streaming reads see unique manifest names per snapshot). Caching will be re-added in PR apache#7350 where streaming actually benefits. Remove find_next_scannable and get_snapshots_batch from SnapshotManager as they have zero callers on this branch. They will be added where needed in downstream PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9119001 to
834c885
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit d2e5cdc.
…uilder, and acceptance tests Add core streaming read support for Paimon Python: - AsyncStreamingTableScan with async/sync iterators, prefetching, lookahead, and diff-based catch-up - StreamReadBuilder for configuring streaming reads with filter, projection, bucket sharding - IncrementalDiffScanner acceptance tests verifying diff vs delta equivalence - Streaming docs section in python-api.md - Table integration: new_stream_read_builder() on FileStoreTable, FormatTable, IcebergTable Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Connect ConsumerManager to AsyncStreamingTableScan so that streaming read progress is persisted and restored across restarts. Add with_consumer_id() to StreamReadBuilder and wire it through to the scan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
744aa4d to
d8ecc0d
Compare
Contributor
Author
|
Splitting this into smaller PRs. Starting with #7417 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
StreamReadBuilderfor configuring streaming reads with consumer registration and shardingAsyncStreamingTableScanfor async iteration over incremental table changesFileStoreTable,Table,FormatTable,IcebergTablepython-api.mdsetup.pyStacked PR series
This is PR 2/5 in the Python streaming read series:
paimon tail)Incremental diff (vs 1c): python-streaming-1c-consumer...tub:paimon:python-streaming-2-core-v2
Test plan
flake8passes on all changed filespython -m pytestpassesstreaming_table_scan_test.py,stream_read_builder_sharding_test.py,incremental_diff_acceptance_test.py