Skip to content

[python] Add StreamReadBuilder and AsyncStreamingTableScan#7350

Closed
tub wants to merge 5 commits intoapache:masterfrom
tub:python-streaming-2-core-v2
Closed

[python] Add StreamReadBuilder and AsyncStreamingTableScan#7350
tub wants to merge 5 commits intoapache:masterfrom
tub:python-streaming-2-core-v2

Conversation

@tub
Copy link
Contributor

@tub tub commented Mar 5, 2026

Summary

  • Add StreamReadBuilder for configuring streaming reads with consumer registration and sharding
  • Add AsyncStreamingTableScan for async iteration over incremental table changes
  • Add streaming read support to FileStoreTable, Table, FormatTable, IcebergTable
  • Add acceptance test for incremental diff reads
  • Add streaming documentation to python-api.md
  • Add oss/lance extras to setup.py

Stacked PR series

This is PR 2/5 in the Python streaming read series:

  • PR 1a: Caching infrastructure + utilities
  • PR 1b: Scanners, sharding, row kind
  • PR 1c: Consumer management
  • PR 2 (this): Core streaming (~2717 lines)
  • PR 3: CLI (paimon tail)

Incremental diff (vs 1c): python-streaming-1c-consumer...tub:paimon:python-streaming-2-core-v2

Test plan

  • flake8 passes on all changed files
  • python -m pytest passes
  • New tests: streaming_table_scan_test.py, stream_read_builder_sharding_test.py, incremental_diff_acceptance_test.py

tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub tub force-pushed the python-streaming-2-core-v2 branch from a1d9372 to dfbdb01 Compare March 10, 2026 15:53
tub added a commit to tub/paimon that referenced this pull request Mar 10, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub tub force-pushed the python-streaming-2-core-v2 branch from dfbdb01 to d987c80 Compare March 11, 2026 11:56
tub added a commit to tub/paimon that referenced this pull request Mar 11, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 11, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub tub force-pushed the python-streaming-2-core-v2 branch 2 times, most recently from 9119001 to 834c885 Compare March 11, 2026 18:14
tub and others added 5 commits March 12, 2026 13:54
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…uilder, and acceptance tests

Add core streaming read support for Paimon Python:
- AsyncStreamingTableScan with async/sync iterators, prefetching, lookahead, and diff-based catch-up
- StreamReadBuilder for configuring streaming reads with filter, projection, bucket sharding
- IncrementalDiffScanner acceptance tests verifying diff vs delta equivalence
- Streaming docs section in python-api.md
- Table integration: new_stream_read_builder() on FileStoreTable, FormatTable, IcebergTable

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Connect ConsumerManager to AsyncStreamingTableScan so that streaming
read progress is persisted and restored across restarts. Add
with_consumer_id() to StreamReadBuilder and wire it through to the scan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub tub force-pushed the python-streaming-2-core-v2 branch from 744aa4d to d8ecc0d Compare March 12, 2026 14:17
@tub
Copy link
Contributor Author

tub commented Mar 12, 2026

Splitting this into smaller PRs. Starting with #7417

@tub tub closed this Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant