Skip to content

perf: Vectorize get_chunk_slice for faster sharded writes#3713

Open
mkitti wants to merge 21 commits intozarr-developers:mainfrom
mkitti:mkitti-get-chunk-slice-vectorization
Open

perf: Vectorize get_chunk_slice for faster sharded writes#3713
mkitti wants to merge 21 commits intozarr-developers:mainfrom
mkitti:mkitti-get-chunk-slice-vectorization

Conversation

@mkitti
Copy link
Contributor

@mkitti mkitti commented Feb 17, 2026

Summary

This PR adds vectorized methods to _ShardIndex and _ShardReader for batch chunk slice lookups, significantly reducing per-chunk function call overhead when writing to shards.

Changes

New Methods

_ShardIndex.get_chunk_slices_vectorized: Batch lookup of chunk slices using NumPy vectorized operations instead of per-chunk Python calls.

_ShardReader.to_dict_vectorized: Build a chunk dictionary using vectorized lookup instead of iterating with individual get() calls.

Modified Code Path

In _encode_partial_single, replaced:

shard_dict = {k: shard_reader.get(k) for k in morton_order_iter(chunks_per_shard)}

With vectorized approach:

morton_coords = _morton_order(chunks_per_shard)
chunk_coords_array = np.array(morton_coords, dtype=np.uint64)
shard_dict = shard_reader.to_dict_vectorized(chunk_coords_array, morton_coords)

Benchmark Results

Single Chunk Write to Large Shard

Writing a single 1x1x1 chunk to a shard with 32³ chunks (using test_sharded_morton_write_single_chunk from PR #3712):

Optimization Time Speedup vs Main
Main branch (original) 422ms -
+ Morton optimization (PR #3708) 261ms 1.6x
+ Vectorized get_chunk_slice 95ms 4.4x

Profile Breakdown

Function Before After
get_chunk_slice + _localize_chunk 215ms 3ms
to_dict_vectorized loop 81ms 9ms
Total function calls 299k 37k

Checklist

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

mkitti and others added 21 commits February 13, 2026 00:15
Add benchmarks that clear the _morton_order LRU cache before each
iteration to measure the full Morton computation cost:

- test_sharded_morton_indexing: 512-4096 chunks per shard
- test_sharded_morton_indexing_large: 32768 chunks per shard

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add vectorized methods to _ShardIndex and _ShardReader for batch
chunk slice lookups, reducing per-chunk function call overhead
when writing to shards.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@mkitti
Copy link
Contributor Author

mkitti commented Feb 18, 2026

@d-v-b add the benchmark tag here as well please.

@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Feb 18, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Feb 18, 2026

Merging this PR will improve performance by ×7.2

⚡ 4 improved benchmarks
✅ 52 untouched benchmarks
⏩ 6 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_morton_order_iter[(8, 8, 8)] 6.2 ms 1.1 ms ×5.5
WallTime test_morton_order_iter[(16, 16, 16)] 56.2 ms 8.1 ms ×6.9
WallTime test_morton_order_iter[(32, 32, 32)] 502.6 ms 70.2 ms ×7.2
WallTime test_sharded_morton_write_single_chunk[(32, 32, 32)-memory] 953.3 ms 164.4 ms ×5.8

Comparing mkitti:mkitti-get-chunk-slice-vectorization (157d283) with main (36caf1f)

Open in CodSpeed

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

# Use vectorized lookup for better performance
morton_coords = _morton_order(chunks_per_shard)
chunk_coords_array = np.array(morton_coords, dtype=np.uint64)
shard_dict = shard_reader.to_dict_vectorized(chunk_coords_array, morton_coords)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does to_dict_vectorized take the same values as 2 different types? Can't we pass in morton_coords and have the array version created internally as an implementation detail?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments