feat: add IndexTransform library for composable, lazy coordinate mappings by d-v-b · Pull Request #3906 · zarr-developers/zarr-python

d-v-b · 2026-04-14T12:53:31Z

My summary:

With dask maintenance on the decline, it's more important than ever that we give zarr-python users a dask-free way to do something very intuitive: index large zarr arrays without turning the whole thing into a numpy array first. This was discussed at length in #1603.

This PR, done with Claude, makes regular indexing go through a lazy indexing layer. The lazy indexing layer is based on abstractions defined in tensorstore. The basic idea is to explicitly model indexing an array as a transformation from some input coordinates to output coordinates, and to bind such a representation to our Array classes.

Regular indexing via .__getitem__ is still immediate, but arrays have a new .z attribute that exposes the lazy indexing layer:

>>> from zarr import create_array                                                        
>>> arr = create_array(store={}, shape=(100,), dtype="uint8")
>>> arr[0] # immediate
np.uint8(0)
>>> arr.z[0] # lazy
<Array memory://136981112028224 shape=() dtype=uint8 domain={ 0 }>
>>> arr.z[0] = 10 # __setitem__ is immediate, no lazy writes yet
>>> arr.z.resolve() # call .resolve to make a numpy array
np.uint8(10)
>>> arr.z[-1] # for lazy indexing, -1 is a real index!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/d-v-b/dev/zarr-python/src/zarr/core/array.py", line 4639, in __getitem__
    new_t = selection_to_transform(selection, self._array._async_array._transform, "basic")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/d-v-b/dev/zarr-python/src/zarr/core/transforms/transform.py", line 924, in selection_to_transform
    return transform[selection]
           ~~~~~~~~~^^^^^^^^^^^
  File "/home/d-v-b/dev/zarr-python/src/zarr/core/transforms/transform.py", line 176, in __getitem__
    return _apply_basic_indexing(self, selection)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/d-v-b/dev/zarr-python/src/zarr/core/transforms/transform.py", line 482, in _apply_basic_indexing
    raise IndexError(
IndexError: index -1 is out of bounds for dimension 0 with domain [0, 100)

Goals here:

harmonize our indexing internals. The data structures in main are disparate ad-hoc copies of stuff from zarr-python 2.x. We can do better.
use data structures compatible with tensorstore. tensorstore's indexing machinery serializes to JSON. if we adopt the same patterns, we are closer to using tensorstore as an optional backend.
add a useful lazy indexing API for users who need to lazily index zarr arrays without using dask or xarray.
prepare our codebase for chunk encoding / decoding improvements, such as pushing array indexing down into the chunk encoding / decoding process
no breaking changes. old indexing routines should still work, even if they are not load-bearing any more.

Non-goals:

make the codec pipeline lazy-indexing-aware. future work.
support lazy writing. that's for future work.

Claude's summary.

Add a new src/zarr/core/transforms/ package implementing TensorStore-inspired
index transforms. The core idea: every indexing operation (slicing, fancy indexing,
etc.) produces a coordinate mapping from user space to storage space. These mappings
compose lazily — no I/O until explicitly resolved.

Key types:

IndexDomain — rectangular region in N-dimensional integer space
ConstantMap, DimensionMap, ArrayMap — three representations of a set of
storage coordinates (singleton, arithmetic progression, explicit enumeration)
IndexTransform — pairs an input domain with output maps (one per storage dim)
compose(outer, inner) — chain two transforms

Key operations on IndexTransform:

__getitem__, .oindex[], .vindex[] — indexing produces new transforms
.intersect(domain) — restrict to coordinates within a region (chunk resolution)
.translate(shift) — shift coordinates (make chunk-local)

The transform library is standalone with no dependency on Array.

Includes comprehensive test suite (143 tests covering all types, operations,
composition, chunk resolution, and edge cases).

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

…ings Add a new `src/zarr/core/transforms/` package implementing TensorStore-inspired index transforms. The core idea: every indexing operation (slicing, fancy indexing, etc.) produces a coordinate mapping from user space to storage space. These mappings compose lazily — no I/O until explicitly resolved. Key types: - `IndexDomain` — rectangular region in N-dimensional integer space - `ConstantMap`, `DimensionMap`, `ArrayMap` — three representations of a set of storage coordinates (singleton, arithmetic progression, explicit enumeration) - `IndexTransform` — pairs an input domain with output maps (one per storage dim) - `compose(outer, inner)` — chain two transforms Key operations on IndexTransform: - `__getitem__`, `.oindex[]`, `.vindex[]` — indexing produces new transforms - `.intersect(domain)` — restrict to coordinates within a region (chunk resolution) - `.translate(shift)` — shift coordinates (make chunk-local) The transform library is standalone with no dependency on Array. Includes comprehensive test suite (143 tests covering all types, operations, composition, chunk resolution, and edge cases). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

d-v-b · 2026-04-14T12:58:55Z

cc @vincentsarago

codecov · 2026-04-14T13:05:02Z

Codecov Report

❌ Patch coverage is 87.67228% with 161 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.90%. Comparing base (1ab9953) to head (ba97603).

Files with missing lines	Patch %	Lines
src/zarr/core/transforms/transform.py	86.28%	76 Missing ⚠️
src/zarr/core/array.py	83.28%	60 Missing ⚠️
src/zarr/core/transforms/chunk_resolution.py	94.00%	6 Missing ⚠️
src/zarr/core/transforms/composition.py	87.75%	6 Missing ⚠️
src/zarr/core/chunk_partition.py	90.62%	3 Missing ⚠️
src/zarr/core/indexing.py	50.00%	3 Missing ⚠️
src/zarr/core/transforms/domain.py	96.80%	3 Missing ⚠️
src/zarr/testing/strategies.py	88.88%	3 Missing ⚠️
src/zarr/core/transforms/json.py	98.30%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3906      +/-   ##
==========================================
- Coverage   93.50%   92.90%   -0.61%     
==========================================
  Files          90       98       +8     
  Lines       11981    13250    +1269     
==========================================
+ Hits        11203    12310    +1107     
- Misses        778      940     +162

Files with missing lines	Coverage Δ
src/zarr/__init__.py	`100.00% <100.00%> (ø)`
src/zarr/api/synchronous.py	`92.95% <ø> (ø)`
src/zarr/core/group.py	`95.20% <ø> (ø)`
src/zarr/core/transforms/__init__.py	`100.00% <100.00%> (ø)`
src/zarr/core/transforms/output_map.py	`100.00% <100.00%> (ø)`
src/zarr/errors.py	`100.00% <100.00%> (ø)`
src/zarr/core/transforms/json.py	`98.30% <98.30%> (ø)`
src/zarr/core/chunk_partition.py	`90.62% <90.62%> (ø)`
src/zarr/core/indexing.py	`95.57% <50.00%> (-0.72%)`	⬇️
src/zarr/core/transforms/domain.py	`96.80% <96.80%> (ø)`
... and 5 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vincentsarago · 2026-04-14T13:07:50Z

@d-v-b I'm new to zarr-python indexing, does .z[0] select the first value of the first dimension?

My use case is mainly if I have an array of shape (n, m, x, y), I want to be able to select on the first/second dimension to reduce the size of my array. How would I be able to do it with .z[] ?

Add TypedDict definitions and conversion functions for serializing IndexDomain, OutputIndexMap, and IndexTransform to/from JSON. The JSON format follows TensorStore's conventions for interoperability: - IndexDomain: input_inclusive_min, input_exclusive_max, input_labels - OutputIndexMap: offset + optional stride/input_dimension/index_array - IndexTransform: domain fields + output array TypedDicts: IndexDomainJSON, OutputIndexMapJSON, IndexTransformJSON Functions: index_domain_to_json, index_domain_from_json, index_transform_to_json, index_transform_from_json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

d-v-b · 2026-04-14T13:12:50Z

@d-v-b I'm new to zarr-python indexing, does .z[0] select the first value of the first dimension?

My use case is mainly if I have an array of shape (n, m, x, y), I want to be able to select on the first/second dimension to reduce the size of my array. How would I be able to do it with .z[] ?

on this branch:

>>> arr = create_array(store={}, shape=(100,100), dtype="uint8")
>>> arr.z[0]      
<Array memory://136980013828096 shape=(100,) dtype=uint8 domain={ 0, [0, 100) }>
>>> arr.z[0].shape                 
(100,)

the domain={ 0, [0, 100) } means "this array is what you get when you take the first index for the first dimension, and all values along the other dimension", which I think is what you want?

vincentsarago · 2026-04-14T13:16:03Z

What is I have

# create a 4d array (time, band, y, x)
arr = create_array(store={}, shape=(2, 3, 100,100), dtype="uint8")

and I want to select the arrays for band 1 (second dim) but for all the times, would arr.z[:, 0] gives what I want?

d-v-b · 2026-04-14T13:19:32Z

What is I have
# create a 4d array (time, band, y, x)
arr = create_array(store={}, shape=(2, 3, 100,100), dtype="uint8")
and I want to select the arrays for band 1 (second dim) but for all the times, would arr.z[:, 0] gives what I want?

yeah it should!

codspeed-hq · 2026-04-14T13:21:06Z

Merging this PR will degrade performance by 17.94%

⚡ 4 improved benchmarks
❌ 14 regressed benchmarks
✅ 48 untouched benchmarks
⏩ 6 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	1.6 s	1.3 s	+25.3%
❌	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1 s	1.2 s	-12.69%
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	2.1 s	1.9 s	+11.24%
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	2.8 s	2 s	+36.42%
❌	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	275.7 ms	336 ms	-17.94%
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	3.2 s	2.6 s	+25.43%
❌	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	578.8 ms	652.6 ms	-11.31%
❌	WallTime	`test_sharded_morton_single_chunk[(32, 32, 32)-memory]`	1.8 ms	2 ms	-10.6%
❌	WallTime	`test_morton_order_iter[(20, 20, 20)]`	35.8 ms	41.6 ms	-13.84%
❌	WallTime	`test_morton_order_iter[(33, 33, 33)]`	163.9 ms	188.3 ms	-12.94%
❌	WallTime	`test_sharded_morton_write_single_chunk[(33, 33, 33)-memory]`	193.4 ms	224.4 ms	-13.84%
❌	WallTime	`test_morton_order_iter[(10, 10, 10)]`	4.7 ms	5.4 ms	-12.85%
❌	WallTime	`test_sharded_morton_write_single_chunk[(30, 30, 30)-memory]`	146.5 ms	168.8 ms	-13.19%
❌	WallTime	`test_morton_order_iter[(8, 8, 8)]`	2.4 ms	2.8 ms	-13.75%
❌	WallTime	`test_sharded_morton_write_single_chunk[(32, 32, 32)-memory]`	173.6 ms	202.8 ms	-14.37%
❌	WallTime	`test_morton_order_iter[(16, 16, 16)]`	18.2 ms	21 ms	-13.38%
❌	WallTime	`test_morton_order_iter[(30, 30, 30)]`	122.5 ms	143.7 ms	-14.71%
❌	WallTime	`test_morton_order_iter[(32, 32, 32)]`	147.1 ms	172.2 ms	-14.54%

_{Comparing d-v-b:refactor/simplify-indexing (732dddd) with main (7c78574)²}

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
No successful run was found on main (0ea15fd) during the generation of this report, so 7c78574 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

psobolewskiPhD · 2026-04-21T16:58:23Z

I played with this a bit today and this type of slicing is really great:
data.z[:,:,:,0:100,0:100]

One thing that tripped me up -- probably just shows my naivety (PBCAK) -- is that e.g. data.z.shape is AttributeError. So you need to get your attributes from data, but do your slicing with data.z.

d-v-b · 2026-04-21T17:05:04Z

I played with this a bit today and this type of slicing is really great:

data.z[:,:,:,0:100,0:100]

One thing that tripped me up -- probably just shows my naivety (PBCAK) -- is that e.g. data.z.shape is AttributeError. So you need to get your attributes from data, but do your slicing with data.z.

We should probably have a shape attribute there. Thanks for reporting that, and thanks for trying this branch out. Let me know if you find anything else we need to fix!

psobolewskiPhD · 2026-04-21T17:25:45Z

I guess from my PoV it would be nice if I could just use data.z for everything, so it was basically array-like.
Or to flip it around: if data acted like data.z -- but that would be a big breaking change, so I understand :P

Edit: Don't get me wrong, this is pretty great -- as an example, with this I can remove the dask.array wrapper here:
https://github.com/napari/napari/pull/8260/changes#diff-418087900d622739ab769c612c09fbf012e7555d28d0e75840d145eb055785f0

instead, just check if my array is zarr and if so use array.z for slicing.

d-v-b · 2026-04-21T19:04:29Z

I guess from my PoV it would be nice if I could just use data.z for everything, so it was basically array-like. Or to flip it around: if data acted like data.z -- but that would be a big breaking change, so I understand :P

Edit: Don't get me wrong, this is pretty great -- as an example, with this I can remove the dask.array wrapper here: https://github.com/napari/napari/pull/8260/changes#diff-418087900d622739ab769c612c09fbf012e7555d28d0e75840d145eb055785f0

instead, just check if my array is zarr and if so use array.z for slicing.

I'm glad it's useful! I agree that this is probably how slicing should work by default. But that would be a big breaking change, so it's far down the road.

The .z thing, however, is purely additive so I hope we can ship that in a minor release soon! Ping @zarr-developers/python-core-devs, it would be good to get some feedback on this PR.

psobolewskiPhD · 2026-04-21T19:23:30Z

Makes sense. So maybe in followup _LazyIndexAccessor could have more of the array properties and support more of the numpy protocol, so that array.z could be used as a drop-in for array, but be lazy about slicing, etc.

normanrz · 2026-04-22T12:29:31Z

I like this a lot!
Why did you pick .z as a name for the key? I think something like .lazy would be more descriptive, but 3 more chars.
tensorstore uses .result() to materialize a read, because it models the lazy slicing as futures. Maybe it makes sense to use the same terms? Or even making this API based on futures as well?

d-v-b · 2026-04-22T12:36:02Z

I like this a lot!
Why did you pick .z as a name for the key? I think something like .lazy would be more descriptive, but 3 more chars.

I just wanted something really short. Since this is (IMO) a strictly better slicing API, I wanted to make accessing it as low-friction as possible. But I will totally bow to the will of the crowd here, we can use whatever people find intuitive.

tensorstore uses .result() to materialize a read, because it models the lazy slicing as futures. Maybe it makes sense to use the same terms? Or even making this API based on futures as well?

I use resolve here but I don't feel strongly about it. Changing to .result() would be fine! And I think using a futures-based API is a great idea, especially when we get into lazy writes, which are currently not supported in this PR. If we do go for a futures API, we should think about how to use this comprehensively, e.g. with our stores as well.

vincentsarago · 2026-04-22T12:36:17Z

I agree with @normanrz, the .z keys feels a bit weird to me. I think something like .lazy_select(...) -> _LazyIndexAccessor makes things a bit more descriptive.

the length of .z make me feel like it is an attribute not a method.

d-v-b · 2026-04-22T12:37:49Z

what about .select or .sel?

jsignell · 2026-04-22T13:55:50Z

Just for reference: xarray uses .sel, .isel and .loc (https://docs.xarray.dev/en/latest/user-guide/indexing.html#quick-overview). I guess in that naming scheme what you have in this PR would more like .iloc

d-v-b · 2026-04-22T15:56:16Z

ping @ilan-gold for visibility

ilan-gold

use data structures compatible with tensorstore. tensorstore's indexing machinery serializes to JSON. if we adopt the same patterns, we are closer to using tensorstore as an optional backend.

I'm sure zarrs could also support this but I wonder about JSON performance for things like vindex/oindex where you could easily have 10's of thousands of individual coordinates.

ilan-gold · 2026-04-27T09:39:56Z

+- **translate(shift)** — shift all output coordinates. This makes coordinates
+  chunk-local: "express my coordinates relative to the chunk origin."
+
+- **compose(outer, inner)** — chain two transforms. See ``composition.py``.


I think it could be cool to have different kinds of composition like union and I don't see that anywhere intthe PR - it seems that only array.z[my_transform][my_other_transform_relative_to_the_first] is supported wheres it could be cool to have array.z[my_transform + my_other_transform]. Concretely, if I wanted (slice(0, 10), slice(0, 10)) and then also (slice(50, 60), slice(50, 60)) from an array, how would I do that? np.concatenate + np.arange to generate outer indexers is the only option AFAICT but that kind of stinks.

that's a very cool idea, not sure about overloading the addition operator, but some method for combining indexing expressions seems nice. I'll see what we need for this

Add test_indexing_parity: a NumPy-oracle property test over the cartesian product {basic, oindex, vindex, mask} × {eager array, lazy view} × {out=None, out=buffer}. Every lazy/view indexing bug found in review was "the lazy/view path diverges from NumPy/eager for some (method, parameter) combination"; enumerating that surface catches the whole class instead of one case per review round. Doubles as a guard for the existing eager indexing paths (incl. out=), independent of lazy indexing. The harness immediately surfaced two fixes: - _get_selection_via_transform validated a provided out= buffer against the multi-dimensional out_shape, but vectorized reads scatter through a flat index; validate against the (flat) buffer_shape instead, matching the eager contract. - _is_vectorized_transform required >= 2 coordinate ArrayMaps, missing a single coordinate map with a multi-dimensional index array (vindex[idx_2d] on a 1-D view), which also scatters flat. Flag a flat buffer whenever any coordinate (input_dimension is None) ArrayMap is present. Also adds a deterministic view-method out= regression test. Assisted-by: ClaudeCode:claude-opus-4.8

…dows) Promote the inlined test-data generators from the indexing parity test into zarr.testing.strategies as documented, reusable strategies: - indexers(mode, shape): one strategy returning a (zarr_selection, numpy_selection) pair for any of basic / oindex / vindex / mask, so indexing tests are written once and parametrized over mode instead of re-deriving selection setup. - windows(shape): non-negative, full-rank slice windows for building lazy views. test_indexing_parity now draws from these shared strategies. The reusable data-generation building blocks — not any single test — are the durable asset: new indexing tests (and downstream users of zarr.testing) compose them rather than reinventing setup. Assisted-by: ClaudeCode:claude-opus-4.8

…pting get_block_selection / set_block_selection were not updated for non-identity (lazy view) arrays: BlockIndexer used the view shape against the storage chunk grid, so v.blocks[i] silently read/wrote raw storage blocks, ignoring the view's offset/stride. Block selection addresses the storage chunk grid and is ill-defined on a view, so raise NotImplementedError (pointing users to materialize the view) rather than return/write the wrong region. Element-level lazy indexing (basic/oindex/vindex/mask) is unaffected. Assisted-by: ClaudeCode:claude-opus-4.8

…eable) Split the cross-path indexing oracle so the test improvements can land ahead of lazy indexing: - assert_indexing_matches_numpy(target, ref, mode, data): a consumer-agnostic helper — `target` is any zarr.Array (eager array or lazy view), checked against a NumPy reference for a drawn selection, with and without out=. - test_indexing_parity: eager-only, no reference to Array.lazy. Comprehensively guards the existing eager indexing API ({basic,oindex,vindex,mask} × out=) and can merge on its own. - test_lazy_view_indexing_parity: reuses the same helper for a lazy view, and is skipped unless Array.lazy exists — so it's inert on a tree without the feature. The lazy view is "just another zarr.Array" flowing through the same oracle, so adding the lazy consumer is additive rather than a rewrite. windows() is reworded as a generic sub-region strategy (not lazy-specific). Assisted-by: ClaudeCode:claude-opus-4.8

Replace test_basic_indexing / test_oindex / test_vindex / test_mask_indexing with one enumerated oracle, test_indexing_parity, covering every mode against a NumPy reference: sync read, read into an out= buffer, async read, and write. The historical per-mode write guards are preserved in one place (vindex setitem is chunk-dependent for repeated coords; oindex/mask writes to sharded arrays are unsupported per GH2834; oindex repeated-index writes are unspecified). Selection generation comes from the shared indexers() strategy, and the read half is shared with the lazy-view test via assert_read_matches_numpy. Dedicated tests remain for inputs the oracle's strategies don't generate: complex rectilinear arrays and block indexing. Assisted-by: ClaudeCode:claude-opus-4.8

…verage) Low-severity follow-ups from the branch review (jobs 303/304), no behavior change: - array.py: correct the stale "needs_flat_buffer" comment (any coordinate ArrayMap triggers the flat buffer, not just >= 2). - array.py set_coordinate_selection: normalize empty fields with an explicit `== [] or == ()` check instead of a falsy `if not fields`. - transform._intersect_vectorized: drop the unused n_points variable. - test_properties: restore the vindex[mask] read+write equivalence coverage that the per-mode fold dropped (the two mask spellings must agree). - strategies.indexers: fix the docstring's zero-size-axis claim to match _eligible. Assisted-by: ClaudeCode:claude-opus-4.8

The lazy accessor copies TensorStore: index-array values are literal coordinates (origin 0), not NumPy from-the-end. Out-of-domain values (negative, or >= the axis length) previously resolved to silently-wrong data (e.g. lazy.oindex[[-1]] returned element 0); now they raise BoundsCheckError at transform construction, matching the eager bounds check. Applies to both the literal accessor and the NumPy-normalizing view methods (which normalize first, then bounds-check the result). Basic accessor negatives already raise as literal out-of-range. Keeps the intentional accessor-vs-method asymmetry (the accessor is the TensorStore-style transform builder; the Array methods normalize NumPy-style); documented and pinned by test_accessor_negative_index_is_literal. Assisted-by: ClaudeCode:claude-opus-4.8

Mirror the review fixes from the eager test-infra PR onto the lazy branch: - type `mode` as the `IndexMode` literal across the indexing test helpers - `_VECTORIZED_MODES` is a tuple like `_INDEX_MODES` - `indexers(...)` returns `tuple[Selection, Selection]` (real type) Assisted-by: ClaudeCode:claude-opus-4.8

…ted write Replace the brittle silent-`return` write guards in the indexing oracle with a principled split: - test_indexing_read_parity: reads (sync/out=/async) for every mode, no guards — reads are well-defined for all selections. - test_indexing_write_parity: writes, filtering with `assume` only the cases that have no well-defined oracle — repeated coordinates (order-dependent, a fundamental limitation shared with NumPy, not a bug). This now actually exercises vindex writes, sharded single-axis oindex/mask writes, and unsharded multi-axis oindex writes, which the old blanket `assume(shards is None)` skipped. - test_oindex_sharded_multiaxis_write_xfail: pins the one genuinely-unsupported write (orthogonal across >=2 array axes on a sharded array, GH2834) as a strict xfail, so it is visible at the pytest level and flips to a failure when fixed. Assisted-by: ClaudeCode:claude-opus-4.8

…dex normalization CI (slow-hypothesis, ci profile) found test_indexing_write_parity falsified by a vindex write with selection [2, -2, 0, 1] on length 4: -2 normalizes to 2, so cell 2 is written twice. That is the undefined repeated-target case (order- dependent, NumPy has the same ambiguity) — but the old filter checked *raw* index values and missed the post-normalization collision. Replace _has_repeated_indices with _write_is_unambiguous(mode, npsel, shape), which normalizes negative indices first and is mode-aware: per-axis uniqueness for oindex (independent axes), unique coordinate tuples for vindex (correlated axes). Not a zarr bug — a test-oracle gap. Assisted-by: ClaudeCode:claude-opus-4.8

… + dtype-strict oracle Borrow the strongest ideas from TensorStore's and ndindex's indexing test suites: - Stateful harness (TensorStore driver_testutil pattern, NumPy oracle): _IndexingStateMachine applies many random reads/writes (basic/oindex/vindex/mask) to one array while a NumPy model tracks it in lockstep, asserting equality after every step. Catches read-modify-write, chunk-boundary, and shard-merge bugs the single-shot parity tests cannot. Runs over regular/sharded/3D geometries, gated behind --run-slow-hypothesis like the store stateful tests. - Error parity (ndindex check_same): test_indexing_bounds_error_parity generates possibly-out-of-bounds integer selections and asserts zarr raises iff NumPy raises (zarr's bounds errors subclass IndexError). Catches the silent-wrong-data class — returning garbage where NumPy would raise. - Strict read oracle: assert dtype, not just values/shape, for ndim>=1 results. - Regression anchor: test_write_is_unambiguous pins the negative-index collision fix ([2,-2,0,1] -> cell 2 written twice). Assisted-by: ClaudeCode:claude-opus-4.8

…cope The strategy covers only the element-space indexing modes that have a direct NumPy-array oracle (basic/oindex/vindex/mask). Rename it so the name states that, and document that block indexing is deliberately excluded — it addresses the chunk grid (parametrized by the grid, not shape) and has no NumPy equivalent, so it lives in the separate block_indices strategy. Assisted-by: ClaudeCode:claude-opus-4.8

mkdocs renders single backticks for inline code; the docstrings used RST-style double backticks. Normalize to single backticks across the indexing strategies and property tests. Assisted-by: ClaudeCode:claude-opus-4.8

…ambiguous NumPy's deterministic last-write-wins is an implementation detail of serial, C-order index iteration, not a guarantee. zarr writes scalars into chunks and cannot in general promise a write order, so it cannot be expected to match NumPy's choice — hence no well-defined oracle for duplicate-target writes. Assisted-by: ClaudeCode:claude-opus-4.8

…ases Replace raw pytest.param tuples with the conftest Expect dataclass (input / output / id), matching the established idiom in test_chunk_grids etc. Assisted-by: ClaudeCode:claude-opus-4.8

Layer C of the lazy-view chunk-layout design. Members that assume the array fills its chunk grid now raise LazyViewError (a NotImplementedError subclass) on a non-identity lazy view instead of silently describing the backing grid: chunks, shards, read_chunk_sizes, write_chunk_sizes, cdata_shape, nchunks, nchunks_initialized, nbytes_stored, resize, append, info, info_complete. Normal (identity) arrays are unaffected, and views are new, so nothing breaks. Guards go through a single AsyncArray._require_identity helper (sync members delegate). Also fix size/nbytes, which are logical "keep" members: they now derive from the view's shape rather than the backing metadata shape (a.lazy[2:10].size == 8, not 12). cdata_shape's sync accessor is redirected to the guarded public property. Design: docs/superpowers/specs/2026-07-01-lazy-view-chunk-layout-design.md (gist: https://gist.github.com/d-v-b/f50c93bc5f365e6986dadba6e4aa1902). Next: Layer A (ChunkLayout, zarr-developers#4040) and Layer B (chunk_projections partition API). Assisted-by: ClaudeCode:claude-opus-4.8

Layer B of the lazy-view chunk-layout design: expose the L4 partition that a lazy view already resolves. `Array.chunk_projections(unit="read"|"write")` yields a `ChunkProjection(coord, key, shape, chunk_selection, array_selection, is_partial)` per stored chunk the array/view projects onto, reusing the I/O path's `iter_chunk_transforms` + `sub_transform_to_selections`. `is_chunk_aligned()` is the thin L3 wrapper. - Identity arrays: projections tile the whole domain (none partial), recovering the chunk grid. Views: only touched chunks, with partial-coverage flagged (RMW). - Arbitrary selections compose through the lazy accessor: `array.lazy[sel].chunk_projections()` (no `selection=` param). - `unit="write"` is the store-object (shard, else chunk) grid; `unit="read"` equals it unless sharded. Inner-chunk read partitioning of sharded arrays is deferred (raises NotImplementedError, pointing at unit="write"). - New public `zarr.ChunkProjection`; `shape` is extent-clipped via the grid's chunk_sizes so boundary chunks are not misflagged partial. Layer A (the static ChunkLayout, zarr-developers#4040) is intentionally left to maxrjones's PR. Design: gist f50c93bc5f365e6986dadba6e4aa1902. Assisted-by: ClaudeCode:claude-opus-4.8

@dcherian

…view feedback) Address @dcherian's review on zarr-developers#4109: the single mode-dispatched oracle (test_indexing_read_parity / _write_parity going through _get / _setitem / assert_read_matches_numpy) was hard to read, and the stacked `assume()` write filters made it impossible to tell which selections were actually exercised. - Replace the two folded tests with explicit per-mode read/write tests (test_basic_read, test_oindex_read, ... test_mask_write). Each is self-contained, calls the real accessor directly, and still covers sync + async + out= reads. - Writes are now CONSTRUCTIVE instead of generate-then-`assume`-reject: oindex draws unique per-axis indices, vindex draws distinct coordinates via unravel_index, so every write is well-defined by construction (nothing hidden behind assume). - Drop the now-unused assert_read_matches_numpy / _async_get / _VECTORIZED_MODES. The mode-dispatch helpers that remain are used only by the stateful machine (which is inherently random-mode); noted as such. - State machine: set model from the array (`self.model = self.zarray[:]`) per review. The consumer-agnostic shared harness stays on the lazy PR (zarr-developers#3906), where a lazy view actually reuses it — introducing that abstraction at the point of second use. Assisted-by: ClaudeCode:claude-opus-4.8

…n (roborev zarr-developers#341) roborev flagged that _write_is_unambiguous / _n_array_axes were fed the broadcast np.ix_-style `npsel` (what orthogonal_indices returns for NumPy), not the raw per-axis `zsel`. For oindex the broadcast form tiles every component across all axes, so the per-axis uniqueness check tripped on *tiling* not duplicates: it rejected essentially every multi-axis orthogonal write. Effect: test_indexing_write_parity `assume`d them away and the state machine's write rule returned early for them, so the stateful harness performed no multi-axis writes at all — silently gutting the coverage it exists for. (The feature itself is fine; covered by TestLazyOIndex.test_multi_axis_write.) Fix: pass the raw `zsel` to both helpers (correct on per-axis arrays; slices are skipped, so _n_array_axes counts genuine fancy axes for the sharded guard). Clarify the helper docstrings, rename the misleading `npsel` params, and add test_write_filter_rejects_broadcast_oindex pinning the raw-vs-broadcast invariant (the old regression test masked the bug by feeding raw arrays callers never sent). Assisted-by: ClaudeCode:claude-opus-4.8

…nate rule User-probe feedback showed the lazy path's negative-index handling split three ways: integers literal (plain IndexError, "domain" phrasing), index arrays literal (BoundsCheckError, "length" phrasing), but slice bounds silently wrapped from-the-end. Special-casing slice bounds is not consistent; the rule is now one sentence: a lazy selection is a declaration in literal coordinates, domains start at 0, so a negative value is never in bounds regardless of syntactic form. - Transform layer: reject negative slice start/stop (_check_slice_bounds) at all three lowering sites (basic/oindex/vindex). Positive overflow still clamps — a slice denotes a range and ranges intersect the domain; clamping can only shorten, never silently select different data the way wraparound does. - Eager dialect preserved: _normalize_negative_indices now wraps negative slice bounds NumPy-style before lowering, so eager methods on views keep full NumPy semantics (v[-3:] still works); only the .lazy accessor is literal. - Unified error type and phrasing: all literal-coordinate rejections raise BoundsCheckError (an IndexError) saying "valid indices [lo, hi)" — dropping the repr/error "domain" wording collision — plus a hint that negatives are not from-the-end and what to use instead. - Fix selection_repr crash on 0-d index arrays (len -> size), so integer-indexed fancy views can at least be displayed. - LazyViewError message no longer recommends the not-yet-existing chunk_layout; points at chunk_projections + metadata/chunk_grid. - Tests: literal slice bounds (reads/writes/views/oindex), positive-clamp retention, eager-dialect retention, unified error type, guard message, repr regression; strict-xfail pins for the known int-on-fancy-picked-dim defect (read crash; vindex scalar shape) to be fixed separately. Assisted-by: ClaudeCode:claude-opus-4.8

Assisted-by: ClaudeCode:claude-opus-4.8

New docs/user-guide/lazy_indexing.md, registered in the nav after arrays.md. All examples are live (markdown-exec) so the docs build fails if behavior drifts. Structure: - Theory: indexing as a *declaration* (an index transform) vs an action; views are zero-based windows like NumPy views; the repr's domain is provenance, not the indexing coordinate system. - The literal-coordinate rule: one sentence, consistent across integers, slice bounds, and index arrays (negative is never in bounds), demonstrated with the unified BoundsCheckError; positive slice overflow clamps (range intersection). - The two dialects on one object: lazy = declare (literal), eager = access (NumPy semantics, negatives wrap), with a table. - Patterns, all executed: crop/compose, write-through (region, strided, composed), oindex/vindex/mask selection, materialization, chunk-aware processing via chunk_projections/is_chunk_aligned, LazyViewError guards. - "Coming from NumPy" checklist and current limitations (fancy-int defect, sharded read-unit projections, no newaxis / negative steps). Content grounded in two executed probe reports (semantics probe + naive-user journey), which also drove the preceding consistency fix. Assisted-by: ClaudeCode:claude-opus-4.8

…rigin Probing "can a domain contain a negative number?" (IndexDomain.translate makes one in two lines) exposed that the literal-coordinate rule was only implemented for zero-origin domains: on domain [-10, 2), integers were already literal (t[-5] valid, t[-11] rejected with the true bounds) but slice bounds were interpreted as POSITIONS within the domain (t[0:2] silently selected storage [-10, -8)), and the new bounds check hard-coded `< 0` instead of `< lo` — the same ints-vs-slices inconsistency just removed at origin 0, one level down. Complete the rule uniformly: slice bounds name coordinates, shifted into the domain's 0-based range (_resolve_slice_literal; an identity for lo == 0, so every public view is byte-for-byte unchanged) after _check_slice_bounds rejects bounds below inclusive_min — the any-origin generalization of "negative raises". Bounds past exclusive_max still clamp (ranges intersect the domain). Not publicly reachable today: all .lazy lowering re-zeroes domains, now pinned by test_public_view_domains_are_zero_origin so a future translate-style API cannot silently invalidate the documented origin-0 behavior. New TestNegativeOriginDomains covers int/slice literal semantics on [-10, 2). Index-array paths on non-zero-origin domains remain to be audited when a translate API is exposed. Assisted-by: ClaudeCode:claude-opus-4.8

…dialect, translate Match TensorStore's indexing model exactly, grounded in an executed behavioral oracle (tensorstore 0.1.84; every rule below verified empirically and ported to tests/test_transforms/test_tensorstore_parity.py): - Domain preservation: a step-1 slice keeps its literal coordinates as the new domain (a.lazy[2:10] has domain [2,10); v[3] is coordinate 3 = base cell 3; v[0] is out of bounds). Nothing re-zeros implicitly. - Strided-domain rule: origin = trunc(start/step) (toward zero), size = ceil((stop-start)/step), coordinate origin+k -> base start + k*step. - Strict containment: non-empty slice intervals must lie within the domain (no clamping, no negative wrapping); empty intervals are valid anywhere; reversed bounds are an invalid-interval error, not an empty result. - One dialect: views use domain coordinates on EVERY path (lazy accessor and eager methods alike); the legacy wraparound/boundscheck pre-validation on the view branches is removed, as is _normalize_negative_indices (identity arrays still take the unchanged NumPy-dialect legacy fast path). - translate_domain_by/translate_domain_to on IndexTransform, exposed as Array.translate_by / Array.translate_to (TensorStore's translate_*): shift a domain, or re-zero a view explicitly, preserving which cells are addressed. - Index-array values are literal domain coordinates; fancy dims keep fresh [0, n) zero-origin domains (already matching TensorStore). - Views are not iterable (TypeError, like TensorStore): the getitem-from-0 iteration protocol would silently yield nothing on a preserved domain. - The I/O layer and chunk_projections normalize to zero-origin via translate_domain_to at their boundaries, so chunk resolution and buffer placement are unchanged; ChunkProjection.array_selection stays positional. Negative steps (supported by TensorStore) remain rejected for now — the new resolver's arithmetic already generalizes, enablement is a follow-up. Tests: oracle-parity module (47 cases incl. negative-origin domains where -1 is just another index); all lazy/view/composition/error tests rewritten to domain coordinates; property parity re-zeroes views via translate_to (exercising it on every example); superseded zero-origin/clamp-era tests removed. Assisted-by: ClaudeCode:claude-opus-4.8

… semantics The theory section now teaches the real model: domains are preserved (an index is a stable name for a cell, not a position), -1 is just another coordinate — demonstrably valid after translate_by((-10,)) — strict containment replaces clamping, strided views renumber by trunc-division, and every path on a view shares one coordinate system (base arrays keep NumPy semantics unchanged). Adds translate_to/translate_by patterns, eager iter() rejection on views (made eager — a generator __iter__ deferred the raise to first next()), and the positional contract of ChunkProjection.array_selection. All examples remain live (markdown-exec) and were executed end-to-end. Assisted-by: ClaudeCode:claude-opus-4.8

…arr-developers#358) roborev flagged mask-on-non-zero-origin-view as reading "wrong" cells, assuming NumPy positional semantics. Verified against the executed tensorstore 0.1.84 oracle: TensorStore treats mask True-positions as absolute coordinates counted from 0, origin-blind (a mask is sugar for the coordinate array of its True positions), so a True at position 3 on a domain-[2,10) view addresses cell 3 — exactly our behavior; a True below the origin is out of domain (we reject eagerly where TensorStore defers to read — the documented timing deviation). The real gap was coverage: nothing pinned this on a translated view, so the reviewer could not infer intent. Add the oracle-pinning regression test (read + write + below-origin rejection) and sharpen the docs callout. Assisted-by: ClaudeCode:claude-opus-4.8

github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 14, 2026

Merge branch 'main' into refactor/simplify-indexing

732dddd

d-v-b added the benchmark Code will be benchmarked in a CI job. label Apr 14, 2026

d-v-b mentioned this pull request Apr 14, 2026

feat/lazy indexing #3678

Closed

d-v-b added 2 commits April 15, 2026 11:58

Merge branch 'main' into refactor/simplify-indexing

db31317

Merge branch 'main' into refactor/simplify-indexing

797b25e

Merge branch 'main' into refactor/simplify-indexing

9f50a4c

Merge branch 'main' into refactor/simplify-indexing

53c0042

ilan-gold reviewed Apr 27, 2026

View reviewed changes

jsignell mentioned this pull request Apr 28, 2026

Improve Cubed support in xarray NASA-IMPACT/science-support#30

Open

d-v-b added 7 commits June 30, 2026 10:39

d-v-b mentioned this pull request Jun 30, 2026

test: comprehensive eager indexing parity oracle + reusable indexing strategies #4109

Open

d-v-b added 9 commits June 30, 2026 17:01

docs: use single backticks in indexing test-infra docstrings

8153d48

mkdocs renders single backticks for inline code; the docstrings used RST-style double backticks. Normalize to single backticks across the indexing strategies and property tests. Assisted-by: ClaudeCode:claude-opus-4.8

test(indexing): use the Expect helper for test_write_is_unambiguous c…

a192ca4

…ases Replace raw pytest.param tuples with the conftest Expect dataclass (input / output / id), matching the established idiom in test_chunk_grids etc. Assisted-by: ClaudeCode:claude-opus-4.8

maxrjones reviewed Jul 1, 2026

View reviewed changes

Comment thread .claude/hooks/.logs/hook-log.jsonl Outdated

d-v-b force-pushed the refactor/simplify-indexing branch from f6bc1bb to 16e96dc Compare July 1, 2026 13:56

Merge branch 'main' into refactor/simplify-indexing

9b8da76

d-v-b added 8 commits July 1, 2026 18:18

test(lazy): type the bounds-error trigger list for mypy

02a5c67

Assisted-by: ClaudeCode:claude-opus-4.8

Uh oh!

Uh oh!

Conversation

d-v-b commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Apr 14, 2026

Uh oh!

codecov Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vincentsarago commented Apr 14, 2026

Uh oh!

d-v-b commented Apr 14, 2026

Uh oh!

vincentsarago commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Apr 14, 2026

Uh oh!

codspeed-hq Bot commented Apr 14, 2026

Merging this PR will degrade performance by 17.94%

Performance Changes

Footnotes

Uh oh!

psobolewskiPhD commented Apr 21, 2026

Uh oh!

d-v-b commented Apr 21, 2026

Uh oh!

psobolewskiPhD commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Apr 21, 2026

Uh oh!

psobolewskiPhD commented Apr 21, 2026

Uh oh!

normanrz commented Apr 22, 2026

Uh oh!

d-v-b commented Apr 22, 2026

Uh oh!

vincentsarago commented Apr 22, 2026

Uh oh!

d-v-b commented Apr 22, 2026

Uh oh!

jsignell commented Apr 22, 2026

Uh oh!

d-v-b commented Apr 22, 2026

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

ilan-gold Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

d-v-b commented Apr 14, 2026 •

edited

Loading

codecov Bot commented Apr 14, 2026 •

edited

Loading

vincentsarago commented Apr 14, 2026 •

edited

Loading

psobolewskiPhD commented Apr 21, 2026 •

edited

Loading

ilan-gold Apr 27, 2026 •

edited

Loading