feat: add IndexTransform library for composable, lazy coordinate mappings#3906
feat: add IndexTransform library for composable, lazy coordinate mappings#3906d-v-b wants to merge 52 commits into
Conversation
…ings Add a new `src/zarr/core/transforms/` package implementing TensorStore-inspired index transforms. The core idea: every indexing operation (slicing, fancy indexing, etc.) produces a coordinate mapping from user space to storage space. These mappings compose lazily — no I/O until explicitly resolved. Key types: - `IndexDomain` — rectangular region in N-dimensional integer space - `ConstantMap`, `DimensionMap`, `ArrayMap` — three representations of a set of storage coordinates (singleton, arithmetic progression, explicit enumeration) - `IndexTransform` — pairs an input domain with output maps (one per storage dim) - `compose(outer, inner)` — chain two transforms Key operations on IndexTransform: - `__getitem__`, `.oindex[]`, `.vindex[]` — indexing produces new transforms - `.intersect(domain)` — restrict to coordinates within a region (chunk resolution) - `.translate(shift)` — shift coordinates (make chunk-local) The transform library is standalone with no dependency on Array. Includes comprehensive test suite (143 tests covering all types, operations, composition, chunk resolution, and edge cases). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3906 +/- ##
==========================================
- Coverage 93.50% 92.90% -0.61%
==========================================
Files 90 98 +8
Lines 11981 13250 +1269
==========================================
+ Hits 11203 12310 +1107
- Misses 778 940 +162
🚀 New features to boost your workflow:
|
|
@d-v-b I'm new to zarr-python indexing, does My use case is mainly if I have an array of shape |
Add TypedDict definitions and conversion functions for serializing
IndexDomain, OutputIndexMap, and IndexTransform to/from JSON.
The JSON format follows TensorStore's conventions for interoperability:
- IndexDomain: input_inclusive_min, input_exclusive_max, input_labels
- OutputIndexMap: offset + optional stride/input_dimension/index_array
- IndexTransform: domain fields + output array
TypedDicts: IndexDomainJSON, OutputIndexMapJSON, IndexTransformJSON
Functions: index_domain_to_json, index_domain_from_json,
index_transform_to_json, index_transform_from_json
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
on this branch: >>> arr = create_array(store={}, shape=(100,100), dtype="uint8")
>>> arr.z[0]
<Array memory://136980013828096 shape=(100,) dtype=uint8 domain={ 0, [0, 100) }>
>>> arr.z[0].shape
(100,)the |
|
What is I have and I want to select the arrays for band 1 (second dim) but for all the times, would |
yeah it should! |
Merging this PR will degrade performance by 17.94%
Performance ChangesComparing Footnotes
|
|
I played with this a bit today and this type of slicing is really great: One thing that tripped me up -- probably just shows my naivety (PBCAK) -- is that e.g. data.z.shape is |
We should probably have a shape attribute there. Thanks for reporting that, and thanks for trying this branch out. Let me know if you find anything else we need to fix! |
|
I guess from my PoV it would be nice if I could just use Edit: Don't get me wrong, this is pretty great -- as an example, with this I can remove the dask.array wrapper here: instead, just check if my array is zarr and if so use |
I'm glad it's useful! I agree that this is probably how slicing should work by default. But that would be a big breaking change, so it's far down the road. The |
|
Makes sense. So maybe in followup _LazyIndexAccessor could have more of the array properties and support more of the numpy protocol, so that array.z could be used as a drop-in for array, but be lazy about slicing, etc. |
|
I like this a lot! |
I just wanted something really short. Since this is (IMO) a strictly better slicing API, I wanted to make accessing it as low-friction as possible. But I will totally bow to the will of the crowd here, we can use whatever people find intuitive.
I use |
|
I agree with @normanrz, the the length of |
|
what about |
|
Just for reference: xarray uses |
|
ping @ilan-gold for visibility |
ilan-gold
left a comment
There was a problem hiding this comment.
use data structures compatible with tensorstore. tensorstore's indexing machinery serializes to JSON. if we adopt the same patterns, we are closer to using tensorstore as an optional backend.
I'm sure zarrs could also support this but I wonder about JSON performance for things like vindex/oindex where you could easily have 10's of thousands of individual coordinates.
| - **translate(shift)** — shift all output coordinates. This makes coordinates | ||
| chunk-local: "express my coordinates relative to the chunk origin." | ||
|
|
||
| - **compose(outer, inner)** — chain two transforms. See ``composition.py``. |
There was a problem hiding this comment.
I think it could be cool to have different kinds of composition like union and I don't see that anywhere intthe PR - it seems that only array.z[my_transform][my_other_transform_relative_to_the_first] is supported wheres it could be cool to have array.z[my_transform + my_other_transform]. Concretely, if I wanted (slice(0, 10), slice(0, 10)) and then also (slice(50, 60), slice(50, 60)) from an array, how would I do that? np.concatenate + np.arange to generate outer indexers is the only option AFAICT but that kind of stinks.
There was a problem hiding this comment.
that's a very cool idea, not sure about overloading the addition operator, but some method for combining indexing expressions seems nice. I'll see what we need for this
Add test_indexing_parity: a NumPy-oracle property test over the cartesian product
{basic, oindex, vindex, mask} × {eager array, lazy view} × {out=None, out=buffer}.
Every lazy/view indexing bug found in review was "the lazy/view path diverges from
NumPy/eager for some (method, parameter) combination"; enumerating that surface
catches the whole class instead of one case per review round. Doubles as a guard
for the existing eager indexing paths (incl. out=), independent of lazy indexing.
The harness immediately surfaced two fixes:
- _get_selection_via_transform validated a provided out= buffer against the
multi-dimensional out_shape, but vectorized reads scatter through a flat index;
validate against the (flat) buffer_shape instead, matching the eager contract.
- _is_vectorized_transform required >= 2 coordinate ArrayMaps, missing a single
coordinate map with a multi-dimensional index array (vindex[idx_2d] on a 1-D
view), which also scatters flat. Flag a flat buffer whenever any coordinate
(input_dimension is None) ArrayMap is present.
Also adds a deterministic view-method out= regression test.
Assisted-by: ClaudeCode:claude-opus-4.8
…dows) Promote the inlined test-data generators from the indexing parity test into zarr.testing.strategies as documented, reusable strategies: - indexers(mode, shape): one strategy returning a (zarr_selection, numpy_selection) pair for any of basic / oindex / vindex / mask, so indexing tests are written once and parametrized over mode instead of re-deriving selection setup. - windows(shape): non-negative, full-rank slice windows for building lazy views. test_indexing_parity now draws from these shared strategies. The reusable data-generation building blocks — not any single test — are the durable asset: new indexing tests (and downstream users of zarr.testing) compose them rather than reinventing setup. Assisted-by: ClaudeCode:claude-opus-4.8
…pting get_block_selection / set_block_selection were not updated for non-identity (lazy view) arrays: BlockIndexer used the view shape against the storage chunk grid, so v.blocks[i] silently read/wrote raw storage blocks, ignoring the view's offset/stride. Block selection addresses the storage chunk grid and is ill-defined on a view, so raise NotImplementedError (pointing users to materialize the view) rather than return/write the wrong region. Element-level lazy indexing (basic/oindex/vindex/mask) is unaffected. Assisted-by: ClaudeCode:claude-opus-4.8
…eable)
Split the cross-path indexing oracle so the test improvements can land ahead of
lazy indexing:
- assert_indexing_matches_numpy(target, ref, mode, data): a consumer-agnostic
helper — `target` is any zarr.Array (eager array or lazy view), checked against
a NumPy reference for a drawn selection, with and without out=.
- test_indexing_parity: eager-only, no reference to Array.lazy. Comprehensively
guards the existing eager indexing API ({basic,oindex,vindex,mask} × out=) and
can merge on its own.
- test_lazy_view_indexing_parity: reuses the same helper for a lazy view, and is
skipped unless Array.lazy exists — so it's inert on a tree without the feature.
The lazy view is "just another zarr.Array" flowing through the same oracle, so
adding the lazy consumer is additive rather than a rewrite. windows() is reworded
as a generic sub-region strategy (not lazy-specific).
Assisted-by: ClaudeCode:claude-opus-4.8
Replace test_basic_indexing / test_oindex / test_vindex / test_mask_indexing with one enumerated oracle, test_indexing_parity, covering every mode against a NumPy reference: sync read, read into an out= buffer, async read, and write. The historical per-mode write guards are preserved in one place (vindex setitem is chunk-dependent for repeated coords; oindex/mask writes to sharded arrays are unsupported per GH2834; oindex repeated-index writes are unspecified). Selection generation comes from the shared indexers() strategy, and the read half is shared with the lazy-view test via assert_read_matches_numpy. Dedicated tests remain for inputs the oracle's strategies don't generate: complex rectilinear arrays and block indexing. Assisted-by: ClaudeCode:claude-opus-4.8
…verage) Low-severity follow-ups from the branch review (jobs 303/304), no behavior change: - array.py: correct the stale "needs_flat_buffer" comment (any coordinate ArrayMap triggers the flat buffer, not just >= 2). - array.py set_coordinate_selection: normalize empty fields with an explicit `== [] or == ()` check instead of a falsy `if not fields`. - transform._intersect_vectorized: drop the unused n_points variable. - test_properties: restore the vindex[mask] read+write equivalence coverage that the per-mode fold dropped (the two mask spellings must agree). - strategies.indexers: fix the docstring's zero-size-axis claim to match _eligible. Assisted-by: ClaudeCode:claude-opus-4.8
The lazy accessor copies TensorStore: index-array values are literal coordinates (origin 0), not NumPy from-the-end. Out-of-domain values (negative, or >= the axis length) previously resolved to silently-wrong data (e.g. lazy.oindex[[-1]] returned element 0); now they raise BoundsCheckError at transform construction, matching the eager bounds check. Applies to both the literal accessor and the NumPy-normalizing view methods (which normalize first, then bounds-check the result). Basic accessor negatives already raise as literal out-of-range. Keeps the intentional accessor-vs-method asymmetry (the accessor is the TensorStore-style transform builder; the Array methods normalize NumPy-style); documented and pinned by test_accessor_negative_index_is_literal. Assisted-by: ClaudeCode:claude-opus-4.8
Mirror the review fixes from the eager test-infra PR onto the lazy branch: - type `mode` as the `IndexMode` literal across the indexing test helpers - `_VECTORIZED_MODES` is a tuple like `_INDEX_MODES` - `indexers(...)` returns `tuple[Selection, Selection]` (real type) Assisted-by: ClaudeCode:claude-opus-4.8
…ted write Replace the brittle silent-`return` write guards in the indexing oracle with a principled split: - test_indexing_read_parity: reads (sync/out=/async) for every mode, no guards — reads are well-defined for all selections. - test_indexing_write_parity: writes, filtering with `assume` only the cases that have no well-defined oracle — repeated coordinates (order-dependent, a fundamental limitation shared with NumPy, not a bug). This now actually exercises vindex writes, sharded single-axis oindex/mask writes, and unsharded multi-axis oindex writes, which the old blanket `assume(shards is None)` skipped. - test_oindex_sharded_multiaxis_write_xfail: pins the one genuinely-unsupported write (orthogonal across >=2 array axes on a sharded array, GH2834) as a strict xfail, so it is visible at the pytest level and flips to a failure when fixed. Assisted-by: ClaudeCode:claude-opus-4.8
…dex normalization CI (slow-hypothesis, ci profile) found test_indexing_write_parity falsified by a vindex write with selection [2, -2, 0, 1] on length 4: -2 normalizes to 2, so cell 2 is written twice. That is the undefined repeated-target case (order- dependent, NumPy has the same ambiguity) — but the old filter checked *raw* index values and missed the post-normalization collision. Replace _has_repeated_indices with _write_is_unambiguous(mode, npsel, shape), which normalizes negative indices first and is mode-aware: per-axis uniqueness for oindex (independent axes), unique coordinate tuples for vindex (correlated axes). Not a zarr bug — a test-oracle gap. Assisted-by: ClaudeCode:claude-opus-4.8
… + dtype-strict oracle Borrow the strongest ideas from TensorStore's and ndindex's indexing test suites: - Stateful harness (TensorStore driver_testutil pattern, NumPy oracle): _IndexingStateMachine applies many random reads/writes (basic/oindex/vindex/mask) to one array while a NumPy model tracks it in lockstep, asserting equality after every step. Catches read-modify-write, chunk-boundary, and shard-merge bugs the single-shot parity tests cannot. Runs over regular/sharded/3D geometries, gated behind --run-slow-hypothesis like the store stateful tests. - Error parity (ndindex check_same): test_indexing_bounds_error_parity generates possibly-out-of-bounds integer selections and asserts zarr raises iff NumPy raises (zarr's bounds errors subclass IndexError). Catches the silent-wrong-data class — returning garbage where NumPy would raise. - Strict read oracle: assert dtype, not just values/shape, for ndim>=1 results. - Regression anchor: test_write_is_unambiguous pins the negative-index collision fix ([2,-2,0,1] -> cell 2 written twice). Assisted-by: ClaudeCode:claude-opus-4.8
…cope The strategy covers only the element-space indexing modes that have a direct NumPy-array oracle (basic/oindex/vindex/mask). Rename it so the name states that, and document that block indexing is deliberately excluded — it addresses the chunk grid (parametrized by the grid, not shape) and has no NumPy equivalent, so it lives in the separate block_indices strategy. Assisted-by: ClaudeCode:claude-opus-4.8
mkdocs renders single backticks for inline code; the docstrings used RST-style double backticks. Normalize to single backticks across the indexing strategies and property tests. Assisted-by: ClaudeCode:claude-opus-4.8
…ambiguous NumPy's deterministic last-write-wins is an implementation detail of serial, C-order index iteration, not a guarantee. zarr writes scalars into chunks and cannot in general promise a write order, so it cannot be expected to match NumPy's choice — hence no well-defined oracle for duplicate-target writes. Assisted-by: ClaudeCode:claude-opus-4.8
…ases Replace raw pytest.param tuples with the conftest Expect dataclass (input / output / id), matching the established idiom in test_chunk_grids etc. Assisted-by: ClaudeCode:claude-opus-4.8
Layer C of the lazy-view chunk-layout design. Members that assume the array fills its chunk grid now raise LazyViewError (a NotImplementedError subclass) on a non-identity lazy view instead of silently describing the backing grid: chunks, shards, read_chunk_sizes, write_chunk_sizes, cdata_shape, nchunks, nchunks_initialized, nbytes_stored, resize, append, info, info_complete. Normal (identity) arrays are unaffected, and views are new, so nothing breaks. Guards go through a single AsyncArray._require_identity helper (sync members delegate). Also fix size/nbytes, which are logical "keep" members: they now derive from the view's shape rather than the backing metadata shape (a.lazy[2:10].size == 8, not 12). cdata_shape's sync accessor is redirected to the guarded public property. Design: docs/superpowers/specs/2026-07-01-lazy-view-chunk-layout-design.md (gist: https://gist.github.com/d-v-b/f50c93bc5f365e6986dadba6e4aa1902). Next: Layer A (ChunkLayout, zarr-developers#4040) and Layer B (chunk_projections partition API). Assisted-by: ClaudeCode:claude-opus-4.8
Layer B of the lazy-view chunk-layout design: expose the L4 partition that a lazy view already resolves. `Array.chunk_projections(unit="read"|"write")` yields a `ChunkProjection(coord, key, shape, chunk_selection, array_selection, is_partial)` per stored chunk the array/view projects onto, reusing the I/O path's `iter_chunk_transforms` + `sub_transform_to_selections`. `is_chunk_aligned()` is the thin L3 wrapper. - Identity arrays: projections tile the whole domain (none partial), recovering the chunk grid. Views: only touched chunks, with partial-coverage flagged (RMW). - Arbitrary selections compose through the lazy accessor: `array.lazy[sel].chunk_projections()` (no `selection=` param). - `unit="write"` is the store-object (shard, else chunk) grid; `unit="read"` equals it unless sharded. Inner-chunk read partitioning of sharded arrays is deferred (raises NotImplementedError, pointing at unit="write"). - New public `zarr.ChunkProjection`; `shape` is extent-clipped via the grid's chunk_sizes so boundary chunks are not misflagged partial. Layer A (the static ChunkLayout, zarr-developers#4040) is intentionally left to maxrjones's PR. Design: gist f50c93bc5f365e6986dadba6e4aa1902. Assisted-by: ClaudeCode:claude-opus-4.8
f6bc1bb to
16e96dc
Compare
…view feedback) Address @dcherian's review on zarr-developers#4109: the single mode-dispatched oracle (test_indexing_read_parity / _write_parity going through _get / _setitem / assert_read_matches_numpy) was hard to read, and the stacked `assume()` write filters made it impossible to tell which selections were actually exercised. - Replace the two folded tests with explicit per-mode read/write tests (test_basic_read, test_oindex_read, ... test_mask_write). Each is self-contained, calls the real accessor directly, and still covers sync + async + out= reads. - Writes are now CONSTRUCTIVE instead of generate-then-`assume`-reject: oindex draws unique per-axis indices, vindex draws distinct coordinates via unravel_index, so every write is well-defined by construction (nothing hidden behind assume). - Drop the now-unused assert_read_matches_numpy / _async_get / _VECTORIZED_MODES. The mode-dispatch helpers that remain are used only by the stateful machine (which is inherently random-mode); noted as such. - State machine: set model from the array (`self.model = self.zarray[:]`) per review. The consumer-agnostic shared harness stays on the lazy PR (zarr-developers#3906), where a lazy view actually reuses it — introducing that abstraction at the point of second use. Assisted-by: ClaudeCode:claude-opus-4.8
…n (roborev zarr-developers#341) roborev flagged that _write_is_unambiguous / _n_array_axes were fed the broadcast np.ix_-style `npsel` (what orthogonal_indices returns for NumPy), not the raw per-axis `zsel`. For oindex the broadcast form tiles every component across all axes, so the per-axis uniqueness check tripped on *tiling* not duplicates: it rejected essentially every multi-axis orthogonal write. Effect: test_indexing_write_parity `assume`d them away and the state machine's write rule returned early for them, so the stateful harness performed no multi-axis writes at all — silently gutting the coverage it exists for. (The feature itself is fine; covered by TestLazyOIndex.test_multi_axis_write.) Fix: pass the raw `zsel` to both helpers (correct on per-axis arrays; slices are skipped, so _n_array_axes counts genuine fancy axes for the sharded guard). Clarify the helper docstrings, rename the misleading `npsel` params, and add test_write_filter_rejects_broadcast_oindex pinning the raw-vs-broadcast invariant (the old regression test masked the bug by feeding raw arrays callers never sent). Assisted-by: ClaudeCode:claude-opus-4.8
…nate rule User-probe feedback showed the lazy path's negative-index handling split three ways: integers literal (plain IndexError, "domain" phrasing), index arrays literal (BoundsCheckError, "length" phrasing), but slice bounds silently wrapped from-the-end. Special-casing slice bounds is not consistent; the rule is now one sentence: a lazy selection is a declaration in literal coordinates, domains start at 0, so a negative value is never in bounds regardless of syntactic form. - Transform layer: reject negative slice start/stop (_check_slice_bounds) at all three lowering sites (basic/oindex/vindex). Positive overflow still clamps — a slice denotes a range and ranges intersect the domain; clamping can only shorten, never silently select different data the way wraparound does. - Eager dialect preserved: _normalize_negative_indices now wraps negative slice bounds NumPy-style before lowering, so eager methods on views keep full NumPy semantics (v[-3:] still works); only the .lazy accessor is literal. - Unified error type and phrasing: all literal-coordinate rejections raise BoundsCheckError (an IndexError) saying "valid indices [lo, hi)" — dropping the repr/error "domain" wording collision — plus a hint that negatives are not from-the-end and what to use instead. - Fix selection_repr crash on 0-d index arrays (len -> size), so integer-indexed fancy views can at least be displayed. - LazyViewError message no longer recommends the not-yet-existing chunk_layout; points at chunk_projections + metadata/chunk_grid. - Tests: literal slice bounds (reads/writes/views/oindex), positive-clamp retention, eager-dialect retention, unified error type, guard message, repr regression; strict-xfail pins for the known int-on-fancy-picked-dim defect (read crash; vindex scalar shape) to be fixed separately. Assisted-by: ClaudeCode:claude-opus-4.8
Assisted-by: ClaudeCode:claude-opus-4.8
New docs/user-guide/lazy_indexing.md, registered in the nav after arrays.md. All examples are live (markdown-exec) so the docs build fails if behavior drifts. Structure: - Theory: indexing as a *declaration* (an index transform) vs an action; views are zero-based windows like NumPy views; the repr's domain is provenance, not the indexing coordinate system. - The literal-coordinate rule: one sentence, consistent across integers, slice bounds, and index arrays (negative is never in bounds), demonstrated with the unified BoundsCheckError; positive slice overflow clamps (range intersection). - The two dialects on one object: lazy = declare (literal), eager = access (NumPy semantics, negatives wrap), with a table. - Patterns, all executed: crop/compose, write-through (region, strided, composed), oindex/vindex/mask selection, materialization, chunk-aware processing via chunk_projections/is_chunk_aligned, LazyViewError guards. - "Coming from NumPy" checklist and current limitations (fancy-int defect, sharded read-unit projections, no newaxis / negative steps). Content grounded in two executed probe reports (semantics probe + naive-user journey), which also drove the preceding consistency fix. Assisted-by: ClaudeCode:claude-opus-4.8
…rigin Probing "can a domain contain a negative number?" (IndexDomain.translate makes one in two lines) exposed that the literal-coordinate rule was only implemented for zero-origin domains: on domain [-10, 2), integers were already literal (t[-5] valid, t[-11] rejected with the true bounds) but slice bounds were interpreted as POSITIONS within the domain (t[0:2] silently selected storage [-10, -8)), and the new bounds check hard-coded `< 0` instead of `< lo` — the same ints-vs-slices inconsistency just removed at origin 0, one level down. Complete the rule uniformly: slice bounds name coordinates, shifted into the domain's 0-based range (_resolve_slice_literal; an identity for lo == 0, so every public view is byte-for-byte unchanged) after _check_slice_bounds rejects bounds below inclusive_min — the any-origin generalization of "negative raises". Bounds past exclusive_max still clamp (ranges intersect the domain). Not publicly reachable today: all .lazy lowering re-zeroes domains, now pinned by test_public_view_domains_are_zero_origin so a future translate-style API cannot silently invalidate the documented origin-0 behavior. New TestNegativeOriginDomains covers int/slice literal semantics on [-10, 2). Index-array paths on non-zero-origin domains remain to be audited when a translate API is exposed. Assisted-by: ClaudeCode:claude-opus-4.8
…dialect, translate Match TensorStore's indexing model exactly, grounded in an executed behavioral oracle (tensorstore 0.1.84; every rule below verified empirically and ported to tests/test_transforms/test_tensorstore_parity.py): - Domain preservation: a step-1 slice keeps its literal coordinates as the new domain (a.lazy[2:10] has domain [2,10); v[3] is coordinate 3 = base cell 3; v[0] is out of bounds). Nothing re-zeros implicitly. - Strided-domain rule: origin = trunc(start/step) (toward zero), size = ceil((stop-start)/step), coordinate origin+k -> base start + k*step. - Strict containment: non-empty slice intervals must lie within the domain (no clamping, no negative wrapping); empty intervals are valid anywhere; reversed bounds are an invalid-interval error, not an empty result. - One dialect: views use domain coordinates on EVERY path (lazy accessor and eager methods alike); the legacy wraparound/boundscheck pre-validation on the view branches is removed, as is _normalize_negative_indices (identity arrays still take the unchanged NumPy-dialect legacy fast path). - translate_domain_by/translate_domain_to on IndexTransform, exposed as Array.translate_by / Array.translate_to (TensorStore's translate_*): shift a domain, or re-zero a view explicitly, preserving which cells are addressed. - Index-array values are literal domain coordinates; fancy dims keep fresh [0, n) zero-origin domains (already matching TensorStore). - Views are not iterable (TypeError, like TensorStore): the getitem-from-0 iteration protocol would silently yield nothing on a preserved domain. - The I/O layer and chunk_projections normalize to zero-origin via translate_domain_to at their boundaries, so chunk resolution and buffer placement are unchanged; ChunkProjection.array_selection stays positional. Negative steps (supported by TensorStore) remain rejected for now — the new resolver's arithmetic already generalizes, enablement is a follow-up. Tests: oracle-parity module (47 cases incl. negative-origin domains where -1 is just another index); all lazy/view/composition/error tests rewritten to domain coordinates; property parity re-zeroes views via translate_to (exercising it on every example); superseded zero-origin/clamp-era tests removed. Assisted-by: ClaudeCode:claude-opus-4.8
… semantics The theory section now teaches the real model: domains are preserved (an index is a stable name for a cell, not a position), -1 is just another coordinate — demonstrably valid after translate_by((-10,)) — strict containment replaces clamping, strided views renumber by trunc-division, and every path on a view shares one coordinate system (base arrays keep NumPy semantics unchanged). Adds translate_to/translate_by patterns, eager iter() rejection on views (made eager — a generator __iter__ deferred the raise to first next()), and the positional contract of ChunkProjection.array_selection. All examples remain live (markdown-exec) and were executed end-to-end. Assisted-by: ClaudeCode:claude-opus-4.8
…arr-developers#358) roborev flagged mask-on-non-zero-origin-view as reading "wrong" cells, assuming NumPy positional semantics. Verified against the executed tensorstore 0.1.84 oracle: TensorStore treats mask True-positions as absolute coordinates counted from 0, origin-blind (a mask is sugar for the coordinate array of its True positions), so a True at position 3 on a domain-[2,10) view addresses cell 3 — exactly our behavior; a True below the origin is out of domain (we reject eagerly where TensorStore defers to read — the documented timing deviation). The real gap was coverage: nothing pinned this on a translated view, so the reviewer could not infer intent. Add the oracle-pinning regression test (read + write + below-origin rejection) and sharpen the docs callout. Assisted-by: ClaudeCode:claude-opus-4.8
My summary:
With dask maintenance on the decline, it's more important than ever that we give zarr-python users a dask-free way to do something very intuitive: index large zarr arrays without turning the whole thing into a numpy array first. This was discussed at length in #1603.
This PR, done with Claude, makes regular indexing go through a lazy indexing layer. The lazy indexing layer is based on abstractions defined in tensorstore. The basic idea is to explicitly model indexing an array as a transformation from some input coordinates to output coordinates, and to bind such a representation to our
Arrayclasses.Regular indexing via
.__getitem__is still immediate, but arrays have a new.zattribute that exposes the lazy indexing layer:Goals here:
mainare disparate ad-hoc copies of stuff from zarr-python 2.x. We can do better.Non-goals:
Claude's summary.
Add a new
src/zarr/core/transforms/package implementing TensorStore-inspiredindex transforms. The core idea: every indexing operation (slicing, fancy indexing,
etc.) produces a coordinate mapping from user space to storage space. These mappings
compose lazily — no I/O until explicitly resolved.
Key types:
IndexDomain— rectangular region in N-dimensional integer spaceConstantMap,DimensionMap,ArrayMap— three representations of a set ofstorage coordinates (singleton, arithmetic progression, explicit enumeration)
IndexTransform— pairs an input domain with output maps (one per storage dim)compose(outer, inner)— chain two transformsKey operations on IndexTransform:
__getitem__,.oindex[],.vindex[]— indexing produces new transforms.intersect(domain)— restrict to coordinates within a region (chunk resolution).translate(shift)— shift coordinates (make chunk-local)The transform library is standalone with no dependency on Array.
Includes comprehensive test suite (143 tests covering all types, operations,
composition, chunk resolution, and edge cases).
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com