Conversation
| else: | ||
| arr = self._ds.data | ||
| if isinstance(arr, np.ndarray): | ||
| return arr |
There was a problem hiding this comment.
OK so, to be sure, this was the problem - we sent the whole array. I suppose we could have figured out which chunk to send at this point? But maybe indeed better to use Dask's internal logic to do it for us.
There was a problem hiding this comment.
There might be a good semi-internal dask function to be used here, to extract a block from an array that is already in memory. I think we might be paying a significant tokenization cost by using dask.array(...) and should avoid that, but not sure.
There was a problem hiding this comment.
That's a good point, actually. Dask will assign a token based on the content of the data, which would be slow for a big array, but you can specify the token so that it doesn't do this. I doubt there is a function for quite what is needed here.
| # dask array | ||
| return arr.blocks[i].compute() | ||
| # Make a dask.array so that we can return the appropriate block. | ||
| arr = dask.array.from_array(arr, chunks=self._chunks[variable]) |
There was a problem hiding this comment.
Even if we have a dask array here, can we be sure that it has the same chunks as the one that the client will have? It looks like encoding a numpy-backed xarray.Dataset with to_zarr(...) prompts zarr to automatically choose a chunking for us. If we serialized a dask -backed xarray.Dataset, will to_zarr(...) respect the existing chunking? If yes, then we have no problem here.
There was a problem hiding this comment.
to_zarr uses the current chunking; rechunking is optional.
|
I haven't had a chance to investigate the failure |
|
The subclasses that override |
This fix affects access via the server.
The client side constructs an
xarray.Datasetbacked by dask arrays with somechunking. When it loads data, it requests partitions specified by a variable
name and a block "part", as in
('x', 0, 0, 1).If, on the server side, the
DataSourceMixinsubclass is holding a plainnumpy array, not a dask array, then it ignores the "part" and always sends
the whole array for the requested variable.
On the client side, this manifests as a mismatch between the dask array's shape
(the shape of the data it is expected) and the shape of the numpy array that it
receives, leading to errors like
where data that arrives is larger than the data expected.
I expect it's worth refining this to make it more efficient before merging, and
it needs a test. This is just a request for comments and suggestions.