Send correct blocks. by danielballan · Pull Request #53 · intake/intake-xarray

danielballan · 2019-09-17T20:14:16Z

This fix affects access via the server.

The client side constructs an xarray.Dataset backed by dask arrays with some
chunking. When it loads data, it requests partitions specified by a variable
name and a block "part", as in ('x', 0, 0, 1).

If, on the server side, the DataSourceMixin subclass is holding a plain
numpy array, not a dask array, then it ignores the "part" and always sends
the whole array for the requested variable.

On the client side, this manifests as a mismatch between the dask array's shape
(the shape of the data it is expected) and the shape of the numpy array that it
receives, leading to errors like

ValueError: replacement data must match the Variable's shape


> /sdcc/u/dallan/venv/test-databroker/lib64/python3.6/site-packages/xarray/core/variable.py(301)data()
    299         if data.shape != self.shape:
    300             raise ValueError(
--> 301                 "replacement data must match the Variable's shape")
    302         self._data = data
    303 

ipdb>  data.shape
(164, 1, 4000, 3840)
ipdb>  self.shape
(41, 1, 1000, 960)

where data that arrives is larger than the data expected.

I expect it's worth refining this to make it more efficient before merging, and
it needs a test. This is just a request for comments and suggestions.

martindurant · 2019-09-17T20:17:35Z

intake_xarray/base.py

        else:
            arr = self._ds.data
        if isinstance(arr, np.ndarray):
-            return arr


OK so, to be sure, this was the problem - we sent the whole array. I suppose we could have figured out which chunk to send at this point? But maybe indeed better to use Dask's internal logic to do it for us.

There might be a good semi-internal dask function to be used here, to extract a block from an array that is already in memory. I think we might be paying a significant tokenization cost by using dask.array(...) and should avoid that, but not sure.

That's a good point, actually. Dask will assign a token based on the content of the data, which would be slow for a big array, but you can specify the token so that it doesn't do this. I doubt there is a function for quite what is needed here.

danielballan · 2019-09-17T20:18:46Z

intake_xarray/base.py

-        # dask array
-        return arr.blocks[i].compute()
+            # Make a dask.array so that we can return the appropriate block.
+            arr = dask.array.from_array(arr, chunks=self._chunks[variable])


Even if we have a dask array here, can we be sure that it has the same chunks as the one that the client will have? It looks like encoding a numpy-backed xarray.Dataset with to_zarr(...) prompts zarr to automatically choose a chunking for us. If we serialized a dask -backed xarray.Dataset, will to_zarr(...) respect the existing chunking? If yes, then we have no problem here.

to_zarr uses the current chunking; rechunking is optional.

martindurant · 2019-09-18T12:28:55Z

I haven't had a chance to investigate the failure

danielballan · 2019-09-18T19:30:11Z

The subclasses that override _get_schema override _get_schema in the base class DataSourceMixin without calling super(), so self._chunks is never defined. It looks like there is a fair amount of copy paste between the base class and its subclasses, so the easiest fix might be to remove that and use super(). Can't get to this today, but can revisit later this week.

Send correct blocks.

ec42e18

martindurant reviewed Sep 17, 2019

View reviewed changes

danielballan commented Sep 17, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send correct blocks.#53

Send correct blocks.#53
danielballan wants to merge 1 commit intointake:masterfrom
danielballan:consistent-chunking

danielballan commented Sep 17, 2019 •

edited

Loading

Uh oh!

martindurant Sep 17, 2019

Uh oh!

danielballan Sep 17, 2019

Uh oh!

martindurant Sep 17, 2019

Uh oh!

danielballan Sep 17, 2019 •

edited

Loading

Uh oh!

martindurant Sep 17, 2019

Uh oh!

martindurant commented Sep 18, 2019

Uh oh!

danielballan commented Sep 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielballan commented Sep 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant Sep 17, 2019

Choose a reason for hiding this comment

Uh oh!

danielballan Sep 17, 2019

Choose a reason for hiding this comment

Uh oh!

martindurant Sep 17, 2019

Choose a reason for hiding this comment

Uh oh!

danielballan Sep 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martindurant Sep 17, 2019

Choose a reason for hiding this comment

Uh oh!

martindurant commented Sep 18, 2019

Uh oh!

danielballan commented Sep 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielballan commented Sep 17, 2019 •

edited

Loading

danielballan Sep 17, 2019 •

edited

Loading