-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding to_dask/from_dask #198
Comments
Should add this is not urgent. Just something that might be interesting to think about later. |
Yes this would be useful, although I reckon these should go into dask as
da.from_zarr and da.to_zarr - I think of dask as sitting one level above
zarr in the dependency stack.
…On Fri, Nov 24, 2017 at 12:49 AM, jakirkham ***@***.***> wrote:
Should add this is not urgent. Just something that might be interesting to
think about later.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/198#issuecomment-346714016>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QtIGVFRMLILNeJ1wo5lpfuPe5yr2ks5s5hKHgaJpZM4QpRK_>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
That sounds like a reasonable alternative. Thoughts @mrocklin? |
What would the In the |
Well it would be able to extract the dtype and chunks from the Zarr Array as well. This crosses over with issue ( dask/dask#1983 ) a bit. |
Yeah, I agree that this is the same as with HDF5 and other array objects. The |
Fwiw I always do d = da.from_array(z, chunks=z.chunks). I'd appreciate a
convenience, e.g., da.from_zarr(z).
It would also be useful to have da.to_zarr(d, **create_kws) which is
basically a convenient way to materialise the result of a computation into
a new zarr array. Happy to provide further details if you think this is
worth considering.
…On Fri, 24 Nov 2017 at 17:42, Matthew Rocklin ***@***.*** ***@***.***');>> wrote:
Yeah, I agree that this is the same as with HDF5 and other array objects.
The .dtype, .chunks, .shape attributes effectively form a protocol. The
issue here is that it's not always clear that dask.array should use the
chunksizes in the storage format.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/198#issuecomment-346875102>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qr6Ov5KrkpPTBjDuH1uuGZ5HEMQpks5s5wAIgaJpZM4QpRK_>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
One path is that if we want to support the convenience of matching the dask.array's chunks value to the zarr array's chunks value then I would suggest that we just bake this into |
I'm ok with that. IOW just go with the solution proposed in issue ( dask/dask#1983 )? |
Separately would add that part of the reason I had been thinking about having |
FWIW the overhead of the indexing logic is very minimal compared to the
time taken in decompression, even with the fastest compressor, so I don't
think you would gain much performance-wise by providing a more direct route
to access a chunk.
Out of interest, what is the motivation for thinking to return an in-memory
zarr array for each chunk?
…On Friday, November 24, 2017, jakirkham ***@***.***> wrote:
Separately would add that part of the reason I had been thinking about
having to_dask in Zarr is we could bypass some of the indexing logic if
the chunks in the Dask Array are the same as those in the Zarr Array.
Namely we could pull directly from the underlying store and decompress the
contents instead of worrying about how to handle slicings that overlap
multiple chunks. If we want to get even more clever for this case, we don't
even have to do the decompression initially at all. Instead we just return
an in memory Zarr Array for the chunk. Thus allowing it to be decompressed
when used.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/198#issuecomment-346887073>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QjK43HthFJ5D409DPxYngI1ALCgWks5s5xm2gaJpZM4QpRK_>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
The main benefit is that we delay decompression and keep the memory footprint small in Dask. Depending on the operations performed, it may not be necessary to decompress at all. The tail end of this comment provides one such use case. |
Should add discussion about adding |
This is irrelevant as Dask can now convert to/from Zarr thanks to @martindurant's PR ( dask/dask#3460 ). |
Was thinking the other day that it might be nice to have some convenience methods on Zarr's
Array
for converting it to a DaskArray
and for storing a Dask Array to Zarr. May also make sense to have such methods on ZarrGroup
s (thinking of cases where the array has not been created yet). For the most part these are pretty straightforward to do outside of Zarr. That said, they would be convenient and maybe cut some boilerplate for end users.The text was updated successfully, but these errors were encountered: