Method to find Zarr Array checksum #203

jakirkham · 2017-11-27T16:33:44Z

Fixes https://github.com/alimanfoo/zarr/issues/98

Provides a method for Zarr Array's to compute their checksum. Uses the sha1 checksum by default thanks to its speed and low probability of collisions. Though the user can change the checksum to any supported by their version of openssl (per hashlib docs) should they desire to. The hashing is done on the data in the chunk store (so encoded) and the metadata (shape, dtype, order, filtering, and attributes).

alimanfoo

Nice. There are a couple of possible simplifications but otherwise LGTM.

alimanfoo · 2017-11-27T17:15:28Z

zarr/core.py

+            try:
+                h.update(r)
+            except TypeError:
+                h.update(r.encode())


Instead of this you could just pull out '.zarray' from the store and hash that.

Sounds like a much better idea. Can we rely on .zarray always being there or should we add some logic for when it is missing?

It will always be there, an array isn't an array if it doesn't have a .zarray.

alimanfoo · 2017-11-27T17:15:50Z

zarr/core.py

+                h.update(r.encode())
+
+        for k, v in self.attrs.items():
+            h.update((k, v))


You could just pull out '.zattrs' from the store and hash that.

However that is not always there right? Though I suppose we could handle this case.

This said, was trying to workaround issues where .zattrs may be too large to fit into memory all at once. In this case, each key-value pair should still fit in memory comfortably.

jakirkham · 2017-11-27T17:29:28Z

Thanks for taking a look. Happy to make tweaks. Added a couple comments inline.

As a side note, would like to use this to implement __dask_tokenize__ as well.

alimanfoo · 2017-11-27T17:31:34Z

Re .zattrs, FWIW I think you can rely on that being small enough to pull into memory, any attribute key update requires loading the whole JSON document into memory anyway.

alimanfoo · 2017-11-27T17:33:28Z

Re __dask_tokenize__ I was just looking at the deterministic hashing docs, but doesn't actually say what the hashing is used for. Is this to do with caching intermediate values in a dask computation?

jakirkham · 2017-11-27T18:19:58Z

It's used for things like constructing the name of the Dask Array created by from_array. Though I agree that should be more explicit in the docs (particularly from_array). Raised as issue ( dask/dask#2930 ).

Really any time something calls tokenize on an object it ends up searching for __dask_tokenize__ or a function registered with normalize_token.

jakirkham · 2017-11-27T18:20:58Z

@jcrist can probably clarify this better than I though. 😉

jakirkham · 2017-11-27T20:21:14Z

Hmm...the checksums I get locally do not match with those the CIs seem to be getting. 😟

Also I'm a little surprised that the checksums of TestArrayWithPath and TestArrayWithChunkStore cases do not match the TestArray case for me locally. Wouldn't expect storage location or format to have an effect, but it seems to. 😕 Edit: Both of these use LZ4 compression unlike the other test cases, which use Zlib. So the content is actually different and should hash differently.

Any thoughts on ~~either of these~~ (edit: the former) would be appreciated.

Provides a new `hexdigest` method for Zarr Arrays that creates a checksum from the array based on its encoded data and its metadata. Uses the sha1 checksum by default due to its speed and reliability.

jakirkham · 2017-11-27T21:54:08Z

Looks like the checksum mismatch issues are restricted to Python 2 only. Not sure of the cause yet, but could be some subtlety with how the data is represented in Python 2 as opposed to Python 3. The concerns were determined to be non-issues.

jakirkham · 2017-11-27T22:10:00Z

Turns out that an extra space is added right before newlines (after commas) in the JSON for things like .zarray on Python 2, but this does not occur with Python 3. Verified this is not a CR character, but actually a space (at least on my Mac).

alimanfoo · 2017-11-27T23:33:58Z

Hm, the PY2/PY3 JSON issue is annoying. It would be ideal to get the same hexdigest on all platforms. Not sure what to suggest, will give it some thought.

Re __dask_tokenize__ I wonder how much it matters if the value is potentially expensive to compute. E.g., if it gets called every time you do da.from_array(z) and z is a large array and/or an array with relatively slow storage (e.g., S3), the delay to compute the hexdigest could make it unusable.

alimanfoo · 2017-11-27T23:50:18Z

If you set indent=None for all calls to json.dump, does this give consistent JSON in PY2 and PY3? It's supposed to be the most compact representation. I'd be happy to sacrifice readability for consistency.

jakirkham · 2017-11-28T00:07:27Z

Found the magic sauce as far as Python 2/3 JSON goes. Turns out there was a subtle change in Python 3.4 to the default separator behavior (referenced below). If we just override separator to match the Python 3.4+ behavior, the issue goes away for Python 2. Pushed a commit to do exactly that. Happy to make the commit message more expressive given this subtlety.

Changed in version 3.4: Use (',', ': ') as default if indent is not None.

ref: https://docs.python.org/3/library/json.html#json.dump

jakirkham · 2017-11-28T00:18:35Z

Should add IDK if we want something like digest provided by hashlib algorithms, but it is probably worth considering at this point. Basically it provides the binary string equivalent to the more verbose hexdigest.

jakirkham · 2017-11-28T00:25:43Z

Re __dask_tokenize__ I wonder how much it matters if the value is potentially expensive to compute. E.g., if it gets called every time you do da.from_array(z) and z is a large array and/or an array with relatively slow storage (e.g., S3), the delay to compute the hexdigest could make it unusable.

So I haven't done any benchmarking yet. It would be good to do this for sure. Have access to slow storage (e.g. NFS) to do some experimenting on. Can explore that more tomorrow.

For more context, am wanting to explore caching long running computational results to disk. As part of this, I'd like the .name attribute of Dask Arrays to be consistent between runs (if no parameters or inputs have changed). So using some sort of hashing for __dask_tokenize__ would be ideal. Since Dask chose SHA1 as a default for NumPy Arrays, figured it was a safe bet here. They also try using things like CityHash and MurmurHash, which could be worth looking at if we don't mind another dependency. Though am open to other stable hash options if you have any in mind.

alimanfoo · 2017-11-28T00:32:01Z

Looks good.

alimanfoo · 2017-11-28T00:46:40Z

FWIW SHA1 seems fine to me.

If you'd like to add digest() for completeness that's OK by me, fine either way (I'll probably only use hexdigest()).

Re __dask_tokenize__ I think in the general case we have to assume that computing a full hash may be slow simply due to I/O, either because of very large data or slow storage or both. By slow I mean too slow for interactive work. So I'm all for supporting caching, as long as we don't interrupt the flow for the average dask user.

jakirkham · 2017-11-28T01:06:53Z

Ok. I'm not planning to use digest either. So let's leave it out. After all people can just call binascii.unhexlify to convert anyways.

That may be true. Though was hoping since we avoided decompression, this might be less of an issue. Planning on playing with this more tomorrow. In any event, __dask_tokenize__ is not pressing as one can still use dask.base.normalize_token.register to define their own way to do this with Zarr or anything else for that matter.

jakirkham · 2017-11-28T05:45:26Z

Added in a changelog entry as well. This is good to go on my end.

jakirkham · 2017-11-28T16:39:01Z

FWIW tried running hexdigest with the default sha1 on some real data to get an idea of how long this might take. Information about the dataset and the time taken to compute the hash are included below. This was run on a Scientific Linux 7.3 cluster using NFSv3.

>>> imgs.info_items()
[('Name', '/images'),
 ('Type', 'zarr.core.Array'),
 ('Data type', 'uint16'),
 ('Shape', '(1636, 1016, 1262)'),
 ('Chunk shape', '(103, 127, 158)'),
 ('Order', 'C'),
 ('Read-only', 'True'),
 ('Compressor', "Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"),
 ('Store type', 'zarr.storage.DirectoryStore'),
 ('No. bytes', '4195332224 (3.9G)'),
 ('No. bytes stored', '2569038619 (2.4G)'),
 ('Storage ratio', '1.6'),
 ('Chunks initialized', '1024/1024')]
>>> %timeit -n 10 imgs.hexdigest()
4.66 s ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

alimanfoo · 2017-11-28T17:39:52Z

Many thanks, nice feature.

jakirkham · 2017-11-28T17:46:26Z

Thanks. We might want to take a look at BLAKE2 down the road as that would provide the option for tree hashing, which seems very nice to use with something like Zarr.

ref: https://docs.python.org/3/library/hashlib.html#creating-hash-objects

jakirkham · 2017-12-04T17:57:35Z

Found some value in having a digest method as well. So put together an implementation in PR ( https://github.com/alimanfoo/zarr/pull/213 ).

jakirkham mentioned this pull request Nov 27, 2017

ENH: Compute hash of array #98

Closed

alimanfoo reviewed Nov 27, 2017

View reviewed changes

alimanfoo added this to the v2.2 milestone Nov 27, 2017

jakirkham added 4 commits November 27, 2017 16:45

Find the checksum for data in Zarr

3354012

Provides a new `hexdigest` method for Zarr Arrays that creates a checksum from the array based on its encoded data and its metadata. Uses the sha1 checksum by default due to its speed and reliability.

Document hexdigest

a7bd941

Provide a couple examples of hexdigest

2dcf168

Include unit tests for hexdigest

efd5284

Enforce Python 2/3 consistent JSON separators

875cfcd

jakirkham changed the title ~~WIP: Method to find Zarr Array checksum~~ Method to find Zarr Array checksum Nov 28, 2017

Note Array's hexdigest method in the changelog

d36dfe1

alimanfoo merged commit 4941f97 into zarr-developers:master Nov 28, 2017

jakirkham deleted the support_hexdigest branch November 28, 2017 17:44

This was referenced Nov 28, 2017

Add __dask_tokenize__ methods to Zarr objects #202

Open

Add a digest method for Zarr Arrays #213

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method to find Zarr Array checksum #203

Method to find Zarr Array checksum #203

jakirkham commented Nov 27, 2017

alimanfoo left a comment

alimanfoo Nov 27, 2017

jakirkham Nov 27, 2017

alimanfoo Nov 27, 2017

alimanfoo Nov 27, 2017

jakirkham Nov 27, 2017 •

edited

Loading

jakirkham commented Nov 27, 2017

alimanfoo commented Nov 27, 2017

alimanfoo commented Nov 27, 2017

jakirkham commented Nov 27, 2017 •

edited

Loading

jakirkham commented Nov 27, 2017

jakirkham commented Nov 27, 2017 •

edited

Loading

jakirkham commented Nov 27, 2017

jakirkham commented Nov 27, 2017

alimanfoo commented Nov 27, 2017

alimanfoo commented Nov 27, 2017

jakirkham commented Nov 28, 2017

jakirkham commented Nov 28, 2017

jakirkham commented Nov 28, 2017

alimanfoo commented Nov 28, 2017

alimanfoo commented Nov 28, 2017

jakirkham commented Nov 28, 2017 •

edited

Loading

jakirkham commented Nov 28, 2017

jakirkham commented Nov 28, 2017

alimanfoo commented Nov 28, 2017

jakirkham commented Nov 28, 2017 •

edited

Loading

jakirkham commented Dec 4, 2017

Method to find Zarr Array checksum #203

Method to find Zarr Array checksum #203

Conversation

jakirkham commented Nov 27, 2017

alimanfoo left a comment

Choose a reason for hiding this comment

alimanfoo Nov 27, 2017

Choose a reason for hiding this comment

jakirkham Nov 27, 2017

Choose a reason for hiding this comment

alimanfoo Nov 27, 2017

Choose a reason for hiding this comment

alimanfoo Nov 27, 2017

Choose a reason for hiding this comment

jakirkham Nov 27, 2017 • edited Loading

Choose a reason for hiding this comment

jakirkham commented Nov 27, 2017

alimanfoo commented Nov 27, 2017

alimanfoo commented Nov 27, 2017

jakirkham commented Nov 27, 2017 • edited Loading

jakirkham commented Nov 27, 2017

jakirkham commented Nov 27, 2017 • edited Loading

jakirkham commented Nov 27, 2017

jakirkham commented Nov 27, 2017

alimanfoo commented Nov 27, 2017

alimanfoo commented Nov 27, 2017

jakirkham commented Nov 28, 2017

jakirkham commented Nov 28, 2017

jakirkham commented Nov 28, 2017

alimanfoo commented Nov 28, 2017

alimanfoo commented Nov 28, 2017

jakirkham commented Nov 28, 2017 • edited Loading

jakirkham commented Nov 28, 2017

jakirkham commented Nov 28, 2017

alimanfoo commented Nov 28, 2017

jakirkham commented Nov 28, 2017 • edited Loading

jakirkham commented Dec 4, 2017

jakirkham Nov 27, 2017 •

edited

Loading

jakirkham commented Nov 27, 2017 •

edited

Loading

jakirkham commented Nov 27, 2017 •

edited

Loading

jakirkham commented Nov 28, 2017 •

edited

Loading

jakirkham commented Nov 28, 2017 •

edited

Loading