Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method to find Zarr Array checksum #203

Merged
merged 6 commits into from
Nov 28, 2017
Merged

Method to find Zarr Array checksum #203

merged 6 commits into from
Nov 28, 2017

Conversation

jakirkham
Copy link
Member

Fixes https://github.com/alimanfoo/zarr/issues/98

Provides a method for Zarr Array's to compute their checksum. Uses the sha1 checksum by default thanks to its speed and low probability of collisions. Though the user can change the checksum to any supported by their version of openssl (per hashlib docs) should they desire to. The hashing is done on the data in the chunk store (so encoded) and the metadata (shape, dtype, order, filtering, and attributes).

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. There are a couple of possible simplifications but otherwise LGTM.

zarr/core.py Outdated
try:
h.update(r)
except TypeError:
h.update(r.encode())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this you could just pull out '.zarray' from the store and hash that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a much better idea. Can we rely on .zarray always being there or should we add some logic for when it is missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will always be there, an array isn't an array if it doesn't have a .zarray.

zarr/core.py Outdated
h.update(r.encode())

for k, v in self.attrs.items():
h.update((k, v))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just pull out '.zattrs' from the store and hash that.

Copy link
Member Author

@jakirkham jakirkham Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However that is not always there right? Though I suppose we could handle this case.

This said, was trying to workaround issues where .zattrs may be too large to fit into memory all at once. In this case, each key-value pair should still fit in memory comfortably.

@alimanfoo alimanfoo added this to the v2.2 milestone Nov 27, 2017
@jakirkham
Copy link
Member Author

Thanks for taking a look. Happy to make tweaks. Added a couple comments inline.

As a side note, would like to use this to implement __dask_tokenize__ as well.

@alimanfoo
Copy link
Member

Re .zattrs, FWIW I think you can rely on that being small enough to pull into memory, any attribute key update requires loading the whole JSON document into memory anyway.

@alimanfoo
Copy link
Member

Re __dask_tokenize__ I was just looking at the deterministic hashing docs, but doesn't actually say what the hashing is used for. Is this to do with caching intermediate values in a dask computation?

@jakirkham
Copy link
Member Author

jakirkham commented Nov 27, 2017

It's used for things like constructing the name of the Dask Array created by from_array. Though I agree that should be more explicit in the docs (particularly from_array). Raised as issue ( dask/dask#2930 ).

Really any time something calls tokenize on an object it ends up searching for __dask_tokenize__ or a function registered with normalize_token.

@jakirkham
Copy link
Member Author

@jcrist can probably clarify this better than I though. 😉

@jakirkham
Copy link
Member Author

jakirkham commented Nov 27, 2017

Hmm...the checksums I get locally do not match with those the CIs seem to be getting. 😟

Also I'm a little surprised that the checksums of TestArrayWithPath and TestArrayWithChunkStore cases do not match the TestArray case for me locally. Wouldn't expect storage location or format to have an effect, but it seems to. 😕 Edit: Both of these use LZ4 compression unlike the other test cases, which use Zlib. So the content is actually different and should hash differently.

Any thoughts on either of these (edit: the former) would be appreciated.

Provides a new `hexdigest` method for Zarr Arrays that creates a
checksum from the array based on its encoded data and its metadata. Uses
the sha1 checksum by default due to its speed and reliability.
@jakirkham
Copy link
Member Author

Looks like the checksum mismatch issues are restricted to Python 2 only. Not sure of the cause yet, but could be some subtlety with how the data is represented in Python 2 as opposed to Python 3. The concerns were determined to be non-issues.

@jakirkham
Copy link
Member Author

Turns out that an extra space is added right before newlines (after commas) in the JSON for things like .zarray on Python 2, but this does not occur with Python 3. Verified this is not a CR character, but actually a space (at least on my Mac).

@alimanfoo
Copy link
Member

Hm, the PY2/PY3 JSON issue is annoying. It would be ideal to get the same hexdigest on all platforms. Not sure what to suggest, will give it some thought.

Re __dask_tokenize__ I wonder how much it matters if the value is potentially expensive to compute. E.g., if it gets called every time you do da.from_array(z) and z is a large array and/or an array with relatively slow storage (e.g., S3), the delay to compute the hexdigest could make it unusable.

@alimanfoo
Copy link
Member

If you set indent=None for all calls to json.dump, does this give consistent JSON in PY2 and PY3? It's supposed to be the most compact representation. I'd be happy to sacrifice readability for consistency.

@jakirkham
Copy link
Member Author

Found the magic sauce as far as Python 2/3 JSON goes. Turns out there was a subtle change in Python 3.4 to the default separator behavior (referenced below). If we just override separator to match the Python 3.4+ behavior, the issue goes away for Python 2. Pushed a commit to do exactly that. Happy to make the commit message more expressive given this subtlety.

Changed in version 3.4: Use (',', ': ') as default if indent is not None.

ref: https://docs.python.org/3/library/json.html#json.dump

@jakirkham jakirkham changed the title WIP: Method to find Zarr Array checksum Method to find Zarr Array checksum Nov 28, 2017
@jakirkham
Copy link
Member Author

Should add IDK if we want something like digest provided by hashlib algorithms, but it is probably worth considering at this point. Basically it provides the binary string equivalent to the more verbose hexdigest.

@jakirkham
Copy link
Member Author

Re __dask_tokenize__ I wonder how much it matters if the value is potentially expensive to compute. E.g., if it gets called every time you do da.from_array(z) and z is a large array and/or an array with relatively slow storage (e.g., S3), the delay to compute the hexdigest could make it unusable.

So I haven't done any benchmarking yet. It would be good to do this for sure. Have access to slow storage (e.g. NFS) to do some experimenting on. Can explore that more tomorrow.

For more context, am wanting to explore caching long running computational results to disk. As part of this, I'd like the .name attribute of Dask Arrays to be consistent between runs (if no parameters or inputs have changed). So using some sort of hashing for __dask_tokenize__ would be ideal. Since Dask chose SHA1 as a default for NumPy Arrays, figured it was a safe bet here. They also try using things like CityHash and MurmurHash, which could be worth looking at if we don't mind another dependency. Though am open to other stable hash options if you have any in mind.

@alimanfoo
Copy link
Member

Looks good.

@alimanfoo
Copy link
Member

FWIW SHA1 seems fine to me.

If you'd like to add digest() for completeness that's OK by me, fine either way (I'll probably only use hexdigest()).

Re __dask_tokenize__ I think in the general case we have to assume that computing a full hash may be slow simply due to I/O, either because of very large data or slow storage or both. By slow I mean too slow for interactive work. So I'm all for supporting caching, as long as we don't interrupt the flow for the average dask user.

@jakirkham
Copy link
Member Author

jakirkham commented Nov 28, 2017

Ok. I'm not planning to use digest either. So let's leave it out. After all people can just call binascii.unhexlify to convert anyways.

That may be true. Though was hoping since we avoided decompression, this might be less of an issue. Planning on playing with this more tomorrow. In any event, __dask_tokenize__ is not pressing as one can still use dask.base.normalize_token.register to define their own way to do this with Zarr or anything else for that matter.

@jakirkham
Copy link
Member Author

Added in a changelog entry as well. This is good to go on my end.

@jakirkham
Copy link
Member Author

FWIW tried running hexdigest with the default sha1 on some real data to get an idea of how long this might take. Information about the dataset and the time taken to compute the hash are included below. This was run on a Scientific Linux 7.3 cluster using NFSv3.

>>> imgs.info_items()
[('Name', '/images'),
 ('Type', 'zarr.core.Array'),
 ('Data type', 'uint16'),
 ('Shape', '(1636, 1016, 1262)'),
 ('Chunk shape', '(103, 127, 158)'),
 ('Order', 'C'),
 ('Read-only', 'True'),
 ('Compressor', "Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"),
 ('Store type', 'zarr.storage.DirectoryStore'),
 ('No. bytes', '4195332224 (3.9G)'),
 ('No. bytes stored', '2569038619 (2.4G)'),
 ('Storage ratio', '1.6'),
 ('Chunks initialized', '1024/1024')]
>>> %timeit -n 10 imgs.hexdigest()
4.66 s ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@alimanfoo alimanfoo merged commit 4941f97 into zarr-developers:master Nov 28, 2017
@alimanfoo
Copy link
Member

Many thanks, nice feature.

@jakirkham jakirkham deleted the support_hexdigest branch November 28, 2017 17:44
@jakirkham
Copy link
Member Author

jakirkham commented Nov 28, 2017

Thanks. We might want to take a look at BLAKE2 down the road as that would provide the option for tree hashing, which seems very nice to use with something like Zarr.

ref: https://docs.python.org/3/library/hashlib.html#creating-hash-objects

@jakirkham
Copy link
Member Author

Found some value in having a digest method as well. So put together an implementation in PR ( https://github.com/alimanfoo/zarr/pull/213 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants