-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method to find Zarr Array checksum #203
Method to find Zarr Array checksum #203
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. There are a couple of possible simplifications but otherwise LGTM.
zarr/core.py
Outdated
try: | ||
h.update(r) | ||
except TypeError: | ||
h.update(r.encode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this you could just pull out '.zarray' from the store and hash that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a much better idea. Can we rely on .zarray
always being there or should we add some logic for when it is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will always be there, an array isn't an array if it doesn't have a .zarray.
zarr/core.py
Outdated
h.update(r.encode()) | ||
|
||
for k, v in self.attrs.items(): | ||
h.update((k, v)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just pull out '.zattrs' from the store and hash that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However that is not always there right? Though I suppose we could handle this case.
This said, was trying to workaround issues where .zattrs
may be too large to fit into memory all at once. In this case, each key-value pair should still fit in memory comfortably.
Thanks for taking a look. Happy to make tweaks. Added a couple comments inline. As a side note, would like to use this to implement |
Re .zattrs, FWIW I think you can rely on that being small enough to pull into memory, any attribute key update requires loading the whole JSON document into memory anyway. |
Re |
It's used for things like constructing the Really any time something calls |
@jcrist can probably clarify this better than I though. 😉 |
Hmm...the checksums I get locally do not match with those the CIs seem to be getting. 😟
Any thoughts on |
Provides a new `hexdigest` method for Zarr Arrays that creates a checksum from the array based on its encoded data and its metadata. Uses the sha1 checksum by default due to its speed and reliability.
Looks like the checksum mismatch issues are restricted to Python 2 only. Not sure of the cause yet, but could be some subtlety with how the data is represented in Python 2 as opposed to Python 3. The concerns were determined to be non-issues. |
Turns out that an extra space is added right before newlines (after commas) in the JSON for things like |
Hm, the PY2/PY3 JSON issue is annoying. It would be ideal to get the same hexdigest on all platforms. Not sure what to suggest, will give it some thought. Re |
If you set |
Found the magic sauce as far as Python 2/3 JSON goes. Turns out there was a subtle change in Python 3.4 to the default
|
Should add IDK if we want something like |
So I haven't done any benchmarking yet. It would be good to do this for sure. Have access to slow storage (e.g. NFS) to do some experimenting on. Can explore that more tomorrow. For more context, am wanting to explore caching long running computational results to disk. As part of this, I'd like the |
Looks good. |
FWIW SHA1 seems fine to me. If you'd like to add Re |
Ok. I'm not planning to use That may be true. Though was hoping since we avoided decompression, this might be less of an issue. Planning on playing with this more tomorrow. In any event, |
Added in a changelog entry as well. This is good to go on my end. |
FWIW tried running >>> imgs.info_items()
[('Name', '/images'),
('Type', 'zarr.core.Array'),
('Data type', 'uint16'),
('Shape', '(1636, 1016, 1262)'),
('Chunk shape', '(103, 127, 158)'),
('Order', 'C'),
('Read-only', 'True'),
('Compressor', "Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"),
('Store type', 'zarr.storage.DirectoryStore'),
('No. bytes', '4195332224 (3.9G)'),
('No. bytes stored', '2569038619 (2.4G)'),
('Storage ratio', '1.6'),
('Chunks initialized', '1024/1024')]
>>> %timeit -n 10 imgs.hexdigest()
4.66 s ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
Many thanks, nice feature. |
Thanks. We might want to take a look at BLAKE2 down the road as that would provide the option for tree hashing, which seems very nice to use with something like Zarr. ref: https://docs.python.org/3/library/hashlib.html#creating-hash-objects |
Found some value in having a |
Fixes https://github.com/alimanfoo/zarr/issues/98
Provides a method for Zarr Array's to compute their checksum. Uses the
sha1
checksum by default thanks to its speed and low probability of collisions. Though the user can change the checksum to any supported by their version ofopenssl
(perhashlib
docs) should they desire to. The hashing is done on the data in the chunk store (so encoded) and the metadata (shape, dtype, order, filtering, and attributes).