-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special case chunk encoding for dict
chunk store
#359
Special case chunk encoding for dict
chunk store
#359
Conversation
This is ready for review. Would be curious to hear thoughts on this. Basically this is meant to make working with |
Would be great to hear your thoughts on this when you have a chance, @alimanfoo. |
Adding to the v2.3 milestone so we can keep track of it. Happy to change that as needed. |
Thoughts @alimanfoo? |
Thoughts @zarr-developers/core-devs? |
Hello @jakirkham! Thanks for updating the PR.
Comment last updated on February 14, 2019 at 06:40 Hours UTC |
For doing a quick pass through the values within a `MutableMapping`, it is helpful to have the `values` method. This will be needed in some of the tests that we are adding. So add a quick implementation of `values` in the `CustomMapping`.
Currently this is a somewhat loose requirement that not all stores enforce. However it is a useful check to have in some cases. Particularly this is a useful constraint with in-memory stores where there is a concern that the data in the store might be manipulated externally due to ndarray views. The bytes instances don't have this problem as they are immutable and own their data. While views can be take onto bytes instances, they will be read-only views and thus are not a concern. For other stores that place their data in some other storage backend (e.g. on disk), this is less of a concern. Also other stores may choose to represent their data in other ways (e.g. LMDB with `memoryview`s). Manipulating the loaded data from these stores is less of a concern since as it doesn't affect their actually contents, only an in-memory representation. In these cases, it may make sense to disable this test.
It's possible to request that `LMDBStore` return `buffer`s/`memoryview`s instead of returning `bytes` for values. This is incredibly useful as LMDB is memory-mapped. So this avoids a copy when accessing the data. However this means it will fail `test_store_has_bytes_values`. Though this is ok as noted previously since that is not a hard requirement of stores. So we just disable that test for this `LMDBStore` case. The other `LMDBStore` case returns `bytes` instances instead (copying the data); so, it passes this test without issues.
Since people will coming looking to see how a store should be implemented, we should show them good behavior. Namely we should ensure the values provided to `__setitem__` are `bytes`-like and immutable (once stored). By coercing the data to `bytes` we ensure that the data is `bytes`-like and we ensure the data is immutable since `bytes` are immutable.
As the point of the `LRUStoreCache` is to merely hold onto values retrieved from the underlying store and keep them around in memory unaltered, the caching layer doesn't have any control over what type of values are returned to it. Thus it doesn't make much sense to test whether the values it returns are of `bytes` type or not. Though this is fine as there is not strict requirement that values of `bytes` type be returned by stores, so simply disable `test_store_has_bytes_values` for the `LRUStoreCache` test case with a note explaining this.
In `Array` when running `_encode_chunk` and using a `dict`-based chunk store, ensure that the chunk data is of `bytes` type. This is done to convert the underlying data to an immutable value to protect against external views onto the data from changing it (as is the case with NumPy arrays). Also this is done to ensure it is possible to compare `dict`-based stores easily. While coercing to a `bytes` object can introduce a copying, this generally won't happen if a compressor is involved as it will usually have returned a `bytes` object. Though filters may not, which could cause this to introduce an extra copy if only filters and no compressors are used. However this is an unlikely case and is not as important as guaranteeing the values own their own data and are read-only. Plus this should allow us to drop the preventive copy that happens earlier when storing values as this neatly handles that case of no filters and no compressors.
This copy was taken primarily to protect in-memory stores from being mutated by external views of the array. However all stores we define (including in-memory ones like `DictStore`) already perform this safeguarding themselves on the values they store. For the builtin `dict` store, we perform this safeguarding for it (since `dict`'s don't do this themselves) by ensuring we only store `bytes` objects into them. As these are immutable and own their own data, there isn't a way to mutate their content after storing them. Thus this preventive copy here is not needed and can be dropped.
Planning on merging end of day Friday if no comments. |
This removes the special case copying of chunks in commit ( 3c00d52 ). Instead this checks to see if a
dict
instance is being used to back chunk storage. If it is, this coerces the chunk data tobytes
(if it is not alreadybytes
). This avoids a copy for well-defined backends that already protect their data against unwanted tampering. Also continues to avoid copies when compressors and/or filters already return the data asbytes
. However this will copy in any case where a non-bytes
object (e.g. anndarray
) is placed in the chunk store (e.g. no filters/compressors are used or some specific filter is used at the end). This behavior fordict
-based chunk stores is analogous to whatDictStore
does currently for stored values.Note: This also extends to subclasses of
dict
, which includedefaultdict
andOrderedDict
. It does not includeChainMap
orUserDict
; however, it's unclear whether we should be special casing those two asChainMap
is meant to provide a unified interface onto manyMutableMapping
s of various kinds andUserDict
is intended as a clean and simple alternative for subclassingdict
.TODO:
tox -e docs
)Fixes #348