Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special case chunk encoding for dict chunk store #359

Merged
merged 10 commits into from
Feb 16, 2019
Merged

Special case chunk encoding for dict chunk store #359

merged 10 commits into from
Feb 16, 2019

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Dec 7, 2018

This removes the special case copying of chunks in commit ( 3c00d52 ). Instead this checks to see if a dict instance is being used to back chunk storage. If it is, this coerces the chunk data to bytes (if it is not already bytes). This avoids a copy for well-defined backends that already protect their data against unwanted tampering. Also continues to avoid copies when compressors and/or filters already return the data as bytes. However this will copy in any case where a non-bytes object (e.g. an ndarray) is placed in the chunk store (e.g. no filters/compressors are used or some specific filter is used at the end). This behavior for dict-based chunk stores is analogous to what DictStore does currently for stored values.

Note: This also extends to subclasses of dict, which include defaultdict and OrderedDict. It does not include ChainMap or UserDict; however, it's unclear whether we should be special casing those two as ChainMap is meant to provide a unified interface onto many MutableMappings of various kinds and UserDict is intended as a clean and simple alternative for subclassing dict.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

Fixes #348

@jakirkham
Copy link
Member Author

This is ready for review. Would be curious to hear thoughts on this. Basically this is meant to make working with dict objects as an Array chunk store a bit better behaved.

@jakirkham
Copy link
Member Author

Would be great to hear your thoughts on this when you have a chance, @alimanfoo.

@jakirkham
Copy link
Member Author

Adding to the v2.3 milestone so we can keep track of it. Happy to change that as needed.

@jakirkham
Copy link
Member Author

Thoughts @alimanfoo?

@jakirkham
Copy link
Member Author

Thoughts @zarr-developers/core-devs?

@pep8speaks
Copy link

pep8speaks commented Feb 13, 2019

Hello @jakirkham! Thanks for updating the PR.

Line 14:80: E501 line too long (82 > 79 characters)
Line 15:80: E501 line too long (81 > 79 characters)
Line 24:80: E501 line too long (84 > 79 characters)
Line 242:80: E501 line too long (81 > 79 characters)
Line 387:80: E501 line too long (85 > 79 characters)
Line 399:80: E501 line too long (93 > 79 characters)
Line 405:80: E501 line too long (90 > 79 characters)
Line 452:80: E501 line too long (82 > 79 characters)
Line 544:80: E501 line too long (84 > 79 characters)
Line 549:80: E501 line too long (81 > 79 characters)
Line 557:80: E501 line too long (82 > 79 characters)
Line 560:80: E501 line too long (85 > 79 characters)
Line 565:80: E501 line too long (89 > 79 characters)
Line 566:80: E501 line too long (85 > 79 characters)
Line 580:80: E501 line too long (85 > 79 characters)
Line 581:80: E501 line too long (90 > 79 characters)
Line 585:80: E501 line too long (86 > 79 characters)
Line 651:80: E501 line too long (83 > 79 characters)
Line 658:80: E501 line too long (84 > 79 characters)
Line 663:80: E501 line too long (81 > 79 characters)
Line 671:80: E501 line too long (82 > 79 characters)
Line 673:80: E501 line too long (84 > 79 characters)
Line 674:80: E501 line too long (83 > 79 characters)
Line 680:80: E501 line too long (85 > 79 characters)
Line 735:80: E501 line too long (81 > 79 characters)
Line 743:80: E501 line too long (83 > 79 characters)
Line 744:80: E501 line too long (84 > 79 characters)
Line 745:80: E501 line too long (86 > 79 characters)
Line 746:80: E501 line too long (82 > 79 characters)
Line 752:80: E501 line too long (87 > 79 characters)
Line 757:80: E501 line too long (86 > 79 characters)
Line 773:80: E501 line too long (89 > 79 characters)
Line 800:80: E501 line too long (89 > 79 characters)
Line 831:80: E501 line too long (82 > 79 characters)
Line 835:80: E501 line too long (89 > 79 characters)
Line 836:80: E501 line too long (85 > 79 characters)
Line 864:80: E501 line too long (86 > 79 characters)
Line 885:80: E501 line too long (89 > 79 characters)
Line 893:80: E501 line too long (89 > 79 characters)
Line 896:80: E501 line too long (87 > 79 characters)
Line 899:80: E501 line too long (86 > 79 characters)
Line 900:80: E501 line too long (84 > 79 characters)
Line 901:80: E501 line too long (86 > 79 characters)
Line 906:80: E501 line too long (89 > 79 characters)
Line 907:80: E501 line too long (85 > 79 characters)
Line 934:80: E501 line too long (88 > 79 characters)
Line 935:80: E501 line too long (85 > 79 characters)
Line 941:80: E501 line too long (89 > 79 characters)
Line 946:80: E501 line too long (86 > 79 characters)
Line 970:80: E501 line too long (83 > 79 characters)
Line 978:80: E501 line too long (85 > 79 characters)
Line 979:80: E501 line too long (81 > 79 characters)
Line 985:80: E501 line too long (85 > 79 characters)
Line 1004:80: E501 line too long (86 > 79 characters)
Line 1005:80: E501 line too long (85 > 79 characters)
Line 1006:80: E501 line too long (83 > 79 characters)
Line 1008:80: E501 line too long (87 > 79 characters)
Line 1009:80: E501 line too long (86 > 79 characters)
Line 1027:80: E501 line too long (82 > 79 characters)
Line 1041:80: E501 line too long (84 > 79 characters)
Line 1085:80: E501 line too long (83 > 79 characters)
Line 1090:80: E501 line too long (81 > 79 characters)
Line 1099:80: E501 line too long (82 > 79 characters)
Line 1102:80: E501 line too long (82 > 79 characters)
Line 1103:80: E501 line too long (80 > 79 characters)
Line 1108:80: E501 line too long (89 > 79 characters)
Line 1109:80: E501 line too long (85 > 79 characters)
Line 1123:80: E501 line too long (84 > 79 characters)
Line 1128:80: E501 line too long (90 > 79 characters)
Line 1171:80: E501 line too long (85 > 79 characters)
Line 1177:80: E501 line too long (81 > 79 characters)
Line 1186:80: E501 line too long (88 > 79 characters)
Line 1187:80: E501 line too long (83 > 79 characters)
Line 1193:80: E501 line too long (85 > 79 characters)
Line 1208:80: E501 line too long (80 > 79 characters)
Line 1210:80: E501 line too long (80 > 79 characters)
Line 1218:80: E501 line too long (87 > 79 characters)
Line 1223:80: E501 line too long (90 > 79 characters)
Line 1264:80: E501 line too long (88 > 79 characters)
Line 1279:80: E501 line too long (82 > 79 characters)
Line 1283:80: E501 line too long (89 > 79 characters)
Line 1284:80: E501 line too long (85 > 79 characters)
Line 1303:80: E501 line too long (89 > 79 characters)
Line 1313:80: E501 line too long (90 > 79 characters)
Line 1334:80: E501 line too long (88 > 79 characters)
Line 1347:80: E501 line too long (89 > 79 characters)
Line 1350:80: E501 line too long (87 > 79 characters)
Line 1355:80: E501 line too long (89 > 79 characters)
Line 1356:80: E501 line too long (85 > 79 characters)
Line 1381:80: E501 line too long (86 > 79 characters)
Line 1382:80: E501 line too long (85 > 79 characters)
Line 1388:80: E501 line too long (89 > 79 characters)
Line 1393:80: E501 line too long (90 > 79 characters)
Line 1417:80: E501 line too long (88 > 79 characters)
Line 1430:80: E501 line too long (85 > 79 characters)
Line 1431:80: E501 line too long (81 > 79 characters)
Line 1437:80: E501 line too long (85 > 79 characters)
Line 1505:80: E501 line too long (86 > 79 characters)
Line 1506:80: E501 line too long (84 > 79 characters)
Line 1509:80: E501 line too long (87 > 79 characters)
Line 1510:80: E501 line too long (87 > 79 characters)
Line 1550:80: E501 line too long (90 > 79 characters)
Line 1610:80: E501 line too long (83 > 79 characters)
Line 1635:80: E501 line too long (80 > 79 characters)
Line 1661:80: E501 line too long (87 > 79 characters)
Line 1675:80: E501 line too long (84 > 79 characters)
Line 1698:80: E501 line too long (88 > 79 characters)
Line 1701:80: E501 line too long (88 > 79 characters)
Line 1703:80: E501 line too long (88 > 79 characters)
Line 1705:80: E501 line too long (88 > 79 characters)
Line 1716:80: E501 line too long (86 > 79 characters)
Line 1751:80: E501 line too long (82 > 79 characters)
Line 1753:80: E501 line too long (81 > 79 characters)
Line 1754:80: E501 line too long (82 > 79 characters)
Line 1773:80: E501 line too long (80 > 79 characters)
Line 1814:80: E501 line too long (87 > 79 characters)
Line 1875:80: E501 line too long (86 > 79 characters)
Line 1894:80: E501 line too long (81 > 79 characters)
Line 1925:80: E501 line too long (81 > 79 characters)
Line 2076:80: E501 line too long (89 > 79 characters)
Line 2077:80: E501 line too long (88 > 79 characters)
Line 2219:80: E501 line too long (87 > 79 characters)
Line 2245:80: E501 line too long (86 > 79 characters)

Line 17:80: E501 line too long (87 > 79 characters)
Line 18:80: E501 line too long (89 > 79 characters)
Line 24:80: E501 line too long (82 > 79 characters)
Line 90:80: E501 line too long (83 > 79 characters)
Line 106:80: E501 line too long (83 > 79 characters)
Line 110:80: E501 line too long (80 > 79 characters)
Line 250:80: E501 line too long (81 > 79 characters)
Line 392:80: E501 line too long (89 > 79 characters)
Line 686:80: E501 line too long (86 > 79 characters)
Line 760:80: E501 line too long (88 > 79 characters)
Line 793:80: E501 line too long (88 > 79 characters)
Line 889:80: E501 line too long (100 > 79 characters)
Line 899:80: E501 line too long (100 > 79 characters)
Line 908:80: E501 line too long (97 > 79 characters)
Line 955:80: E501 line too long (87 > 79 characters)
Line 1002:80: E501 line too long (85 > 79 characters)
Line 1005:80: E501 line too long (84 > 79 characters)
Line 1008:80: E501 line too long (87 > 79 characters)
Line 1023:80: E501 line too long (86 > 79 characters)
Line 1038:80: E501 line too long (84 > 79 characters)
Line 1056:80: E501 line too long (86 > 79 characters)
Line 1075:80: E501 line too long (85 > 79 characters)
Line 1079:80: E501 line too long (82 > 79 characters)
Line 1083:80: E501 line too long (84 > 79 characters)
Line 1097:80: E501 line too long (87 > 79 characters)
Line 1116:80: E501 line too long (84 > 79 characters)
Line 1137:80: E501 line too long (85 > 79 characters)
Line 1147:80: E501 line too long (87 > 79 characters)
Line 1156:80: E501 line too long (86 > 79 characters)
Line 1182:80: E501 line too long (82 > 79 characters)
Line 1811:80: E501 line too long (86 > 79 characters)
Line 1851:80: E501 line too long (83 > 79 characters)

Comment last updated on February 14, 2019 at 06:40 Hours UTC

For doing a quick pass through the values within a `MutableMapping`, it
is helpful to have the `values` method. This will be needed in some of
the tests that we are adding. So add a quick implementation of `values`
in the `CustomMapping`.
Currently this is a somewhat loose requirement that not all stores
enforce. However it is a useful check to have in some cases.
Particularly this is a useful constraint with in-memory stores where
there is a concern that the data in the store might be manipulated
externally due to ndarray views. The bytes instances don't have this
problem as they are immutable and own their data. While views can be
take onto bytes instances, they will be read-only views and thus are not
a concern.

For other stores that place their data in some other storage backend
(e.g. on disk), this is less of a concern. Also other stores may choose
to represent their data in other ways (e.g. LMDB with `memoryview`s).
Manipulating the loaded data from these stores is less of a concern
since as it doesn't affect their actually contents, only an in-memory
representation. In these cases, it may make sense to disable this test.
It's possible to request that `LMDBStore` return `buffer`s/`memoryview`s
instead of returning `bytes` for values. This is incredibly useful as
LMDB is memory-mapped. So this avoids a copy when accessing the data.
However this means it will fail `test_store_has_bytes_values`. Though
this is ok as noted previously since that is not a hard requirement of
stores. So we just disable that test for this `LMDBStore` case. The
other `LMDBStore` case returns `bytes` instances instead (copying the
data); so, it passes this test without issues.
Since people will coming looking to see how a store should be
implemented, we should show them good behavior. Namely we should ensure
the values provided to `__setitem__` are `bytes`-like and immutable
(once stored). By coercing the data to `bytes` we ensure that the data
is `bytes`-like and we ensure the data is immutable since `bytes` are
immutable.
As the point of the `LRUStoreCache` is to merely hold onto values
retrieved from the underlying store and keep them around in memory
unaltered, the caching layer doesn't have any control over what type of
values are returned to it. Thus it doesn't make much sense to test
whether the values it returns are of `bytes` type or not. Though this is
fine as there is not strict requirement that values of `bytes` type be
returned by stores, so simply disable `test_store_has_bytes_values` for
the `LRUStoreCache` test case with a note explaining this.
In `Array` when running `_encode_chunk` and using a `dict`-based chunk
store, ensure that the chunk data is of `bytes` type. This is done to
convert the underlying data to an immutable value to protect against
external views onto the data from changing it (as is the case with NumPy
arrays). Also this is done to ensure it is possible to compare
`dict`-based stores easily.

While coercing to a `bytes` object can introduce a copying, this
generally won't happen if a compressor is involved as it will usually
have returned a `bytes` object. Though filters may not, which could
cause this to introduce an extra copy if only filters and no compressors
are used. However this is an unlikely case and is not as important as
guaranteeing the values own their own data and are read-only. Plus this
should allow us to drop the preventive copy that happens earlier when
storing values as this neatly handles that case of no filters and no
compressors.
This copy was taken primarily to protect in-memory stores from being
mutated by external views of the array. However all stores we define
(including in-memory ones like `DictStore`) already perform this
safeguarding themselves on the values they store. For the builtin `dict`
store, we perform this safeguarding for it (since `dict`'s don't do this
themselves) by ensuring we only store `bytes` objects into them. As
these are immutable and own their own data, there isn't a way to mutate
their content after storing them. Thus this preventive copy here is not
needed and can be dropped.
@jakirkham
Copy link
Member Author

Planning on merging end of day Friday if no comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants