-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes #68
base: master
Are you sure you want to change the base?
Conversation
…ven't got any tests around this yet.
…changes to actually using the filter pipeline. At this point is failling test_reference.
…lso remove list definition which breaks references.
… a pseudo chunked read. Lots of things to do around optimising that read, but let's test this more widely first.
Adding support for reading only chunks and various pieces of the H5Py lower level interface
…g defined. tests fail.
…have a test which shows that.
…ith multidimensional arrays as well.
…and dtypes remain
…the dataset itself, which needs to be a new issue. Also the caching stuff needs to be a new issue.
All the pathological tests now pass though two new lower priority issues have been generated.
VLEN strings
(I see we're failing the checks due to some build dependencies. Will sort that shortly. |
I've switched to |
.idea | ||
.DS_Store | ||
test-reports/ | ||
<_io.Bytes*> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If filenames with < in them are generated, I'd like to see them.
This pull request was originally prompted by #6, insofar as we needed lazy loading of chunked data, but also our need to a) have a pure python backend reader for
h5netcdf
, and b) which could expose variable b-trees to other downstream software.Our motivations included thread-safety and performance at scale in a cloud environment. To do this we have implemented versions of some more components of the
h5py
stack, and in particular, a version of theh5d.DatasetID
class, which is now holds all the code which is used for data access (as opposed to attribute access, which still lives indataobjects
). There are a couple of extra methods for exposing the chunk index directly rather than via an iterator and to access chunk info using the zarr indexing scheme rather than theh5py
indexing scheme.The code also includes an implementation of what we have called
pseudochunking
which is used for accessing a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks aligned with the array order on disk and use them for data access.There are many small bug fixes and optimisations to support cloud usage, the most important of which is that once a variable is instantiated (i.e. for an open
pyfive.File
instancef
, when you dov=f['variable_name']
) the attributes and b-tree are read, and it is then possible to close the parent file (f
), but continue to use (v
) - and we have test coverage that shows that this usage ofv
is thread-safe (there is a test which demonstrates this, it's slow, but it needs to be as shorter tests were sporadically passing). (The test harness now includes all the components necessary for testing pyfive accessing data via both Posix and S3).As well as closing #6, this pull request would close: #41,#59,#60,#64.