Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes #68

Open
wants to merge 121 commits into
base: master
Choose a base branch
from

Conversation

bnlawrence
Copy link
Collaborator

This pull request was originally prompted by #6, insofar as we needed lazy loading of chunked data, but also our need to a) have a pure python backend reader for h5netcdf, and b) which could expose variable b-trees to other downstream software.

Our motivations included thread-safety and performance at scale in a cloud environment. To do this we have implemented versions of some more components of the h5py stack, and in particular, a version of the h5d.DatasetID class, which is now holds all the code which is used for data access (as opposed to attribute access, which still lives in dataobjects). There are a couple of extra methods for exposing the chunk index directly rather than via an iterator and to access chunk info using the zarr indexing scheme rather than the h5py indexing scheme.

The code also includes an implementation of what we have called pseudochunking which is used for accessing a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks aligned with the array order on disk and use them for data access.

There are many small bug fixes and optimisations to support cloud usage, the most important of which is that once a variable is instantiated (i.e. for an open pyfive.File instance f, when you do v=f['variable_name']) the attributes and b-tree are read, and it is then possible to close the parent file (f), but continue to use (v) - and we have test coverage that shows that this usage of v is thread-safe (there is a test which demonstrates this, it's slow, but it needs to be as shorter tests were sporadically passing). (The test harness now includes all the components necessary for testing pyfive accessing data via both Posix and S3).

As well as closing #6, this pull request would close: #41,#59,#60,#64.

Bryan Lawrence and others added 30 commits February 22, 2024 12:20
…changes to actually using the filter pipeline. At this point is failling test_reference.
…lso remove list definition which breaks references.
… a pseudo chunked read. Lots of things to do around optimising that read, but let's test this more widely first.
Adding support for reading only chunks and various pieces of the H5Py lower level interface
Bryan Lawrence and others added 23 commits January 20, 2025 14:57
…the dataset itself, which needs to be a new issue. Also the caching stuff needs to be a new issue.
All the pathological tests now pass though two new lower priority issues have been generated.
@bnlawrence
Copy link
Collaborator Author

(I see we're failing the checks due to some build dependencies. Will sort that shortly.

@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Jan 30, 2025

(I see we're failing the checks due to some build dependencies. Will sort that shortly.

I've switched to pip install .[testing] in the GA workflow so we get them deps for testing (only) 👍

.idea
.DS_Store
test-reports/
<_io.Bytes*>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If filenames with < in them are generated, I'd like to see them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unable to read attributes when the size of attribute lists is relatively big
4 participants