Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes #68

bnlawrence · 2025-01-30T13:41:30Z

This pull request was originally prompted by #6, insofar as we needed lazy loading of chunked data, but also our need to a) have a pure python backend reader for h5netcdf, and b) which could expose variable b-trees to other downstream software.

Our motivations included thread-safety and performance at scale in a cloud environment. To do this we have implemented versions of some more components of the h5py stack, and in particular, a version of the h5d.DatasetID class, which is now holds all the code which is used for data access (as opposed to attribute access, which still lives in dataobjects). There are a couple of extra methods for exposing the chunk index directly rather than via an iterator and to access chunk info using the zarr indexing scheme rather than the h5py indexing scheme.

The code also includes an implementation of what we have called pseudochunking which is used for accessing a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks aligned with the array order on disk and use them for data access.

There are many small bug fixes and optimisations to support cloud usage, the most important of which is that once a variable is instantiated (i.e. for an open pyfive.File instance f, when you do v=f['variable_name']) the attributes and b-tree are read, and it is then possible to close the parent file (f), but continue to use (v) - and we have test coverage that shows that this usage of v is thread-safe (there is a test which demonstrates this, it's slow, but it needs to be as shorter tests were sporadically passing). (The test harness now includes all the components necessary for testing pyfive accessing data via both Posix and S3).

As well as closing #6, this pull request would close: #41,#59,#60,#64.

…er right yet.

…ven't got any tests around this yet.

…changes to actually using the filter pipeline. At this point is failling test_reference.

…lso remove list definition which breaks references.

… a pseudo chunked read. Lots of things to do around optimising that read, but let's test this more widely first.

Adding support for reading only chunks and various pieces of the H5Py lower level interface

…g defined. tests fail.

…have a test which shows that.

…coord` method

…ith multidimensional arrays as well.

…and dtypes remain

…the dataset itself, which needs to be a new issue. Also the caching stuff needs to be a new issue.

All the pathological tests now pass though two new lower priority issues have been generated.

…tcdf

#34)

VLEN strings

bnlawrence · 2025-01-30T13:45:58Z

(I see we're failing the checks due to some build dependencies. Will sort that shortly.

valeriupredoi · 2025-01-30T15:04:58Z

(I see we're failing the checks due to some build dependencies. Will sort that shortly.

I've switched to pip install .[testing] in the GA workflow so we get them deps for testing (only) 👍

kalvdans · 2025-02-07T15:44:05Z

.gitignore

+.idea
+.DS_Store
+test-reports/
+<_io.Bytes*>


If filenames with < in them are generated, I'd like to see them.

Bryan Lawrence and others added 30 commits February 22, 2024 12:20

Using s3 to get at some real data for testing

02fca54

Getting the address as well as size into the index

df3669a

With timer

16c0e81

Not working yet. Don't reckon I have the arguments to OrthogonalIndex…

c464be8

…er right yet.

A few more notes in the code so I can come back to it anon.

afaa4f5

Woops. Need this.

18bc37c

First working lazy read (only reads chunks needed for selection)

4b0ac08

Woops didnt' commit the real oil

5356aa0

Should now support filtering chunks in the partical chunk loading. Ha…

9fe2394

…ven't got any tests around this yet.

Some additional documentation

dafb3c9

Seems to work, prior to re-integration

53e4ebe

Moved chunk support into standard API

9ac0bbd

removing playing code

a88a150

Merge branch 'jjhelmus:master' into issue6

89aafe3

Fixes bug which stops the selection read from actually occurring and …

96dc178

…changes to actually using the filter pipeline. At this point is failling test_reference.

Hack to avoid reference datatypes in chunk by chunk selections.

eb44c15

Remove obsolete function

51f7cca

Support for third party access to contiguous data address and size. A…

1f61d6c

…lso remove list definition which breaks references.

First cut, fails references and classic, even with new stuff turned off?

e6217b5

This version appears to now support failing over from a memory map to…

67c93e0

… a pseudo chunked read. Lots of things to do around optimising that read, but let's test this more widely first.

First cut, no tests yet

a08ee20

Improvements

dc00503

With some failing tests

9ffb5b2

Fixed one test

223a931

All tests for new functionality pass, but I've broken something old

3a256ab

Now passing all tests

32d83dd

Checking coverage of get_chunk_info_by_coord(method)

f5f89c5

Missing docstring

2c8f59c

Cleaning up

013ce62

Merge pull request #5 from bnlawrence/h5pyapi

400c798

Adding support for reading only chunks and various pieces of the H5Py lower level interface

Bryan Lawrence and others added 23 commits January 20, 2025 14:57

Giving up on in-memory netcdf tests for #29

40c898b

explicitly close POSIX files

8acf067

vlen strings data test case, vanilla version, and version with missin…

526c642

…g defined. tests fail.

add extra posix test for file closure

0e4a45b

Merge branch 'h5netcdf' into fix-memmap

298edc7

More on h5d and testing. The iter_chunks method is broken and we now …

1fb9c98

…have a test which shows that.

Support for pyactivestorage via a bespoke `get_chunk_info_from_chunk_…

82dc2a9

…coord` method

better ignore

838b0a5

The first vlen data test passes with this code

a633683

closer to a solution for #29. These tests pass, but we need to deal w…

fbdda40

…ith multidimensional arrays as well.

Partially working vlen string support, issues with global heap usage …

a20763f

…and dtypes remain

Passing all vlen tests for #29, though we are ignoreing the dtype of …

e7c465e

…the dataset itself, which needs to be a new issue. Also the caching stuff needs to be a new issue.

Merge pull request #33 from NCAS-CMS/vlen

580e3df

All the pathological tests now pass though two new lower priority issues have been generated.

Merge remote-tracking branch 'refs/remotes/origin/h5netcdf' into h5ne…

0a4c801

…tcdf

Merge branch 'fix-memmap' into h5netcdf

b955b4b

Remaining tests for vlen and iterchunks, support for vlen dtypes (closes

e40c7d7

#34)

dev

599db7b

dev

4b4fbc3

dev

a2cfaeb

vlen related fixes

bd16147

Update pyfive/indexing.py

a50204d

Merge pull request #37 from NCAS-CMS/vlen-dtype

1f9b2c0

VLEN strings

Merge branch 'master' into wacasoft

6c02408

install only in test mode

eed7e99

actual correct name for testing regime

6255fc0

bnlawrence mentioned this pull request Jan 30, 2025

Support pyfive as an alternative backend h5netcdf/h5netcdf#25

Open

valeriupredoi mentioned this pull request Feb 7, 2025

Modernize package #69

Merged

kalvdans reviewed Feb 7, 2025

View reviewed changes

.gitignore

.idea

.DS_Store

test-reports/

<_io.Bytes*>

Copy link

kalvdans Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If filenames with < in them are generated, I'd like to see them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes #68

Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes #68

bnlawrence commented Jan 30, 2025

bnlawrence commented Jan 30, 2025

valeriupredoi commented Jan 30, 2025 •

edited

Loading

kalvdans Feb 7, 2025

Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes #68

Are you sure you want to change the base?

Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes #68

Conversation

bnlawrence commented Jan 30, 2025

bnlawrence commented Jan 30, 2025

valeriupredoi commented Jan 30, 2025 • edited Loading

kalvdans Feb 7, 2025

Choose a reason for hiding this comment

valeriupredoi commented Jan 30, 2025 •

edited

Loading