Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem reading VLEN string variable in branch "h5netcdf" #29

Closed
davidhassell opened this issue Jan 20, 2025 · 29 comments · Fixed by #37
Closed

Problem reading VLEN string variable in branch "h5netcdf" #29

davidhassell opened this issue Jan 20, 2025 · 29 comments · Fixed by #37

Comments

@davidhassell
Copy link
Collaborator

Hi @bnlawrence, possible incorrect treatment of string variables and/or attributes? As this looks like it might be in HDF5 "message" territory (?), I thought I'd hold off looking further, but am happy to dig if you want.

I am using the "h5netcdf" branch.

I have a file with a VLEN string variable called auxiliary: test_file.nc.gz, for which I get:

In [30]: import pyfive

In [31]: p = pyfive.File('test_file.nc')

In [32]: !ncdump -h test_file.nc| grep 'string auxiliary'   # check that 'auxiliary' is a string variable
	string auxiliary(grid_latitude) ;

In [33]: aux = h['auxiliary']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[41], line 1
----> 1 aux = p['auxiliary']

File ~/pyfive/pyfive/high_level.py:94, in Group.__getitem__(self, y)
     92     if additional_obj != '.':
     93         raise KeyError('%s is a dataset, not a group' % (obj_name))
---> 94     return Dataset(obj_name, DatasetID(dataobjs), self)
     96 try:
     97     # if true, this may well raise a NotImplementedError, if so, we need
     98     # to warn the user, who may be able to use other parts of the data.
     99     is_datatype = dataobjs.is_datatype

File ~/pyfive/pyfive/h5d.py:82, in DatasetID.__init__(self, dataobject, pseudo_chunking_size_MB)
     79 else:
     80     self.dtype = np.dtype(dataobject.dtype)
---> 82 self._meta = DatasetMeta(dataobject)
     84 self._index =  None
     85 match self.layout_class:

File ~/pyfive/pyfive/h5d.py:461, in DatasetMeta.__init__(self, dataobject)
    459 self.shuffle = dataobject.shuffle
    460 self.fletcher32 = dataobject.fletcher32
--> 461 self.fillvalue = dataobject.fillvalue
    462 self.attributes = dataobject.get_attributes()
    464 #horrible kludge for now, this isn't really the same sort of thing
    465 #https://github.com/NCAS-CMS/pyfive/issues/13#issuecomment-2557121461
    466 # this is used directly in the Dataset init method.

File ~/pyfive/pyfive/dataobjects.py:340, in DataObjects.fillvalue(self)
    338     payload = self.msg_data[offset:offset+size]
    339     print (repr(payload), self.dtype)
--> 340     fillvalue = np.frombuffer(payload, dtype=self.dtype, count=1)[0]
    341 else:
    342     fillvalue = 0

TypeError: Tuple must have size 2, but has size 3

A print statement of print(f"payload={payload!r}, dtype={self.dtype!r}") just before the line of utlimate failure (~/pyfive/pyfive/dataobjects.py:340) gives

payload=b'\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x13\x00\x00\x00', dtype=('VLEN_STRING', 0, 1)

so we have a 3-tuple for the dtype.

For a "normal" numeric in the same file (such as p['x']), the same print statement gives

payload=b'\x00\x00\x00\x00\x00\x00\x9eG', dtype='<f8'

The file behaves fine when open with h5netcdf with the HDF5 backend.

@davidhassell
Copy link
Collaborator Author

I've just that elsewhere in the code we have things like if not isinstance(self.dtype, tuple):. Would that sort of construction be relevant here, I wonder?

@bnlawrence
Copy link
Collaborator

bnlawrence commented Jan 20, 2025

This is, I think, related to #16. Do we know how this was created?

(I think the self.dtype tuple cases are often references ...)

@davidhassell
Copy link
Collaborator Author

Do we know how this was created?

Do you mean "how was the file created"? It was created with cfdm using netCDF4 under the hood.

bnlawrence pushed a commit that referenced this issue Jan 20, 2025
@bnlawrence
Copy link
Collaborator

bnlawrence commented Jan 20, 2025

Can you give me a few lines for a unit test which just creates a file with the vlen string in it? (I am assuming it will look exactly like the example in #16, but I'd rather have a test that does both an nc version and an hd5 version)

@davidhassell
Copy link
Collaborator Author

Not sure what a unit test to create a file should look like, but here's the how to do it, and a demo that this file fails in the same way

import netCDF4
import numpy as np

n = netCDF4.Dataset('vlen_string.nc', "w", format="NETCDF4")
n.Conventions = f"CF-1.12"
n.createDimension("time", 4)
months = n.createVariable("months", str, ("time",))
months.long_name = "string: Four months"
months[:] =  np.array(["January", "February", "March", "April"], dtype="S8")
n.close()

Using the file:

>>> import pyfive
>>> p = pyfive.File('vlen_string.nc')
>>> p['months']
---------------------------------------------------------------------------
<snip>

File ~/pyfive/pyfive/dataobjects.py:340, in fillvalue(self)
    338     payload = self.msg_data[offset:offset+size]
    339     fillvalue = np.frombuffer(payload, self.dtype, count=1)[0]
--> 340 else:
    341     fillvalue = 0
    342 return fillvalue

TypeError: Tuple must have size 2, but has size 3

@bnlawrence
Copy link
Collaborator

bnlawrence commented Jan 20, 2025

Well, interestingly, I get a different error using a file_like object:

import h5py
import pyfive
import netCDF4 as nc
import io
import numpy as np

def make_file_hdf5(file_like, _vlen_string):
    with h5py.File(file_like,'w') as hfile:
        
        dt = h5py.special_dtype(vlen=str)
        v = hfile.create_dataset("var_len_str", (1,), dtype=dt)
        v[0] = _vlen_string

def make_file_nc(file_like,m_array):
  
    n = nc.Dataset(file_like, "w", format="NETCDF4")
    n.createDimension("time", 4)
    months = n.createVariable("months", str, ("time",))
    months[:] =  np.array(m_array, dtype="S8")
    n.close()

def test_vlen_string_hdf5():

    tfile = io.BytesIO()
    _vlen_string = "foo"
    make_file_hdf5(tfile, _vlen_string)
    with pyfive.File(tfile) as hfile:
         
        ds1 = hfile['var_len_str']
        assert _vlen_string == ds1[0]
    
def test_vlen_string_nc():

    tfile = io.BytesIO()
    m_array = ["January", "February", "March", "April"]
    make_file_nc(tfile,m_array)
    with pyfive.File(tfile) as hfile:
        ds1 = hfile['months']
        assert np.array_equal(m_array, ds1) 

Gives:

NotImplementedError: datatype not implemented - VLEN_STRING

for the first one and

struct.error: unpack_from requires a buffer of at least 9 bytes for unpacking 1 bytes at offset 8 (actual buffer size is 0)

for the second (and that's just from trying to open the file).

@bnlawrence
Copy link
Collaborator

Oh, interesting, h5py can't open it either.

@bnlawrence
Copy link
Collaborator

Looks like netcdf4 is doing something odd with in memory files. Yuck. That's an error to understand as well.

@bnlawrence
Copy link
Collaborator

Ok, now I have the same error as you, but we have three different errors in play now. yuck. Will commit the test.

bnlawrence pushed a commit that referenced this issue Jan 20, 2025
@davidhassell
Copy link
Collaborator Author

OK - you've fixed it, I guess. I was going to say:

Have you tried n = nc.Dataset(file_like, "w", format="NETCDF4", diskless=True), and maybe not closing n?

or

Just write it to disk and tidy up afterwards :)

@bnlawrence
Copy link
Collaborator

Gross. NetCDf4 is doing weird things with memory files. Not kosher. Not even a little bit. Looks like you can't do it. Unidata/netcdf4-python#295

@bnlawrence
Copy link
Collaborator

(Doesn't affect this issue per se, but it's really untidy.)

@bnlawrence
Copy link
Collaborator

OK, dealing with the h5py created one, the problem starts with the dtype we return being "wrong" as shown in #16. I appear to have already flagged this with a #FIXME.
The real problem is in datatype_message.py where we find this in determine_type:

...
elif datatype_class == DATATYPE_VARIABLE_LENGTH:
    vlen_type = self._determine_dtype_vlen(datatype_msg)
    if vlen_type[0] == 'VLEN_SEQUENCE':
            base_type = self.determine_dtype()
            vlen_type = ('VLEN_SEQUENCE', base_type)
    return vlen_type

What we actually have for vlen_type is ('VLEN_STRING', 0, 1). The base_type which can be investigated using self.determin_dtype() even though the loop doesn't go there is '<u1' (or uint8 in numpy language).

@bnlawrence
Copy link
Collaborator

OK, this particular tuple arises here:

def _determine_dtype_vlen(datatype_msg):
        """ Return the dtype information for a variable length class. """
        vlen_type = datatype_msg['class_bit_field_0'] & 0x01
        if vlen_type != 1:
            return ('VLEN_SEQUENCE', 0, 0)
        padding_type = datatype_msg['class_bit_field_0'] >> 4  # bits 4-7
        character_set = datatype_msg['class_bit_field_1'] & 0x01
        return ('VLEN_STRING', padding_type, character_set)

where padding type is

Value Description
0 Null terminate: A zero byte marks the end of a string and is guaranteed to be present after converting a long string to a short string. When converting a short string to a long string, the value is padded with additional null characters as necessary.
1 Null pad: Null characters are added to the end of the value during conversion from a short string to a longer string. Conversion from a long string to a shorter string simply truncates the value.
2 Space pad: Space characters are added to the end of the value during conversion from a short string to a longer string. Conversion from a long string to a shorter string simply truncates the value. This is the Fortran representation of the string.

and character set is

Value Description
0 ASCII character set encoding
1 UTF-8 character set encoding

There is code that uses this in the _attr_value method of DataObjects in dataobjects.py, the issue here is that this is in an array type.

@davidhassell
Copy link
Collaborator Author

netCDF4 and h5netcdf both return HDF5 VLEN "string" types as object-type numpy arrays:

>>> import h5netcdf
>>> h = h5netcdf.File('vlen_string.nc')
>>> h['months'][...]
array([b'January', b'February', b'March', b'April'], dtype=object)

You probably knew that :) but it might be noteworthy (I'm not sure how, though)

@bnlawrence
Copy link
Collaborator

bnlawrence commented Jan 20, 2025

So we will need access to the relevant global heap.

This bit of code is from the attribute reading, but I think the same bit will be necessary for variable length data. There are some issues here about efficiency, laziness, and threadssfety!

def _vlen_size_and_data(self, buf, offset)
    """ Extract the length and data of a variables length attr. """
    # offset should be incremented by 16 after calling this method
    vlen_size, = struct.unpack_from('<I', buf, offset=offset)
    # section IV.B
    # Data with a variable-length datatype is stored in the
    # global heap of the HDF5 file. Global heap identifiers are
    # stored in the data object storage.
    gheap_id = _unpack_struct_from(GLOBAL_HEAP_ID, buf, offset+4)
    gheap_address = gheap_id['collection_address']
    if gheap_address not in self._global_heaps:
        # load the global heap and cache the instance
        gheap = GlobalHeap(self.fh, gheap_address)
        self._global_heaps[gheap_address] = gheap
    gheap = self._global_heaps[gheap_address]
    vlen_data = gheap.objects[gheap_id['object_index']]
    return vlen_size, vlen_data

@bnlawrence
Copy link
Collaborator

The two tests fail in different ways because the netcdf derived one has a fill value which is variable length, and it too is likely on the heap, which isn't expected in the relevant property.

@bnlawrence
Copy link
Collaborator

bnlawrence commented Jan 21, 2025

It might be that simply engineering this by using the code from attributes will work fine in single threaded code, but I am not sure what to do about this wrt threading and laziness. If we assume that this has to be lazy, then we don't want to be reading the heap until we try and get at the data (but right now an attempt to get the metadata will try and access the missing data value, which if it exists, looks like it will be on a/the global heap). That would make any attempt to access these sort of variables unlazy, but work well for threading. If we somehow postponed the heap read (for laziness reasons) then we have a problem with all the threads trying to get at different parts of the heap - we know we can't do that with an existing open file pointer. A per thread file pointer would be fine, but on S3 that would potentially be multiple gets of the same object.

@bnlawrence
Copy link
Collaborator

So @davidhassell and I discussed this. Clearly we have to make this lazy as the priority over parallelism, given as it looks like one has to read an entire heap for such a variable, it will of necessity fit into memory.

The existing reading code for variable length attributes does the following:

  1. parse the attribute message (_parse_attribute_msg) from an attribute message at a particular offset. In doing so,
  2. Read the attribute header, name, and datatype information. This datatype corresponds to what we will have for our data datatype.
  3. It then determines a dataspace from that, which I think is again something we will already have.
  4. Counts the number of items (count=np.prod(shape))
  5. Increments the offset, then grabs the "value" using ._attr_value(dtype, buffer, items, offset)
  6. That involves creating an empty buffer to put the values in (np.empty(count, dtype=object) and we have a loop that gets the data ( _,vlen_data= self._vlen_size_and_data(buf,offset)) before converting decoding it using the character set stored in the dtype.
  7. Inside _vlen_size_and_data the thing that is got from storage is first the heap address, and then some heap magic gets the data. I am sure we can re-use the heap magic.

I think the thing we need to do is go from some combination of steps 4 and 5 ...

@davidhassell
Copy link
Collaborator Author

Sounds good.

Minor, minor thought: from math import prod; count=prod(shape) is O(100) time faster than np.prod, and also always returns an integer, unlike np). It's only been in the standard library for a few years.

@davidhassell
Copy link
Collaborator Author

In case it's useful, here is the cfdm string/char data test file: string_char.nc.gz. It's horrible! but should include every use case, I hope. Also, I include the netCDF4 code that made it.

netcdf string_char {
dimensions:
	dim1 = 1 ;
	time = 4 ;
	lat = 2 ;
	lon = 3 ;
	strlen8 = 8 ;
	strlen7 = 7 ;
	strlen5 = 5 ;
	strlen3 = 3 ;
variables:
	string s_months4(time) ;
		s_months4:long_name = "string: Four months" ;
	string s_months1(dim1) ;
		s_months1:long_name = "string: One month" ;
	string s_months0 ;
		s_months0:long_name = "string: One month (scalar)" ;
	string s_numbers(lat, lon) ;
		s_numbers:long_name = "string: Two dimensional" ;
	string s_months4m(time) ;
		s_months4m:long_name = "string: Four months (masked)" ;
	char c_months4(time, strlen8) ;
		c_months4:long_name = "char: Four months" ;
	char c_months1(dim1, strlen8) ;
		c_months1:long_name = "char: One month" ;
	char c_months0(strlen3) ;
		c_months0:long_name = "char: One month (scalar)" ;
	char c_numbers(lat, lon, strlen5) ;
		c_numbers:long_name = "char: Two dimensional" ;
	char c_months4m(time, strlen7) ;
		c_months4m:long_name = "char: Four months (masked)" ;

// global attributes:
		:Conventions = "CF-1.12" ;
		:comment = "A netCDF file with variables of string and char data types" ;
data:

 s_months4 = "January", "February", "March", "April" ;

 s_months1 = "December" ;

 s_months0 = "May" ;

 s_numbers =
  "one", "two", "three",
  "four", "five", "six" ;

 s_months4m = "January", _, "March", "April" ;

 c_months4 =
  "January",
  "February",
  "March",
  "April" ;

 c_months1 =
  "December" ;

 c_months0 = "May" ;

 c_numbers =
  "one",
  "two",
  "three",
  "four",
  "five",
  "six" ;

 c_months4m =
  "January",
  "",
  "March",
  "April" ;
}
    n = netCDF4.Dataset(filename, "w", format="NETCDF4")

    n.Conventions = f"CF-{VN}"
    n.comment = "A netCDF file with variables of string and char data types"

    n.createDimension("dim1", 1)
    n.createDimension("time", 4)
    n.createDimension("lat", 2)
    n.createDimension("lon", 3)
    n.createDimension("strlen8", 8)
    n.createDimension("strlen7", 7)
    n.createDimension("strlen5", 5)
    n.createDimension("strlen3", 3)

    months = np.array(["January", "February", "March", "April"], dtype="S8")

    months_m = np.ma.array(
        months, dtype="S7", mask=[0, 1, 0, 0], fill_value=b""
    )

    numbers = np.array(
        [["one", "two", "three"], ["four", "five", "six"]], dtype="S5"
    )

    s_months4 = n.createVariable("s_months4", str, ("time",))
    s_months4.long_name = "string: Four months"
    s_months4[:] = months

    s_months1 = n.createVariable("s_months1", str, ("dim1",))
    s_months1.long_name = "string: One month"
    s_months1[:] = np.array(["December"], dtype="S8")

    s_months0 = n.createVariable("s_months0", str, ())
    s_months0.long_name = "string: One month (scalar)"
    s_months0[:] = np.array(["May"], dtype="S3")

    s_numbers = n.createVariable("s_numbers", str, ("lat", "lon"))
    s_numbers.long_name = "string: Two dimensional"
    s_numbers[...] = numbers

    s_months4m = n.createVariable("s_months4m", str, ("time",))
    s_months4m.long_name = "string: Four months (masked)"
    array = months.copy()
    array[1] = ""
    s_months4m[...] = array

    c_months4 = n.createVariable("c_months4", "S1", ("time", "strlen8"))
    c_months4.long_name = "char: Four months"
    c_months4[:, :] = netCDF4.stringtochar(months)

    c_months1 = n.createVariable("c_months1", "S1", ("dim1", "strlen8"))
    c_months1.long_name = "char: One month"
    c_months1[:] = netCDF4.stringtochar(np.array(["December"], dtype="S8"))

    c_months0 = n.createVariable("c_months0", "S1", ("strlen3",))
    c_months0.long_name = "char: One month (scalar)"
    c_months0[:] = np.array(list("May"))

    c_numbers = n.createVariable("c_numbers", "S1", ("lat", "lon", "strlen5"))
    c_numbers.long_name = "char: Two dimensional"
    np.empty((2, 3, 5), dtype="S1")
    c_numbers[...] = netCDF4.stringtochar(numbers)

    c_months4m = n.createVariable("c_months4m", "S1", ("time", "strlen7"))
    c_months4m.long_name = "char: Four months (masked)"
    array = netCDF4.stringtochar(months_m)
    c_months4m[:, :] = array

    n.close()

@bnlawrence
Copy link
Collaborator

Crickey. I might get there. Meanwhile, good news, the above algorithm works for reading the data, I now need to only work out how to get the fill value! Progress.

@bnlawrence
Copy link
Collaborator

Status report: I have a working implementation (a20763f), but it fails on the tests above because of the dtype produced (but what to do is not yet obvious). I have also discovered that the global heap is read for the fillvalue, so the moment we get that, we've read and cached the relevant global heap - the problem being in my current version, it's cached in the wrong place. So I have two issues to address:

  1. Optimising what we're up to with the global heap, and
  2. Deciding how and where to deal with dtypes.

@bnlawrence
Copy link
Collaborator

bnlawrence commented Jan 22, 2025

Ah, well, with respect to the dtype produced, there are two issues, reporting the right value, and getting the right value.

It looks to me like h5py gets it wrong, reporting unicode strings as byte strings! Here, for example is what is going on with just one of the tests:

For

months = np.array(["January", "February", "March", "April"], dtype="S8")
 s_months4 = n.createVariable("s_months4", str, ("time",))
 s_months4[:] = months
 validation={'s_months4':s_months4[:]}

written using netCDF4, we find the following after reading it:

Validation data ['January' 'February' 'March' 'April']
h5py [b'January' b'February' b'March' b'April']
pyfive ['January' 'February' 'March' 'April']

@davidhassell
Copy link
Collaborator Author

I presume that all of the above example outputs are numpy "object" arrays?

Vanilla h5netcdf returns also byte strings (presumably because it doesn't change what h5py gives it?) but netCDF4 returns unicode, like pyfive is doing above.

@bnlawrence
Copy link
Collaborator

Yes numpy arrays and yes, I think h5py and hence vanilla h5netcdf (if it is not fixing it, and I think it probalby can't because by the time it gets the variable it's lost the info) is wrong.

I still have a problem with a couple of the tests (assuming we will do it right, given hdf5 itself knows the difference between a byte string and a unicode string), mind you, it only tells you it is unicode, it doesn't tell you what to use to decode, so there is a bit of an assumption going on in netcdf (and pyfive) i think.

Current status (assuming we can ignore most instances of h5py behaving badly):

tests/test_vlen_str.py ['foo' 'foobar']
..--> Passing s_months4 (object,object)
--> Passing s_months1 (object,object)
---> Failing s_months0 (object,object)
Original data May
h5py object b'May'
pyfive ('VLEN_STRING', 0, 1) ['May']
---> Failing s_numbers (object,object)
Original data [['one' 'two' 'three']
 ['four' 'five' 'six']]
h5py object [[b'one' b'two' b'three']
 [b'four' b'five' b'six']]
pyfive ('VLEN_STRING', 0, 1) ['one' 'two' 'three' 'four' 'five' 'six']
--> Passing s_months4m (object,object)
--> Passing c_months4 (|S1,|S1)
--> Passing c_months1 (|S1,|S1)
--> Passing c_months0 (|S1,|S1)
--> Passing c_numbers (|S1,|S1)
--> Passing c_months4m (|S1,|S1)

@bnlawrence
Copy link
Collaborator

And now:

tests/test_vlen_str.py ['foo' 'foobar']
..--> Passing s_months4 (object,object)
--> Passing s_months1 (object,object)
--> Passing s_months0 (object,object)
--> Passing s_numbers (object,object)
--> Passing s_months4m (object,object)
--> Passing c_months4 (|S1,|S1)
--> Passing c_months1 (|S1,|S1)
--> Passing c_months0 (|S1,|S1)
--> Passing c_numbers (|S1,|S1)
--> Passing c_months4m (|S1,|S1)

bnlawrence pushed a commit that referenced this issue Jan 22, 2025
…the dataset itself, which needs to be a new issue. Also the caching stuff needs to be a new issue.
@bnlawrence
Copy link
Collaborator

Apart from the two new issues, and given the tests now pass, I'm closing this ticket and dealing with the associated pull request.

@bnlawrence
Copy link
Collaborator

Bugger. People can and (sadly) do create chunked vlen datasets. They may even compress them. Sadly what that probably means is that they are chunking the addresses of the vlen data, the data itself is still on global heap(s). So we need to support getting those addresses from chunked data too.

(This discussion and subsequent ones suggests that as of 2012 there was no value in compressing such data, and I don't believe the situation has changed since.)

@bnlawrence bnlawrence reopened this Jan 23, 2025
This was referenced Jan 24, 2025
@davidhassell davidhassell linked a pull request Jan 28, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants