-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem reading VLEN string variable in branch "h5netcdf" #29
Comments
I've just that elsewhere in the code we have things like |
This is, I think, related to #16. Do we know how this was created? (I think the self.dtype tuple cases are often references ...) |
Do you mean "how was the file created"? It was created with |
Can you give me a few lines for a unit test which just creates a file with the vlen string in it? (I am assuming it will look exactly like the example in #16, but I'd rather have a test that does both an nc version and an hd5 version) |
Not sure what a unit test to create a file should look like, but here's the how to do it, and a demo that this file fails in the same way import netCDF4
import numpy as np
n = netCDF4.Dataset('vlen_string.nc', "w", format="NETCDF4")
n.Conventions = f"CF-1.12"
n.createDimension("time", 4)
months = n.createVariable("months", str, ("time",))
months.long_name = "string: Four months"
months[:] = np.array(["January", "February", "March", "April"], dtype="S8")
n.close() Using the file: >>> import pyfive
>>> p = pyfive.File('vlen_string.nc')
>>> p['months']
---------------------------------------------------------------------------
<snip>
File ~/pyfive/pyfive/dataobjects.py:340, in fillvalue(self)
338 payload = self.msg_data[offset:offset+size]
339 fillvalue = np.frombuffer(payload, self.dtype, count=1)[0]
--> 340 else:
341 fillvalue = 0
342 return fillvalue
TypeError: Tuple must have size 2, but has size 3 |
Well, interestingly, I get a different error using a file_like object: import h5py
import pyfive
import netCDF4 as nc
import io
import numpy as np
def make_file_hdf5(file_like, _vlen_string):
with h5py.File(file_like,'w') as hfile:
dt = h5py.special_dtype(vlen=str)
v = hfile.create_dataset("var_len_str", (1,), dtype=dt)
v[0] = _vlen_string
def make_file_nc(file_like,m_array):
n = nc.Dataset(file_like, "w", format="NETCDF4")
n.createDimension("time", 4)
months = n.createVariable("months", str, ("time",))
months[:] = np.array(m_array, dtype="S8")
n.close()
def test_vlen_string_hdf5():
tfile = io.BytesIO()
_vlen_string = "foo"
make_file_hdf5(tfile, _vlen_string)
with pyfive.File(tfile) as hfile:
ds1 = hfile['var_len_str']
assert _vlen_string == ds1[0]
def test_vlen_string_nc():
tfile = io.BytesIO()
m_array = ["January", "February", "March", "April"]
make_file_nc(tfile,m_array)
with pyfive.File(tfile) as hfile:
ds1 = hfile['months']
assert np.array_equal(m_array, ds1) Gives:
for the first one and
for the second (and that's just from trying to open the file). |
Oh, interesting, h5py can't open it either. |
Looks like netcdf4 is doing something odd with in memory files. Yuck. That's an error to understand as well. |
Ok, now I have the same error as you, but we have three different errors in play now. yuck. Will commit the test. |
OK - you've fixed it, I guess. I was going to say: Have you tried or Just write it to disk and tidy up afterwards :) |
Gross. NetCDf4 is doing weird things with memory files. Not kosher. Not even a little bit. Looks like you can't do it. Unidata/netcdf4-python#295 |
(Doesn't affect this issue per se, but it's really untidy.) |
OK, dealing with the ...
elif datatype_class == DATATYPE_VARIABLE_LENGTH:
vlen_type = self._determine_dtype_vlen(datatype_msg)
if vlen_type[0] == 'VLEN_SEQUENCE':
base_type = self.determine_dtype()
vlen_type = ('VLEN_SEQUENCE', base_type)
return vlen_type What we actually have for |
OK, this particular tuple arises here: def _determine_dtype_vlen(datatype_msg):
""" Return the dtype information for a variable length class. """
vlen_type = datatype_msg['class_bit_field_0'] & 0x01
if vlen_type != 1:
return ('VLEN_SEQUENCE', 0, 0)
padding_type = datatype_msg['class_bit_field_0'] >> 4 # bits 4-7
character_set = datatype_msg['class_bit_field_1'] & 0x01
return ('VLEN_STRING', padding_type, character_set) where padding type is
and character set is
There is code that uses this in the |
>>> import h5netcdf
>>> h = h5netcdf.File('vlen_string.nc')
>>> h['months'][...]
array([b'January', b'February', b'March', b'April'], dtype=object) You probably knew that :) but it might be noteworthy (I'm not sure how, though) |
So we will need access to the relevant global heap. This bit of code is from the attribute reading, but I think the same bit will be necessary for variable length data. There are some issues here about efficiency, laziness, and threadssfety! def _vlen_size_and_data(self, buf, offset)
""" Extract the length and data of a variables length attr. """
# offset should be incremented by 16 after calling this method
vlen_size, = struct.unpack_from('<I', buf, offset=offset)
# section IV.B
# Data with a variable-length datatype is stored in the
# global heap of the HDF5 file. Global heap identifiers are
# stored in the data object storage.
gheap_id = _unpack_struct_from(GLOBAL_HEAP_ID, buf, offset+4)
gheap_address = gheap_id['collection_address']
if gheap_address not in self._global_heaps:
# load the global heap and cache the instance
gheap = GlobalHeap(self.fh, gheap_address)
self._global_heaps[gheap_address] = gheap
gheap = self._global_heaps[gheap_address]
vlen_data = gheap.objects[gheap_id['object_index']]
return vlen_size, vlen_data |
The two tests fail in different ways because the netcdf derived one has a fill value which is variable length, and it too is likely on the heap, which isn't expected in the relevant property. |
It might be that simply engineering this by using the code from attributes will work fine in single threaded code, but I am not sure what to do about this wrt threading and laziness. If we assume that this has to be lazy, then we don't want to be reading the heap until we try and get at the data (but right now an attempt to get the metadata will try and access the missing data value, which if it exists, looks like it will be on a/the global heap). That would make any attempt to access these sort of variables unlazy, but work well for threading. If we somehow postponed the heap read (for laziness reasons) then we have a problem with all the threads trying to get at different parts of the heap - we know we can't do that with an existing open file pointer. A per thread file pointer would be fine, but on S3 that would potentially be multiple gets of the same object. |
So @davidhassell and I discussed this. Clearly we have to make this lazy as the priority over parallelism, given as it looks like one has to read an entire heap for such a variable, it will of necessity fit into memory. The existing reading code for variable length attributes does the following:
I think the thing we need to do is go from some combination of steps 4 and 5 ... |
Sounds good. Minor, minor thought: |
In case it's useful, here is the cfdm string/char data test file: string_char.nc.gz. It's horrible! but should include every use case, I hope. Also, I include the
n = netCDF4.Dataset(filename, "w", format="NETCDF4")
n.Conventions = f"CF-{VN}"
n.comment = "A netCDF file with variables of string and char data types"
n.createDimension("dim1", 1)
n.createDimension("time", 4)
n.createDimension("lat", 2)
n.createDimension("lon", 3)
n.createDimension("strlen8", 8)
n.createDimension("strlen7", 7)
n.createDimension("strlen5", 5)
n.createDimension("strlen3", 3)
months = np.array(["January", "February", "March", "April"], dtype="S8")
months_m = np.ma.array(
months, dtype="S7", mask=[0, 1, 0, 0], fill_value=b""
)
numbers = np.array(
[["one", "two", "three"], ["four", "five", "six"]], dtype="S5"
)
s_months4 = n.createVariable("s_months4", str, ("time",))
s_months4.long_name = "string: Four months"
s_months4[:] = months
s_months1 = n.createVariable("s_months1", str, ("dim1",))
s_months1.long_name = "string: One month"
s_months1[:] = np.array(["December"], dtype="S8")
s_months0 = n.createVariable("s_months0", str, ())
s_months0.long_name = "string: One month (scalar)"
s_months0[:] = np.array(["May"], dtype="S3")
s_numbers = n.createVariable("s_numbers", str, ("lat", "lon"))
s_numbers.long_name = "string: Two dimensional"
s_numbers[...] = numbers
s_months4m = n.createVariable("s_months4m", str, ("time",))
s_months4m.long_name = "string: Four months (masked)"
array = months.copy()
array[1] = ""
s_months4m[...] = array
c_months4 = n.createVariable("c_months4", "S1", ("time", "strlen8"))
c_months4.long_name = "char: Four months"
c_months4[:, :] = netCDF4.stringtochar(months)
c_months1 = n.createVariable("c_months1", "S1", ("dim1", "strlen8"))
c_months1.long_name = "char: One month"
c_months1[:] = netCDF4.stringtochar(np.array(["December"], dtype="S8"))
c_months0 = n.createVariable("c_months0", "S1", ("strlen3",))
c_months0.long_name = "char: One month (scalar)"
c_months0[:] = np.array(list("May"))
c_numbers = n.createVariable("c_numbers", "S1", ("lat", "lon", "strlen5"))
c_numbers.long_name = "char: Two dimensional"
np.empty((2, 3, 5), dtype="S1")
c_numbers[...] = netCDF4.stringtochar(numbers)
c_months4m = n.createVariable("c_months4m", "S1", ("time", "strlen7"))
c_months4m.long_name = "char: Four months (masked)"
array = netCDF4.stringtochar(months_m)
c_months4m[:, :] = array
n.close() |
Crickey. I might get there. Meanwhile, good news, the above algorithm works for reading the data, I now need to only work out how to get the fill value! Progress. |
…ith multidimensional arrays as well.
Status report: I have a working implementation (a20763f), but it fails on the tests above because of the dtype produced (but what to do is not yet obvious). I have also discovered that the global heap is read for the fillvalue, so the moment we get that, we've read and cached the relevant global heap - the problem being in my current version, it's cached in the wrong place. So I have two issues to address:
|
Ah, well, with respect to the dtype produced, there are two issues, reporting the right value, and getting the right value. It looks to me like h5py gets it wrong, reporting unicode strings as byte strings! Here, for example is what is going on with just one of the tests: For months = np.array(["January", "February", "March", "April"], dtype="S8")
s_months4 = n.createVariable("s_months4", str, ("time",))
s_months4[:] = months
validation={'s_months4':s_months4[:]} written using netCDF4, we find the following after reading it:
|
I presume that all of the above example outputs are numpy "object" arrays? Vanilla |
Yes numpy arrays and yes, I think h5py and hence vanilla h5netcdf (if it is not fixing it, and I think it probalby can't because by the time it gets the variable it's lost the info) is wrong. I still have a problem with a couple of the tests (assuming we will do it right, given hdf5 itself knows the difference between a byte string and a unicode string), mind you, it only tells you it is unicode, it doesn't tell you what to use to decode, so there is a bit of an assumption going on in netcdf (and pyfive) i think. Current status (assuming we can ignore most instances of h5py behaving badly):
|
And now:
|
…the dataset itself, which needs to be a new issue. Also the caching stuff needs to be a new issue.
Apart from the two new issues, and given the tests now pass, I'm closing this ticket and dealing with the associated pull request. |
Bugger. People can and (sadly) do create chunked vlen datasets. They may even compress them. Sadly what that probably means is that they are chunking the addresses of the vlen data, the data itself is still on global heap(s). So we need to support getting those addresses from chunked data too. (This discussion and subsequent ones suggests that as of 2012 there was no value in compressing such data, and I don't believe the situation has changed since.) |
Hi @bnlawrence, possible incorrect treatment of string variables and/or attributes? As this looks like it might be in HDF5 "message" territory (?), I thought I'd hold off looking further, but am happy to dig if you want.
I am using the "h5netcdf" branch.
I have a file with a VLEN string variable called
auxiliary
: test_file.nc.gz, for which I get:A print statement of
print(f"payload={payload!r}, dtype={self.dtype!r}")
just before the line of utlimate failure (~/pyfive/pyfive/dataobjects.py:340
) givesso we have a 3-tuple for the dtype.
For a "normal" numeric in the same file (such as
p['x']
), the same print statement givesThe file behaves fine when open with
h5netcdf
with the HDF5 backend.The text was updated successfully, but these errors were encountered: