-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle NetCDF variable length strings (and other VLen types) #6340
Conversation
Would be better to force all Variable Length types to be lazy, but can't ascertain this information from the CFAuxiliaryCoordinateVariable instance.
* upstream/main: (98 commits) [pre-commit.ci] pre-commit autoupdate (SciTools#6335) SPEC 0: drop py310 and support py313 (SciTools#6195) Better benchmarking Python version handling (SciTools#6329) Move loading and combine code into their own submodules. (SciTools#6321) Bump scitools/workflows from 2025.02.1 to 2025.02.2 (SciTools#6327) replaced reference from build to python build (SciTools#6324) [pre-commit.ci] pre-commit autoupdate (SciTools#6315) Cache Dask arrays created from `NetCDFDataProxy`s to speed up loading files with multiple variables (SciTools#6252) Bump scitools/workflows from 2025.02.0 to 2025.02.1 (SciTools#6313) [pre-commit.ci] pre-commit autoupdate (SciTools#6310) Bump scitools/workflows from 2025.01.5 to 2025.02.0 (SciTools#6306) Updated environment lockfiles (SciTools#6301) Improve speed of loading small NetCDF files (SciTools#6229) [pre-commit.ci] pre-commit autoupdate (SciTools#6298) Use cube chunks for weights in aggregations with smart weights (SciTools#6288) Updated environment lockfiles (SciTools#6296) Bump scitools/workflows from 2025.01.4 to 2025.01.5 (SciTools#6300) Bump scitools/workflows from 2025.01.3 to 2025.01.4 (SciTools#6295) Lazy rectilinear interpolator (SciTools#6084) Revert "Fix broken link. (SciTools#6246)" (SciTools#6297) ...
variable length "str" case.
storage is netCDF (which for this module is true, but for mock testing is not)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6340 +/- ##
==========================================
- Coverage 89.80% 89.80% -0.01%
==========================================
Files 90 90
Lines 23576 23589 +13
Branches 4398 4402 +4
==========================================
+ Hits 21172 21183 +11
Misses 1662 1662
- Partials 742 744 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The problems with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff, solves the problem nicely I think 👍
Apart from the outstanding test fail
-- which I have proposed a fix for (see comment)
Also a couple of small comments + docs to adjust, as noted.
BTW I think it's really important that people can get what they want out of these cases by taking control of the variable chunking. Which currently doesn't work since it hits the 'itemsize' line and crashes. I think it's rather cool that we have a drop on xarray here : we can specify chunking on a per-variable basis, but they don't |
Use 'safe-access' version of netCDF4.VLType
for more information, see https://pre-commit.ci
Yes - to a degree. Users can now pass down custom chunk parameters for variable length arrays, but as the "ragged array size" is not defined by an actual dimension we still have not way of passing this down to Dask. We can hint on the mean ragged array length to aid the decision for lazy loading, but that's about the extent of it. Your chunks will always be bigger in memory than you expect for variable length arrays as the chunking is only working over the explicit known dimensions. This does at least allow is to load variable length string data now in a way that is consistent with how other variable length data was being handled. |
I totally agree with that -- the chunks of course don't have known sizes. |
* upstream/main: Faster dimension lookup for derived coordinates (SciTools#6337) Unpin dask 2 (SciTools#6342) added classification enums to qp (SciTools#6346) Add castable check for valid_range, valid_min, valid_max. (SciTools#6343) Pin Sphinx below 8.2 (SciTools#6344) Add a text feature to quickplot (SciTools#6333)
New sample data is now being picked up from iris-sample-data and doctests are now succeeding. This one is ready for a final review now. |
Some earlier comments still to address (above, all in the docs) Otherwise it has updated + the sample-data now functioning |
The docs should already be up-to-date:
However, the doctest GHA is failing due to the RTD environment having an older version of iris-sample-data (2.5.1). |
Oh perhaps it could be.. |
OK - have updated these and doctests are succeeding now. |
⏱️ Performance Benchmark Report: 548bc76Performance shifts
Full benchmark results
Generated by GHA run |
@ukmo-ccbunney testing all good now, I'm really keen to get this in ! |
⏱️ Performance Benchmark Report: 07dfb1bPerformance shifts
Full benchmark results
Generated by GHA run |
⏱️ Performance Benchmark Report: 96c9d0cPerformance shifts
Full benchmark results
Generated by GHA run |
⏱️ Performance Benchmark Report: b2409cdPerformance shifts
Full benchmark results
Generated by GHA run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one tiny further suggestion ...
⏱️ Performance Benchmark Report: 16c97c7Performance shifts
Full benchmark results
Generated by GHA run |
⏱️ Performance Benchmark Report: edf8e41Performance shifts
Full benchmark results
Generated by GHA run |
🚀 Pull Request
Description
NetCDF provides a "variable length" data type (
VLen
) that can be used to store arrays where one dimension size is unknown/variable.This variable length datatype is used when creating a new variable with the
str
type (crucially this is different from aNC_CHAR
type).Iris fails to load
VLen
str
types as it cannot determine the itemsize of astr
type when trying to calculate whether the array is big enough to lazy load.Other
VLen
types load fine (such as floats, chars, ints, etc), but the calculation for the total array size will be incorrect as the size of the variable length array dimension is not known until the data has been read from disk.This PR makes the following changes
str
type, a default itemsize of 4 bytes is used (sufficient for storing a single Unicode character)str
types is not a think in netCDF4, so I have made it the same as the 'S1' type (a null byte).CHUNK_CONTROL
context manager using a special_vl_hint
dimension name.Fixes: #6149