-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sz using ctypes #422
Add sz using ctypes #422
Conversation
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After these two changes, this works for me.
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
(how to install in complex CI?)
Codecov Report
@@ Coverage Diff @@
## main #422 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 55 57 +2
Lines 2121 2221 +100
==========================================
+ Hits 2121 2221 +100
|
I was looking through the libaec source code regarding the options. Only two options seem to be used practically. https://gitlab.dkrz.de/k202009/libaec/-/blob/master/src/sz_compat.c#L53-54
#define SZ_MSB_OPTION_MASK 16
#define SZ_NN_OPTION_MASK 32 Thus the functional value classes for
Preprocessing is described below.
Besides the szip compatability, we may also want to consider an interface to AEC directly. |
Personally, I my only use case here is to allow SZIP to be read via kerchunk. Do you suppose anyone would really use this to compress data? I would defer AEC compression and more detailed options until someone asks for them (but that someone might be you, @mkitti ). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While some of this may apply to HDF4 or HDF more generally, we primarily based this on the HDF5 implementation directly. I think we should more explicitly refer to "HDF5", capitalized as an abbreviation where relevant.
I also added some constants so we can check pixels_per_block
and pixels_per_scanline
on construction.
Another potential feature might be the ability to pass a ctypes._SimpleCData
subtype as bits_per_pixel
. The bits could then be calculated as bits_per_pixel = ctypes.sizeof(bits_per_pixel)*8
.
Thanks Martin! 🙏 Is there a reason this is using |
A couple of reasons
|
I am happy to accept all those suggestions. Is there a way to do it without clicking through them all? |
At least from my experience on the deployment side, have found Maybe a bigger question worth asking here given some of the concerns noted above, is whether this needs to be in Numcodecs or whether it could be a 3rd party library that plugins into Numcodecs (using the entry point mechanism)? |
kerchunk can happily host it, since it already has some codecs. I don't mind. That does imply that reading the NASA data via kerchunk is probably the only use it'll get. |
The best I know is to use the "add suggestion to batch" button. Sorry. |
SZIP/AEC is a CCSDS recommended standard for loseless since 1997 and most recently published in the August 2020 Blue Book. It is also one of two builtin HDF5 compression filters. The main blocker for wide adoption was that SZIP was under a non-free license. With the release of a BSD licensed libaec from DKRZ (German Climate Computing Center) I think we might see increased adoption of this, especially since it is now being distributed via conda-forge. After conda-forge/hdf5-feedstock#179, I think it's worthwhile to consider first class support for SZIP / AEC in numcodecs. A key impediment to the adoption of compression other than ZLIB-based deflate is the lack of universal availability of the compression codec. We learned a lot about SZIP from this pull request despite not having much documentation to work with. Between this and the C demo earlier, #420 (comment), a Cython version might not be very difficult if that is preferred. |
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
To the extent that adding wouldn't constitute a breaking change later, no?
Agreed. I could only see that under "Files" and not in this view. I applied them, @martindurant. |
|
The direct AEC interface should be a distinct codec here. I do not know what advantage it may have over the SZIP interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make flake happy.
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Can we agree, then, that this codec is useful enough to include here as-is? I would argue that the ctypes implementation is actually simpler and better than cython for this case where are simply calling library functions. |
Usefulness is not really the discussion. Instead think we are discussing where it lives In particular this becomes a conversation about expectations around ownership, maintenance, and what is considered required/optional This already comes up with the various stores, their associated maintenance costs, and who owns them ( zarr-developers/zarr-python#414 ) ( zarr-developers/zarr-python#764 ) ( zarr-developers/zarr-python#1274 ). The general consensus there has been we want to separate those out somehow Think we have similar concern regarding compressors, which we have already collected a lot of. Some of which are domain specific Given we already have an entrypoint system, we are in a much better place for making decisions about what batteries included means In particular I'd push for this to be external. Think we should also do an audit of what is included here and decide whether we split it out.
While these are 2 valid options, by no means are they the only ones. For example, we could add a We can probably think of other choices.
This is a secondary discussion from my perspective. It ties back into maintenance and ownership, but once those are settled then this may be moot. |
If I were to name a distinct package for this code, I would name it numcodecs-aec. It would be a very small package. |
Whilst entrypoints is convenient for finding codecs installed in an environment, it is no help at all in dealing with missing codecs - we don't know how to instruct the user on what they need to install. Thus, there is a lot to be gained by including as many codecs as reasonable in the main repo or other standard collections (thinking of imagecodecs). Also, if we were to change the API, for instance to pass optional context information when en/decoding, having codecs live elsewhere pretty much ensures breaking workflows, at least temporarily. Intake has the same problem, leading to https://intake.readthedocs.io/en/latest/plugin-directory.html and manual maintenance of links, which is also painful. |
Closing in favour of the implementation in imagecodecs https://github.com/cgohlke/imagecodecs/blob/a74e4f305a9c8c317da17beb7a0698543c27b4ae/imagecodecs/numcodecs.py#L1408-L1450 |
Fixes #420
@mkitti , this hard crashes as is, but it should be an OK template to work from.
TODO: