Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sz using ctypes #422

Closed
wants to merge 16 commits into from
Closed

Conversation

martindurant
Copy link
Member

Fixes #420

@mkitti , this hard crashes as is, but it should be an OK template to work from.

TODO:

  • Unit tests and/or doctests in docstrings
  • Tests pass locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • Docs build locally
  • GitHub Actions CI passes
  • Test coverage to 100% (Codecov passes)

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
martindurant and others added 2 commits March 6, 2023 16:11
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Copy link
Contributor

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After these two changes, this works for me.

martindurant and others added 4 commits March 7, 2023 09:00
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
(how to install in complex CI?)
@codecov
Copy link

codecov bot commented Mar 7, 2023

Codecov Report

Merging #422 (2b5f092) into main (67ede4c) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              main      #422    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           55        57     +2     
  Lines         2121      2221   +100     
==========================================
+ Hits          2121      2221   +100     
Impacted Files Coverage Δ
numcodecs/__init__.py 100.00% <100.00%> (ø)
numcodecs/sz.py 100.00% <100.00%> (ø)
numcodecs/tests/test_sz.py 100.00% <100.00%> (ø)

@martindurant martindurant marked this pull request as ready for review March 7, 2023 19:23
@mkitti
Copy link
Contributor

mkitti commented Mar 7, 2023

I was looking through the libaec source code regarding the options. Only two options seem to be used practically.

https://gitlab.dkrz.de/k202009/libaec/-/blob/master/src/sz_compat.c#L53-54

    co[SZ_MSB_OPTION_MASK] = AEC_DATA_MSB;
    co[SZ_NN_OPTION_MASK] = AEC_DATA_PREPROCESS;
#define SZ_MSB_OPTION_MASK 16
#define SZ_NN_OPTION_MASK 32

Thus the functional value classes for mask_options are

option_mask endian preprocess
0 little endian EC
16 big endian EC
32 little endian NN (do preprocessing)
48 big endian NN (do preprocessing)

Preprocessing is described below.

AEC_DATA_PREPROCESS: preprocessing input will improve compression efficiency if data samples are correlated. It will only cost performance for no gain in efficiency if the data is already uncorrelated.

Besides the szip compatability, we may also want to consider an interface to AEC directly.

@martindurant
Copy link
Member Author

Personally, I my only use case here is to allow SZIP to be read via kerchunk. Do you suppose anyone would really use this to compress data? I would defer AEC compression and more detailed options until someone asks for them (but that someone might be you, @mkitti ).

Copy link
Contributor

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While some of this may apply to HDF4 or HDF more generally, we primarily based this on the HDF5 implementation directly. I think we should more explicitly refer to "HDF5", capitalized as an abbreviation where relevant.

I also added some constants so we can check pixels_per_block and pixels_per_scanline on construction.

Another potential feature might be the ability to pass a ctypes._SimpleCData subtype as bits_per_pixel. The bits could then be calculated as bits_per_pixel = ctypes.sizeof(bits_per_pixel)*8.

@jakirkham
Copy link
Member

Thanks Martin! 🙏

Is there a reason this is using ctypes as opposed to Cython?

@martindurant
Copy link
Member Author

Is there a reason this is using ctypes as opposed to Cython?

A couple of reasons

  • no extra compile stub in our overgrown setup file
  • only one source file instead of a .pyx and .py wrapper as recommended in my last PR
  • no need to depend on libaec at build time

@martindurant
Copy link
Member Author

I am happy to accept all those suggestions. Is there a way to do it without clicking through them all?

@jakirkham
Copy link
Member

At least from my experience on the deployment side, have found ctypes more challenging to work with since it makes decisions about where to look for shared libraries, which are not always right and sometimes needs some nudging. Relying on the system library loader doesn't tend to have these issues.

Maybe a bigger question worth asking here given some of the concerns noted above, is whether this needs to be in Numcodecs or whether it could be a 3rd party library that plugins into Numcodecs (using the entry point mechanism)?

@martindurant
Copy link
Member Author

whether this needs to be in Numcodecs or whether it could be a 3rd party library

kerchunk can happily host it, since it already has some codecs. I don't mind. That does imply that reading the NASA data via kerchunk is probably the only use it'll get.

@mkitti
Copy link
Contributor

mkitti commented Mar 7, 2023

I am happy to accept all those suggestions. Is there a way to do it without clicking through them all?

The best I know is to use the "add suggestion to batch" button. Sorry.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/incorporating-feedback-in-your-pull-request

@mkitti
Copy link
Contributor

mkitti commented Mar 8, 2023

That does imply that reading the NASA data via kerchunk is probably the only use it'll get.

SZIP/AEC is a CCSDS recommended standard for loseless since 1997 and most recently published in the August 2020 Blue Book. It is also one of two builtin HDF5 compression filters. The main blocker for wide adoption was that SZIP was under a non-free license.

With the release of a BSD licensed libaec from DKRZ (German Climate Computing Center) I think we might see increased adoption of this, especially since it is now being distributed via conda-forge. After conda-forge/hdf5-feedstock#179, I think it's worthwhile to consider first class support for SZIP / AEC in numcodecs. A key impediment to the adoption of compression other than ZLIB-based deflate is the lack of universal availability of the compression codec.

We learned a lot about SZIP from this pull request despite not having much documentation to work with. Between this and the C demo earlier, #420 (comment), a Cython version might not be very difficult if that is preferred.

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
@joshmoore
Copy link
Member

I would defer AEC compression and more detailed options until someone asks for them (but that someone might be you, @mkitti ).

To the extent that adding wouldn't constitute a breaking change later, no?

The best I know is to use the "add suggestion to batch" button. Sorry.

Agreed. I could only see that under "Files" and not in this view. I applied them, @martindurant.

@mkitti
Copy link
Contributor

mkitti commented Mar 8, 2023

We I made flake, unhappy. Let me see if I can fix that.

@mkitti
Copy link
Contributor

mkitti commented Mar 8, 2023

To the extent that adding wouldn't constitute a breaking change later, no?

The direct AEC interface should be a distinct codec here. I do not know what advantage it may have over the SZIP interface.

Copy link
Contributor

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make flake happy.

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
@martindurant
Copy link
Member Author

Can we agree, then, that this codec is useful enough to include here as-is?
A following aec codec could be considered in a separate PR.

I would argue that the ctypes implementation is actually simpler and better than cython for this case where are simply calling library functions.

@jakirkham
Copy link
Member

Can we agree, then, that this codec is useful enough to include here as-is?

Usefulness is not really the discussion. Instead think we are discussing where it lives

In particular this becomes a conversation about expectations around ownership, maintenance, and what is considered required/optional

This already comes up with the various stores, their associated maintenance costs, and who owns them ( zarr-developers/zarr-python#414 ) ( zarr-developers/zarr-python#764 ) ( zarr-developers/zarr-python#1274 ). The general consensus there has been we want to separate those out somehow

Think we have similar concern regarding compressors, which we have already collected a lot of. Some of which are domain specific

Given we already have an entrypoint system, we are in a much better place for making decisions about what batteries included means

In particular I'd push for this to be external. Think we should also do an audit of what is included here and decide whether we split it out.

whether this needs to be in Numcodecs or whether it could be a 3rd party library
kerchunk can happily host it, since it already has some codecs. I don't mind. That does imply that reading the NASA data via kerchunk is probably the only use it'll get.

While these are 2 valid options, by no means are they the only ones.

For example, we could add a numcodecs-szip repo in this org and give relevant folks permissions. It could also live outside the org if preferred (have a slight preference for within the org, but not a strong one).

We can probably think of other choices.

I would argue that the ctypes implementation is actually simpler and better than cython for this case where are simply calling library functions.

This is a secondary discussion from my perspective. It ties back into maintenance and ownership, but once those are settled then this may be moot.

@mkitti
Copy link
Contributor

mkitti commented Mar 9, 2023

If I were to name a distinct package for this code, I would name it numcodecs-aec. It would be a very small package.

@martindurant
Copy link
Member Author

Whilst entrypoints is convenient for finding codecs installed in an environment, it is no help at all in dealing with missing codecs - we don't know how to instruct the user on what they need to install. Thus, there is a lot to be gained by including as many codecs as reasonable in the main repo or other standard collections (thinking of imagecodecs). Also, if we were to change the API, for instance to pass optional context information when en/decoding, having codecs live elsewhere pretty much ensures breaking workflows, at least temporarily.

Intake has the same problem, leading to https://intake.readthedocs.io/en/latest/plugin-directory.html and manual maintenance of links, which is also painful.

@martindurant
Copy link
Member Author

Closing in favour of the implementation in imagecodecs https://github.com/cgohlke/imagecodecs/blob/a74e4f305a9c8c317da17beb7a0698543c27b4ae/imagecodecs/numcodecs.py#L1408-L1450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

libaec / szip support
4 participants