Add sz using ctypes #422

martindurant · 2023-03-06T17:52:22Z

Fixes #420

@mkitti , this hard crashes as is, but it should be an OK template to work from.

TODO:

Unit tests and/or doctests in docstrings
Tests pass locally
Docstrings and API docs for any new/modified user-facing classes and functions
Changes documented in docs/release.rst
Docs build locally
GitHub Actions CI passes
Test coverage to 100% (Codecov passes)

numcodecs/sz.py

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

numcodecs/sz.py

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

mkitti

After these two changes, this works for me.

numcodecs/sz.py

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

(how to install in complex CI?)

codecov · 2023-03-07T15:14:52Z

Codecov Report

Merging #422 (2b5f092) into main (67ede4c) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              main      #422    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           55        57     +2     
  Lines         2121      2221   +100     
==========================================
+ Hits          2121      2221   +100

Impacted Files	Coverage Δ
numcodecs/__init__.py	`100.00% <100.00%> (ø)`
numcodecs/sz.py	`100.00% <100.00%> (ø)`
numcodecs/tests/test_sz.py	`100.00% <100.00%> (ø)`

mkitti · 2023-03-07T19:24:58Z

I was looking through the libaec source code regarding the options. Only two options seem to be used practically.

https://gitlab.dkrz.de/k202009/libaec/-/blob/master/src/sz_compat.c#L53-54

    co[SZ_MSB_OPTION_MASK] = AEC_DATA_MSB;
    co[SZ_NN_OPTION_MASK] = AEC_DATA_PREPROCESS;

#define SZ_MSB_OPTION_MASK 16
#define SZ_NN_OPTION_MASK 32

Thus the functional value classes for mask_options are

option_mask	endian	preprocess
0	little endian	EC
16	big endian	EC
32	little endian	NN (do preprocessing)
48	big endian	NN (do preprocessing)

Preprocessing is described below.

AEC_DATA_PREPROCESS: preprocessing input will improve compression efficiency if data samples are correlated. It will only cost performance for no gain in efficiency if the data is already uncorrelated.

Besides the szip compatability, we may also want to consider an interface to AEC directly.

martindurant · 2023-03-07T19:54:37Z

Personally, I my only use case here is to allow SZIP to be read via kerchunk. Do you suppose anyone would really use this to compress data? I would defer AEC compression and more detailed options until someone asks for them (but that someone might be you, @mkitti ).

mkitti

While some of this may apply to HDF4 or HDF more generally, we primarily based this on the HDF5 implementation directly. I think we should more explicitly refer to "HDF5", capitalized as an abbreviation where relevant.

I also added some constants so we can check pixels_per_block and pixels_per_scanline on construction.

Another potential feature might be the ability to pass a ctypes._SimpleCData subtype as bits_per_pixel. The bits could then be calculated as bits_per_pixel = ctypes.sizeof(bits_per_pixel)*8.

docs/szip.rst

numcodecs/__init__.py

numcodecs/sz.py

numcodecs/tests/test_sz.py

jakirkham · 2023-03-07T20:11:09Z

Thanks Martin! 🙏

Is there a reason this is using ctypes as opposed to Cython?

martindurant · 2023-03-07T20:23:54Z

Is there a reason this is using ctypes as opposed to Cython?

A couple of reasons

no extra compile stub in our overgrown setup file
only one source file instead of a .pyx and .py wrapper as recommended in my last PR
no need to depend on libaec at build time

martindurant · 2023-03-07T20:28:04Z

I am happy to accept all those suggestions. Is there a way to do it without clicking through them all?

jakirkham · 2023-03-07T20:33:01Z

At least from my experience on the deployment side, have found ctypes more challenging to work with since it makes decisions about where to look for shared libraries, which are not always right and sometimes needs some nudging. Relying on the system library loader doesn't tend to have these issues.

Maybe a bigger question worth asking here given some of the concerns noted above, is whether this needs to be in Numcodecs or whether it could be a 3rd party library that plugins into Numcodecs (using the entry point mechanism)?

martindurant · 2023-03-07T20:38:08Z

whether this needs to be in Numcodecs or whether it could be a 3rd party library

kerchunk can happily host it, since it already has some codecs. I don't mind. That does imply that reading the NASA data via kerchunk is probably the only use it'll get.

mkitti · 2023-03-07T20:38:47Z

I am happy to accept all those suggestions. Is there a way to do it without clicking through them all?

The best I know is to use the "add suggestion to batch" button. Sorry.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/incorporating-feedback-in-your-pull-request

mkitti · 2023-03-08T00:43:17Z

That does imply that reading the NASA data via kerchunk is probably the only use it'll get.

SZIP/AEC is a CCSDS recommended standard for loseless since 1997 and most recently published in the August 2020 Blue Book. It is also one of two builtin HDF5 compression filters. The main blocker for wide adoption was that SZIP was under a non-free license.

With the release of a BSD licensed libaec from DKRZ (German Climate Computing Center) I think we might see increased adoption of this, especially since it is now being distributed via conda-forge. After conda-forge/hdf5-feedstock#179, I think it's worthwhile to consider first class support for SZIP / AEC in numcodecs. A key impediment to the adoption of compression other than ZLIB-based deflate is the lack of universal availability of the compression codec.

We learned a lot about SZIP from this pull request despite not having much documentation to work with. Between this and the C demo earlier, #420 (comment), a Cython version might not be very difficult if that is preferred.

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

joshmoore · 2023-03-08T07:34:42Z

I would defer AEC compression and more detailed options until someone asks for them (but that someone might be you, @mkitti ).

To the extent that adding wouldn't constitute a breaking change later, no?

The best I know is to use the "add suggestion to batch" button. Sorry.

Agreed. I could only see that under "Files" and not in this view. I applied them, @martindurant.

mkitti · 2023-03-08T07:49:44Z

We I made flake, unhappy. Let me see if I can fix that.

mkitti · 2023-03-08T08:02:24Z

To the extent that adding wouldn't constitute a breaking change later, no?

The direct AEC interface should be a distinct codec here. I do not know what advantage it may have over the SZIP interface.

mkitti

Make flake happy.

numcodecs/sz.py

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

martindurant · 2023-03-08T15:18:02Z

Can we agree, then, that this codec is useful enough to include here as-is?
A following aec codec could be considered in a separate PR.

I would argue that the ctypes implementation is actually simpler and better than cython for this case where are simply calling library functions.

jakirkham · 2023-03-08T22:47:20Z

Can we agree, then, that this codec is useful enough to include here as-is?

Usefulness is not really the discussion. Instead think we are discussing where it lives

In particular this becomes a conversation about expectations around ownership, maintenance, and what is considered required/optional

This already comes up with the various stores, their associated maintenance costs, and who owns them ( zarr-developers/zarr-python#414 ) ( zarr-developers/zarr-python#764 ) ( zarr-developers/zarr-python#1274 ). The general consensus there has been we want to separate those out somehow

Think we have similar concern regarding compressors, which we have already collected a lot of. Some of which are domain specific

Given we already have an entrypoint system, we are in a much better place for making decisions about what batteries included means

In particular I'd push for this to be external. Think we should also do an audit of what is included here and decide whether we split it out.

whether this needs to be in Numcodecs or whether it could be a 3rd party library
kerchunk can happily host it, since it already has some codecs. I don't mind. That does imply that reading the NASA data via kerchunk is probably the only use it'll get.

While these are 2 valid options, by no means are they the only ones.

For example, we could add a numcodecs-szip repo in this org and give relevant folks permissions. It could also live outside the org if preferred (have a slight preference for within the org, but not a strong one).

We can probably think of other choices.

I would argue that the ctypes implementation is actually simpler and better than cython for this case where are simply calling library functions.

This is a secondary discussion from my perspective. It ties back into maintenance and ownership, but once those are settled then this may be moot.

mkitti · 2023-03-09T01:06:29Z

If I were to name a distinct package for this code, I would name it numcodecs-aec. It would be a very small package.

martindurant · 2023-03-09T14:31:57Z

Whilst entrypoints is convenient for finding codecs installed in an environment, it is no help at all in dealing with missing codecs - we don't know how to instruct the user on what they need to install. Thus, there is a lot to be gained by including as many codecs as reasonable in the main repo or other standard collections (thinking of imagecodecs). Also, if we were to change the API, for instance to pass optional context information when en/decoding, having codecs live elsewhere pretty much ensures breaking workflows, at least temporarily.

Intake has the same problem, leading to https://intake.readthedocs.io/en/latest/plugin-directory.html and manual maintenance of links, which is also painful.

martindurant · 2023-03-17T13:15:17Z

Closing in favour of the implementation in imagecodecs https://github.com/cgohlke/imagecodecs/blob/a74e4f305a9c8c317da17beb7a0698543c27b4ae/imagecodecs/numcodecs.py#L1408-L1450

Add sz using ctypes

bb9a2da

github-actions bot added the needs release notes label Mar 6, 2023

mkitti reviewed Mar 6, 2023

View reviewed changes

numcodecs/sz.py Outdated Show resolved Hide resolved

Update numcodecs/sz.py

64ec7f6

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

mkitti reviewed Mar 6, 2023

View reviewed changes

numcodecs/sz.py Outdated Show resolved Hide resolved

martindurant and others added 2 commits March 6, 2023 16:11

Update numcodecs/sz.py

b71611b

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

one type

834b28f

mkitti suggested changes Mar 7, 2023

View reviewed changes

numcodecs/sz.py Outdated Show resolved Hide resolved

numcodecs/sz.py Outdated Show resolved Hide resolved

mkitti reviewed Mar 7, 2023

View reviewed changes

numcodecs/sz.py Outdated Show resolved Hide resolved

martindurant and others added 4 commits March 7, 2023 09:00

Update numcodecs/sz.py

24ae913

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

Update numcodecs/sz.py

868f907

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

Update numcodecs/sz.py

7d1bbbc

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

Add docs and tests

0e09b41

github-actions bot removed the needs release notes label Mar 7, 2023

Skip without libaec

a06fcbf

(how to install in complex CI?)

martindurant added 5 commits March 7, 2023 10:27

lint and add ilbaec to linux ci

12d40d0

better skip

0dcb23d

typo

00d301b

skip reason

4aa7af0

coverage

90fea47

martindurant marked this pull request as ready for review March 7, 2023 19:23

mkitti suggested changes Mar 7, 2023

View reviewed changes

mkitti mentioned this pull request Mar 8, 2023

SZIP/libaec support on Windows conda-forge/hdf5-feedstock#197

Closed

1 task

Apply suggestions from code review

3220437

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

mkitti approved these changes Mar 8, 2023

View reviewed changes

numcodecs/sz.py Outdated Show resolved Hide resolved

numcodecs/sz.py Show resolved Hide resolved

Mark makes flake happy

2b5f092

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

martindurant closed this May 12, 2023

mkitti mentioned this pull request Jul 12, 2023

Add Jenkin's lookup3 as a 32-bit checksum for HDF5 #445

Closed

martindurant mentioned this pull request Aug 16, 2023

Support HDF5 compression filter plugins fsspec/kerchunk#351

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sz using ctypes #422

Add sz using ctypes #422

martindurant commented Mar 6, 2023

mkitti left a comment

codecov bot commented Mar 7, 2023 •

edited

Loading

mkitti commented Mar 7, 2023

martindurant commented Mar 7, 2023

mkitti left a comment

jakirkham commented Mar 7, 2023

martindurant commented Mar 7, 2023

martindurant commented Mar 7, 2023

jakirkham commented Mar 7, 2023

martindurant commented Mar 7, 2023

mkitti commented Mar 7, 2023

mkitti commented Mar 8, 2023

joshmoore commented Mar 8, 2023

mkitti commented Mar 8, 2023 •

edited

Loading

mkitti commented Mar 8, 2023

mkitti left a comment

martindurant commented Mar 8, 2023

jakirkham commented Mar 8, 2023

mkitti commented Mar 9, 2023 •

edited

Loading

martindurant commented Mar 9, 2023

martindurant commented Mar 17, 2023

Add sz using ctypes #422

Add sz using ctypes #422

Conversation

martindurant commented Mar 6, 2023

mkitti left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 7, 2023 • edited Loading

Codecov Report

mkitti commented Mar 7, 2023

martindurant commented Mar 7, 2023

mkitti left a comment

Choose a reason for hiding this comment

jakirkham commented Mar 7, 2023

martindurant commented Mar 7, 2023

martindurant commented Mar 7, 2023

jakirkham commented Mar 7, 2023

martindurant commented Mar 7, 2023

mkitti commented Mar 7, 2023

mkitti commented Mar 8, 2023

joshmoore commented Mar 8, 2023

mkitti commented Mar 8, 2023 • edited Loading

mkitti commented Mar 8, 2023

mkitti left a comment

Choose a reason for hiding this comment

martindurant commented Mar 8, 2023

jakirkham commented Mar 8, 2023

mkitti commented Mar 9, 2023 • edited Loading

martindurant commented Mar 9, 2023

martindurant commented Mar 17, 2023

codecov bot commented Mar 7, 2023 •

edited

Loading

mkitti commented Mar 8, 2023 •

edited

Loading

mkitti commented Mar 9, 2023 •

edited

Loading