-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
higher zstd compression level resulting in larger compressed data #3793
Comments
This can happen, though this is not the expectation. All these algorithms make bets, even the strongest ones. We already know that these bets can be weaponized to make them fail. It is more likely to happen in presence of synthetic data, which tend to be unrepresentative of real-world scenarios. But of course, from time to time, real scenarios can be affected too. The solution is far from trivial. It generally requires access to source data, in order to understand its behavior, and how it's analyzed by the algorithm. Then, sometimes, an heuristic can be tweaked to better handle this corner case. Other times, it's not, or it comes at the expense of another corner case. Long term, we could imagine an even stronger (and slower) compression parser that would feature less bets into the future, because it would have access to more information to evaluate trade-offs. Sadly it hasn't happened yet, because it's a very time-consuming task, and we don't have real production workloads which would benefit from this effort. Also, the gains are expected to be small, i.e. < 1% for general cases, excluding pathological corner cases. So this effort tends to be postponed repetitively. It's still in the cards, and I still hope it will happen someday. For the time being, our best option is to have a look at source data, if it's shareable, and try to learn from it. |
|
this works great for 32-bit arrays, notably the synthetic ones, with extreme regularity, unfortunately, it's not universal, and in some cases, it's a loss. Crucially, on average, it's a loss on silesia. The most negatively impacted file is x-ray. It deserves an investigation before suggesting it as an evolution.
Current situation regarding this specific scenario :
Compression ratio could be even better at higher compression levels, but at least performance doesn't degrade anymore. |
I'll provide some context here: I've seen them on other occasions like elastic-search's index file too the original data I referenced is from a tsdb timestamp, thus they increment 30(seconds) for every sample, with some increments like 3600(1 hour) and/or 86400(1 day). the original code used the hand written delta compression(with varint and other tricks) but no zstd. I tried to replace the hand written algorithm with zstd to see if I can make the code more general and easier to maintain, while still providing comparable compression benefits the unexpected is not (only) the compression ratio per say, but also the higher compression level resulting in noticeably larger size. I thought that higher level would internally try all heuristics in all lower levels
|
Yes, but I believe the situation was not as bad for arrays of 64-bit values. Therefore, comparatively, arrays of 32-bit receive a more substantial boost.
Not without breaking the format. |
this works great for 32-bit arrays, notably the synthetic ones, with extreme regularity, unfortunately, it's not universal, and in some cases, it's a loss. Crucially, on average, it's a loss on silesia. The most negatively impacted file is x-ray. It deserves an investigation before suggesting it as an evolution.
this works great for 32-bit arrays, notably the synthetic ones, with extreme regularity, unfortunately, it's not universal, and in some cases, it's a loss. Crucially, on average, it's a loss on silesia. The most negatively impacted file is x-ray. It deserves an investigation before suggesting it as an evolution.
using zstd 1.5.5, latest version as of writing
prepare an int array(each int occupies 4 bytes, little endian)
[0,30,60,90,...]
65536 ints, 65536*4 bytesthen compress it using various compression levels(simple compression, no dict):
as seen from the above output, higher compression level(18) starts resulting in larger compressed data
-- a search using
compression level size
in issues results in no relative information in the first page, nor relative result in google :( sorry if this has already been brought upand there's a related questions I'm putting into a same issue(forgive me :)
The text was updated successfully, but these errors were encountered: