-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication of bug #3517 #4015
Comments
If I read it correctly, the case in #3517 requires :
So that's a pretty hard combination of circumstances to generate, even on purpose. Looking at your trace : It still doesn't explain why the data is wrong. It could happen if a dictionary was used to compress this data, in which case this long offset is searching for a match into the dictionary. If the software is the cause of this corruption, it should be reproducible, meaning that the same input with the same compression settings should result in the same corruption event. This can be difficult to reproduce though if you don't have the original data anymore, or if you don't know what were the compression settings. It would also give a simple mitigation strategy: just revert to older (presumed working) version At this point, what would help is some watermarking, that would trace the origin system that produced the data, when and how. If all corruption events come from the same system, for example, it's a pretty strong indication. Unfortunately, that's easier said than done. If such watermark was not in place at the time of the corruption event, there is now very little decompression can do to investigate or fix the problem after the fact. That being said, you also mention that checksum was enabled, so it can now be used as a "validator". |
This is super suspicious. Like Yann said, as long as a dictionary was used, this should be basically impossible for the compressor to generate. This corruption could not be caused by the issue fixed in #3517. I'd start by investigating bitflips. When we investigate issues like this internally, the root cause almost always ends up being bad hardware of some sort. What is the order of magnitude number of compressions you are doing? Thousands, Millions, billions, or trillions? Are all the corrupt blobs compressed by the same host? Are you sending the data over the network unencrypted, where bitflips could happen? I'd first start by looking for bitflips in the frame where the first sequence is corrupted. The first block is 6672 bytes, so the bits for the first offset will be stored right near the end of the block, since they are read in reverse. So I would basically start flipping bits from the end of the block, for say 20 bytes to be safe. And see if any bitflips cause the checksum to succeed. Beyond that, I'll have to think a bit about how we can go about debugging this. Maybe revert to v1.2.0 temporarily to see if the issue goes away? |
If you think that the issue will eventually reproduce, you could add decompression verification directly after compression. Then when it fails, save both the original data, and the compressed data for debugging. Then you can see if the issue reproduces, both on the same host, and on other hosts. This will rule out faulty hardware. If it is a deterministic issue that reproduces on another host, and you have the original input data, then I am 100% confident that we can work together to find & fix the issue, even though you can't share the data. At that point, just bisecting the issue to a commit would likely be enough to find the issue. |
Thank you @Cyan4973 and @terrelln for your response. We use ZSTD to compress blocks of data up to 256KiB in size. At this point, there are billions of compressed blocks in the world. Since we updated, there are probably 100's of millions of blocks with the new version. We have seen two sets of occurrences of the error (both with the new version). Two independent hardware configurations in separate parts of the world that have no relationship to each other. Each hit the problem twice while running a unique workload. So, workload A created two hits on different days for user A and workload B created two hits on different days for user B. We have CRC across the "decompressed" data. We will start working on a bit flip sandbox to see if we can get the data to recover. We also have neighboring compressed blocks that are similar in nature that do decompress. We will also work on adding a decompression-verification routine that can be enabled dynamically. We also attempted to decompress the data with the older version with no success. If we continue to see occurrences, we may have to revert out of caution. At this point, it's most likely a flaw in the newer ZSTD version we are now running or a flaw in our code that leverages the new version. It could perhaps still be hardware given both user A and B use the same brand/configuration of hardware components. |
If trying out bitflips doesn't work, I highly suggest running a decompression directly after the compression. Then log the original blob if it fails. If you have that blob, and can reproduce the issue & bisect, we will be able to fix the issue. It also has the additional property of validating that the blob is not corrupt. In the meantime, I will do a bit of digging to see if I can think of anything that could cause this issue. I don't really expect to find anything without being able to reproduce it though. I have a few questions to narrow my search:
|
|
Are you using this code in a multi-threaded environment? |
It's multi-threaded, but it's always the same thread which calls the routine. What we posted above is also a streamlined version of what's actually being called. The context is saved per instance or handle and the handle is only accessed by a single thread at any given time. |
Is the system your compressing on 32-bit or 64-bit? |
Compressed on 32 bit, attempting decompression on both 32 and 64. 32bit big endian |
@AV-Coding, one note:
Hardware corruption can happen with any pair of compression / decompression setups. Good machines can go bad over time as the hardware ages. Corruption can happen at any point from the compression machine, to various NICs and routers in between the hosts, to issues on the decompression machine. In addition to @terrelln's suggestion (verify decompression immediately after compression), another way to rule out hardware issues is to add a checksum of the compressed data. If you discover a compressed blob where the compressed checksum matches, but the decompressed checksum fails, then you can rule out a large class of hardware errors. |
We agree. We have already completed that 2nd suggestion a couple years ago by adding an extra CRC over the compressed data. Not all users have that newer level. Unfortunately, neither of the cases that have corruption were running this newer level when the corruption was discovered. But, one of the two users has recently upgraded and we too believe it would help us narrow down when the corruption occurred if we get another hit. The main reason we feel it likely isn't hardware is because each user hit the problem twice for a particular workload that was run at different times. In one case, it was the exact same step in a sequence of millions of data-creation steps where the compressed record became corrupted. This implies there is something about the data in that step that causes the issue. But, it still may be how our software reacts to that particular job and step. We ran the bit flip test where we we flipped the last 32 bits of the compressed block with no success. Perhaps the length being used to find the last byte of the sequence chain is invalid? Is there an eye catcher or pattern we could look at for the next block to see if it starts properly? |
I looked over the thread again, and I'm still unclear on which version produced the offending blobs: was it v1.2.0 or v1.5.2? I do understand that you have experienced the decoder-side issue with both v1.5.2 and v1.5.6, just clarifying what version the encoder was on. Apologies if this is answered somewhere in the thread, and I simply missed it. |
@embg, the version where we ran into the offending blobs was at 1.5.2. After being unable to decompress the data with v1.5.2, we attempted to decompress with v1.2.0 and v1.5.6, but with no success |
We have an important update of the problem we are hitting.
This zero rework helped with user A's two occurrences allowing both of them to be successfully decompressed. We also attempted to compress the uncompressed block using 1.5.2 and 1.5.6. Both compress and decompress successfully using the zstd tool with the same compression level. One question. The 32bit application performing the compression uses a static libzstd.a library which is not multi-threaded yet the application itself is multi-threaded. |
It's also my thoughts on reading your investigation results.
Even when the static library is not multi-threaded, it can still be used in multi-threaded environment: the only restriction is that it's not possible to trigger multi-threaded compression of a single source, which is generally not a problem. What really matters is that a given Given that there is only one compression event at a time ever possible in the above design, it certainly reduces risks. |
Here's an update of where we are: We are still unable to find root cause of this issue, but we have made progress. After further analysis, we have determined that the invalid compressed block with invalid offsets is in fact referencing matches from the previous unrelated block that had used the same context. The invalid offset in the bad block does seem to have a pattern when compared to the previous block. For example, if the invalid offset is 12,288, the data it is attempting to reference is at absolute offset X+12288 within the previous block where X is some multiple of 4KiB. In our use case, we have a large contiguous buffer broken up into 4K indexed segments. Data up to 256KiB arrives into this buffer prior to compression using one or more of the 4KiB segments. The location is dependent on where the previous buffer left off, rounded up to the next 4KiB boundary plus 32 bytes of meta-data not included in the compression payload. So, though back to back blocks are not exactly end-to-end, the ending of the previous block and start of the current block can be anywhere from 32 bytes to 4KiB of each other. Or, if the end of buffer is hit, the next block will arrive at the beginning with an address lower than the previous block. Each block is requested to be compressed using the same context and always occurs serially using the same thread. This can go on for millions of blocks. As of today, we do not call any explicit context reset functions between compression requests. Should we be calling such a function to reset the context? Additional items we have noticed is that the decompression of the bad block is attempting to use a SplitLiteral, while if we compress the same block in a testbed, it does not use a SplitLiteral. Overall, it would appear we have some sort of race where the context is too sticky. We see that some days it works fine while the next day the exact same blocks (content) hit the issue, making us believe it might have something to do with the location of the blocks within our large 4KiB segmented buffer. |
What you are trying to achieve is unclear to me. Are you cutting some "large" input into independent blocks of 4 KB, compressed individually ? |
No, blocks up to 256KiB exist within the buffer on 4KiB boundaries. Each block will use one or more 4KiB blocks as needed and the entire block (up to 256KiB) is sent to the compression API as a single request. The details above are attempting to explain how it's a shared buffer and blocks (up to 256KiB) are located in this shared buffer end to end on 4KiB address boundaries. |
So, you are compressing independent inputs of up to |
Yes, though the length isn't always divisible by 4KiB. We just round up to the next 4KiB boundary for the next block. So, the end of the previous block to the start of the next block is < 4KiB. |
This part is much less clear to me. It seems you employ the word "block" to mean "independent inputs" ? |
Each block is an independent input of contiguous data up to 256KiB in size. Multiple blocks reside in a contiguous address/memory space on 4KiB boundaries. They are each independently compressed sequentially. The context of the bad block is sometimes sticky and uses offsets into the previous block. Though each block (up to 256KiB in size) is requested to be compressed in a single request, the previous blocks likely still exist in valid memory space just ahead of the current block. It simply depends on timing and whether that portion of the buffer has been reused by the time we ask ZSTD to compress the next one. |
When compression is unsuccessful, the resulting state of So, indeed, invoking Therefore, it's weird that a call to Let's also be clear that this is a scenario for which Unfortunately, it's hard to tell more without access to a reproducible test case. As soon as a scenario can reliably reproduce the problem, we'll be able to analyze it and create a fix for it. One problem is that the bug is observed at decompression time, which can be much later than compression time, thus obscuring the conditions required to trigger it. One possibility would be to confirm each compression job by starting a validation decompression stage right after it, thus detecting the problem at the moment it's created. But if the bug tends to be "random", meaning it mostly works, but sometimes it fails unpredictably at a very low rate, and no scenario can reliably fail or succeed, then this symptom can also be compatible with a hardware failure. This is rare, but not unheard of, and essentially guaranteed to happen once the fleet size becomes "large enough" (we are regularly confronted to hardware failure scenarios in our working environment, just due to its scale). |
What compression level are you using? From previous code snippets, it seems like level 1, but just want to verify.
@AV-Coding does this mean you are able to reproduce the issue? If so, could you share the exact code you are using to reproduce? Given that, we may be able to reproduce without needing the data. If you can reproduce, you should try logging on this line: zstd/lib/compress/zstd_compress.c Line 4471 in 97291fc
If this happens immediately before the bad block is produced, that will give us an indication of what is happening. It perhaps matches the symptoms that it only happens rarely, given that overflow correction only happens every few GB of data processed. If there were a bug there, it would also match the symptom of the corrupted block has more literals, because the match finding might not work as expected. |
But, I don't think you should actually hit this case if you are using zstd/lib/compress/zstd_compress.c Lines 1994 to 1997 in 97291fc
I'm not convinced that this is causing the issue, but it is the only thing I can think of where the symptoms match. |
If you can reproduce it, could you run your program under TSAN, ASAN, and MSAN? That will help rule out unrelated issues like 2 cctx's being used by different threads simultaneously. |
We have been hitting the issue once we updated to 1.5.2 from 1.2.0. We have not yet upgraded to a newer version for production level code given the process is not simple for us and we still don't know if a newer version helps or not. We reuse the same context over and over for multiple GB's of data. In a case where we hit the problem multiple times for the same user on different days, the cumulative amount of data used for the context would be the same or nearly the same at the point where the bad compression event occurs. But, putting this much data through a context occurs often. Would the potential problem require a combination of things? Perhaps a particular pattern seen when the overflow event occurs? This would make more sense given the depth (cumulative amount of data through the context for multiple independent compression requests) and data pattern for the bad block are the same for a user who has hit the issue multiple times. |
Hello, we will be saving off the context before and after compression. We are wondering what fields and sub-fields are most important that we should include? We plan to add a log for the |
Hello, attached below are logs of the context state before and after compression. Also, we would like to ask if it's expected that compression went down both of these paths:
|
I'm not sure if this is meaningful, Also :
the "Overlapping" part suggests that the new input is overwriting a memory region which was so far part of the context history. It's not visible in the traces though... |
Just a reminder, the corrupted block is referencing a dictionary that is ahead of the start of the block. So, if the non-contig is detected, should it also assume a new dictionary and flush any memory of a previous dictionary, including overlap.
Each buf is a completely independent buffer up to 256KiB in size and shares the context from the previous buffer. |
The surprising part is that the traces mention Yet, the trace also mentions That's rather unexpected. At this point, I don't know yet if this is a trace error, or a consequence of a problem that happened just before the trace. Debugging is unfortunately much harder without an ability to reproduce locally... |
Attached below is a txt file that is originally a C file(github doesn't allow .c files to be dropped). This will help explain what all the values mean from the log. We called the |
Hi @Cyan4973 , has anyone from zstd team found any additional information as to what happened? |
@AV-Coding a similar issue was reported in Issue #4292, which was fixed by PR #4129. Given the data they shared, we were able to pretty strongly confirm that their issue was caused by that bug. Now that I understand the bug clearly, it seems likely that it was also causing your issue. Were you able to confirm that PR #4129 fixes your issue? |
I'm going to close this issue, assuming that it is fixed by PR #4129, but please file a new issue if this is not the case. |
Describe the bug
Decompression failed with an error of -20. We have run into the issue a total of four times. Two unrelated cases, each hitting the issue twice for the same workload. In all four occurrences, the decompression fails due to an invalid sequence.offset. The offset is anywhere from a few bytes to KB ahead of the virtualStart of the sequence. For one set, it's the very first sequence frame that is invalid.
We have been using zSTD compression for a number of years and haven't seen this issue until we recently upgraded from v1.2.0 to v1.5.2. Looking through known issues, the only one that might be related is #3517 which is fixed in 1.5.5. We have attempted to decompress the sequence with the most recent level including the pending 1.5.6 with the same result. The thought is the compression itself produced an invalid sequence. We are unable to recreate the original pattern prior to compression, so we don't know if any recent fixes have improved the outcome.
We compile under AIX/Bigendian. We have attempted decompression under Linux/x86 with the same outcome.
In one set, the two invalid offsets of the two corrupted sequences are divisible by 4K. One is 4KiB while the 2nd is 12KiB.
In the other set, the invalid offset is exactly the same for two different compressed blocks created by a common workload. Offset=239.
Working with the offset=239 cases (first sequence frame), the literal's are copied and then the match-copy can only come from the copied literals since the output of the entire decompressed sequence starts with the copied literals. We adjusted the offset to be within this literal-copied region, which allows the entire block to decompress, but our CRC check then fails. The decompressed data after this sequence does appear valid, which would seem the first sequence may be the only corrupted part.
Due to the sensitivity of the data, we are unable to share the compressed block, but we can share debug logs.
Expected behavior
The expected output would have been that the block of data was able to be decompressed or the compression itself that produced the block created a valid sequence.
Logs:
Here is a short snippet of the logs we have for this issue:
Additional context
If we did hit bug #3517, would this be the expected result? What would you expect is the outcome of hitting it? Does the compression fail, or does it succeed resulting in a corrupted compressed buffer? For example, an invalid offset. If an invalid compressed block results, is it possible to remedy the issue through manipulating the compressed block?
The text was updated successfully, but these errors were encountered: