Replication of bug #3517 #4015

AV-Coding · 2024-04-02T19:48:35Z

Describe the bug
Decompression failed with an error of -20. We have run into the issue a total of four times. Two unrelated cases, each hitting the issue twice for the same workload. In all four occurrences, the decompression fails due to an invalid sequence.offset. The offset is anywhere from a few bytes to KB ahead of the virtualStart of the sequence. For one set, it's the very first sequence frame that is invalid.

We have been using zSTD compression for a number of years and haven't seen this issue until we recently upgraded from v1.2.0 to v1.5.2. Looking through known issues, the only one that might be related is #3517 which is fixed in 1.5.5. We have attempted to decompress the sequence with the most recent level including the pending 1.5.6 with the same result. The thought is the compression itself produced an invalid sequence. We are unable to recreate the original pattern prior to compression, so we don't know if any recent fixes have improved the outcome.

We compile under AIX/Bigendian. We have attempted decompression under Linux/x86 with the same outcome.
In one set, the two invalid offsets of the two corrupted sequences are divisible by 4K. One is 4KiB while the 2nd is 12KiB.
In the other set, the invalid offset is exactly the same for two different compressed blocks created by a common workload. Offset=239.

Working with the offset=239 cases (first sequence frame), the literal's are copied and then the match-copy can only come from the copied literals since the output of the entire decompressed sequence starts with the copied literals. We adjusted the offset to be within this literal-copied region, which allows the entire block to decompress, but our CRC check then fails. The decompressed data after this sequence does appear valid, which would seem the first sequence may be the only corrupted part.

Due to the sensitivity of the data, we are unable to share the compressed block, but we can share debug logs.

Expected behavior
The expected output would have been that the block of data was able to be decompressed or the compression itself that produced the block created a valid sequence.

Logs:
Here is a short snippet of the logs we have for this issue:

decompress/zstd_decompress.c: ZSTD_createDCtx
ZSTD_createDCtx() was successful
decompress/zstd_decompress.c: ZSTD_decompressMultiFrame
decompress/zstd_decompress.c: reading magic number FD2FB528
decompress/zstd_decompress.c: ZSTD_decompressFrame (srcSize:52957)
decompress/zstd_decompress.c: ZSTD_getFrameHeader_advanced: minInputSize = 5, srcSize = 9
decompress/zstd_decompress_block.c: ZSTD_decompressBlock_internal (size : 6672)
decompress/zstd_decompress_block.c: ZSTD_decodeLiteralsBlock
decompress/zstd_decompress_block.c: ZSTD_decodeLiteralsBlock : cSize=3778, nbLiterals=4546
decompress/zstd_decompress_block.c: ZSTD_decodeSeqHeaders
decompress/zstd_decompress_block.c: ZSTD_decompressSequences
decompress/zstd_decompress_block.c: ZSTD_decompressSequences_body: nbSeq = 1271
decompress/zstd_decompress_block.c: ZSTD_initFseState : val=139 using 8 bits
decompress/zstd_decompress_block.c: ZSTD_initFseState : val=61 using 8 bits
decompress/zstd_decompress_block.c: ZSTD_initFseState : val=187 using 8 bits
decompress/zstd_decompress_block.c: seq: litL=24, matchL=5, offset=239
decompress/zstd_decompress_block.c:1017: ERROR!: check (sequence.offset > (size_t)(oLitEnd - virtualStart)) failed, returning ERROR(corruption_detected):
ZSTD_decompress() failed. rc=-20 (Data corruption detected).

Additional context
If we did hit bug #3517, would this be the expected result? What would you expect is the outcome of hitting it? Does the compression fail, or does it succeed resulting in a corrupted compressed buffer? For example, an invalid offset. If an invalid compressed block results, is it possible to remedy the issue through manipulating the compressed block?

The text was updated successfully, but these errors were encountered:

Cyan4973 · 2024-04-02T20:21:17Z

If I read it correctly, the case in #3517 requires :

Block Splitter is activated (i.e. high compression only)
It confuses specifically literals_length == 65536 with literals_lenth == 0 .
It uses a repcode in the same sequence, and because it incorrectly believes that literals_lenth == 0, it picks a wrong index.
It splits at this specific position

So that's a pretty hard combination of circumstances to generate, even on purpose.

Looking at your trace :
decompress/zstd_decompress_block.c: seq: litL=24, matchL=5, offset=239
the source issue seems different, because litL=24, so that's incompatible with the above scenario.

It still doesn't explain why the data is wrong.
Starting the first sequence with offset=239 looks obviously incorrect, though it has nothing to do with repcode.

It could happen if a dictionary was used to compress this data, in which case this long offset is searching for a match into the dictionary.
I presume no dictionary is expected to have been used to compress this data, in which case the next logical explanation is corruption.
And at this point, all bets are off.

If the software is the cause of this corruption, it should be reproducible, meaning that the same input with the same compression settings should result in the same corruption event. This can be difficult to reproduce though if you don't have the original data anymore, or if you don't know what were the compression settings. It would also give a simple mitigation strategy: just revert to older (presumed working) version v1.2.0, and see if the corruption events stop (assuming this is observable, which can be difficult for very rare events).
Unfortunately, for very rare corruption events, we also can't rule out hardware issues. It's a measurable thing when you have a large enough fleet of servers and data set to look after. It can happen at network, storage and cpu levels, so there's a broad range of options. That's also why checksum is so useful.

At this point, what would help is some watermarking, that would trace the origin system that produced the data, when and how. If all corruption events come from the same system, for example, it's a pretty strong indication. Unfortunately, that's easier said than done. If such watermark was not in place at the time of the corruption event, there is now very little decompression can do to investigate or fix the problem after the fact.

That being said, you also mention that checksum was enabled, so it can now be used as a "validator".
You also mention that, for the specific case reproduced above, you believe that most of the data is correct, it's only the offset of the first sequence which would be incorrect.
In which case, it seems one could "just" try all possible valid values of "offset", and see which one regenerates the "right" checksum (this requires a capability to manually apply the wanted sequences, which is not trivial).
There's no guarantee of success though, because the basic premise is that only this first offset value is wrong, but who knows.

terrelln · 2024-04-03T16:33:58Z

decompress/zstd_decompress_block.c: seq: litL=24, matchL=5, offset=239

This is super suspicious. Like Yann said, as long as a dictionary was used, this should be basically impossible for the compressor to generate. This corruption could not be caused by the issue fixed in #3517. I'd start by investigating bitflips. When we investigate issues like this internally, the root cause almost always ends up being bad hardware of some sort.

What is the order of magnitude number of compressions you are doing? Thousands, Millions, billions, or trillions? Are all the corrupt blobs compressed by the same host? Are you sending the data over the network unencrypted, where bitflips could happen?

I'd first start by looking for bitflips in the frame where the first sequence is corrupted. The first block is 6672 bytes, so the bits for the first offset will be stored right near the end of the block, since they are read in reverse. So I would basically start flipping bits from the end of the block, for say 20 bytes to be safe. And see if any bitflips cause the checksum to succeed.

Beyond that, I'll have to think a bit about how we can go about debugging this. Maybe revert to v1.2.0 temporarily to see if the issue goes away?

terrelln · 2024-04-03T16:46:40Z

If you think that the issue will eventually reproduce, you could add decompression verification directly after compression. Then when it fails, save both the original data, and the compressed data for debugging.

Then you can see if the issue reproduces, both on the same host, and on other hosts. This will rule out faulty hardware. If it is a deterministic issue that reproduces on another host, and you have the original input data, then I am 100% confident that we can work together to find & fix the issue, even though you can't share the data. At that point, just bisecting the issue to a commit would likely be enough to find the issue.

AV-Coding · 2024-04-03T17:25:22Z

Thank you @Cyan4973 and @terrelln for your response.

We use ZSTD to compress blocks of data up to 256KiB in size. At this point, there are billions of compressed blocks in the world. Since we updated, there are probably 100's of millions of blocks with the new version. We have seen two sets of occurrences of the error (both with the new version). Two independent hardware configurations in separate parts of the world that have no relationship to each other. Each hit the problem twice while running a unique workload. So, workload A created two hits on different days for user A and workload B created two hits on different days for user B.

We have CRC across the "decompressed" data. We will start working on a bit flip sandbox to see if we can get the data to recover. We also have neighboring compressed blocks that are similar in nature that do decompress. We will also work on adding a decompression-verification routine that can be enabled dynamically.

We also attempted to decompress the data with the older version with no success. If we continue to see occurrences, we may have to revert out of caution.

At this point, it's most likely a flaw in the newer ZSTD version we are now running or a flaw in our code that leverages the new version. It could perhaps still be hardware given both user A and B use the same brand/configuration of hardware components.

terrelln · 2024-04-03T17:36:55Z

If trying out bitflips doesn't work, I highly suggest running a decompression directly after the compression. Then log the original blob if it fails. If you have that blob, and can reproduce the issue & bisect, we will be able to fix the issue. It also has the additional property of validating that the blob is not corrupt.

In the meantime, I will do a bit of digging to see if I can think of anything that could cause this issue. I don't really expect to find anything without being able to reproduce it though.

I have a few questions to narrow my search:

Can you share exactly how you call zstd compression, including all the parameters you set, and all the APIs you use?
Can you print out these variables during the decompression? I will also put up a PR to add them to our debuglogs, for the future.

AV-Coding · 2024-04-03T18:20:01Z

    static ZSTD_CCtx *comp_context = NULL;

    if (comp_context == NULL) {
      // allocate a context to help speed up back to back compressions under the same thread
      comp_context = ZSTD_createCCtx();
      if (comp_context == NULL) {
        err("Failure to allocation ZSTD context\n");
        exit(-1);
      }
    }

    comp_param = 1;
    comp_size = ZSTD_compressCCtx(comp_context, out_buf_p, out_buf_rest, in_buf_p, in_len, comp_param);

zstd/lib//decompress/zstd_decompress_block.c:763: LLtype=2, OFtype=2, MLtype=2

sample.txt

terrelln · 2024-04-03T18:23:23Z

static ZSTD_CCtx *comp_context = NULL;

Are you using this code in a multi-threaded environment? ZSTD_CCtx are not thread-safe, and it looks like this would be using the same context in two different threads.

AV-Coding · 2024-04-03T18:33:34Z

It's multi-threaded, but it's always the same thread which calls the routine. What we posted above is also a streamlined version of what's actually being called. The context is saved per instance or handle and the handle is only accessed by a single thread at any given time.

terrelln · 2024-04-03T18:44:22Z

Is the system your compressing on 32-bit or 64-bit?

AV-Coding · 2024-04-03T18:49:26Z

Compressed on 32 bit, attempting decompression on both 32 and 64. 32bit big endian

embg · 2024-04-04T02:47:29Z

@AV-Coding, one note:

It could perhaps still be hardware given both user A and B use the same brand/configuration of hardware components.

Hardware corruption can happen with any pair of compression / decompression setups. Good machines can go bad over time as the hardware ages. Corruption can happen at any point from the compression machine, to various NICs and routers in between the hosts, to issues on the decompression machine.

In addition to @terrelln's suggestion (verify decompression immediately after compression), another way to rule out hardware issues is to add a checksum of the compressed data. If you discover a compressed blob where the compressed checksum matches, but the decompressed checksum fails, then you can rule out a large class of hardware errors.

AV-Coding · 2024-04-04T17:53:11Z

We agree. We have already completed that 2nd suggestion a couple years ago by adding an extra CRC over the compressed data. Not all users have that newer level. Unfortunately, neither of the cases that have corruption were running this newer level when the corruption was discovered. But, one of the two users has recently upgraded and we too believe it would help us narrow down when the corruption occurred if we get another hit.

The main reason we feel it likely isn't hardware is because each user hit the problem twice for a particular workload that was run at different times. In one case, it was the exact same step in a sequence of millions of data-creation steps where the compressed record became corrupted. This implies there is something about the data in that step that causes the issue. But, it still may be how our software reacts to that particular job and step.

We ran the bit flip test where we we flipped the last 32 bits of the compressed block with no success.
We also ran a test where we replaced the last three bytes with every value from 0x000000 to 0xFFFFFF. The decompression attempts made further progress for thousands of scenarios, getting as far as sequence number 220, but it still broke down. I'm assuming this means the corruption spans more than three bytes.

Perhaps the length being used to find the last byte of the sequence chain is invalid? Is there an eye catcher or pattern we could look at for the next block to see if it starts properly?

embg · 2024-04-07T04:45:08Z

I looked over the thread again, and I'm still unclear on which version produced the offending blobs: was it v1.2.0 or v1.5.2? I do understand that you have experienced the decoder-side issue with both v1.5.2 and v1.5.6, just clarifying what version the encoder was on.

Apologies if this is answered somewhere in the thread, and I simply missed it.

AV-Coding · 2024-04-08T18:34:54Z

@embg, the version where we ran into the offending blobs was at 1.5.2. After being unable to decompress the data with v1.5.2, we attempted to decompress with v1.2.0 and v1.5.6, but with no success

AV-Coding · 2024-04-08T22:43:00Z

We have an important update of the problem we are hitting.
Similar blocks for the same workload have a large amount of zero data, so we were thinking perhaps these invalid offsets are referencing a zero match.
As shown in this log, we replaced invalid offset match data with zeros nine times for the compressed 256KiB block.
Not only does it decompress correctly, but the uncompressed CRC matches.

zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 239 (limit of 24) with 5 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 239 (limit of 45) with 5 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 239 (limit of 70) with 5 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 954 (limit of 83) with 33 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 3663 (limit of 1769) with 5 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 68470 (limit of 1775) with 4 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 68470 (limit of 1782) with 5 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 68470 (limit of 1799) with 9 zeros
zstd_new/zstd/lib//decompress/zstd_decompress_block.c:1325: Replaced bad offset 68470 (limit of 1809) with 5 zeros

This zero rework helped with user A's two occurrences allowing both of them to be successfully decompressed.
But, it did not work for user B. We attempted to replace the user B's match source with values from 0x00-0xFF with no luck.
The mystery match reference for user B's cases must be a pattern that isn't single byte repeating.

We also attempted to compress the uncompressed block using 1.5.2 and 1.5.6. Both compress and decompress successfully using the zstd tool with the same compression level.
They also compress significantly more. For example, the bad compressed block was about 33KiB in size while the test compressed block is about 10KiB.
We are wondering if the 33KiB block has an excess set of literals.
We are thinking it may be some form of context corruption or invalid context reset, like it's using a dictionary from a previous block.

One question. The 32bit application performing the compression uses a static libzstd.a library which is not multi-threaded yet the application itself is multi-threaded.
But, as mentioned before, the thread which does call the compression function is always the same thread. Though other threads exist, they do not call the ZSTD functions.
Do you think this may be a problem?

Cyan4973 · 2024-04-09T04:36:23Z

We are thinking it may be some form of context corruption or invalid context reset, like it's using a dictionary from a previous block.

It's also my thoughts on reading your investigation results.

The 32bit application performing the compression uses a static libzstd.a library which is not multi-threaded yet the application itself is multi-threaded.
But, as mentioned before, the thread which does call the compression function is always the same thread. Though other threads exist, they do not call the ZSTD functions.

Even when the static library is not multi-threaded, it can still be used in multi-threaded environment: the only restriction is that it's not possible to trigger multi-threaded compression of a single source, which is generally not a problem.

What really matters is that a given ZSTD_CCtx* compression context is only used to compress one session at a time.

Given that there is only one compression event at a time ever possible in the above design, it certainly reduces risks.
It's still possible to mis-manage the sharing of context between 2 consecutive sessions (assuming a compression context is long-lived across multiple sessions). But the problem would be the same for v1.2.0 or v1.5.2, and it's strange that the issue only starts appearing after the update to v1.5.2. Assuming it's not just bad coincidental timing.

AV-Coding · 2024-04-29T18:59:06Z

Here's an update of where we are:

We are still unable to find root cause of this issue, but we have made progress. After further analysis, we have determined that the invalid compressed block with invalid offsets is in fact referencing matches from the previous unrelated block that had used the same context. The invalid offset in the bad block does seem to have a pattern when compared to the previous block. For example, if the invalid offset is 12,288, the data it is attempting to reference is at absolute offset X+12288 within the previous block where X is some multiple of 4KiB.

In our use case, we have a large contiguous buffer broken up into 4K indexed segments. Data up to 256KiB arrives into this buffer prior to compression using one or more of the 4KiB segments. The location is dependent on where the previous buffer left off, rounded up to the next 4KiB boundary plus 32 bytes of meta-data not included in the compression payload. So, though back to back blocks are not exactly end-to-end, the ending of the previous block and start of the current block can be anywhere from 32 bytes to 4KiB of each other. Or, if the end of buffer is hit, the next block will arrive at the beginning with an address lower than the previous block. Each block is requested to be compressed using the same context and always occurs serially using the same thread. This can go on for millions of blocks.

As of today, we do not call any explicit context reset functions between compression requests. Should we be calling such a function to reset the context?

Additional items we have noticed is that the decompression of the bad block is attempting to use a SplitLiteral, while if we compress the same block in a testbed, it does not use a SplitLiteral.
We also see references to DEBUGLOG(6, "invalidating dictionary for current block (distance > windowSize)"); within the good testbed compression. There is a recent change in this area under Fix #3102.
if (blockEndIdx > loadedDictEnd + maxDist [ ADDED || loadedDictEnd != window->dictLimit])
Perhaps it may be related?

Overall, it would appear we have some sort of race where the context is too sticky. We see that some days it works fine while the next day the exact same blocks (content) hit the issue, making us believe it might have something to do with the location of the blocks within our large 4KiB segmented buffer.

Cyan4973 · 2024-04-29T19:05:29Z

What you are trying to achieve is unclear to me.

Are you cutting some "large" input into independent blocks of 4 KB, compressed individually ?

AV-Coding · 2024-04-29T19:16:47Z

No, blocks up to 256KiB exist within the buffer on 4KiB boundaries. Each block will use one or more 4KiB blocks as needed and the entire block (up to 256KiB) is sent to the compression API as a single request.

The details above are attempting to explain how it's a shared buffer and blocks (up to 256KiB) are located in this shared buffer end to end on 4KiB address boundaries.

Cyan4973 · 2024-04-29T19:21:01Z

So, you are compressing independent inputs of up to 256 KB, and sizes are always multiple of 4 KB ?
And each inputs is presented as a single continuous buffer ?
And the buffer itself is shared across (single-threaded) compression sessions ?

AV-Coding · 2024-04-29T19:39:38Z

Yes, though the length isn't always divisible by 4KiB. We just round up to the next 4KiB boundary for the next block. So, the end of the previous block to the start of the next block is < 4KiB.

Cyan4973 · 2024-04-29T19:43:24Z

This part is much less clear to me.

It seems you employ the word "block" to mean "independent inputs" ?

AV-Coding · 2024-04-29T20:05:03Z

Each block is an independent input of contiguous data up to 256KiB in size. Multiple blocks reside in a contiguous address/memory space on 4KiB boundaries. They are each independently compressed sequentially. The context of the bad block is sometimes sticky and uses offsets into the previous block. Though each block (up to 256KiB in size) is requested to be compressed in a single request, the previous blocks likely still exist in valid memory space just ahead of the current block. It simply depends on timing and whether that portion of the buffer has been reused by the time we ask ZSTD to compress the next one.
[ block 1 ] [pad] [block 2] [pad] [block 3] [pad] [block 4] .... and so on.
ZSTD_compressCCTx( contextA. out_buf, out_buf_len, &block1, length_block1);
// move out_buf contents elsewhere
// Update the buffer state to state the block 1 space can be reused. (4K to 256KiB of space).
ZSTD_compressCCTx( contextA, out_buf, out_buf_len, &block2, length_block2):
// move out_buf contents elsewhere
// Update the buffer state to state the block 1 space can be reused. (4K to 256KiB of space).
ZSTD_compressCCTx( contextA, out_buf, out_buf_len, &block3, length_block3); <----- bad compressed block that compresses half as well as it should and references data via invalid offsets from block2.

Cyan4973 · 2024-04-29T21:02:15Z

ZSTD_compressCCtx() is designed to be a fully complete operation,
meaning it will reset the compression state even if it was used in a previously unfinished operation,
and it will properly close the operation when successful.

When compression is unsuccessful, the resulting state of contextA is less clear, so it should be reset.
But just invoking ZSTD_compressCCtx() should be enough to clear the state and render it valid for the new compression job, because it starts with a reset operation.

So, indeed, invoking ZSTD_compressCCtx() repetitively on a set of input / output buffers, even overlapping ones, should be successful. It actually doesn't matter if these buffers are separate, or contiguous, or the same, or partially overlapping. All the guarantees still apply.

Therefore, it's weird that a call to ZSTD_compressCCtx() would reference data from a previous input completed using the same contextA state. If that's the case, there's something seriously wrong there.

Let's also be clear that this is a scenario for which ZSTD_compressCCtx() is already heavily tested, in multiple CI environments, and so far it has been working fine. So it would be surprising if some blatant lingering bug would still be present there.

Unfortunately, it's hard to tell more without access to a reproducible test case. As soon as a scenario can reliably reproduce the problem, we'll be able to analyze it and create a fix for it.

One problem is that the bug is observed at decompression time, which can be much later than compression time, thus obscuring the conditions required to trigger it. One possibility would be to confirm each compression job by starting a validation decompression stage right after it, thus detecting the problem at the moment it's created.
Also, since the bug seems related to a sequence of compressions, it would be useful to log all compression events, as a way to rebuilt the history and re-create the same sequence of compressions when the bug is detected.
Thanks to these traces and immediate corruption detection, it should be possible to recreate the exact same scenario, and observe if it fails reliably or not.

But if the bug tends to be "random", meaning it mostly works, but sometimes it fails unpredictably at a very low rate, and no scenario can reliably fail or succeed, then this symptom can also be compatible with a hardware failure. This is rare, but not unheard of, and essentially guaranteed to happen once the fleet size becomes "large enough" (we are regularly confronted to hardware failure scenarios in our working environment, just due to its scale).

terrelln · 2024-05-07T14:47:06Z

What compression level are you using? From previous code snippets, it seems like level 1, but just want to verify.

We are still unable to find root cause of this issue, but we have made progress. After further analysis, we have determined that the invalid compressed block with invalid offsets is in fact referencing matches from the previous unrelated block that had used the same context. The invalid offset in the bad block does seem to have a pattern when compared to the previous block. For example, if the invalid offset is 12,288, the data it is attempting to reference is at absolute offset X+12288 within the previous block where X is some multiple of 4KiB.

@AV-Coding does this mean you are able to reproduce the issue? If so, could you share the exact code you are using to reproduce? Given that, we may be able to reproduce without needing the data.

If you can reproduce, you should try logging on this line:

zstd/lib/compress/zstd_compress.c

Line 4471 in 97291fc

    
           U32 const correction = ZSTD_window_correctOverflow(&ms->window, cycleLog, maxDist, ip);

If this happens immediately before the bad block is produced, that will give us an indication of what is happening. It perhaps matches the symptoms that it only happens rarely, given that overflow correction only happens every few GB of data processed. If there were a bug there, it would also match the symptom of the corrupted block has more literals, because the match finding might not work as expected.

terrelln · 2024-05-07T15:13:57Z

But, I don't think you should actually hit this case if you are using v1.5.6, because when we get close to needing to overflow correct, we opt to just reset the indices instead, because this branch will be called first. So you could also add a log line inside this branch, and see if it happens right before a corruption.

zstd/lib/compress/zstd_compress.c

Lines 1994 to 1997 in 97291fc

    
           if (forceResetIndex == ZSTDirp_reset) { 
        
               ZSTD_window_init(&ms->window); 
        
               ZSTD_cwksp_mark_tables_dirty(ws); 
        
           }

I'm not convinced that this is causing the issue, but it is the only thing I can think of where the symptoms match.

terrelln · 2024-05-07T16:01:15Z

If you can reproduce it, could you run your program under TSAN, ASAN, and MSAN? That will help rule out unrelated issues like 2 cctx's being used by different threads simultaneously.

AV-Coding · 2024-05-07T17:32:48Z

We have been hitting the issue once we updated to 1.5.2 from 1.2.0. We have not yet upgraded to a newer version for production level code given the process is not simple for us and we still don't know if a newer version helps or not. We reuse the same context over and over for multiple GB's of data. In a case where we hit the problem multiple times for the same user on different days, the cumulative amount of data used for the context would be the same or nearly the same at the point where the bad compression event occurs. But, putting this much data through a context occurs often. Would the potential problem require a combination of things? Perhaps a particular pattern seen when the overflow event occurs? This would make more sense given the depth (cumulative amount of data through the context for multiple independent compression requests) and data pattern for the bad block are the same for a user who has hit the issue multiple times.

AV-Coding · 2024-05-09T20:53:15Z

Hello, we will be saving off the context before and after compression. We are wondering what fields and sub-fields are most important that we should include? We plan to add a log for the correction value pasted above.

AV-Coding · 2024-07-17T21:46:23Z

Hello, attached below are logs of the context state before and after compression. pre_comp_ctxt_and_comp.log has the logged context state and the end of the compression logs. post_comp_ctxt.log just logs the context state after we performed compression. These logs are the result of a block of data that was incorrectly compressed.
post_comp_ctxt.log
pre_comp_ctxt_and_comp.log

Also, we would like to ask if it's expected that compression went down both of these paths:

ZSTD_compress_insertDictionary (dictSize=0)
ZSTD_compressContinue_internal, stage: 1, srcSize: 262144
ZSTD_writeFrameHeader : dictIDFlag : 1 ; dictID : 0 ; dictIDSizeCode : 0
ZSTD_window_update
Non contiguous blocks, new segment starts at 2147352578 <-----------------------------
Overlapping extDict and input : new lowLimit = 2147610626 <---------------------------
ZSTD_compressContinue_internal (blockSize=131072)
ZSTD_compress_frameChunk (blockSize=131072)
ZSTD_checkDictValidity: blockEndIdx=2147483650, maxDist=262144, loadedDictEnd=0

Cyan4973 · 2024-07-17T23:25:19Z

I'm not sure if this is meaningful,
but I note that these index values are crossing the 2 GB quantity (2147483648).

Also :

ZSTD_window_update
Non contiguous blocks
Overlapping extDict and input

the "Overlapping" part suggests that the new input is overwriting a memory region which was so far part of the context history.
This can happen in rolling buffer designs, and that's why it's tested.
That being said, since the cctx has its own internal buffer, it should not matter,
unless some advanced parameter, such as ZSTD_c_stableInBuffer, was in use.

It's not visible in the traces though...

AV-Coding · 2024-07-18T01:50:36Z

Just a reminder, the corrupted block is referencing a dictionary that is ahead of the start of the block. So, if the non-contig is detected, should it also assume a new dictionary and flush any memory of a previous dictionary, including overlap.

[buf0...][non-referenced-pad][buf1......][nrp][buf2..][nrp][buf3.....][nrp]...[bufN]   eventually loop around and start at the beginning again.

If the timing is correct, buf1 can be where buf0 is if buf0 is processed completely before buf1 arrives.
So the wrap occurs either because the buffer is empty or we hit the end and the previous blocks have been completely handled.

Each buf is a completely independent buffer up to 256KiB in size and shares the context from the previous buffer.

Cyan4973 · 2024-07-18T02:05:41Z

The surprising part is that the traces mention ZSTD_compressCCtx(),
therefore I expect each block to be completely independent.

Yet, the trace also mentions Overlapping extDict and input,
as if the compression was starting from "some history",
like an external dictionary.

That's rather unexpected. At this point, I don't know yet if this is a trace error, or a consequence of a problem that happened just before the trace.

Debugging is unfortunately much harder without an ability to reproduce locally...

AV-Coding · 2024-07-19T22:20:58Z

Attached below is a txt file that is originally a C file(github doesn't allow .c files to be dropped). This will help explain what all the values mean from the log.

We called the ZSTD_CCtx_saveContext function just prior to starting the compression thinking there may be some lingering information in the context that is throwing off the next compression request.

zstd_xyz_ctxt.txt

AV-Coding · 2024-07-25T19:46:25Z

Hi @Cyan4973 , has anyone from zstd team found any additional information as to what happened?

terrelln · 2025-02-12T23:15:44Z

@AV-Coding a similar issue was reported in Issue #4292, which was fixed by PR #4129. Given the data they shared, we were able to pretty strongly confirm that their issue was caused by that bug. Now that I understand the bug clearly, it seems likely that it was also causing your issue.

Were you able to confirm that PR #4129 fixes your issue?

terrelln · 2025-02-18T16:08:08Z

I'm going to close this issue, assuming that it is fixed by PR #4129, but please file a new issue if this is not the case.

terrelln mentioned this issue Feb 11, 2025

Data corruption on ARM32 with dictionary #4292

Closed

terrelln closed this as completed Feb 18, 2025

Replication of bug #3517 #4015

Replication of bug #3517 #4015

Comments

AV-Coding commented Apr 2, 2024

Cyan4973 commented Apr 2, 2024 • edited Loading

terrelln commented Apr 3, 2024

terrelln commented Apr 3, 2024

AV-Coding commented Apr 3, 2024

terrelln commented Apr 3, 2024

AV-Coding commented Apr 3, 2024 • edited Loading

terrelln commented Apr 3, 2024

AV-Coding commented Apr 3, 2024

terrelln commented Apr 3, 2024

AV-Coding commented Apr 3, 2024

embg commented Apr 4, 2024 • edited Loading

AV-Coding commented Apr 4, 2024

embg commented Apr 7, 2024

AV-Coding commented Apr 8, 2024

AV-Coding commented Apr 8, 2024

Cyan4973 commented Apr 9, 2024

AV-Coding commented Apr 29, 2024

Cyan4973 commented Apr 29, 2024

AV-Coding commented Apr 29, 2024

Cyan4973 commented Apr 29, 2024

AV-Coding commented Apr 29, 2024

Cyan4973 commented Apr 29, 2024

AV-Coding commented Apr 29, 2024

Cyan4973 commented Apr 29, 2024

terrelln commented May 7, 2024 • edited Loading

terrelln commented May 7, 2024 • edited Loading

terrelln commented May 7, 2024

AV-Coding commented May 7, 2024

AV-Coding commented May 9, 2024

AV-Coding commented Jul 17, 2024 • edited Loading

Cyan4973 commented Jul 17, 2024

AV-Coding commented Jul 18, 2024 • edited Loading

Cyan4973 commented Jul 18, 2024 • edited Loading

AV-Coding commented Jul 19, 2024 • edited Loading

AV-Coding commented Jul 25, 2024

terrelln commented Feb 12, 2025

terrelln commented Feb 18, 2025

Cyan4973 commented Apr 2, 2024 •

edited

Loading

AV-Coding commented Apr 3, 2024 •

edited

Loading

embg commented Apr 4, 2024 •

edited

Loading

terrelln commented May 7, 2024 •

edited

Loading

terrelln commented May 7, 2024 •

edited

Loading

AV-Coding commented Jul 17, 2024 •

edited

Loading

AV-Coding commented Jul 18, 2024 •

edited

Loading

Cyan4973 commented Jul 18, 2024 •

edited

Loading

AV-Coding commented Jul 19, 2024 •

edited

Loading