-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption on ARM32 with dictionary #4292
Comments
Difficult to say. If this is neither of those, i.e. it happens rarely but on multiple systems, and there is no reproduction case, i.e. the data supposed detected as a failure doesn't fail on retry, then it's a very difficult case to investigate. |
I would also add that level 22 is probably not a great idea for 32-bit, as this mode can consume considerable resources, making it difficult for the rest of the system to continue operating. |
It happened about 10 times so far, always in a different device, out of a 3000 devices fleet. As compression is limited to about 3KB/s it doesn't exhaust resources. My hope would be that the way the pattern repeats, and the particular spots where it got corrupted, could point to some likely bug. |
Is the context used once, or reused for many compressions?
That would be great! |
Please find the dict here: https://raw.githubusercontent.com/nunojpg/zstd/06dbfd5d1edc4721c292e6e5f24d11b63549decb/240611.bin I use the context multiple times. For example in this case the context was 7 days old and had done about 1 million operations before. |
Thanks @nunojpg! I will try to reproduce this issue on my Raspberry Pi |
@terrelln just to confirm: this only happens very rarely. Normally zstd will compress this exact data correctly. |
This seems like it could be related to #4015, which was also rarely happening on 32-bit machines. In that case we didn't have access to the data, so it was hard to debug. But I'm hopeful that with this example, we may be able to make some progress. |
Great. I have a few more cases available in case you need. |
If you add 197 to every offset that points into the dictionary, it decompresses correctly. Now to figure out why. diff --git a/lib/decompress/zstd_decompress_block.c b/lib/decompress/zstd_decompress_block.c
index ca5044376..c2121b370 100644
--- a/lib/decompress/zstd_decompress_block.c
+++ b/lib/decompress/zstd_decompress_block.c
@@ -928,6 +928,8 @@ size_t ZSTD_execSequenceEnd(BYTE* op,
/* copy Match */
if (sequence.offset > (size_t)(oLitEnd - prefixStart)) {
+ sequence.offset += 197;
+ match -= 197;
/* offset beyond prefix */
RETURN_ERROR_IF(sequence.offset > (size_t)(oLitEnd - virtualStart), corruption_detected, "");
match = dictEnd - (prefixStart - match);
@@ -977,6 +979,8 @@ size_t ZSTD_execSequenceEndSplitLitBuffer(BYTE* op,
/* copy Match */
if (sequence.offset > (size_t)(oLitEnd - prefixStart)) {
+ sequence.offset += 197;
+ match -= 197;
/* offset beyond prefix */
RETURN_ERROR_IF(sequence.offset > (size_t)(oLitEnd - virtualStart), corruption_detected, "");
match = dictEnd - (prefixStart - match);
@@ -1022,11 +1026,7 @@ size_t ZSTD_execSequence(BYTE* op,
* - Match end is within WILDCOPY_OVERLIMIT of oend
* - 32-bit mode and the match length overflows
*/
- if (UNLIKELY(
- iLitEnd > litLimit ||
- oMatchEnd > oend_w ||
- (MEM_32bits() && (size_t)(oend - op) < sequenceLength + WILDCOPY_OVERLENGTH)))
- return ZSTD_execSequenceEnd(op, oend, sequence, litPtr, litLimit, prefixStart, virtualStart, dictEnd);
+ return ZSTD_execSequenceEnd(op, oend, sequence, litPtr, litLimit, prefixStart, virtualStart, dictEnd);
/* Assumptions (everything else goes into ZSTD_execSequenceEnd()) */
assert(op <= oLitEnd /* No overflow */);
@@ -1115,11 +1115,7 @@ size_t ZSTD_execSequenceSplitLitBuffer(BYTE* op,
* - Match end is within WILDCOPY_OVERLIMIT of oend
* - 32-bit mode and the match length overflows
*/
- if (UNLIKELY(
- iLitEnd > litLimit ||
- oMatchEnd > oend_w ||
- (MEM_32bits() && (size_t)(oend - op) < sequenceLength + WILDCOPY_OVERLENGTH)))
- return ZSTD_execSequenceEndSplitLitBuffer(op, oend, oend_w, sequence, litPtr, litLimit, prefixStart, virtualStart, dictEnd);
+ return ZSTD_execSequenceEndSplitLitBuffer(op, oend, oend_w, sequence, litPtr, litLimit, prefixStart, virtualStart, dictEnd);
/* Assumptions (everything else goes into ZSTD_execSequenceEnd()) */
assert(op <= oLitEnd /* No overflow */); |
@nunojpg if you could share another example or two that would be great! I likely have enough to go on here, but more datapoints would be useful to help narrow things down quicker. |
original |
original |
I don't have any more at the moment, we are harvesting them at about 3 per week. |
Two should be enough! Thanks @nunojpg! I feel confident we'll be able to figure this out quickly with these examples. |
This second example decompresses correctly if you add |
Theory:
Evidence:
Not sure how this could happen yet though. |
Sorry if this is dumb, as this doesn't happen with a clean context, does this point to some uninitialized variable that by chance gets set with a very unlikely option from the previous compression? |
Or a fuzzer for the context . |
@nunojpg I don't think there are uninitialized variables. I suspect some of the code that resets the context may be buggy in some very specific scenarios, that have something to do with either the amount of data processed, or the location of the buffers in memory. Could you describe how you're using these compression contexts in a bit more detail? It would be useful in trying to understand what kind of conditions could trigger this. A few specific questions:
|
Sure.
Always the same. Init in the beggining, from that point I only call ZSTD_compress2 and ZSTD_isError.
Minimum about 10 bytes, maximum about 1480 bytes.
Input: address in the heap, always the same |
There are two ZSTD_CCtx in the same process. The other ZSTD_CCtx is compressing blocks of about 5MB, and runs in a thread, so there could be concurrency issues if there was some static variables in the library. |
Zstd does not use any static variables. |
I suspect that PR #4129 fixes this issue. But, now that I have a concrete target, I'm going to attempt to repro the issue, so I can prove that it is the culprit. |
oh, I forgot #4129 ! Nice remember @terrelln ! It's indeed a very good candidate for this issue description. I presume @nunojpg application is using the |
@nunojpg I've reproduced exactly the same symptoms locally on ReproduceThis command will reproduce a corrupted file in
The corrupted file shows exactly the same symptom, where offsets into the dictionary are shifted by some fixed constant. In this case every offset into the dictionary is 200 lower than it should be. IssueThis code gets triggered when the input buffer overlaps with the previous input buffer. When the indices are right on the 2GB boundary, the comparison on line 1278 can be wrong. If Then we end up in a case where After this, we mis-compute the offsets into the dictionary, and they are shifted by zstd/lib/compress/zstd_compress_internal.h Lines 1275 to 1281 in 794ea1b
TestingThis code is tricky to test because it requires processing at least 2GB of input. Its also very sensitive to the physical location in memory where the buffers are. This logic is required due to the buffer-less streaming API. Once we're able to delete this API, I'd suggest drastically simplifying this code. I suggest we add a fuzzer for |
Great work, would you like to close this issue? |
@nunojpg Thanks for reporting this with the detailed examples, it made it possible to debug it! I'll go ahead and close this issue, but if you find that it is still occurring after PR #4129, please let us know! It is very likely that is the bug causing the issue, but it is impossible for us to know for sure. |
I get corrupt output on compression about 1 / 5 billion operations (5e9).
I am using zstd-1.5.6 in ARM32.
I init the ctx with:
I compress with ZSTD_compress2.
On ARM64 machines I have no issues. But on ARM32 I very rarely get a corrupt output.
There are interesting patterns in the corruption, so I share one case. Most of the decompressed data is correct, but in some repeating patterns it gets invalid.
This is all done locally, but of course I can't assume that there are no bit flops in hardware. This happens both at a Beaglebone and a Raspeberry Pi A.
To test any patch It takes me about 6 months since the error rate is so low and I can only distribute the patch to a limited set of devices.
I can provide the dictionary.
I want to get a idea if this is either ZSTD bug or a memory bit flip.
original ff75451d6e26370b4181027866a13515e6575b8d2011d0a4301a004f16818d5fb5a5a53038894f0d993c05838db53ea31f995e1624bd58670386024cd0a4158de141b20263c3aab692e140f98d7fa9a99f7078873b15991c618220a80dc0b71300cf687c8a8db91facce6872194483586a1c878d4021abeed3ef9df05b583a3c8a2048a7a8b30d004d67868d838aa495d773b687b9580024830205b612e60538828d008d74b08125159944128d5d3e06a87c19830246f3ad368505b72f5c858d694bc09904c092e50c99572a8220fc48a4b815007a2d818da3f9a892084813250a997a3d858d00000000003d06e116678b5d4294a83b49858dec10ad2b09e40984a5582452835d0e0249835dc2a2a3d333498e02d950a63690e156568e8df2b4a7a8add93a5017584a5f818d10b84a00020021f8160c8520003515003d02810223daad918fe12a3e86020e3893e166728302df36a33f96e117418b8db218a4c6f6bd7f974f585f4b1c8b02c338a4b995e12bd38d4c63a0a90450836115997d7d825df391a0172b858daac5ac8850788f1609990f5c875dc232ad742e85201320040069688b8de3cfa7900c58274421990b108902083410e6bd6e5a898dee27a355dd5946c7615855c02012981400017d858de400a6924c5002569599545d84201ce3a5131b004918948deec8ad9134582aed1099075f86025819a53696e13e13828d0c085c0110b93aea6b24818dc635a5b9fcc877805f5817081b900243cfa2b490e115428f5d2a23a05213855dc9acad6b788c8d0dbc5a00062013f85f25818d49f9ab903cb8170a15995c6a818d1e085c0158b80fea1f679302103095e160b88d018504700a3312990d275a838d811aa11dd5e913f78b583b76818d1d9504789f301499567d8a8d8d12ab07bacc33839b58527c898dbf38ad020d6f2db057585e72828d13e9f561c4072250171b908d1eb84a00020023f847538102293f91e14c61828d1b9574a8257b0c996e71b4021f1c8cc13626848d48d6a583efbe70506d583557848dbff5a5f000e03c87cd587253855de772a57744828d13090c200a21149945038a8d1583a48a4c500c351599371781026840ac3f9be16300878d2b978c50b03d1499700243858d11b84a00020033f8672e820221139be132f320093f0b00191c810219b995e10c76828d597bc0840c1898320d9910478202ee85acb810200472845d45bcac1e4c83022d9899e14229d9021e108fe1740d8702d641a41897e12659812018b809000f9d5d2d476e86021f1c8cc16e1d875d03239902a5ccad1b8ac16d1f8602351b8ac18c7e06868d16382a5dcd167d583671828d2a8e04d8160a0d9978628a02f882a89894e165148902b6d9abb391e1377e8f200b181700302a828d2ab84900020023f8335f858d1d00000000af19e15627858d5907aa000000000206e1587161848d26c10e6a1387af584be0020ab38de10d4b8502129894e15dcc8dc62aa2885f0160983eea086d8102b294a6188dc160d08d32cf2c130b8787583a3e845d606ac03319830204158de1034e13818d47e2a0ce6a1530a00f593f5d8d8d8c11849b0410875512991d4c825d0251258d022a3893e14c258120103015006074878d14de0f4d0a6783587617865d1f71e38d03a13c7802521599817c1081808482a1306ff6cd80b9581897e12651898d191630e74c90af58151a83021b3c8ee1711f918d2db84a00060023f8239f021b3c8ee13d0881023d1099e12c5d828d06b84a00020023f8114a8802402ba73898e1107900818d1024633d5503ab586a0e8f0220b996e15b33838d3d0524476400c9586064835d323d31810229f1a23693e1067a868d0d10739d4b602b596337888d40200882395364214a5d8302341897e105614b835d691ca41f7f83022d9899e1645e8702f877aab0120104378402369894e12653828d33c9a25739008759
compressed 28b52ffd0008bd28006b4da8122c012b0129014e3c3d181b57545053cdab4d53afd1148cceaaac69f595b112052b4e98008004bf5856d6a37d54624c4175318b28c70811354cd2cc61361df38cf581ed59b1fdc0a3b3964e404aa525cea9f4b8364d3dba69c327b949c1c21d4f0c19e1f39f0d536f2709ac15c0651eb39fddc071c6fd676cd579345a500c2bfa6845095655330be189ab539a47a1fade9b7cb3cf4f7d596ce5000586b56ab699764f350cbe092249209a7d7aae6baa7d71c20416189149a99f8d7e102147089933505efd82a19ba958f6e8185021e3e91109e91a98a6c498a13d012eb7bd7cef923a302f040a14867e08540f47182940dfaa2ea27a782f8950ae6a5a64aecbb05151fb32239c38bdd9caf4b55f6c7118139373cde72a5fd3963089502ed89f9971e035ba940bace6512548381a60aafc4a03daea0af1a7671386f54a33d102018c64a2788f5858f6bbe1108177248ccfff967aa290940e2280b9bc90a3fd28442479379aea3bd97a4486b7d5c73999202731df4bb2d2c9d4ce1c92e5f39b35ad570cccb042c58d34744a98650d4f1aa394b7471b00110d4c55db2ef84a5f3b04b4b18227a6d1013eb7c91044c9739be2d443f20893a557d36b237f0b024f2f6a5c6123040ff9cd9abaa4606af0023ea78dfdc8b2ea98cbb1c4d487471752a75d4a14f00cb2de8680155907d9889c103cfb8206191c6c8bf02a7b5bf267a8727ab69d485737bdc129603450df8aa11a696ab4938f55610c1caf551890c2a37b562c1b4b482b17167e6f9b797ca3b07ce3f2397a7a447a92108356135fb6d6fad7a4a10c140e20f175e113a3567a483e8d38da5135efc9f578149b54d30138fedca6d77b5c47c81e9d53d68305c6a3797c99273c5df99680193920f5687c9149018c2abb1bd70392f24499b48ae816f8b5242aa2f7228f2feb1df542151c693c62159fbe725034aaa665ee6820baac5800b3fa6b44d2d4301ad21aae7ed23e9c4cb19d5f704ec6e931f92511254155adedebc1b88da952b023d5e353fe54cd5c312a06a84a3befdde06da70a7408d5f2d5f66b8728bfd9f38c4172a4274b357f6199963c61c1250cf5a84c99a99681e6aa79bfdbb439569cecf1eae95b8fca27155074d57c05e97b8db2632e441266af4714a9fadd191270169b82b9f7230d2db6d771a899784c47314ff8f46733a37e37ce23400433a6044734145f64aa00e869f8d9c19b137d912ebedcc6d5a3f169a79a8bb55d56c6342305b49dd0418a2fefa975b3cd1fbc64f534c94cc40d225f16c3e9014979543d23dd5494cd1480a455b51a9ca10c8997ba34f3569bb5c7e4b7a86a4ed39a81134b2ae36028bf467a23f7b06d7de0db6a7d7ac3b5c7e391881176aab950d3509a190eeecac6dae4cd44673d243dc15546130c3d5384ace9cd66ebf3a9e0b607a5d8440f5475ad95e6d3e36f9baa39de14a2a65146afc722934bd5dce2ed9a6dae1f611516cf06eb9e873850c00b5f96cf87f99630ffdeb06a20d3a588d7404f262dbdf2d7cd73245d271874bdba8f6dea0d6fc1f90d578f47314fd5df8055d3ca5c301f854cbdbd2bcf5c45efd39ec82796612f06062c33b812e9e1bd77d5cc423403035bdad949a9abd525293e0d0cf950361afe5b02d24535bfc3790ef00ba4a154d15bda1efe054995c756d754a76030a1111f3cc80e78f16ed35f3a0fc208a4c838635593c1c66a689a46c191533265ff7dbd638da1c694de3af538e2d1fbfab1c4d1232d88959a00002ec5b8c5958ace841482253cf0ebdfdf36080dc058ab09a204220aa6edc015fedb89e31ee2458fa1ce357cdf76e779a352
decompressed ff75451d6e26370b4181027866a13515e6575b8d2011d0a4301a004f16818d5fb5a5a53038894f0d993c05838db53ea31f995e1624bd58670386024cd0a4158de141b20263c3aab692e140f98d7fa9a99f7078873b15991c618220a80dc0b71300cf687c8a8db91facce6872194483586a1c878d4021abeed3ef9df05b583a3c8a2048a7a8b30d004d67868d838aa495d773b687b9580024830205b612e60538828d008d74b08125159944128d5d3e06a87c19830246f3ad368505b72f5c858d694bc09904c092e50c99572a8220fc48a4b815007a2d818da3f9a892084813250a997a3d858d00000000003d06e116678b5d4294a83b49858dec10ad2b09e40984a5582452835d0e58ff5b60c2a2a3d333498e02d950a63690e156568e8df2b4a7a8add93a5017584a5fa3080bd68d0e0aa3085c160c8520003515003d02810223daad918fe12a3e86020e3893e166728302df36a33f96e117418b8db218a4c6f6bd7f974f585f4b1c8b02c338a4b995e12bd38d4c63a0a90450836115997d7d825df391a0172b858daac5ac8850788f1609990f5c875dc232ad742e85201320040069688b4369928d900c58274421990b108902083410e6bd6e5a898dee27a355dd5946c7615855c02012981400017d858de400a6924c5002569599545d84201ce3a5131b004918948deec8ad9134582aed1099075f86025819a53696e13e1352838c32384910b93aea6b24818dc635a5b9fcc877805f5817081b900243cfa2b490e115428f5d2a23a05213855dc9acad6b788c8d0d5a828d7410c06b5f25818d49f9ab903cb8170a15995848ea4d1e2f6c838d5b0fea1f679302103095e160b88d018504700a3312990d275a31130423a11dd5e913f78b583b76818d1d9504789f301499567d8a8d8d12ab07bacc33839b58527c898dbf38ad020d6f2db057585e72828d13e9f561c4072250171b90580b04818d0a8b8d8e59538102293f91e14c61828d1b9574a8257b0c996e71b4021f1c8cc136cb311354d6a583efbe70506d583557848dbff5a5f000e03c87cd587253855de772a50d99097a13090c200a21149945038a8d1583a48a4c500c351599371781026840ac3f9be16300878d2b978c50b03d149970028dc3b1a6083c0158584aea672e820221139be132f320093f0b00191c810219b9953eea162a81597bc0840c1898320d9910478202ee85acb810200472845d45bcac1e4c83022d9899e14229d9021e108fe1740d8702d641a41897e12659812018b809000f9d5d2d476e86021f1c8cc16e1d875d03239902a5ccad1b8ac16d1f8602351b8ac18c7e06868d16382a5dcd167d583671828d2a8e04d8160a0d9978628a02f882a89894e165148902b6d9abb391e1377e8f200b181700302a828d2aacb84900020021f8c3858d1daa083cd16419e15627858d5907fa35a3085c0906e1587161848d26c10e6a1387af584be0020ab38de10d4b8502129894e15dcc8dc62a000000000c1ae12d086d8102b294a6188dc160d08d32cf2c130b8787583a3e845d606ac03319830204158de103686e858d47e2a0ce6a1530a00f593f5d8d8d8c11849b0410875512991d4c825d0251258d022a3893e14c258120103015006074878d14de0f4d0a6783587617865d1f71e38d03a13c7802521599817c1081808482a1306ff6cd80b9581897e12651898d191630e74c90af58151a83021b3c8ee1711f918d48a2acb84a000200239f021b3c8ee13d0881023d1099e12c5dea7660d48d0ab83c0158884a8802402ba73898e1107900818d1024633d5503ab586a0e8f0220b996e15b33838d3d052447644704681564835d323d31810229f1a23693e1067a868d0d10739d4b602b596337888d40200882395364214a5d8302341897e105614b835d691ca41f7f83022d9899e1645e8702f877aab0120104378402369894e12653828d33c9a25739008759
The text was updated successfully, but these errors were encountered: