Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption on ARM32 with dictionary #4292

Closed
nunojpg opened this issue Feb 10, 2025 · 28 comments
Closed

Data corruption on ARM32 with dictionary #4292

nunojpg opened this issue Feb 10, 2025 · 28 comments

Comments

@nunojpg
Copy link

nunojpg commented Feb 10, 2025

I get corrupt output on compression about 1 / 5 billion operations (5e9).

I am using zstd-1.5.6 in ARM32.

I init the ctx with:

ZSTD_CCtx_setParameter(m_ctx, ZSTD_c_compressionLevel, 22);
ZSTD_CCtx_setParameter(m_ctx, ZSTD_c_contentSizeFlag, 0);
ZSTD_CCtx_setParameter(m_ctx, ZSTD_c_checksumFlag, 0);
ZSTD_CCtx_setParameter(m_ctx, ZSTD_c_dictIDFlag, 0);
ZSTD_CCtx_loadDictionary(m_ctx, dictionary.data(), dictionary.size());

I compress with ZSTD_compress2.

On ARM64 machines I have no issues. But on ARM32 I very rarely get a corrupt output.

There are interesting patterns in the corruption, so I share one case. Most of the decompressed data is correct, but in some repeating patterns it gets invalid.

This is all done locally, but of course I can't assume that there are no bit flops in hardware. This happens both at a Beaglebone and a Raspeberry Pi A.

To test any patch It takes me about 6 months since the error rate is so low and I can only distribute the patch to a limited set of devices.

I can provide the dictionary.

I want to get a idea if this is either ZSTD bug or a memory bit flip.

original ff75451d6e26370b4181027866a13515e6575b8d2011d0a4301a004f16818d5fb5a5a53038894f0d993c05838db53ea31f995e1624bd58670386024cd0a4158de141b20263c3aab692e140f98d7fa9a99f7078873b15991c618220a80dc0b71300cf687c8a8db91facce6872194483586a1c878d4021abeed3ef9df05b583a3c8a2048a7a8b30d004d67868d838aa495d773b687b9580024830205b612e60538828d008d74b08125159944128d5d3e06a87c19830246f3ad368505b72f5c858d694bc09904c092e50c99572a8220fc48a4b815007a2d818da3f9a892084813250a997a3d858d00000000003d06e116678b5d4294a83b49858dec10ad2b09e40984a5582452835d0e0249835dc2a2a3d333498e02d950a63690e156568e8df2b4a7a8add93a5017584a5f818d10b84a00020021f8160c8520003515003d02810223daad918fe12a3e86020e3893e166728302df36a33f96e117418b8db218a4c6f6bd7f974f585f4b1c8b02c338a4b995e12bd38d4c63a0a90450836115997d7d825df391a0172b858daac5ac8850788f1609990f5c875dc232ad742e85201320040069688b8de3cfa7900c58274421990b108902083410e6bd6e5a898dee27a355dd5946c7615855c02012981400017d858de400a6924c5002569599545d84201ce3a5131b004918948deec8ad9134582aed1099075f86025819a53696e13e13828d0c085c0110b93aea6b24818dc635a5b9fcc877805f5817081b900243cfa2b490e115428f5d2a23a05213855dc9acad6b788c8d0dbc5a00062013f85f25818d49f9ab903cb8170a15995c6a818d1e085c0158b80fea1f679302103095e160b88d018504700a3312990d275a838d811aa11dd5e913f78b583b76818d1d9504789f301499567d8a8d8d12ab07bacc33839b58527c898dbf38ad020d6f2db057585e72828d13e9f561c4072250171b908d1eb84a00020023f847538102293f91e14c61828d1b9574a8257b0c996e71b4021f1c8cc13626848d48d6a583efbe70506d583557848dbff5a5f000e03c87cd587253855de772a57744828d13090c200a21149945038a8d1583a48a4c500c351599371781026840ac3f9be16300878d2b978c50b03d1499700243858d11b84a00020033f8672e820221139be132f320093f0b00191c810219b995e10c76828d597bc0840c1898320d9910478202ee85acb810200472845d45bcac1e4c83022d9899e14229d9021e108fe1740d8702d641a41897e12659812018b809000f9d5d2d476e86021f1c8cc16e1d875d03239902a5ccad1b8ac16d1f8602351b8ac18c7e06868d16382a5dcd167d583671828d2a8e04d8160a0d9978628a02f882a89894e165148902b6d9abb391e1377e8f200b181700302a828d2ab84900020023f8335f858d1d00000000af19e15627858d5907aa000000000206e1587161848d26c10e6a1387af584be0020ab38de10d4b8502129894e15dcc8dc62aa2885f0160983eea086d8102b294a6188dc160d08d32cf2c130b8787583a3e845d606ac03319830204158de1034e13818d47e2a0ce6a1530a00f593f5d8d8d8c11849b0410875512991d4c825d0251258d022a3893e14c258120103015006074878d14de0f4d0a6783587617865d1f71e38d03a13c7802521599817c1081808482a1306ff6cd80b9581897e12651898d191630e74c90af58151a83021b3c8ee1711f918d2db84a00060023f8239f021b3c8ee13d0881023d1099e12c5d828d06b84a00020023f8114a8802402ba73898e1107900818d1024633d5503ab586a0e8f0220b996e15b33838d3d0524476400c9586064835d323d31810229f1a23693e1067a868d0d10739d4b602b596337888d40200882395364214a5d8302341897e105614b835d691ca41f7f83022d9899e1645e8702f877aab0120104378402369894e12653828d33c9a25739008759
compressed 28b52ffd0008bd28006b4da8122c012b0129014e3c3d181b57545053cdab4d53afd1148cceaaac69f595b112052b4e98008004bf5856d6a37d54624c4175318b28c70811354cd2cc61361df38cf581ed59b1fdc0a3b3964e404aa525cea9f4b8364d3dba69c327b949c1c21d4f0c19e1f39f0d536f2709ac15c0651eb39fddc071c6fd676cd579345a500c2bfa6845095655330be189ab539a47a1fade9b7cb3cf4f7d596ce5000586b56ab699764f350cbe092249209a7d7aae6baa7d71c20416189149a99f8d7e102147089933505efd82a19ba958f6e8185021e3e91109e91a98a6c498a13d012eb7bd7cef923a302f040a14867e08540f47182940dfaa2ea27a782f8950ae6a5a64aecbb05151fb32239c38bdd9caf4b55f6c7118139373cde72a5fd3963089502ed89f9971e035ba940bace6512548381a60aafc4a03daea0af1a7671386f54a33d102018c64a2788f5858f6bbe1108177248ccfff967aa290940e2280b9bc90a3fd28442479379aea3bd97a4486b7d5c73999202731df4bb2d2c9d4ce1c92e5f39b35ad570cccb042c58d34744a98650d4f1aa394b7471b00110d4c55db2ef84a5f3b04b4b18227a6d1013eb7c91044c9739be2d443f20893a557d36b237f0b024f2f6a5c6123040ff9cd9abaa4606af0023ea78dfdc8b2ea98cbb1c4d487471752a75d4a14f00cb2de8680155907d9889c103cfb8206191c6c8bf02a7b5bf267a8727ab69d485737bdc129603450df8aa11a696ab4938f55610c1caf551890c2a37b562c1b4b482b17167e6f9b797ca3b07ce3f2397a7a447a92108356135fb6d6fad7a4a10c140e20f175e113a3567a483e8d38da5135efc9f578149b54d30138fedca6d77b5c47c81e9d53d68305c6a3797c99273c5df99680193920f5687c9149018c2abb1bd70392f24499b48ae816f8b5242aa2f7228f2feb1df542151c693c62159fbe725034aaa665ee6820baac5800b3fa6b44d2d4301ad21aae7ed23e9c4cb19d5f704ec6e931f92511254155adedebc1b88da952b023d5e353fe54cd5c312a06a84a3befdde06da70a7408d5f2d5f66b8728bfd9f38c4172a4274b357f6199963c61c1250cf5a84c99a99681e6aa79bfdbb439569cecf1eae95b8fca27155074d57c05e97b8db2632e441266af4714a9fadd191270169b82b9f7230d2db6d771a899784c47314ff8f46733a37e37ce23400433a6044734145f64aa00e869f8d9c19b137d912ebedcc6d5a3f169a79a8bb55d56c6342305b49dd0418a2fefa975b3cd1fbc64f534c94cc40d225f16c3e9014979543d23dd5494cd1480a455b51a9ca10c8997ba34f3569bb5c7e4b7a86a4ed39a81134b2ae36028bf467a23f7b06d7de0db6a7d7ac3b5c7e391881176aab950d3509a190eeecac6dae4cd44673d243dc15546130c3d5384ace9cd66ebf3a9e0b607a5d8440f5475ad95e6d3e36f9baa39de14a2a65146afc722934bd5dce2ed9a6dae1f611516cf06eb9e873850c00b5f96cf87f99630ffdeb06a20d3a588d7404f262dbdf2d7cd73245d271874bdba8f6dea0d6fc1f90d578f47314fd5df8055d3ca5c301f854cbdbd2bcf5c45efd39ec82796612f06062c33b812e9e1bd77d5cc423403035bdad949a9abd525293e0d0cf950361afe5b02d24535bfc3790ef00ba4a154d15bda1efe054995c756d754a76030a1111f3cc80e78f16ed35f3a0fc208a4c838635593c1c66a689a46c191533265ff7dbd638da1c694de3af538e2d1fbfab1c4d1232d88959a00002ec5b8c5958ace841482253cf0ebdfdf36080dc058ab09a204220aa6edc015fedb89e31ee2458fa1ce357cdf76e779a352
decompressed ff75451d6e26370b4181027866a13515e6575b8d2011d0a4301a004f16818d5fb5a5a53038894f0d993c05838db53ea31f995e1624bd58670386024cd0a4158de141b20263c3aab692e140f98d7fa9a99f7078873b15991c618220a80dc0b71300cf687c8a8db91facce6872194483586a1c878d4021abeed3ef9df05b583a3c8a2048a7a8b30d004d67868d838aa495d773b687b9580024830205b612e60538828d008d74b08125159944128d5d3e06a87c19830246f3ad368505b72f5c858d694bc09904c092e50c99572a8220fc48a4b815007a2d818da3f9a892084813250a997a3d858d00000000003d06e116678b5d4294a83b49858dec10ad2b09e40984a5582452835d0e58ff5b60c2a2a3d333498e02d950a63690e156568e8df2b4a7a8add93a5017584a5fa3080bd68d0e0aa3085c160c8520003515003d02810223daad918fe12a3e86020e3893e166728302df36a33f96e117418b8db218a4c6f6bd7f974f585f4b1c8b02c338a4b995e12bd38d4c63a0a90450836115997d7d825df391a0172b858daac5ac8850788f1609990f5c875dc232ad742e85201320040069688b4369928d900c58274421990b108902083410e6bd6e5a898dee27a355dd5946c7615855c02012981400017d858de400a6924c5002569599545d84201ce3a5131b004918948deec8ad9134582aed1099075f86025819a53696e13e1352838c32384910b93aea6b24818dc635a5b9fcc877805f5817081b900243cfa2b490e115428f5d2a23a05213855dc9acad6b788c8d0d5a828d7410c06b5f25818d49f9ab903cb8170a15995848ea4d1e2f6c838d5b0fea1f679302103095e160b88d018504700a3312990d275a31130423a11dd5e913f78b583b76818d1d9504789f301499567d8a8d8d12ab07bacc33839b58527c898dbf38ad020d6f2db057585e72828d13e9f561c4072250171b90580b04818d0a8b8d8e59538102293f91e14c61828d1b9574a8257b0c996e71b4021f1c8cc136cb311354d6a583efbe70506d583557848dbff5a5f000e03c87cd587253855de772a50d99097a13090c200a21149945038a8d1583a48a4c500c351599371781026840ac3f9be16300878d2b978c50b03d149970028dc3b1a6083c0158584aea672e820221139be132f320093f0b00191c810219b9953eea162a81597bc0840c1898320d9910478202ee85acb810200472845d45bcac1e4c83022d9899e14229d9021e108fe1740d8702d641a41897e12659812018b809000f9d5d2d476e86021f1c8cc16e1d875d03239902a5ccad1b8ac16d1f8602351b8ac18c7e06868d16382a5dcd167d583671828d2a8e04d8160a0d9978628a02f882a89894e165148902b6d9abb391e1377e8f200b181700302a828d2aacb84900020021f8c3858d1daa083cd16419e15627858d5907fa35a3085c0906e1587161848d26c10e6a1387af584be0020ab38de10d4b8502129894e15dcc8dc62a000000000c1ae12d086d8102b294a6188dc160d08d32cf2c130b8787583a3e845d606ac03319830204158de103686e858d47e2a0ce6a1530a00f593f5d8d8d8c11849b0410875512991d4c825d0251258d022a3893e14c258120103015006074878d14de0f4d0a6783587617865d1f71e38d03a13c7802521599817c1081808482a1306ff6cd80b9581897e12651898d191630e74c90af58151a83021b3c8ee1711f918d48a2acb84a000200239f021b3c8ee13d0881023d1099e12c5dea7660d48d0ab83c0158884a8802402ba73898e1107900818d1024633d5503ab586a0e8f0220b996e15b33838d3d052447644704681564835d323d31810229f1a23693e1067a868d0d10739d4b602b596337888d40200882395364214a5d8302341897e105614b835d691ca41f7f83022d9899e1645e8702f877aab0120104378402369894e12653828d33c9a25739008759

@Cyan4973
Copy link
Contributor

Difficult to say.
If the errors only happen on the same machine, it increases the likelihood that it is a hardware issue.
On the other, if the same sample fails the same way on multiple systems, it's most likely a library issue.

If this is neither of those, i.e. it happens rarely but on multiple systems, and there is no reproduction case, i.e. the data supposed detected as a failure doesn't fail on retry, then it's a very difficult case to investigate.
It might be the library, in combination with some system pressure. But these kind of bugs are extremely difficult to analyze, since the issue is not guaranteed to reproduce, and the very presence of debugging traces or tools can alter the symptoms.

@Cyan4973
Copy link
Contributor

I would also add that level 22 is probably not a great idea for 32-bit, as this mode can consume considerable resources, making it difficult for the rest of the system to continue operating.
I would limit the level to something like 19, which is much safer on this front.
Level 19 has most of the same properties as level 22, and therefore the compression ratio should be close enough, especially for small data. It's mostly on large data that there is a difference, in which case level 19 will simply clamp its resource consumption to a much more reasonable level.

@nunojpg
Copy link
Author

nunojpg commented Feb 10, 2025

It happened about 10 times so far, always in a different device, out of a 3000 devices fleet.

As compression is limited to about 3KB/s it doesn't exhaust resources.

My hope would be that the way the pattern repeats, and the particular spots where it got corrupted, could point to some likely bug.

@terrelln
Copy link
Contributor

Is the context used once, or reused for many compressions?

I can provide the dictionary.

That would be great!

nunojpg added a commit to nunojpg/zstd that referenced this issue Feb 10, 2025
@nunojpg
Copy link
Author

nunojpg commented Feb 10, 2025

Please find the dict here: https://raw.githubusercontent.com/nunojpg/zstd/06dbfd5d1edc4721c292e6e5f24d11b63549decb/240611.bin

I use the context multiple times. For example in this case the context was 7 days old and had done about 1 million operations before.

@terrelln
Copy link
Contributor

Thanks @nunojpg! I will try to reproduce this issue on my Raspberry Pi

@nunojpg
Copy link
Author

nunojpg commented Feb 11, 2025

@terrelln just to confirm: this only happens very rarely. Normally zstd will compress this exact data correctly.

@terrelln
Copy link
Contributor

This seems like it could be related to #4015, which was also rarely happening on 32-bit machines.

In that case we didn't have access to the data, so it was hard to debug. But I'm hopeful that with this example, we may be able to make some progress.

@nunojpg
Copy link
Author

nunojpg commented Feb 11, 2025

Great. I have a few more cases available in case you need.

@terrelln
Copy link
Contributor

If you add 197 to every offset that points into the dictionary, it decompresses correctly. Now to figure out why.

diff --git a/lib/decompress/zstd_decompress_block.c b/lib/decompress/zstd_decompress_block.c
index ca5044376..c2121b370 100644
--- a/lib/decompress/zstd_decompress_block.c
+++ b/lib/decompress/zstd_decompress_block.c
@@ -928,6 +928,8 @@ size_t ZSTD_execSequenceEnd(BYTE* op,
 
     /* copy Match */
     if (sequence.offset > (size_t)(oLitEnd - prefixStart)) {
+        sequence.offset += 197;
+        match -= 197;
         /* offset beyond prefix */
         RETURN_ERROR_IF(sequence.offset > (size_t)(oLitEnd - virtualStart), corruption_detected, "");
         match = dictEnd - (prefixStart - match);
@@ -977,6 +979,8 @@ size_t ZSTD_execSequenceEndSplitLitBuffer(BYTE* op,
 
     /* copy Match */
     if (sequence.offset > (size_t)(oLitEnd - prefixStart)) {
+	sequence.offset += 197;
+        match -= 197;
         /* offset beyond prefix */
         RETURN_ERROR_IF(sequence.offset > (size_t)(oLitEnd - virtualStart), corruption_detected, "");
         match = dictEnd - (prefixStart - match);
@@ -1022,11 +1026,7 @@ size_t ZSTD_execSequence(BYTE* op,
      *   - Match end is within WILDCOPY_OVERLIMIT of oend
      *   - 32-bit mode and the match length overflows
      */
-    if (UNLIKELY(
-        iLitEnd > litLimit ||
-        oMatchEnd > oend_w ||
-        (MEM_32bits() && (size_t)(oend - op) < sequenceLength + WILDCOPY_OVERLENGTH)))
-        return ZSTD_execSequenceEnd(op, oend, sequence, litPtr, litLimit, prefixStart, virtualStart, dictEnd);
+    return ZSTD_execSequenceEnd(op, oend, sequence, litPtr, litLimit, prefixStart, virtualStart, dictEnd);
 
     /* Assumptions (everything else goes into ZSTD_execSequenceEnd()) */
     assert(op <= oLitEnd /* No overflow */);
@@ -1115,11 +1115,7 @@ size_t ZSTD_execSequenceSplitLitBuffer(BYTE* op,
      *   - Match end is within WILDCOPY_OVERLIMIT of oend
      *   - 32-bit mode and the match length overflows
      */
-    if (UNLIKELY(
-            iLitEnd > litLimit ||
-            oMatchEnd > oend_w ||
-            (MEM_32bits() && (size_t)(oend - op) < sequenceLength + WILDCOPY_OVERLENGTH)))
-        return ZSTD_execSequenceEndSplitLitBuffer(op, oend, oend_w, sequence, litPtr, litLimit, prefixStart, virtualStart, dictEnd);
+    return ZSTD_execSequenceEndSplitLitBuffer(op, oend, oend_w, sequence, litPtr, litLimit, prefixStart, virtualStart, dictEnd);
 
     /* Assumptions (everything else goes into ZSTD_execSequenceEnd()) */
     assert(op <= oLitEnd /* No overflow */);

@terrelln
Copy link
Contributor

@nunojpg if you could share another example or two that would be great! I likely have enough to go on here, but more datapoints would be useful to help narrow things down quicker.

@nunojpg
Copy link
Author

nunojpg commented Feb 11, 2025

original
d367213e6147420b4181a08a1878f704a03f63b2ffb0170070508428644e50170300406f81200131150017029528011703004cea8d443e50081000588842ea78048120013115007b5b828d42cd769c04908109129919048d8083184bc1b81f4103c9581099e125191d81a04db14b00f43f34416adeb815001ae1a000a06dc3b8330d20b01700141c825d7c00443b07812801170300456f812001311500460e81024d08609994e1392988a0030000aa300048c8b01700770481a800a06dc3b8330d20a51c00b0767d92a803f70420410f34803b100001398e2801170300496f812001311500687d822003b017004321858d05673c3e16471107bf681b7c818d5d1e150000000020b50067d92800a51c00305485a0c7553c0034e030e3b99f101900203538842801170300456f812001311500236b832800a51c00624b948f049904b895db09993640888d089e041095df119953cba0650a780014e0334f9aa13015007054848d070e81f50697a5581e6481a80a0000a4300030cc9b1a0040092fa48d07b84900060023f86f0288a80bef04c03b735480ba0a00254381a80b0000a4300078c0ba0a0047438ca0010000aa300078c031150015318b2805a51d00436585a0010000aa300078c03115000e44848f061e150000000010b50079018ca801dc0ce03681b5ff170300016c1286a0205c15beff7d2d3b7aa81812000d4e8aa0010234e02f1d6aaa3115003c46882005b8150033f58f0c081000607844eb3d588720013115005762895d084c51822805a51d00486f812005b8150000065988a80dd20c60324b3580b41300284381a80d0000000000509db4130069658a2805a51d00446f812005b815006e73ab2805a51d00486f812005b815005c71962005b815006e5f8ea00800e47f352f1aa5b01700242422862805a51d00496f812005b815004f28a98dae653c9e04708af10999583381a808f804603fddf4ff821200284381a8080000aa300048c88212000a2987a04392780000aa300020ce9819002469888d00d49a112a07bf580a04818d009d08e8259a119990074f8a8d087c4900020023f85b0e945f0c433586a80ff704603a735c80ac0c005344c9a00a0000a4300030cc101900453684a00492780000fd800a0310b01700053a93a00200ecbf301d6adfb815005d7782a8102008db72000d20a71b002437958d9d2d50083f0160d834ea0006f1a0100000aa300048c8b01700614f8ca0050000a4300068c2b81500083089a01002d47f352d2a9fb01700600d84a0114d74e224214aadb80c002f2082a00cffe73f3009aa99911620002881a802e004e037471c80390800411f82a8020000a4300068c2390800
compressed
28b52ffd0000651a007331bd533ca97aa22c1335c5930e5fae9078dc3caa27729d2958a5f1a7c0869f7e94662dcbe0a069b67385d68c4d067d5d5ccbb3966dbab21982e7c7c81082ead6921dc6ad968a5f6dd395ade508d7e9133efef9e023916ca62e26c0dae99e0538787ba20c9836f576eec00013fefc605ab57c54b2b2705b9e8faa482c6f50894db32a0ed1f894d86e212727d24653d935481342dc237e8db43fb7c30d10fcc3748487db0182a95fd7d94c59021da72cd53b245394328e522e3541c713db4d8a9fc889fac78e80987783608a63e151e87c65a5696c6ca62b9b982d4146863ea6b8e5e29aa41e5a269f062ac4b8ba0fe3a936058043a3ad21b6ec03da136bf2759ae36195c1b8f034b69bd9d42b4405ab7b930754816cd32d9e653c8168eaedb482d7564c53af500a0c30413c8aaa544b442fed14e57b62c3d24f6129015e34b0651ac58e3a155929e5cc8f78642fa09e50a40a4e39bd3157c63395093a6e6a0a8369d54914f2522d15a7aea62401949cc8a76ea624011c8ef12dcdb8452839b1aee291d64cafd3acce114f22080b356da95c203c0d93984e494ee470311d73f27a426e32e2cf6379536ff8b2c317118fee349024ce5201b1cfdd11b9bd41e07e0d75f4b1164476deff6684dc729bbe604e75cc196aa65e212a38c7ab9a8aac5075ce6223f094651bc25383308fa22913190296e8b1f0f40573aa43c1dc94133c9d4ea7de98349f13bad07d8e70af549c98e6ac1e3d33355dd91401970be9a65e212a38c7cb9a8aac50751693a8d3edd261d14914f8162272c2b1f06f73fa8209d6127582de594e4fe437e07465d306d6349c623a15b15cef24f580fc6c78573537b19a7a85a8e01caf6a2ab242d5c94796a8a68469e597ee746513ac25ea04bdb39c5af668997a3b31c01055f464c274ab4597650ec372d3e36ffb0d9128994e1a5b5ff3a1bf593e0a366fdf1efe7048f9f2d9e0d42b4405e7776010a9759f6ef1b113950fc2351559a1ea445657a65e212a38c7839a8aac8cbde074cee3bce9ca66a98bb727cb5b48e017d1328f2a91ba6a739a99cb111b0c811302116a319809de8ff5edcbfe1bfc74b960d6d0dd6b8e54d7fa75903f2405c28ff7187ec8b883325858cac38e1d27e0772021953964c2b862c3f62275e95d850723f861248dce9e5fb31fc579820c36d914
recovered
d367213e6147420b4181a08a1878f704a03f63b2ffb0170070508428644e50170300406f81200131150017029528011703004cea8d443e50081000588842ea78048120013115007b5b828d42cd769c04908109129919048d8083184bc1b81f4103c9581099e125191d81a04db14b00f43f34416adeb815001ae1a000a06dc3b8330d20b01700141c825d7c00443b07812801170300456f812001311500460e81024d08609994e1392988a0030000aa300048c8b01700770481a800a06dc3b8330d20a51c00b0767d92a803f70420410f34803b100001398e2801170300496f812001311500687d822003b017004321858d05673c3e16471107bf681b7c818d5d1e150000000020b50067d92800a51c00305485a0c7553c0034e030e3b99f101900203538842801170300456f812001311500236b832800a51c00624b948f049904b895db09993640888d089e041095df119953cba0650a780014e0334f9aa13015007054848d070e81f50697a5581e6481a80a0000a4300030cc9b1a0040092fa48d07b84900060023f86f0288a80bef04c03b735480ba0a00254381a80b5806a158fa051cba0a0047438ca0010ee91199494d868df96815318b2805a51d00436585a0010ee91199494d868df9680e44848f061e1599095b848db50079018ca801dc0ce03681b5ff170300016c1286a0205c15beff7d2d3b7aa81812000d4e8aa0010234e02f1d6aaa3115003c46882005b8150033f58f0c081000607844eb3d588720013115005762895d084c51822805a51d00486f812005b8150000065988a80dd20c60324b3580b4138d7515a8000d0000000000509db4138d69658a2805a51d00446f812005b815006e73ab2805a51d00486f812005b815005c71962005b815006e5f8ea00800e47f352f1aa5b01700242422862805a51d00496f812005b815004f28a98dae653c9e04708af10999583381a808f804603fddf4ff82128d7515a800084c5fd7661854ea8212000a2987a0439278908d734ca200000019002469888d00d49a112a073a8c3f015858009d08e8259a119990074f8a8d087c8d9d25991b7f910e945f0c433586a80ff704603a735c80ac0c005344c9a00a60958d0351a746045895453684a0049278908dfda5083c0158b83a053a93a00200ecbf301d6adfb815005d7782a8102008db72000d20a71b002437958d9d2d50083f49000600330006f1a010bc4900062012f8681700614f8ca005918d1c083cc5668842ea083089a01002d47f352d2a9fb01700600d84a0114d74e224214aadb80c002f2082a00cffe73f3009aa99911620002881a802e004e037471c80390800411f82a802918d1c083cc566390800

@nunojpg
Copy link
Author

nunojpg commented Feb 11, 2025

original
5f4e33643971120c418102092fa1b895e1711087800117ac264f8516801b58388361223f8f805e95a81d4263ebf32758bf84a13edb8daf88a3d729a21d44815876b520bbf5ac930c0015ce8d030650a02eea08990c2b888d2c3da50a0450947e0d991a25845d01ff1b718a8d80f0a300000000880ce15a788820cff2ac1f0c0005689a021a2ca63602411f62848de904c01fb92c0784b958404882003c49a5a001004e09818dbeaea9040410a4410d996308828dd491ac070470987c0d99553a89024924a31997e1b80a65820206910200060d825d053e5083000aa001001962825d1439a24e18870292e8ab1b85a1133f848d9779a50804d0938e0d994218830202bf84a11d5e8b8ddd4bac03685816e91499731e578a8dcc97a9090410888c15991217848d5834a48d8a8b3e00bf586b57840204938cc11c40835d095448818d1676a3083c0150b80fea6d448402c62bacb58ac13c4b898d27d3a80604d8287021996e588a0208360241d40f46868d13085c01585848ea51278702083602411f12868db36539b84900020021f8672282800f5b07eca5b729581b85a17943838dd372aab84a00060023f8064a8202069102000447878d0eb9a3885e095d8840ea6875825dd45bab27360b845dee42a21840835df5b8a37b448a027cd0a19602415f4382201bb203003335815d033c18855d211bac1449898d12000000001b19e10f29870206910200073b478202005fab1b8381662a8c5dd708a84010818da73ea4598fac4717c9586330858d07085ccd64082dea605f815f1d4e22858d0f5b07eca5b729585f7d820202bf84a17f138d8d055f6b010587a5586116338e02ecb7a61a87a10d1b8102149b85a126678902149b85a15035958d20ee55f5b893cd587a1c818d03085c2d5ca83cea6e68818d88a8a54433280280af587f20858df02ba20104d0a24215994f43890208360241285837858d152068e376704d23021a832008360228106a865d1f7f0b81027bc0ab9986c10e088e8d1aed28947ae753585656818dfd4ca4b84a00060023f83f1f888d0db84a00020023f838d702083602414c7f638220083602283b5f815d066839838d328cad100c4896db0999407f9620806faa900e0076bd8d01d5f2bdaf871b5870458f8d142018c737131023267a838d1836a0f48af851879b5809069f8d0048ca14a782af5801493d838d8a74a5dee80a3003dd58081385200c3813006bb6201f1b0300270283201f1b0300495f82201f1b0300495f82201f1b03007e42845d020a57848d0cb84900020021f84a4bd88d190284108ed109997b3f8e8d3227a1fda0328187af58703d825d254564858058e6a3fe637acc82af58b895e10566818d0f83383800de1499644581201f1b03003345968dfd8ea018ef495222bf58
compressed
28b52ffd0000751b001331bbfd93262c4b84afe5220b5eb28a28af82fb5d0ca662b1f9c4e772d61e919d5db3bcc086b7164670d00fd170553769e5a6bcaea9ce872c59b8cc19fa580716b7e72a8b17ddd23a9b9154d3ce521cca6b59e66ace54c445060c23c334459c05a3537012201860f6217a735396896e0258531fe28aed956a30524413cc1b706ab65dcf5895d75dd49a32488880ea65fc4d4dda29f00ec19286501f6b81245676bb4271c89b93e8ae69f9da52f9cd906b022d092e167c5e56d6c57e6c80d401a1c52fe22634d23f73c660352f3194494b63d8ea80cd2d553ab0d7a391ba0e7beaeb7ba54484de54113da31e0d36ec9a87551d222acc136b9fd369f879d9d0713c769e401602d8471a8dd9733897e8ca6e3577d15eaa5b4ac4f8a0ba48a4d2d631d5566a8635abf72986e5c2358fa6f18259cbd7e07084dffa8d952b158c64124e267cbe853a814f90a6c823979b5d32d7a6d5f2d5d2887ae704c5b3b3ba2a5c66d2d05f2691b4dc88682ae2c2055c9e5a8baa4a5e86125611a90647749ef015211542dab9e42827f30c82b4b25f1a566d3dc2d3c4abaf6eadeb94e025daa5efc4c0fc60e4ece4b04c3317f7c2a75164d4396056271044558ff02a62d7f067f27444efeb891c399ebdc54403915345cf01d533d55d9093ce718e35864d3343bdd0507cbcab67b30cfd73ea9488e84914757f5e41f95ec5f5f6c69ac5f0fd6e072a9d4e24510de6f2fc59ad2e3e86de38a6e0a43f3fa55ca25d32813856d55f2f1149ddb4b21b0bc9d43c76b2a695b5d769b5f15a2dc79174f6986419619355652b917d2d2165de3da2a21182af8d22b1bd6f142c08f039d66b4ab5c4cc629a96b1a0bae779c4c75366c6dd813fbbc944558f4af8b4543bd059a75919152434005e075ee8911d3df4887a2ea5d4b4f34dcbd772801cdf24984d3e7edd93fd781020cbf155c433e09d661f559c200098f3afe7543022762618b1d6f416385d9e3f7b4e091ebc49c19c88543b647ca269338bbbcba10afa0b47c50af550bdd5eb11d9791984ade6515914bd243822afb5273cc756f20d3f14424de63a02dc59598a80b14b9c5deaafebae62625e909724356506a62796cc2e717c48c8f2ee2e3e0dbb608fbb144f445069fd1ed3ce8d97f30e061d8b7d0af6cbee30730cd47936b542f991cdb2ac369e72a5007cbc8148c13dcb2a48ca4cbdc91638fa509a172c1d1174ff7f3cf1f50a6eaf0ec2fd0c
recovered
5f4e33643971120c418102092fa1b895e1711087800117ac264f8516801b58388361223f8f805e95a81d4263ebf32758bf84a13edb8daf88a3d729a21d44815876b520bbf5ac930c0015ce8d030650a02eea08990c2b888d2c3da50a0450947e0d991a25845d01ff1b718a8d80f0e98040083c010ce15a788820cff2ac1f0c0005689a021a2ca63602411f62848de904c01fb92c0784b958404882003c49a5a001004e09818dbeaea9040410a4410d998e8d4e16d491ac070470987c0d99553a89024924a31997e1b80a65820206910200060d825d053e5083000aa001001962825d1439a24e18870292e8ab1b85a1133f848d9779a50804d0938e858d2bb8490202bf84a11d5e8bf87d1e8403685816e91499731e57384eea672e090410888c15991217848d5834a48d8a8b3e00bf586b57840204938cc11c40835d095448818d2d000000001650b80fea6d448402c62bacb58ac13c4b898d27d3a80604d8287021996e588a0208360241d40f460882253e78818d7527a751278702083602411f12868db365085c0158b838ea22672282800f5b07eca5b729581b85a179436c8a95a1cd2eee0149460210064a8202069102000447878d0eb9a3885e095d06acf92075825dd45bab27360b845dee42a21840835df5b8a37b448a027cd0a19602415f4382201bb203003335815d033c18855d211bac1449898d124a00060023f86d0f29870206910200073b478202005fab1b8381662a8c5dd708a84010818da73ea4598fac4717c95863306c128c8d32cd648f8ff01e5f815f1d4e22858d0f5b07eca5b729585f7d820202bf84a17f138d8d055f6b010587a5586116338e02ecb7a61a87a10d1b8102149b85a126678902149b85a15035958d20ee55f5b893cd587a1c6813aa083c2dc14ea0446e68818d88a8a54433280280af587f20858df02ba20104d0a24215994f43890208360241285837858d152068adaa4d0418021a832008360228106a865d1f7f0b81027bc0ab9986c10e088e8d1aed28947ae753585656818dfd4ca40d11a1885f01608817ea450d085c0158b83aea3cd702083602414c7f638220083602283b5f815d066839838d328cad100c4896db0999407f9620806faa900e0076bd8d01d5f2bdaf871b5870458f8d14b95104bf581023267a83858d02adf48af851879b5809069f8d0048ca14a700020023493d838d8a74a5dee80a898d3775081385200c3813006bb6201f1b0300270283201f1b0300495f82201f1b0300495f82201f1b03007e42845d020a57848d4102a0b84a0002aa4a4bd88d190284108ed10999235c70823227a1fda0328187af58703d825d254564858058e6a3fe637acc82af58b895e10566818d0f83383800de1499644581201f1b03003345968dfd8ea018ef495222bf58

@nunojpg
Copy link
Author

nunojpg commented Feb 11, 2025

I don't have any more at the moment, we are harvesting them at about 3 per week.

@terrelln
Copy link
Contributor

Two should be enough! Thanks @nunojpg!

I feel confident we'll be able to figure this out quickly with these examples.

@terrelln
Copy link
Contributor

This second example decompresses correctly if you add 60 to every offset.

@terrelln
Copy link
Contributor

Theory:

  • Dictionary is getting attached
  • The lowLimit / dictLimit is getting set too high
  • This makes the dictionary offsets all wrong by the same amount

Evidence:

  • Everything is offset by the same amount in the same file
  • In the original file, the "correction" is +197.
  • If I recompress the original file, I get these sequences:
seq: litL=29, matchL=5, offset=171592 
seq: litL=103, matchL=4, offset=112667 
seq: litL=88, matchL=6, offset=5266 
  • However in the corrupted file we see:
seq: litL=231, matchL=4, offset=1 
  • This suggests that for whatever reason, the compressor may have considered these indices out of bounds, so it couldn't match until it surpassed index 197.

Not sure how this could happen yet though.

@nunojpg
Copy link
Author

nunojpg commented Feb 12, 2025

Sorry if this is dumb, as this doesn't happen with a clean context, does this point to some uninitialized variable that by chance gets set with a very unlikely option from the previous compression?
But I am pretty sure on the compression following the corrupt one it resumes good operation, so whatever happened would get restored. Which doesn't really match my assumption, unless it is "the last operation of the previous compression" that sets a variable that is not correctly reset to begin a new compression.
Would a run with sanitizers on ARM32 help?

@nunojpg
Copy link
Author

nunojpg commented Feb 12, 2025

Or a fuzzer for the context .

@terrelln
Copy link
Contributor

@nunojpg I don't think there are uninitialized variables. I suspect some of the code that resets the context may be buggy in some very specific scenarios, that have something to do with either the amount of data processed, or the location of the buffers in memory.

Could you describe how you're using these compression contexts in a bit more detail? It would be useful in trying to understand what kind of conditions could trigger this. A few specific questions:

  • Are you always using the same dictionary? Or does the dictionary change?
  • What size range do you expect your input data to be? Always around 1KB? Or could it sometimes be much larger?
  • Are you always compressing from the same buffer, or do you reallocate a new buffer every time?

@nunojpg
Copy link
Author

nunojpg commented Feb 12, 2025

Sure.

  • Are you always using the same dictionary? Or does the dictionary change?

Always the same. Init in the beggining, from that point I only call ZSTD_compress2 and ZSTD_isError.

  • What size range do you expect your input data to be? Always around 1KB? Or could it sometimes be much larger?

Minimum about 10 bytes, maximum about 1480 bytes.

  • Are you always compressing from the same buffer, or do you reallocate a new buffer every time?

Input: address in the heap, always the same
Ouput: address in the stack, can vary by up to 100 bytes (first I place a header, then I compress in place after the header.)

@nunojpg
Copy link
Author

nunojpg commented Feb 12, 2025

There are two ZSTD_CCtx in the same process. The other ZSTD_CCtx is compressing blocks of about 5MB, and runs in a thread, so there could be concurrency issues if there was some static variables in the library.
I link dynamically.

@terrelln
Copy link
Contributor

Zstd does not use any static variables.

@terrelln
Copy link
Contributor

I suspect that PR #4129 fixes this issue. But, now that I have a concrete target, I'm going to attempt to repro the issue, so I can prove that it is the culprit.

@Cyan4973
Copy link
Contributor

oh, I forgot #4129 ! Nice remember @terrelln ! It's indeed a very good candidate for this issue description.

I presume @nunojpg application is using the v1.5.6 release, which doesn't include the #4129 fix.
In which case, a possible test could be to employ the dev branch instead, to observe the difference. Though I'm not sure if that's really practical in your environment. Maybe just in a test environment, but then it's unclear if it can generate enough volume to statistically trigger the issue since it's pretty rare.

@terrelln
Copy link
Contributor

@nunojpg I've reproduced exactly the same symptoms locally on v1.5.6. These issues are fixed by PR #4129. You can test on the latest dev branch, or we will be making a release soon, which includes this fix.

Reproduce

This command will reproduce a corrupted file in compressed and report corruption in 32-bit builds. It is extremely messy, since I haven't bothered to clean anything up, but it gets the job done.

git clone https://github.com/terrelln/zstd
cd zstd
git checkout 2025-02-12-corruption-repro
make -j zstd
./zstd

The corrupted file shows exactly the same symptom, where offsets into the dictionary are shifted by some fixed constant. In this case every offset into the dictionary is 200 lower than it should be.

Issue

This code gets triggered when the input buffer overlaps with the previous input buffer. When the indices are right on the 2GB boundary, the comparison on line 1278 can be wrong.

If highInputIdx > 2GB and window->dictLimit < 2GB, then the comparison will return false on 32-bit mode, when it should return true.

Then we end up in a case where window->lowLimit > window->dictLimit, which breaks a fundamental assumption inside of Zstandard.

After this, we mis-compute the offsets into the dictionary, and they are shifted by window->lowLimit - window->dictLimit.

if ( (ip+srcSize > window->dictBase + window->lowLimit)
& (ip < window->dictBase + window->dictLimit)) {
ptrdiff_t const highInputIdx = (ip + srcSize) - window->dictBase;
U32 const lowLimitMax = (highInputIdx > (ptrdiff_t)window->dictLimit) ? window->dictLimit : (U32)highInputIdx;
window->lowLimit = lowLimitMax;
DEBUGLOG(5, "Overlapping extDict and input : new lowLimit = %u", window->lowLimit);
}

Testing

This code is tricky to test because it requires processing at least 2GB of input. Its also very sensitive to the physical location in memory where the buffers are. This logic is required due to the buffer-less streaming API. Once we're able to delete this API, I'd suggest drastically simplifying this code.

I suggest we add a fuzzer for ZSTD_window that doesn't process any data, but uses the fuzzer to fuzz this API directly, and ensure that its invariants are always respected.

@nunojpg
Copy link
Author

nunojpg commented Feb 13, 2025

Great work, would you like to close this issue?

@terrelln
Copy link
Contributor

@nunojpg Thanks for reporting this with the detailed examples, it made it possible to debug it!

I'll go ahead and close this issue, but if you find that it is still occurring after PR #4129, please let us know! It is very likely that is the bug causing the issue, but it is impossible for us to know for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants