Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP/2: Rare data corruption when stress testing against Kestrel #30550

Closed
eiriktsarpalis opened this issue Aug 9, 2019 · 15 comments
Closed
Assignees
Milestone

Comments

@eiriktsarpalis
Copy link
Member

I've discovered a data corruption issue when running the HttpStress suite over long stretches of time. The occurrences are very rare (I've recorded 8 instances in stress runs adding to over 60M requests).

The issue occurs in requests where the server echoes back random data sent by the client, either headers or content. The final response always differs from expected by a single character: in most cases it's been corrupted to a different value but I've also recorded a couple of instances where it's missing altogether (as such the content length differs from the expected).

Examples

For example, one operation was expecting

"Bo4JwKXM5KukFjgxQ7E8AV52yeoddJr7XHtKwxu9mVUpl8VbswO3XX9fi2LHWtjlTnCxqM0TD5Lf1FGq9Clguna07DbucdcsliMyvXlVwvqRBUeSlU7J82qfDBpck5qZSTtJtIubQFjzx8pdZ898VJkjrgkjh8kr5GziaKZSZVssIri5MC3cwi49iARCXBtHVsug872EZgY0QsQzyfNV58vNCqzQtMuObaMPx1xEAp2F93jqjREsnOCMxXUI8bLCl9OpV02wi8KG5fD878gR72bDO0SPXe6ZTpT0ftLjPIAf4vHIJLBzuu5jYBQVYcP31lmFCDSDnx"

but got

"Bo4JwKXM5KukFjgxQ7E8AV(2yeoddJr7XHtKwxu9mVUpl8VbswO3XX9fi2LHWtjlTnCxqM0TD5Lf1FGq9Clguna07DbucdcsliMyvXlVwvqRBUeSlU7J82qfDBpck5qZSTtJtIubQFjzx8pdZ898VJkjrgkjh8kr5GziaKZSZVssIri5MC3cwi49iARCXBtHVsug872EZgY0QsQzyfNV58vNCqzQtMuObaMPx1xEAp2F93jqjREsnOCMxXUI8bLCl9OpV02wi8KG5fD878gR72bDO0SPXe6ZTpT0ftLjPIAf4vHIJLBzuu5jYBQVYcP31lmFCDSDnx"

The strings are identical with the exception of character in position 22, where the returned value was ( instead of 5.

The issue overwhelmingly impacted the POST Duplex Slow operation, which echoes content flushing characters one by one. However today I recorded a single failure impacting the GET Headers operation, which echoes randomized headers:

Unexpected values for header Header-51. Expected
"%2c%3d%7e%25%10%10PuF%07%3cXO)%3cz%16%40fQ5%0dD(%22", "%5b%3aQx", "%185%1f%1d%3e%17", "rRV+" 
but got 
"%2c%3d%7e%25%10%10PuF%07%3cXO)%3cz%16%40fQ5%0dD(%22", "%5b%3aQx", "%185%1f%1d%3e%|7", "rRV+"

In this case the second-to-last character in the third value has been corrupted. If caused by the same bug, this suggests that it might not be triggered because of DATA frame granularity.

Here's a list of all corrupted characters if somebody can deduce a pattern:

Index Expected Actual
22 5 (
231 j b
0 3 l
243 o i
408 q U+263c
70 H missing
327 6 missing
N/A (header value) 1 |

More details

  • Issue manifests itself without cancellation happening.
  • Only reproduced when targeting Kestrel, no occurences when using http.sys.
  • Data corruption impacts both request bytes and response bytes.
  • Issue manifests itself is POST duplex, GET and PUT operations.

cc @geoffkizer

@eiriktsarpalis eiriktsarpalis self-assigned this Aug 9, 2019
@geoffkizer
Copy link
Contributor

Given that this doesn't happen with http.sys, what leads us to believe this is a client issue?

@stephentoub
Copy link
Member

Can we enable WinHttpHandler in the stress test, and see if it repros against kestrel?

@eiriktsarpalis
Copy link
Member Author

Given that this doesn't happen with http.sys, what leads us to believe this is a client issue?

We don't, just added the issue here for visibility until we can validate it's not the client.

@jkotalik
Copy link
Contributor

Can you gather more information about whether the corruption is occurring when Kestrel reads the request or when it writes the response? We may need some wireshark traces for this one also to confirm what is going on.

@eiriktsarpalis
Copy link
Member Author

Can you gather more information about whether the corruption is occurring when Kestrel reads the request or when it writes the response? We may need some wireshark traces for this one also to confirm what is going on.

Check dotnet/corefx#40360. Seems that it happens both ways.

@jkotalik
Copy link
Contributor

So we found one instance of data corruption and it's definitely occurring on the server (Kestrel). Our current working hypothesis is there is an off by one error when leasing memory out in our ConcurrentPipeWriter. What we see happening is a request with a large response is getting its first byte corrupted with a value that was previously written. There is a test that is running which sends 1-byte data frames from the client to the server and vice versa. Because these tests are running in parallel, we believe one byte is corrupting the memory of the other response before we write it to the StreamPipeWriter.

I'm bashing my head against this code trying to figure out how this can happen. I may try writing a test which tries to have two streams writing to the response at the same time, making sure their responses are in order. However, consider this repros once in every 3 hours when running a stress suite, I doubt that will really help. My current plan is to custom build kestrel and add more logging to the ConcurrentPipeWriter.
cc @halter73 @Tratcher

@eiriktsarpalis
Copy link
Member Author

@jkotalik @geoffkizer Given we have sufficient evidence, would it make sense if we moved this issue to aspnet?

@jkotalik
Copy link
Contributor

I don't think we should move it, we were still seeing checksum issues when sending data from the client to the server. I'll file an issue in the AspNetCore repo.

@eiriktsarpalis
Copy link
Member Author

Some updates regarding the issue. I was able to capture frames for a number of data corruptions that happen when sending data from the client to the server: in all cases the data sent over the wire is correct, indicating that kestrel is corrupting data on read as well.

For instance, I reproduced the following error on OSX:

Error from iteration 3834619 (10: POST ExpectContinue) in task 1 with 15337470 successes / 5 fails:
System.Exception: Content mismatch. Expected:
"i8KVXTATAya0lqaHDqAyFCPfbbDWldIDOiQdF0220PBykNQaCd96uehVjftwGdiWgMb2Px2Do8yp0QuDAcK7UCJk0vta65lMheIOmNAFbwIlgsUIkgmdJa2T3r2U4S7iNc8M8WRmR4WXuY47QapQJ726boLuLgB8paO3k8AEEzCF4UZUnu4TdgOFa1kKBWRltNiKA0NHQbCKvn16Eq7QchV8B8TiOp3hYRLgmFKmXm31U4Jmzd5KP18Csa81XetaYUmURLKzVLtqdTd7pgikkiIXI9oB6JFssTT0IpKeI5p38IYOw0Cqv9ppCFBIX8sC50xAzdpzL4RetEP0bPF67Pwiden2e27ACcy7KNPIBCUriAwxqAg7tm5g3FXLpmLJxHSGGHgE0wuuu40ShqCTLzGQXH35p3SOrBuRighYUS7wrAlyE4dzZrxx01iyDbOwUW2jAZk18eytsKYMV4qlPSp0ExQgyiKt4ASL82ofKcqvs44gTLS8kcX1h2UD933QJ5oisCESGBRmct7AskDcGB8RcQe6XbWGZQbnFMoH6zsZBkPnuKKOqQA7wN5RX8uefACBHKuVvwVGJZGs1rBDG7Q9GUqrHMjy4Et1YDVaPpSuutoM7DQgxhNYj3unU2DZCXB4gEJnLbnReWFMfTk4w21NqLHUIh5MbthMijgstcKLuvtDoY80I9cAWoEKxpUeVQqkosevmsBJhfSuR3gkZ54RHXqtVMu48gpT58OvtCEPyISrLJst8j4QLOTH2lQQx6B1pzpIelrNqeTy4yXBT1wPbP7gf7t7moSr7Q1MUZEMrhZhvUC3DEwwAmPQr8dXrja2CbrdxAhtmZXtpQlBuqtXrWgK0bVMwhV86XUJcY53X29bFWaQMRd1p3AguXdMeKx53PmIvSYLkq0pM55U05zU3TI3MoO5vjOC54XqjHUutHHz2LJpR8LZLS1a"
Actual:
"L8KVXTATAya0lqaHDqAyFCPfbbDWldIDOiQdF0220PBykNQaCd96uehVjftwGdiWgMb2Px2Do8yp0QuDAcK7UCJk0vta65lMheIOmNAFbwIlgsUIkgmdJa2T3r2U4S7iNc8M8WRmR4WXuY47QapQJ726boLuLgB8paO3k8AEEzCF4UZUnu4TdgOFa1kKBWRltNiKA0NHQbCKvn16Eq7QchV8B8TiOp3hYRLgmFKmXm31U4Jmzd5KP18Csa81XetaYUmURLKzVLtqdTd7pgikkiIXI9oB6JFssTT0IpKeI5p38IYOw0Cqv9ppCFBIX8sC50xAzdpzL4RetEP0bPF67Pwiden2e27ACcy7KNPIBCUriAwxqAg7tm5g3FXLpmLJxHSGGHgE0wuuu40ShqCTLzGQXH35p3SOrBuRighYUS7wrAlyE4dzZrxx01iyDbOwUW2jAZk18eytsKYMV4qlPSp0ExQgyiKt4ASL82ofKcqvs44gTLS8kcX1h2UD933QJ5oisCESGBRmct7AskDcGB8RcQe6XbWGZQbnFMoH6zsZBkPnuKKOqQA7wN5RX8uefACBHKuVvwVGJZGs1rBDG7Q9GUqrHMjy4Et1YDVaPpSuutoM7DQgxhNYj3unU2DZCXB4gEJnLbnReWFMfTk4w21NqLHUIh5MbthMijgstcKLuvtDoY80I9cAWoEKxpUeVQqkosevmsBJhfSuR3gkZ54RHXqtVMu48gpT58OvtCEPyISrLJst8j4QLOTH2lQQx6B1pzpIelrNqeTy4yXBT1wPbP7gf7t7moSr7Q1MUZEMrhZhvUC3DEwwAmPQr8dXrja2CbrdxAhtmZXtpQlBuqtXrWgK0bVMwhV86XUJcY53X29bFWaQMRd1p3AguXdMeKx53PmIvSYLkq0pM55U05zU3TI3MoO5vjOC54XqjHUutHHz2LJpR8LZLS1a"
 Diverging at index 0. server checksum mismatch
   at HttpStress.ClientOperations.ValidateContent(String expectedContent, String actualContent, String details) in /Users/admin/eirik-tests/corefx/src/System.Net.Http/tests/StressTests/HttpStress/ClientOperations.cs:line 499
   at HttpStress.ClientOperations.<>c.<<get_Operations>b__1_10>d.MoveNext() in /Users/admin/eirik-tests/corefx/src/System.Net.Http/tests/StressTests/HttpStress/ClientOperations.cs:line 419
--- End of stack trace from previous location where exception was thrown ---
   at HttpStress.StressClient.<>c__DisplayClass12_0.<<StartCore>g__RunWorker|1>d.MoveNext() in /Users/admin/eirik-tests/corefx/src/System.Net.Http/tests/StressTests/HttpStress/StressClient.cs:line 129

Client side data corruption error

Here is the trace for that error: httpstress.zip (stream identifier: 22550965)

@davidfowl
Copy link
Member

indicating that kestrel is corrupting data on read as well.

Read of what? Do you have data indicating that data coming out of the request body is invalid? This checksum in on the client side.

@davidsh
Copy link
Contributor

davidsh commented Sep 3, 2019

Read of what? Do you have data indicating that data coming out of the request body is invalid? This checksum in on the client side.

The checksum is calculated on the server side and sent as headers in the HTTP response. The client reads that response body and its corresponding checksum (in the response headers). The checksum received shows that the data received from the server is corrupted. So, it's a server-side problem.

@davidfowl
Copy link
Member

I didn’t say it wasn’t a server issue. I’m specifically asking if the server received the wrong data in the request body. We’re aware of the response body corruption

@eiriktsarpalis
Copy link
Member Author

I’m specifically asking if the server received the wrong data in the request body.

In this case we're looking at request body corruption, however the data as intercepted over the wire looks fine.

@jkotalik
Copy link
Contributor

jkotalik commented Sep 3, 2019

In this case we're looking at request body corruption, however the data as intercepted over the wire looks fine.

From our investigation, the issue is looking to lean towards an issue in either Pipelines or interactions with the MemoryPool. I'm not surprised the corruption happens reading the request as well.

@jkotalik
Copy link
Contributor

jkotalik commented Sep 5, 2019

I think we can close this now. The issue is purely on the server in aspnetcore.

@jkotalik jkotalik closed this as completed Sep 5, 2019
@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 3.0 milestone Feb 1, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants