-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP/2: Rare data corruption when stress testing against Kestrel #30550
Comments
Given that this doesn't happen with http.sys, what leads us to believe this is a client issue? |
Can we enable WinHttpHandler in the stress test, and see if it repros against kestrel? |
We don't, just added the issue here for visibility until we can validate it's not the client. |
Can you gather more information about whether the corruption is occurring when Kestrel reads the request or when it writes the response? We may need some wireshark traces for this one also to confirm what is going on. |
Check dotnet/corefx#40360. Seems that it happens both ways. |
So we found one instance of data corruption and it's definitely occurring on the server (Kestrel). Our current working hypothesis is there is an off by one error when leasing memory out in our ConcurrentPipeWriter. What we see happening is a request with a large response is getting its first byte corrupted with a value that was previously written. There is a test that is running which sends 1-byte data frames from the client to the server and vice versa. Because these tests are running in parallel, we believe one byte is corrupting the memory of the other response before we write it to the StreamPipeWriter. I'm bashing my head against this code trying to figure out how this can happen. I may try writing a test which tries to have two streams writing to the response at the same time, making sure their responses are in order. However, consider this repros once in every 3 hours when running a stress suite, I doubt that will really help. My current plan is to custom build kestrel and add more logging to the ConcurrentPipeWriter. |
@jkotalik @geoffkizer Given we have sufficient evidence, would it make sense if we moved this issue to aspnet? |
I don't think we should move it, we were still seeing checksum issues when sending data from the client to the server. I'll file an issue in the AspNetCore repo. |
Some updates regarding the issue. I was able to capture frames for a number of data corruptions that happen when sending data from the client to the server: in all cases the data sent over the wire is correct, indicating that kestrel is corrupting data on read as well. For instance, I reproduced the following error on OSX:
Here is the trace for that error: httpstress.zip (stream identifier: 22550965) |
Read of what? Do you have data indicating that data coming out of the request body is invalid? This checksum in on the client side. |
The checksum is calculated on the server side and sent as headers in the HTTP response. The client reads that response body and its corresponding checksum (in the response headers). The checksum received shows that the data received from the server is corrupted. So, it's a server-side problem. |
I didn’t say it wasn’t a server issue. I’m specifically asking if the server received the wrong data in the request body. We’re aware of the response body corruption |
In this case we're looking at request body corruption, however the data as intercepted over the wire looks fine. |
From our investigation, the issue is looking to lean towards an issue in either Pipelines or interactions with the MemoryPool. I'm not surprised the corruption happens reading the request as well. |
I think we can close this now. The issue is purely on the server in aspnetcore. |
I've discovered a data corruption issue when running the HttpStress suite over long stretches of time. The occurrences are very rare (I've recorded 8 instances in stress runs adding to over 60M requests).
The issue occurs in requests where the server echoes back random data sent by the client, either headers or content. The final response always differs from expected by a single character: in most cases it's been corrupted to a different value but I've also recorded a couple of instances where it's missing altogether (as such the content length differs from the expected).
Examples
For example, one operation was expecting
but got
The strings are identical with the exception of character in position 22, where the returned value was
(
instead of5
.The issue overwhelmingly impacted the
POST Duplex Slow
operation, which echoes content flushing characters one by one. However today I recorded a single failure impacting theGET Headers
operation, which echoes randomized headers:In this case the second-to-last character in the third value has been corrupted. If caused by the same bug, this suggests that it might not be triggered because of DATA frame granularity.
Here's a list of all corrupted characters if somebody can deduce a pattern:
5
(
j
b
3
l
o
i
q
U+263c
H
6
1
|
More details
cc @geoffkizer
The text was updated successfully, but these errors were encountered: