-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: fix some flakes that might relate to #5104. #5172
Conversation
/retest |
🔨 rebuilding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for deflaking.
test/integration/http_integration.cc
Outdated
@@ -1428,10 +1429,6 @@ void HttpIntegrationTest::testUpstreamDisconnectWithTwoRequests() { | |||
auto response = codec_client_->makeRequestWithBody(default_request_headers_, 1024); | |||
waitForNextUpstreamRequest(); | |||
|
|||
// Request 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HttpIntegrationTest::testUpstreamDisconnectWithTwoRequests was racy. It was possible before for
both requests to be load balanced and sent to the first upstream connection. When the 200
for the first request is received, we would then disconnect the upstream. This would then 503
back. Not sure if we've lost some coverage by these changes; the alternative might be to configure
retries?
Hmm yeah, TBH I don't remember the original intention of this test. If I had to guess it's to make sure that the disconnect flow works correctly when we have 2 in flight connections, so this is changing things a bit. Don't we only have a single upstream and a single worker? What's the failing interleaving here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the sequence. Consider this for HTTP/2, where we have a single connection:
- Client sends request (1) and (2) to Envoy.
- Envoy sends requests (1) and (2) to upstream.
- Upstream waits for a request, 200s request (1) and then disconnects. Request (2) might be queued up in the upstream.
- Envoy sends back to client the 200 for request (1). However, it sees the disconnect and 503s request (2).
This breaks the test. I think the usual story is that the upstream disconnects in step 3 before Envoy has managed to send request (2) to the upstream. While we only have one worker an one upstream thread, these are both independent. New streams are autonomously accepted by the FakeUpstream
, without any interlock with the test/client thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK yeah that makes sense for HTTP/2. Is this only flaking for HTTP/2?
TBH I really don't recall all the history here. @alyssawilk do you? In a perfect world, I think we would actually explicitly test both scenarios (both the 200 happy case and the 503 unhappy case), but that might require splitting on HTTP/1 and HTTP/2 and probably using some stats waiting to correctly synchronize. @alyssawilk thoughts on what to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this test was added by @lizan in #2871 to regression test #2715, so it was to handle a specific case with HTTP/1.1 connection pooling. I think this change will result in that no longer being regression tested and so I'd be inclined to split out HTTP/1 and HTTP/2 if I hadn't half-convinced myself there's also a (less likely) race on the H1 path. I'll let Lizan weigh in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah,I think there is still a window in (2) for this to happen with keep-alive and HTTP/1.1. Upstream 200s, Envoy asynchronously forwards the next request to the upstream and it's in-flight, then the disconnect happens.
I think we should resolve this one quickly, this is behind a lot of the test failures we're seeing on CI AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't that then defeat the purpose of the test, if it has to do with connection pooling and HTTP/1.1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure of the original intention of the test. @lizan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just spent a little more time looking at the original change and the test and I think I understand the intention. If I understand correctly, I think the test could be done as follows:
- Do test only for HTTP/1.1
- Set max upstream connections to 1
- Issue 2 requests on 2 connections
- Use stats to make sure 1 request is pending and 1 is active
- Respond and disconnect
- Make sure 2nd request goes through.
If this is too complicated to fix quickly and you want to deflake, it's probably fine to comment out this test and fix in a follow up? Or maybe @lizan can fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - I'm fine with DISABLING_ and adding a TODO for later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix this today, I'm in an all-day meeting so it might be a bit delayed.
@venilnoronha FYI, it's probably interesting to leave this PR without retests, since it's intended to fix flakes; if they are still present, then that's an issue. Looks like the build issue was only with docs, which is surprising, I haven't seen that before. |
@htuch is this the only possible flake, or are there other identified flakes too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
submit review. Important button that :-(
test/integration/http_integration.cc
Outdated
@@ -1428,10 +1429,6 @@ void HttpIntegrationTest::testUpstreamDisconnectWithTwoRequests() { | |||
auto response = codec_client_->makeRequestWithBody(default_request_headers_, 1024); | |||
waitForNextUpstreamRequest(); | |||
|
|||
// Request 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this test was added by @lizan in #2871 to regression test #2715, so it was to handle a specific case with HTTP/1.1 connection pooling. I think this change will result in that no longer being regression tested and so I'd be inclined to split out HTTP/1 and HTTP/2 if I hadn't half-convinced myself there's also a (less likely) race on the H1 path. I'll let Lizan weigh in.
Signed-off-by: Harvey Tuch <htuch@google.com>
c78700e
to
57aa489
Compare
DCO is stuck due to forced push, I'm going to kill this PR and start another.. |
Need to be prepared to handle disconnection when resetting the server in
HttpIntegrationTest::testRouterHeaderOnlyRequestAndResponse.
HttpIntegrationTest::testUpstreamDisconnectWithTwoRequests was racy. It was possible before for
both requests to be load balanced and sent to the first upstream connection. When the 200
for the first request is received, we would then disconnect the upstream. This would then 503
back. Not sure if we've lost some coverage by these changes; the alternative might be to configure
retries?
Risk Level: Low
Testing: 1k runs on internal build farm, flakes disappeared.
Signed-off-by: Harvey Tuch htuch@google.com