Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix some flakes that might relate to #5104. #5172

Closed
wants to merge 1 commit into from

Conversation

htuch
Copy link
Member

@htuch htuch commented Nov 30, 2018

  • Need to be prepared to handle disconnection when resetting the server in
    HttpIntegrationTest::testRouterHeaderOnlyRequestAndResponse.

  • HttpIntegrationTest::testUpstreamDisconnectWithTwoRequests was racy. It was possible before for
    both requests to be load balanced and sent to the first upstream connection. When the 200
    for the first request is received, we would then disconnect the upstream. This would then 503
    back. Not sure if we've lost some coverage by these changes; the alternative might be to configure
    retries?

Risk Level: Low
Testing: 1k runs on internal build farm, flakes disappeared.

Signed-off-by: Harvey Tuch htuch@google.com

@venilnoronha
Copy link
Member

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: docs (failed build)

🐱

Caused by: a #5172 (comment) was created by @venilnoronha.

see: more, trace.

venilnoronha
venilnoronha previously approved these changes Dec 1, 2018
Copy link
Member

@venilnoronha venilnoronha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good.

@mattklein123 mattklein123 self-assigned this Dec 1, 2018
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for deflaking.

@@ -1428,10 +1429,6 @@ void HttpIntegrationTest::testUpstreamDisconnectWithTwoRequests() {
auto response = codec_client_->makeRequestWithBody(default_request_headers_, 1024);
waitForNextUpstreamRequest();

// Request 2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HttpIntegrationTest::testUpstreamDisconnectWithTwoRequests was racy. It was possible before for
both requests to be load balanced and sent to the first upstream connection. When the 200
for the first request is received, we would then disconnect the upstream. This would then 503
back. Not sure if we've lost some coverage by these changes; the alternative might be to configure
retries?

Hmm yeah, TBH I don't remember the original intention of this test. If I had to guess it's to make sure that the disconnect flow works correctly when we have 2 in flight connections, so this is changing things a bit. Don't we only have a single upstream and a single worker? What's the failing interleaving here?

Copy link
Member Author

@htuch htuch Dec 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the sequence. Consider this for HTTP/2, where we have a single connection:

  1. Client sends request (1) and (2) to Envoy.
  2. Envoy sends requests (1) and (2) to upstream.
  3. Upstream waits for a request, 200s request (1) and then disconnects. Request (2) might be queued up in the upstream.
  4. Envoy sends back to client the 200 for request (1). However, it sees the disconnect and 503s request (2).

This breaks the test. I think the usual story is that the upstream disconnects in step 3 before Envoy has managed to send request (2) to the upstream. While we only have one worker an one upstream thread, these are both independent. New streams are autonomously accepted by the FakeUpstream, without any interlock with the test/client thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yeah that makes sense for HTTP/2. Is this only flaking for HTTP/2?

TBH I really don't recall all the history here. @alyssawilk do you? In a perfect world, I think we would actually explicitly test both scenarios (both the 200 happy case and the 503 unhappy case), but that might require splitting on HTTP/1 and HTTP/2 and probably using some stats waiting to correctly synchronize. @alyssawilk thoughts on what to do here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this test was added by @lizan in #2871 to regression test #2715, so it was to handle a specific case with HTTP/1.1 connection pooling. I think this change will result in that no longer being regression tested and so I'd be inclined to split out HTTP/1 and HTTP/2 if I hadn't half-convinced myself there's also a (less likely) race on the H1 path. I'll let Lizan weigh in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah,I think there is still a window in (2) for this to happen with keep-alive and HTTP/1.1. Upstream 200s, Envoy asynchronously forwards the next request to the upstream and it's in-flight, then the disconnect happens.

I think we should resolve this one quickly, this is behind a lot of the test failures we're seeing on CI AFAICT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that then defeat the purpose of the test, if it has to do with connection pooling and HTTP/1.1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of the original intention of the test. @lizan?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just spent a little more time looking at the original change and the test and I think I understand the intention. If I understand correctly, I think the test could be done as follows:

  1. Do test only for HTTP/1.1
  2. Set max upstream connections to 1
  3. Issue 2 requests on 2 connections
  4. Use stats to make sure 1 request is pending and 1 is active
  5. Respond and disconnect
  6. Make sure 2nd request goes through.

If this is too complicated to fix quickly and you want to deflake, it's probably fine to comment out this test and fix in a follow up? Or maybe @lizan can fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - I'm fine with DISABLING_ and adding a TODO for later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this today, I'm in an all-day meeting so it might be a bit delayed.

@htuch
Copy link
Member Author

htuch commented Dec 2, 2018

@venilnoronha FYI, it's probably interesting to leave this PR without retests, since it's intended to fix flakes; if they are still present, then that's an issue. Looks like the build issue was only with docs, which is surprising, I haven't seen that before.

@venilnoronha
Copy link
Member

@htuch is this the only possible flake, or are there other identified flakes too?

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submit review. Important button that :-(

@@ -1428,10 +1429,6 @@ void HttpIntegrationTest::testUpstreamDisconnectWithTwoRequests() {
auto response = codec_client_->makeRequestWithBody(default_request_headers_, 1024);
waitForNextUpstreamRequest();

// Request 2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this test was added by @lizan in #2871 to regression test #2715, so it was to handle a specific case with HTTP/1.1 connection pooling. I think this change will result in that no longer being regression tested and so I'd be inclined to split out HTTP/1 and HTTP/2 if I hadn't half-convinced myself there's also a (less likely) race on the H1 path. I'll let Lizan weigh in.

Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch
Copy link
Member Author

htuch commented Dec 4, 2018

DCO is stuck due to forced push, I'm going to kill this PR and start another..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants