test: fix some flakes that might relate to #5104. #5172

htuch · 2018-11-30T21:32:32Z

Need to be prepared to handle disconnection when resetting the server in
HttpIntegrationTest::testRouterHeaderOnlyRequestAndResponse.
HttpIntegrationTest::testUpstreamDisconnectWithTwoRequests was racy. It was possible before for
both requests to be load balanced and sent to the first upstream connection. When the 200
for the first request is received, we would then disconnect the upstream. This would then 503
back. Not sure if we've lost some coverage by these changes; the alternative might be to configure
retries?

Risk Level: Low
Testing: 1k runs on internal build farm, flakes disappeared.

Signed-off-by: Harvey Tuch htuch@google.com

venilnoronha · 2018-12-01T04:47:49Z

/retest

repokitteh-read-only · 2018-12-01T04:47:52Z

🔨 rebuilding ci/circleci: docs (failed build)

🐱

Caused by: a #5172 (comment) was created by @venilnoronha.

see: more, trace.

venilnoronha

This is good.

mattklein123

Thanks for deflaking.

mattklein123 · 2018-12-01T22:29:50Z

test/integration/http_integration.cc

@@ -1428,10 +1429,6 @@ void HttpIntegrationTest::testUpstreamDisconnectWithTwoRequests() {
  auto response = codec_client_->makeRequestWithBody(default_request_headers_, 1024);
  waitForNextUpstreamRequest();

-  // Request 2.


HttpIntegrationTest::testUpstreamDisconnectWithTwoRequests was racy. It was possible before for
both requests to be load balanced and sent to the first upstream connection. When the 200
for the first request is received, we would then disconnect the upstream. This would then 503
back. Not sure if we've lost some coverage by these changes; the alternative might be to configure
retries?

Hmm yeah, TBH I don't remember the original intention of this test. If I had to guess it's to make sure that the disconnect flow works correctly when we have 2 in flight connections, so this is changing things a bit. Don't we only have a single upstream and a single worker? What's the failing interleaving here?

I think this is the sequence. Consider this for HTTP/2, where we have a single connection:

Client sends request (1) and (2) to Envoy.

Envoy sends requests (1) and (2) to upstream.

Upstream waits for a request, 200s request (1) and then disconnects. Request (2) might be queued up in the upstream.

Envoy sends back to client the 200 for request (1). However, it sees the disconnect and 503s request (2).

This breaks the test. I think the usual story is that the upstream disconnects in step 3 before Envoy has managed to send request (2) to the upstream. While we only have one worker an one upstream thread, these are both independent. New streams are autonomously accepted by the FakeUpstream, without any interlock with the test/client thread.

OK yeah that makes sense for HTTP/2. Is this only flaking for HTTP/2?

TBH I really don't recall all the history here. @alyssawilk do you? In a perfect world, I think we would actually explicitly test both scenarios (both the 200 happy case and the 503 unhappy case), but that might require splitting on HTTP/1 and HTTP/2 and probably using some stats waiting to correctly synchronize. @alyssawilk thoughts on what to do here?

Looks like this test was added by @lizan in #2871 to regression test #2715, so it was to handle a specific case with HTTP/1.1 connection pooling. I think this change will result in that no longer being regression tested and so I'd be inclined to split out HTTP/1 and HTTP/2 if I hadn't half-convinced myself there's also a (less likely) race on the H1 path. I'll let Lizan weigh in.

Yeah,I think there is still a window in (2) for this to happen with keep-alive and HTTP/1.1. Upstream 200s, Envoy asynchronously forwards the next request to the upstream and it's in-flight, then the disconnect happens.

I think we should resolve this one quickly, this is behind a lot of the test failures we're seeing on CI AFAICT.

Doesn't that then defeat the purpose of the test, if it has to do with connection pooling and HTTP/1.1?

I'm not sure of the original intention of the test. @lizan?

I just spent a little more time looking at the original change and the test and I think I understand the intention. If I understand correctly, I think the test could be done as follows:

Do test only for HTTP/1.1

Set max upstream connections to 1

Issue 2 requests on 2 connections

Use stats to make sure 1 request is pending and 1 is active

Respond and disconnect

Make sure 2nd request goes through.

If this is too complicated to fix quickly and you want to deflake, it's probably fine to comment out this test and fix in a follow up? Or maybe @lizan can fix?

+1 - I'm fine with DISABLING_ and adding a TODO for later

I will fix this today, I'm in an all-day meeting so it might be a bit delayed.

htuch · 2018-12-02T03:54:01Z

@venilnoronha FYI, it's probably interesting to leave this PR without retests, since it's intended to fix flakes; if they are still present, then that's an issue. Looks like the build issue was only with docs, which is surprising, I haven't seen that before.

venilnoronha · 2018-12-02T04:18:00Z

@htuch is this the only possible flake, or are there other identified flakes too?

alyssawilk

submit review. Important button that :-(

alyssawilk · 2018-12-03T14:30:59Z

test/integration/http_integration.cc

@@ -1428,10 +1429,6 @@ void HttpIntegrationTest::testUpstreamDisconnectWithTwoRequests() {
  auto response = codec_client_->makeRequestWithBody(default_request_headers_, 1024);
  waitForNextUpstreamRequest();

-  // Request 2.


Looks like this test was added by @lizan in #2871 to regression test #2715, so it was to handle a specific case with HTTP/1.1 connection pooling. I think this change will result in that no longer being regression tested and so I'd be inclined to split out HTTP/1 and HTTP/2 if I hadn't half-convinced myself there's also a (less likely) race on the H1 path. I'll let Lizan weigh in.

Signed-off-by: Harvey Tuch <htuch@google.com>

htuch · 2018-12-04T19:01:33Z

DCO is stuck due to forced push, I'm going to kill this PR and start another..

venilnoronha previously approved these changes Dec 1, 2018

View reviewed changes

mattklein123 self-assigned this Dec 1, 2018

mattklein123 reviewed Dec 1, 2018

View reviewed changes

alyssawilk reviewed Dec 3, 2018

View reviewed changes

htuch mentioned this pull request Dec 4, 2018

sporadic segfaults in integration tests #5104

Closed

This version of the test changes should make sense.

57aa489

Signed-off-by: Harvey Tuch <htuch@google.com>

htuch dismissed venilnoronha’s stale review via 57aa489 December 4, 2018 18:57

htuch force-pushed the fix-ipv6-flakes branch from c78700e to 57aa489 Compare December 4, 2018 18:57

htuch closed this Dec 4, 2018

htuch mentioned this pull request Dec 4, 2018

test: fix some flakes that might relate to #5104. #5211

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix some flakes that might relate to #5104. #5172

test: fix some flakes that might relate to #5104. #5172

htuch commented Nov 30, 2018

venilnoronha commented Dec 1, 2018

repokitteh-read-only bot commented Dec 1, 2018

venilnoronha left a comment

mattklein123 left a comment

mattklein123 Dec 1, 2018

htuch Dec 2, 2018 •

edited

Loading

mattklein123 Dec 3, 2018

alyssawilk Dec 3, 2018

htuch Dec 3, 2018

htuch Dec 3, 2018

mattklein123 Dec 4, 2018

mattklein123 Dec 4, 2018

alyssawilk Dec 4, 2018

htuch Dec 4, 2018

htuch commented Dec 2, 2018

venilnoronha commented Dec 2, 2018

alyssawilk left a comment

alyssawilk Dec 3, 2018

htuch commented Dec 4, 2018

test: fix some flakes that might relate to #5104. #5172

test: fix some flakes that might relate to #5104. #5172

Conversation

htuch commented Nov 30, 2018

venilnoronha commented Dec 1, 2018

repokitteh-read-only bot commented Dec 1, 2018

venilnoronha left a comment

Choose a reason for hiding this comment

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htuch Dec 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htuch commented Dec 2, 2018

venilnoronha commented Dec 2, 2018

alyssawilk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htuch commented Dec 4, 2018

htuch Dec 2, 2018 •

edited

Loading