Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasm: onLog is not always called for network streams on upstream failure #13806

Closed
kyessenov opened this issue Oct 28, 2020 · 10 comments
Closed
Labels
area/wasm bug stale stalebot believes this issue/PR has not been touched recently

Comments

@kyessenov
Copy link
Contributor

Ref: istio/istio#24720
Validated with a flakey upstream that becomes unavailable at an interval.
Recording number of calls to onLog and onNewConnection, we see that onLog is not always called:
connection_count: 11030 log_count: 10662

/cc @PiotrSikora

@kyessenov kyessenov added bug triage Issue requires triage labels Oct 28, 2020
@kyessenov kyessenov changed the title wasm: onLog is not always called for networking stream on upstream failure wasm: onLog is not always called for network streams on upstream failure Oct 28, 2020
@kyessenov
Copy link
Contributor Author

Note: stream contexts are also leaking (they are not destroyed in null VM after the stream is gone).

@kyessenov
Copy link
Contributor Author

kyessenov commented Oct 28, 2020

We chatted briefly and believe that we should probably simplify the ABI to stop trying to figure out who closed the connection (upstream or downstream) and then wait for both. Instead, we should just emit closure event if either upstream or downstream closes.

@PiotrSikora
Copy link
Contributor

You can do this right now, there is nothing at the ABI level that is stopping you from performing actions on downstream or upstream connection close.

It looks that the guard to wait for both connections being closed was added here: envoyproxy/envoy-wasm#453, though I don't see anything wrong there.

Note that the "upstream connection close" event implementation in Envoy is a bit hacky right now, and it's triggered when downstream connection sees doWrite(..., end_stream=true). That's most likely the source of the issue.

cc @gargnupur

@yanavlasov yanavlasov added area/wasm and removed triage Issue requires triage labels Oct 29, 2020
@gargnupur
Copy link
Contributor

At the time of envoyproxy/envoy-wasm#453 the lifecycles were not very clear and hence the guard was added to make sure, we don't close preemptively. But looks like now, it's ok to close if downstream close event is reached..

@PiotrSikora
Copy link
Contributor

From what @kyessenov mentioned yesterday, I was under the impression that downstream close events are getting lost, and if that's the case, then relying on them might not solve anything.

@kyessenov
Copy link
Contributor Author

I think we might be lost in the terminology here. Upstream with respect to the downstream connection is the proxy, so upstream close really is downstream local close from the client perspective. From reading other network filters, either Local or Remote close is guaranteed for a connection (but not both). I suggest we follow this practice and initiate termination once any close event is received. I don't think we can reliably detect whether the connection to an upstream host is closed (it's not even always there for direct response, or could be tunnelled).

@PiotrSikora
Copy link
Contributor

The issue is that the upstream close event is not raised if the connection to upstream was never established (e.g. connection timeout, connection refused), but the context destruction is gated on that event, so it leaks in case of failures.

@PiotrSikora
Copy link
Contributor

See: #13939 and #13940.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 15, 2020
@PiotrSikora
Copy link
Contributor

Fixed / worked around in #13836.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/wasm bug stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

4 participants