Round-robin LB should be aware of and skip bad servers #1600

zhangkun83 · 2016-03-25T22:33:33Z

The round-robin logic in GrpclbLoadBalancer should listen to transport life-cycle events (handleTransportReady() and handleTransportShutdown()) to learn what servers are not usable, so that it can skip them. The round-robin logic will also be shared with the simple round-robin LoadBalancer (which is yet to be implemented).

The text was updated successfully, but these errors were encountered:

zhangkun83 · 2016-09-15T22:13:30Z

Since TransportSet today reconnects only if there is request, the round-robin LB needs to make the TransporSets actively reconnect, otherwise a once disconnected address wouldn't have a chance to be used again. This was supposed to be done by TransportManager.updateRetainedTransports(), which requires TransportSet to be able to actively reconnect by itself.

@ejona86 comes up with a better approach -- do not explicitly call out "retained transports", instead let LoadBalancer listen to the "channel state" (#28) of TransportSets, and make TransportSet to reconnect as desired when it's disconnected. With this approach, TransportSet can keep the lazy connection behavior, and the LoadBalancer API would be simpler.

There are some discrepancies between TransportSet's state and the channel state semantics.
Basically TransportSet solely relies on disconnection events to go IDLE and does not respect IDLE_TIMEOUT, while channel state API relies on IDLE_TIMEOUT in 2 out of 3 cases:

TransportSet doesn't transition from READY/CONNECTING to IDLE after no RPC activity for IDLE_TIMEOUT, while channel state API does.
TransportSet transitions from TRANSIENT_FAILURE to IDLE when it finds no pending RPC at the end of a reconnect back-off. The channel state doesn't have such edge, instead the channel should go back to CONNECTING.

Those discrepancies probably don't matter to LoadBalancer. It just needs to know when TransportSet enters IDLE state, because is where TransportSet stops reconnecting. However, I am a little concerned about implementing the channel state API while violating its transitioning rules.

zhangkun83 · 2016-09-16T20:52:29Z

@ejona86 and I brainstormed a few options. We found the following issues keep coming up:

Channel state callback leakage

Rejected solution: TransportManager has a getChannel(EAG) that returns a stateful subchannel which is not bound to a TransportSet. The advantage is those subchannels are shims and don't need lifecycle management. However, channel state callbacks may be held by ManagedChannelImpl and may never be cleared if a TransportSet for that EAG never gets created.

TransportSet leakage

This is actually #2276, but it's highly related. This issue is particularly a problem in GRPCLB affinity case, where the local LoadBalancer doesn't know all the servers, thus can't give TransportManager the list of servers to be kept.

Rejected solution: we thought about relying on ChannelImpl's idle mode and TransportSet shutting down itself when going idle, to clear up unused TransportSets. However,

ChannelImpl may never go idle while harnessing unused TransportSets, thus the leaking.
Shutting down TransportSet when it's idle has its own problem (see the next issue).

Reconnect back-off state being cleared

Rejected solution: we always shutdown the TransportSet when it becomes idle. It solves the problem of TransportSet leakage, but shutting down the TransportSet would clear the back-off state, so if the app keeps sending fail-fast RPCs to a going-away server, the following series of state transitions will end up hammering the server without any back-off.
READY --(go-away) -->IDLE --> SHUTDOWN -- (new request, new TransportSet) --> CONNECTING --> READY

zhangkun83 · 2016-09-16T21:40:12Z

The solution we finally landed with:

TransportSet implements Channel State API. A discrepancy with the state semantics needs to be fixed: at the end of back-off, go to CONNECTING state instead of going to idle. This is equivalent to infinite IDLE_TIMEOUT.
Never shutdown TransportSet when it goes IDLE.
TransportManager: replace getTransport(EAG) with createSubChannel(EAG), which means two changes:
1. For an EAG, LoadBalancer will get a SubChannel that extends ManagedChannel, which will have Channel state API on it, instead of a transport. SubChannel is in fact a facade of TransportSet.
2. LoadBalancer is responsible for the life-cycle of the returned subchannels. createSubChannel() always creates a new subchannel, unlike getTransport() which will re-use existing TransportSets for the same EAG.

How this addresses the issues

Channel state callback leakage

The callback is owned by TransportSet and will not leak as long as TransportSet doesn't leak.

TransportSet leakage

It becomes LoadBalancer's responsibility to shutdown a SubChannel (thus TransportSet) that is no longer used.

Reconnect back-off state being cleared

Because we never shutdown TransportSet when idle, back-off state will be kept (unless LoadBalancer maliciously shuts down and re-create a SubChannel frequently, which is not our concern).

A once-bad server never gets used again by round-robin

RoundRobinLoadBalancer will keep a list of all SubChannels and listen to the state of them. When seeing a SubChannel becoming TRANSIENT_ERROR, the LB will mark its EAG as unhealthy and skip it. Once the channel state semantics discrepancy in TransportSet is fixed, TRANSIENT_ERROR will always transition to CONNECTING, thus TransportSet will always try to reconnect if connection failed.

zhangkun83 · 2016-09-16T21:54:34Z

TransportManager: replace getTransport(EAG) with createSubChannel(EAG), which means two changes:

For an EAG, LoadBalancer will get a SubChannel that extends ManagedChannel, which will have Channel state API on it, instead of a transport. SubChannel is in fact a facade of TransportSet.

LoadBalancer is responsible for the life-cycle of the returned subchannels. createSubChannel() always creates a new subchannel, unlike getTransport() which will re-use existing TransportSets for the same EAG.

@lukaszx0 @jhump This API change may impact your work in a non-trivial way. I'd like to hear your opinions.

jhump · 2016-09-19T14:39:10Z

Thanks for the heads up @zhangkun83.

I think this will actually be a better API for us. @lukaszx0 has done a lot of work integrating our name service with gRPC. But one last step is implementing heartbeats and better load balancing. The existing API is problematic -- particularly that TransportManager takes an EquivalentAddressGroup' instead of aResolvedServerInfoGroup`.

We have cases where some traffic must transit through a virtual IP, for security reasons. Virtual IPs in our network generally map to hardware load balancers. This means that a client may only have a single address (e.g. host and port) but actually have multiple backends. With the previous API (only being able to target by actual address), there wasn't a clear way to distinguish multiple backends that transit through the same virtual IP address (for purposes of sending heartbeats and also for load balancing since we must distribute traffic across multiple connections that all go through the same address from the client's point of view).

So we would likely have had to write the same connection management logic in our load balancer anyway, and potentially be unable to use TransportManager. I think this change will actually help.

We have this stuff implemented in Go, but their API for picking a connection accepts an address object which includes metadata from the name service (so much more like ResolvedServerInfoGroup than EquivalentAddressGroup).

lukaszx0 · 2016-09-19T19:24:15Z

@zhangkun83 thanks for heads up, as Josh said, I think this might be a better API, however I must admit that I still need to spend some more time reading internals to fully understand current design and its abstractions to be able to better judge and/or propose some alternative solutions.

zhangkun83 · 2016-09-20T00:30:03Z

@jhump happy to hear that! Would that be OK if I just change the API in-place, instead of deprecating and deleting the old methods in two releases? Keeping the old API is kind of painful.

zhangkun83 · 2016-09-20T00:33:59Z

RE: detecting bad server, we have to live with the current channel-state API design, which doesn't provide reliable signal of TRANSIENT_FAILURE. As a workaround, we will mark an address as bad if it has not been ready for some timeout.

lukaszx0 · 2016-09-20T09:38:31Z

Would that be OK if I just change the API in-place, instead of deprecating and deleting the old methods in two releases? Keeping the old API is kind of painful.

Yes. We're quite flexible and this shouldn't be a problem for us.

lukaszx0 · 2016-09-22T17:20:37Z

For the record, while I don't know much about kubernetes it looks like people using it are also "affected" by this and the use case @jhump described - the situation when there's a L4 load balancer in play

ejona86 · 2016-09-22T23:57:18Z

All FYI (nothing really impacting discussion):

There is a grpc-wide L4 LB issue at grpc/grpc#7957.

The specific case of having multiple transports to the same host was intended to be supported, but we had multiple ways of obtaining it. For instance, we could easily have an extra id on EquivalentAddressGroup that serves no purpose but to allow making instances not equal.

The C core auto-reuses transports across channels, so they have the same problem (but at an even higher level). They use a similar hack where they add useless configuration that avoids sharing but otherwise has no effect.

zhangkun83 · 2016-09-28T01:45:17Z

There is another issue that will result from LoadBalancer actively shutting down TransportSets -- a race between shutting down the transport (when RR addresses change) and calling newStream() on the transport. If not handled, this race could cause spurious errors.

EDIT: a possible solution, is that createStubchannel() returns a wrapper of TransportSet, in which shutdown() registers a timer that will call TransportSet's shutdown() after a grace period.

zhangkun83 · 2016-11-17T00:20:40Z

Design doc for the new API and the related Channel impl refactor:
https://github.com/grpc/grpc-java/files/14888135/gRPC.Java.LoadBalancer.v2.pdf

This is the first step of a major refactor for the LoadBalancer-related part of Channel impl (grpc#1600). What's changed: - TransportSet no longer has delayed transport, thus will not buffer streams when a READY real transport is absent. - When there isn't a READ real transport, obtainActiveTransport() will return null. - Overhauled Callback interface, with onStateChange() replacing the adhoc transport event methods. - Call out channelExecutor, which is a serializing executor that runs the Callback.

jhspaybar · 2016-11-17T00:25:51Z

The doc appears private. I requested access.

zhangkun83 · 2016-11-17T00:29:04Z

@jhspaybar: Due to policy restrictions, the doc cannot be published and I have to grant access to individuals upon request.

This is the first step of a major refactor for the LoadBalancer-related part of Channel impl (grpc#1600). What's changed: - TransportSet no longer has delayed transport, thus will not buffer streams when a READY real transport is absent. - When there isn't a READ real transport, obtainActiveTransport() will return null. - Overhauled Callback interface, with onStateChange() replacing the adhoc transport event methods. - Call out channelExecutor, which is a serializing executor that runs the Callback.

This is the first step of a major refactor for the LoadBalancer-related part of Channel impl (#1600). It forks TransportSet into InternalSubchannel and makes changes on it. What's changed: - InternalSubchannel no longer has delayed transport, thus will not buffer streams when a READY real transport is absent. - When there isn't a READ real transport, obtainActiveTransport() will return null. - InternalSubchannel is no longer a ManagedChannel - Overhauled Callback interface, with onStateChange() replacing the adhoc transport event methods. - Call out channelExecutor, which is a serializing executor that runs the Callback. The first commit is an unmodified copy of the files that are being forked. Please review the second commit which changes on the forked files.

This is the second step of a major refactor for the LoadBalancer-related part of Channel impl (grpc#1600)

This is the second step of a major refactor for the LoadBalancer-related part of Channel impl (#1600)

lukaszx0 · 2017-02-10T07:55:43Z

@zhangkun83 LBv2 has landed in master, I think we can close it?

zhangkun83 · 2017-02-10T15:56:04Z

Resolved by #2707

buchgr self-assigned this Mar 27, 2016

ejona86 added this to the 1.1 milestone Apr 8, 2016

buchgr removed their assignment Apr 13, 2016

This was referenced Sep 15, 2016

Implement gRPC LB client as GrpclbLoadBalancer. #1137

Closed

TransportSet garbage collection #2276

Closed

zhangkun83 mentioned this issue Sep 16, 2016

Channel-state API #28

Closed

zhangkun83 mentioned this issue Sep 20, 2016

core: channel connectivity state API and implementation by TransportSet #2286

Merged

zhangkun83 mentioned this issue Sep 21, 2016

core: fix a discrepency in state transition. #2291

Merged

lukaszx0 mentioned this issue Nov 4, 2016

Improve LoadBalancer API #2211

Closed

zhangkun83 mentioned this issue Nov 17, 2016

core: InternalSubchannel: the new TransportSet. #2427

Merged

zhangkun83 added a commit to zhangkun83/grpc-java that referenced this issue Nov 23, 2016

LoadBalancer2 and reprocess() in delayed transport.

652a0ef

This is the second step of a major refactor for the LoadBalancer-related part of Channel impl (grpc#1600)

zhangkun83 mentioned this issue Nov 24, 2016

core: LoadBalancer2 and DelayedClientTransport.reprocess(). #2443

Merged

zhangkun83 added a commit that referenced this issue Dec 2, 2016

core: LoadBalancer2 and DelayedClientTransport.reprocess(). (#2443)

7ec8167

This is the second step of a major refactor for the LoadBalancer-related part of Channel impl (#1600)

zhangkun83 modified the milestones: 1.2, 1.1 Jan 12, 2017

zhangkun83 closed this as completed Feb 10, 2017

lock bot locked as resolved and limited conversation to collaborators Sep 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round-robin LB should be aware of and skip bad servers #1600

Round-robin LB should be aware of and skip bad servers #1600

zhangkun83 commented Mar 25, 2016

zhangkun83 commented Sep 15, 2016 •

edited

Loading

zhangkun83 commented Sep 16, 2016 •

edited

Loading

zhangkun83 commented Sep 16, 2016 •

edited

Loading

zhangkun83 commented Sep 16, 2016

jhump commented Sep 19, 2016

lukaszx0 commented Sep 19, 2016

zhangkun83 commented Sep 20, 2016

zhangkun83 commented Sep 20, 2016

lukaszx0 commented Sep 20, 2016

lukaszx0 commented Sep 22, 2016 •

edited

Loading

ejona86 commented Sep 22, 2016

zhangkun83 commented Sep 28, 2016 •

edited

Loading

zhangkun83 commented Nov 17, 2016 •

edited

Loading

jhspaybar commented Nov 17, 2016

zhangkun83 commented Nov 17, 2016

lukaszx0 commented Feb 10, 2017

zhangkun83 commented Feb 10, 2017

Round-robin LB should be aware of and skip bad servers #1600

Round-robin LB should be aware of and skip bad servers #1600

Comments

zhangkun83 commented Mar 25, 2016

zhangkun83 commented Sep 15, 2016 • edited Loading

zhangkun83 commented Sep 16, 2016 • edited Loading

Channel state callback leakage

TransportSet leakage

Reconnect back-off state being cleared

zhangkun83 commented Sep 16, 2016 • edited Loading

How this addresses the issues

Channel state callback leakage

TransportSet leakage

Reconnect back-off state being cleared

A once-bad server never gets used again by round-robin

zhangkun83 commented Sep 16, 2016

jhump commented Sep 19, 2016

lukaszx0 commented Sep 19, 2016

zhangkun83 commented Sep 20, 2016

zhangkun83 commented Sep 20, 2016

lukaszx0 commented Sep 20, 2016

lukaszx0 commented Sep 22, 2016 • edited Loading

ejona86 commented Sep 22, 2016

zhangkun83 commented Sep 28, 2016 • edited Loading

zhangkun83 commented Nov 17, 2016 • edited Loading

jhspaybar commented Nov 17, 2016

zhangkun83 commented Nov 17, 2016

lukaszx0 commented Feb 10, 2017

zhangkun83 commented Feb 10, 2017

zhangkun83 commented Sep 15, 2016 •

edited

Loading

zhangkun83 commented Sep 16, 2016 •

edited

Loading

zhangkun83 commented Sep 16, 2016 •

edited

Loading

lukaszx0 commented Sep 22, 2016 •

edited

Loading

zhangkun83 commented Sep 28, 2016 •

edited

Loading

zhangkun83 commented Nov 17, 2016 •

edited

Loading