Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round-robin LB should be aware of and skip bad servers #1600

Closed
zhangkun83 opened this issue Mar 25, 2016 · 17 comments
Closed

Round-robin LB should be aware of and skip bad servers #1600

zhangkun83 opened this issue Mar 25, 2016 · 17 comments
Milestone

Comments

@zhangkun83
Copy link
Contributor

The round-robin logic in GrpclbLoadBalancer should listen to transport life-cycle events (handleTransportReady() and handleTransportShutdown()) to learn what servers are not usable, so that it can skip them. The round-robin logic will also be shared with the simple round-robin LoadBalancer (which is yet to be implemented).

@buchgr buchgr self-assigned this Mar 27, 2016
@ejona86 ejona86 added this to the 1.1 milestone Apr 8, 2016
@buchgr buchgr removed their assignment Apr 13, 2016
@zhangkun83
Copy link
Contributor Author

zhangkun83 commented Sep 15, 2016

Since TransportSet today reconnects only if there is request, the round-robin LB needs to make the TransporSets actively reconnect, otherwise a once disconnected address wouldn't have a chance to be used again. This was supposed to be done by TransportManager.updateRetainedTransports(), which requires TransportSet to be able to actively reconnect by itself.

@ejona86 comes up with a better approach -- do not explicitly call out "retained transports", instead let LoadBalancer listen to the "channel state" (#28) of TransportSets, and make TransportSet to reconnect as desired when it's disconnected. With this approach, TransportSet can keep the lazy connection behavior, and the LoadBalancer API would be simpler.

There are some discrepancies between TransportSet's state and the channel state semantics.
Basically TransportSet solely relies on disconnection events to go IDLE and does not respect IDLE_TIMEOUT, while channel state API relies on IDLE_TIMEOUT in 2 out of 3 cases:

  1. TransportSet doesn't transition from READY/CONNECTING to IDLE after no RPC activity for IDLE_TIMEOUT, while channel state API does.
  2. TransportSet transitions from TRANSIENT_FAILURE to IDLE when it finds no pending RPC at the end of a reconnect back-off. The channel state doesn't have such edge, instead the channel should go back to CONNECTING.

Those discrepancies probably don't matter to LoadBalancer. It just needs to know when TransportSet enters IDLE state, because is where TransportSet stops reconnecting. However, I am a little concerned about implementing the channel state API while violating its transitioning rules.

@zhangkun83
Copy link
Contributor Author

zhangkun83 commented Sep 16, 2016

@ejona86 and I brainstormed a few options. We found the following issues keep coming up:

Channel state callback leakage

Rejected solution: TransportManager has a getChannel(EAG) that returns a stateful subchannel which is not bound to a TransportSet. The advantage is those subchannels are shims and don't need lifecycle management. However, channel state callbacks may be held by ManagedChannelImpl and may never be cleared if a TransportSet for that EAG never gets created.

TransportSet leakage

This is actually #2276, but it's highly related. This issue is particularly a problem in GRPCLB affinity case, where the local LoadBalancer doesn't know all the servers, thus can't give TransportManager the list of servers to be kept.

Rejected solution: we thought about relying on ChannelImpl's idle mode and TransportSet shutting down itself when going idle, to clear up unused TransportSets. However,

  1. ChannelImpl may never go idle while harnessing unused TransportSets, thus the leaking.
  2. Shutting down TransportSet when it's idle has its own problem (see the next issue).

Reconnect back-off state being cleared

Rejected solution: we always shutdown the TransportSet when it becomes idle. It solves the problem of TransportSet leakage, but shutting down the TransportSet would clear the back-off state, so if the app keeps sending fail-fast RPCs to a going-away server, the following series of state transitions will end up hammering the server without any back-off.
READY --(go-away) -->IDLE --> SHUTDOWN -- (new request, new TransportSet) --> CONNECTING --> READY

@zhangkun83
Copy link
Contributor Author

zhangkun83 commented Sep 16, 2016

The solution we finally landed with:

  1. TransportSet implements Channel State API. A discrepancy with the state semantics needs to be fixed: at the end of back-off, go to CONNECTING state instead of going to idle. This is equivalent to infinite IDLE_TIMEOUT.
  2. Never shutdown TransportSet when it goes IDLE.
  3. TransportManager: replace getTransport(EAG) with createSubChannel(EAG), which means two changes:
    1. For an EAG, LoadBalancer will get a SubChannel that extends ManagedChannel, which will have Channel state API on it, instead of a transport. SubChannel is in fact a facade of TransportSet.
    2. LoadBalancer is responsible for the life-cycle of the returned subchannels. createSubChannel() always creates a new subchannel, unlike getTransport() which will re-use existing TransportSets for the same EAG.

How this addresses the issues

Channel state callback leakage

The callback is owned by TransportSet and will not leak as long as TransportSet doesn't leak.

TransportSet leakage

It becomes LoadBalancer's responsibility to shutdown a SubChannel (thus TransportSet) that is no longer used.

Reconnect back-off state being cleared

Because we never shutdown TransportSet when idle, back-off state will be kept (unless LoadBalancer maliciously shuts down and re-create a SubChannel frequently, which is not our concern).

A once-bad server never gets used again by round-robin

RoundRobinLoadBalancer will keep a list of all SubChannels and listen to the state of them. When seeing a SubChannel becoming TRANSIENT_ERROR, the LB will mark its EAG as unhealthy and skip it. Once the channel state semantics discrepancy in TransportSet is fixed, TRANSIENT_ERROR will always transition to CONNECTING, thus TransportSet will always try to reconnect if connection failed.

@zhangkun83
Copy link
Contributor Author

TransportManager: replace getTransport(EAG) with createSubChannel(EAG), which means two changes:

  1. For an EAG, LoadBalancer will get a SubChannel that extends ManagedChannel, which will have Channel state API on it, instead of a transport. SubChannel is in fact a facade of TransportSet.
  2. LoadBalancer is responsible for the life-cycle of the returned subchannels. createSubChannel() always creates a new subchannel, unlike getTransport() which will re-use existing TransportSets for the same EAG.

@lukaszx0 @jhump This API change may impact your work in a non-trivial way. I'd like to hear your opinions.

@jhump
Copy link
Member

jhump commented Sep 19, 2016

Thanks for the heads up @zhangkun83.

I think this will actually be a better API for us. @lukaszx0 has done a lot of work integrating our name service with gRPC. But one last step is implementing heartbeats and better load balancing. The existing API is problematic -- particularly that TransportManager takes an EquivalentAddressGroup' instead of aResolvedServerInfoGroup`.

We have cases where some traffic must transit through a virtual IP, for security reasons. Virtual IPs in our network generally map to hardware load balancers. This means that a client may only have a single address (e.g. host and port) but actually have multiple backends. With the previous API (only being able to target by actual address), there wasn't a clear way to distinguish multiple backends that transit through the same virtual IP address (for purposes of sending heartbeats and also for load balancing since we must distribute traffic across multiple connections that all go through the same address from the client's point of view).

So we would likely have had to write the same connection management logic in our load balancer anyway, and potentially be unable to use TransportManager. I think this change will actually help.

We have this stuff implemented in Go, but their API for picking a connection accepts an address object which includes metadata from the name service (so much more like ResolvedServerInfoGroup than EquivalentAddressGroup).

@lukaszx0
Copy link
Contributor

@zhangkun83 thanks for heads up, as Josh said, I think this might be a better API, however I must admit that I still need to spend some more time reading internals to fully understand current design and its abstractions to be able to better judge and/or propose some alternative solutions.

@zhangkun83
Copy link
Contributor Author

@jhump happy to hear that! Would that be OK if I just change the API in-place, instead of deprecating and deleting the old methods in two releases? Keeping the old API is kind of painful.

@zhangkun83
Copy link
Contributor Author

RE: detecting bad server, we have to live with the current channel-state API design, which doesn't provide reliable signal of TRANSIENT_FAILURE. As a workaround, we will mark an address as bad if it has not been ready for some timeout.

@lukaszx0
Copy link
Contributor

Would that be OK if I just change the API in-place, instead of deprecating and deleting the old methods in two releases? Keeping the old API is kind of painful.

Yes. We're quite flexible and this shouldn't be a problem for us.

@lukaszx0
Copy link
Contributor

lukaszx0 commented Sep 22, 2016

For the record, while I don't know much about kubernetes it looks like people using it are also "affected" by this and the use case @jhump described - the situation when there's a L4 load balancer in play

@ejona86
Copy link
Member

ejona86 commented Sep 22, 2016

All FYI (nothing really impacting discussion):

There is a grpc-wide L4 LB issue at grpc/grpc#7957.

The specific case of having multiple transports to the same host was intended to be supported, but we had multiple ways of obtaining it. For instance, we could easily have an extra id on EquivalentAddressGroup that serves no purpose but to allow making instances not equal.

The C core auto-reuses transports across channels, so they have the same problem (but at an even higher level). They use a similar hack where they add useless configuration that avoids sharing but otherwise has no effect.

@zhangkun83
Copy link
Contributor Author

zhangkun83 commented Sep 28, 2016

There is another issue that will result from LoadBalancer actively shutting down TransportSets -- a race between shutting down the transport (when RR addresses change) and calling newStream() on the transport. If not handled, this race could cause spurious errors.

EDIT: a possible solution, is that createStubchannel() returns a wrapper of TransportSet, in which shutdown() registers a timer that will call TransportSet's shutdown() after a grace period.

@zhangkun83
Copy link
Contributor Author

zhangkun83 commented Nov 17, 2016

Design doc for the new API and the related Channel impl refactor:
https://github.com/grpc/grpc-java/files/14888135/gRPC.Java.LoadBalancer.v2.pdf

zhangkun83 added a commit to zhangkun83/grpc-java that referenced this issue Nov 17, 2016
This is the first step of a major refactor for the LoadBalancer-related
part of Channel impl (grpc#1600).

What's changed:

- TransportSet no longer has delayed transport, thus will not buffer
  streams when a READY real transport is absent.
- When there isn't a READ real transport, obtainActiveTransport() will
  return null.
- Overhauled Callback interface, with onStateChange() replacing the
  adhoc transport event methods.
- Call out channelExecutor, which is a serializing executor that runs
  the Callback.
@jhspaybar
Copy link
Contributor

The doc appears private. I requested access.

@zhangkun83
Copy link
Contributor Author

@jhspaybar: Due to policy restrictions, the doc cannot be published and I have to grant access to individuals upon request.

zhangkun83 added a commit to zhangkun83/grpc-java that referenced this issue Nov 17, 2016
This is the first step of a major refactor for the LoadBalancer-related
part of Channel impl (grpc#1600).

What's changed:

- TransportSet no longer has delayed transport, thus will not buffer
  streams when a READY real transport is absent.
- When there isn't a READ real transport, obtainActiveTransport() will
  return null.
- Overhauled Callback interface, with onStateChange() replacing the
  adhoc transport event methods.
- Call out channelExecutor, which is a serializing executor that runs
  the Callback.
zhangkun83 added a commit that referenced this issue Nov 23, 2016
This is the first step of a major refactor for the LoadBalancer-related part of Channel impl (#1600). It forks TransportSet into InternalSubchannel and makes changes on it.

What's changed:

- InternalSubchannel no longer has delayed transport, thus will not buffer
  streams when a READY real transport is absent.
- When there isn't a READ real transport, obtainActiveTransport() will
  return null.
- InternalSubchannel is no longer a ManagedChannel
- Overhauled Callback interface, with onStateChange() replacing the
  adhoc transport event methods.
- Call out channelExecutor, which is a serializing executor that runs
  the Callback.

The first commit is an unmodified copy of the files that are being forked. Please review the second commit which changes on the forked files.
zhangkun83 added a commit to zhangkun83/grpc-java that referenced this issue Nov 23, 2016
This is the second step of a major refactor for the LoadBalancer-related
part of Channel impl (grpc#1600)
zhangkun83 added a commit that referenced this issue Dec 2, 2016
This is the second step of a major refactor for the LoadBalancer-related part of Channel impl (#1600)
@zhangkun83 zhangkun83 modified the milestones: 1.2, 1.1 Jan 12, 2017
@lukaszx0
Copy link
Contributor

@zhangkun83 LBv2 has landed in master, I think we can close it?

@zhangkun83
Copy link
Contributor Author

Resolved by #2707

@lock lock bot locked as resolved and limited conversation to collaborators Sep 22, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants