Channel-state API #28

ejona86 · 2015-01-21T00:55:40Z

At this moment, creating TCP connections are created lazily on first call of a Channel, and if the TCP connection goes down it isn't reconnected until a subsequent call. However, some users will want the TCP connection to be created and maintained during the lifetime of the Channel.

This "constant connection" behavior does not make as much sense when accessing a third party service, as the service may purposefully be disconnecting idle clients, but is very reasonable in low-latency, intra-datacenter communication.

We need an API to choose between those behaviors and to export failure information about the Channel. All of this is bundled together for the moment under the name "health-checking API," but we can split it apart as it makes sense.

They are tied together for the moment because certain operations like "wait until Channel is healthy" assume that the channel will actively try to connect.

Some notes from @louiscryan:

Do we want to canonicalize transport failure modes into an enum or are we
happy with a boolean indicating transient vs. durable. What failure modes
will we have

wire incompatability which can occur at any time and while is in theory
transient you may not want your application to continue working

unreachable

internal implementation error

redirection. the addressed service has moved elsewhere

ejona86 · 2015-01-27T18:45:18Z

As part of this issue, we are free to pull in the current lifecycle API and redesign/tweak it.

ejona86 · 2015-07-08T17:46:49Z

https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md

ejona86 · 2015-10-01T22:06:26Z

I am aware of users who are wanting this feature.

buchgr · 2016-04-13T20:22:26Z

I ll work on this next. Assuming the connectivity and semantics api document is still recent @ejona86?

ejona86 · 2016-04-13T20:26:41Z

@buchgr, yes it is still recent. I do know that LoadBalancing will come into play and I don't think we have the states defined for it, so it would take some thought.

buchgr · 2016-04-14T12:47:11Z

@ejona86 thanks! so I guess ll first learn about the load balancer and then check out the implementation in the C core, to see if they took load balancing into account (yet), else I ll come up with my own proposal, that we can discuss and update the document according. I ll then implement this for Java. Sounds good?

ejona86 · 2016-04-14T15:53:15Z

@buchgr, sounds good, but I'm okay with both of those going in parallel. LB is still in progress, so I don't mind improving the state specification over time and calling them bug fixes.

juberti · 2016-05-27T20:02:55Z

Any update on the stats aspect of this? Right now, when we make a gRPC, we don't get any feedback regarding the state of the underlying connection, making it hard to understand where delays are coming from.

lukaszx0 · 2016-05-31T17:01:40Z

@buchgr any updates? Are you still planning to work on this?

buchgr · 2016-06-01T08:38:26Z

@lukaszx0 nope ... it's free to be picked up by someone :-)

ejona86 · 2016-06-01T18:29:57Z

I will note that the changes for fail fast made this easier to implement. It's mainly API work now.

Backaway · 2016-08-22T02:43:42Z

@zhangkun83 any updates? When will this feature be released?

lukaszx0 · 2016-08-22T13:58:09Z

@zhangkun83 I was planning to find some time and work on this. I'll reach out to you and we can discuss.

zhangkun83 · 2016-09-16T23:38:55Z

@ejona86 and I talked about the channel state API as #1600 will depend on it. @ejona86 wanted to not have the state be passed to the callback. Instead, the callback would need to call getState() to get the current state:

class ManagedChannel {
  State getState();
  void notifyWhenStateChanged(State source, Runnable callback, boolean connect);
}

This has an issue with round-robin. The solution we decided on #1600 depends on the RR LB being able to catch every transition into TRANSIENT_FAILURE. However, if an address always results in connection timeout, the TransportSet for it would switching back and forth between CONNECTING and TRANSIENT_FAILURE. Because the initial back-off delays can be very short, which is the time spent in TRANSIENT_FAILURE, while CONNECTING may be noticeably long, it's possible that the callback from RR LB may get CONNECTING from getState(), despite that the callback is actually called for the transition to TRANSIENT_FAILURE. RR LB wouldn't know that and still thinks the address is good and keeps sending requests on it.

I think we need to pass the state to the callback.

zhangkun83 · 2016-09-17T00:20:47Z

I take it back. Passing the state to the callback is also flawed. Any state change between the callback being called and the callback calling getState() will be missed. It will be an issue if the user is particularly interested in whether the channel has been in a state recently. RR LB is a such case.

Technically notifyWhenStateChanged(source) is notifyWhenStateIsNot(unexpectedState). When I consider the typical use cases -- notify when ready, and the RR LB -- I find notifyWhenStateIs(expectedState) more useful. I also find the new names convey the semantics better -- you don't have to educate users to pass getStats()'s result as the source. So maybe:

class ManagedChannel {
  State getState();
  void notifyWhenStateIsNot(State unexpected, Runnable callback, boolean connect);
  void notifyWhenStateIs(State expected, Runnable callback, boolean connect);
}

or more powerful:

class ManagedChannel {
  State getState();
  void notifyWhenStateIsIn(EnumSet<State> expected, Runnable callback, boolean connect);
}

lukaszx0 · 2016-09-19T19:00:31Z

I skimmed through grpc connectivity semantics docs and read your proposals.

Instead, the callback would need to call getState() to get the current state.

Why is this a requirement? Is it mainly to keep API consistency with other languages?

Also, while I think I like unified notifyWhenStateIsIn better, why go with API that has the method handling Runnable and not observers (onStateChange)?

ejona86 · 2016-09-19T19:14:11Z

Technically notifyWhenStateChanged(source) is notifyWhenStateIsNot(unexpectedState).

No, I don't think it is. The state can change and then change back to the previous state, so that some changes will appear spurious. I don't think we can use notifyWhenStateIsNot and notifyWhenStateIs because it implies the state when the callback is run, and we can't guarantee what the state will be.

Why is this a requirement? Is it mainly to keep API consistency with other languages?

It is to handle a very strong race. Basically, at times the state may change very, very rapidly. The listener should not be notified of every state transition, because that could lead to a heavy queue forming. Instead, just providing the current state has no performance issues and calling getState() explicitly makes it more obvious there is a race involved.

ejona86 · 2016-09-19T19:18:52Z

Kun, the connect arg should be on getState(). Check out the C++ API (they actually have both a sync and async API). I don't quite know why they have deadline on the async API, but we may need a way to cancel/remove listeners...

class ManagedChannel {
  State getState(boolean tryToConnect);
  void notifyWhenStateChanged(State lastObserved, Runnable callback);
}

zhangkun83 · 2016-09-19T19:46:29Z

No, I don't think it is. The state can change and then change back to the previous state, so that some changes will appear spurious. I don't think we can use notifyWhenStateIsNot and notifyWhenStateIs because it implies the state when the callback is run, and we can't guarantee what the state will be.

Maybe these methods could be named in a way that doesn't imply what the state is when the callback is run, but instead imply what state triggers the callback.

My issue with notifyWhenStateChanged(source) is that it only tells you whether there was a edge out of a state, but doesn't tell you whether there was a edge into a state, which is necessary for RR LB.

zhangkun83 · 2016-09-19T23:50:25Z

Had a conversation with @a11r and @ejona86. @a11r doesn't think we need an alternative API. We came up with a workaround for the RR LB. It can mark a connection as bad if it has not been ready for a timeout, probably the same as the maximum connection timeout. Then it won't rely on seeing TRANSIENT_FAILURE.

lukaszx0 · 2016-09-21T14:11:43Z

I believe it has been resolved with #2286 and this issue can be closed.

zhangkun83 · 2016-09-21T14:41:39Z

I have filed #2292 to track the implementation in ManagedChannelImpl.

robeden · 2018-04-19T18:21:05Z

ManagedChannel.getState() and notifyWhenStateChanged(ConnectivityState, Runnable) are both still listed as experimental APIs referencing this issue. Is that correct seeing as this is closed and ManagedChannelImpl is also implemented?

ejona86 · 2018-04-19T19:28:21Z

@robeden, thanks for pointing that out. I've created #4359 to track and #4360 to fix the links in the code. Because there wasn't a tracking issue, this did slip through the cracks. It should be stabilized. It'll be discussed in our next API review in two weeks (we just finished an API review discussion, and we do it every other week).

ejona86 added the enhancement label Jan 21, 2015

ejona86 mentioned this issue Jan 26, 2015

Remove Service API from ServerImpl #33

Merged

This was referenced Apr 16, 2015

Remove Guava's Service from server transport #303

Merged

Revisit Lifecycle API #307

Closed

ejona86 changed the title ~~Health-checking API~~ Channel-state API Jul 8, 2015

ejona86 assigned nmittler Jul 8, 2015

ejona86 added this to the 0.8.0 milestone Jul 8, 2015

ejona86 modified the milestones: Beta (0.9.0), 0.8.0 Aug 14, 2015

ejona86 mentioned this issue Aug 14, 2015

Breaking out ClientCallFactory abstract class #680

Closed

ejona86 removed this from the Beta (0.9.0) milestone Aug 26, 2015

ejona86 added this to the 1.0 milestone Dec 1, 2015

ejona86 mentioned this issue Jan 8, 2016

Time for the first time a rpc service is used is way longer than that of the following usage. #1304

Closed

ejona86 modified the milestones: 0.11.0, 1.0 Jan 11, 2016

zhangkun83 modified the milestones: 1.0, 0.13.0 Jan 29, 2016

ejona86 unassigned nmittler Mar 16, 2016

buchgr self-assigned this Apr 13, 2016

ejona86 modified the milestones: 1.1, 1.0 Apr 19, 2016

buchgr removed their assignment May 13, 2016

zhangkun83 self-assigned this Jul 26, 2016

zhangkun83 assigned lukaszx0 Aug 22, 2016

ejona86 mentioned this issue Aug 26, 2016

Provide base implementation for load balancer #2199

Closed

lukaszx0 mentioned this issue Aug 27, 2016

Improve LoadBalancer API #2211

Closed

zhangkun83 mentioned this issue Sep 15, 2016

Round-robin LB should be aware of and skip bad servers #1600

Closed

zhangkun83 mentioned this issue Sep 20, 2016

core: channel connectivity state API and implementation by TransportSet #2286

Merged

zhangkun83 closed this as completed Sep 21, 2016

dcow mentioned this issue Nov 16, 2017

ManagedChannel getState/notifyChanged API docs #3762

Closed

thermosym mentioned this issue May 23, 2018

Netty ObjectCleanerThread keeps high CPU #4495

Closed

lock bot locked as resolved and limited conversation to collaborators Sep 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Channel-state API #28

Channel-state API #28

ejona86 commented Jan 21, 2015

ejona86 commented Jan 27, 2015

ejona86 commented Jul 8, 2015

ejona86 commented Oct 1, 2015

buchgr commented Apr 13, 2016

ejona86 commented Apr 13, 2016

buchgr commented Apr 14, 2016

ejona86 commented Apr 14, 2016

juberti commented May 27, 2016

lukaszx0 commented May 31, 2016

buchgr commented Jun 1, 2016

ejona86 commented Jun 1, 2016

Backaway commented Aug 22, 2016

lukaszx0 commented Aug 22, 2016

zhangkun83 commented Sep 16, 2016 •

edited

Loading

zhangkun83 commented Sep 17, 2016 •

edited

Loading

lukaszx0 commented Sep 19, 2016

ejona86 commented Sep 19, 2016

ejona86 commented Sep 19, 2016 •

edited

Loading

zhangkun83 commented Sep 19, 2016

zhangkun83 commented Sep 19, 2016

lukaszx0 commented Sep 21, 2016 •

edited

Loading

zhangkun83 commented Sep 21, 2016

robeden commented Apr 19, 2018

ejona86 commented Apr 19, 2018

Channel-state API #28

Channel-state API #28

Comments

ejona86 commented Jan 21, 2015

ejona86 commented Jan 27, 2015

ejona86 commented Jul 8, 2015

ejona86 commented Oct 1, 2015

buchgr commented Apr 13, 2016

ejona86 commented Apr 13, 2016

buchgr commented Apr 14, 2016

ejona86 commented Apr 14, 2016

juberti commented May 27, 2016

lukaszx0 commented May 31, 2016

buchgr commented Jun 1, 2016

ejona86 commented Jun 1, 2016

Backaway commented Aug 22, 2016

lukaszx0 commented Aug 22, 2016

zhangkun83 commented Sep 16, 2016 • edited Loading

zhangkun83 commented Sep 17, 2016 • edited Loading

lukaszx0 commented Sep 19, 2016

ejona86 commented Sep 19, 2016

ejona86 commented Sep 19, 2016 • edited Loading

zhangkun83 commented Sep 19, 2016

zhangkun83 commented Sep 19, 2016

lukaszx0 commented Sep 21, 2016 • edited Loading

zhangkun83 commented Sep 21, 2016

robeden commented Apr 19, 2018

ejona86 commented Apr 19, 2018

zhangkun83 commented Sep 16, 2016 •

edited

Loading

zhangkun83 commented Sep 17, 2016 •

edited

Loading

ejona86 commented Sep 19, 2016 •

edited

Loading

lukaszx0 commented Sep 21, 2016 •

edited

Loading