Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel-state API #28

Closed
ejona86 opened this issue Jan 21, 2015 · 24 comments
Closed

Channel-state API #28

ejona86 opened this issue Jan 21, 2015 · 24 comments
Assignees
Milestone

Comments

@ejona86
Copy link
Member

ejona86 commented Jan 21, 2015

At this moment, creating TCP connections are created lazily on first call of a Channel, and if the TCP connection goes down it isn't reconnected until a subsequent call. However, some users will want the TCP connection to be created and maintained during the lifetime of the Channel.

This "constant connection" behavior does not make as much sense when accessing a third party service, as the service may purposefully be disconnecting idle clients, but is very reasonable in low-latency, intra-datacenter communication.

We need an API to choose between those behaviors and to export failure information about the Channel. All of this is bundled together for the moment under the name "health-checking API," but we can split it apart as it makes sense.

They are tied together for the moment because certain operations like "wait until Channel is healthy" assume that the channel will actively try to connect.

Some notes from @louiscryan:

Do we want to canonicalize transport failure modes into an enum or are we
happy with a boolean indicating transient vs. durable. What failure modes
will we have

  • wire incompatability which can occur at any time and while is in theory
    transient you may not want your application to continue working
  • unreachable
  • internal implementation error
  • redirection. the addressed service has moved elsewhere
@ejona86
Copy link
Member Author

ejona86 commented Jan 27, 2015

As part of this issue, we are free to pull in the current lifecycle API and redesign/tweak it.

@ejona86 ejona86 changed the title Health-checking API Channel-state API Jul 8, 2015
@ejona86
Copy link
Member Author

ejona86 commented Jul 8, 2015

@ejona86 ejona86 added this to the 0.8.0 milestone Jul 8, 2015
@ejona86 ejona86 modified the milestones: Beta (0.9.0), 0.8.0 Aug 14, 2015
@ejona86 ejona86 removed this from the Beta (0.9.0) milestone Aug 26, 2015
@ejona86
Copy link
Member Author

ejona86 commented Oct 1, 2015

I am aware of users who are wanting this feature.

@buchgr
Copy link
Contributor

buchgr commented Apr 13, 2016

I ll work on this next. Assuming the connectivity and semantics api document is still recent @ejona86?

@buchgr buchgr self-assigned this Apr 13, 2016
@ejona86
Copy link
Member Author

ejona86 commented Apr 13, 2016

@buchgr, yes it is still recent. I do know that LoadBalancing will come into play and I don't think we have the states defined for it, so it would take some thought.

@buchgr
Copy link
Contributor

buchgr commented Apr 14, 2016

@ejona86 thanks! so I guess ll first learn about the load balancer and then check out the implementation in the C core, to see if they took load balancing into account (yet), else I ll come up with my own proposal, that we can discuss and update the document according. I ll then implement this for Java. Sounds good?

@ejona86
Copy link
Member Author

ejona86 commented Apr 14, 2016

@buchgr, sounds good, but I'm okay with both of those going in parallel. LB is still in progress, so I don't mind improving the state specification over time and calling them bug fixes.

@ejona86 ejona86 modified the milestones: 1.1, 1.0 Apr 19, 2016
@buchgr buchgr removed their assignment May 13, 2016
@juberti
Copy link

juberti commented May 27, 2016

Any update on the stats aspect of this? Right now, when we make a gRPC, we don't get any feedback regarding the state of the underlying connection, making it hard to understand where delays are coming from.

@lukaszx0
Copy link
Contributor

@buchgr any updates? Are you still planning to work on this?

@buchgr
Copy link
Contributor

buchgr commented Jun 1, 2016

@lukaszx0 nope ... it's free to be picked up by someone :-)

@ejona86
Copy link
Member Author

ejona86 commented Jun 1, 2016

I will note that the changes for fail fast made this easier to implement. It's mainly API work now.

@zhangkun83 zhangkun83 self-assigned this Jul 26, 2016
@Backaway
Copy link

@zhangkun83 any updates? When will this feature be released?

@lukaszx0
Copy link
Contributor

@zhangkun83 I was planning to find some time and work on this. I'll reach out to you and we can discuss.

@zhangkun83
Copy link
Contributor

zhangkun83 commented Sep 16, 2016

@ejona86 and I talked about the channel state API as #1600 will depend on it. @ejona86 wanted to not have the state be passed to the callback. Instead, the callback would need to call getState() to get the current state:

class ManagedChannel {
  State getState();
  void notifyWhenStateChanged(State source, Runnable callback, boolean connect);
}

This has an issue with round-robin. The solution we decided on #1600 depends on the RR LB being able to catch every transition into TRANSIENT_FAILURE. However, if an address always results in connection timeout, the TransportSet for it would switching back and forth between CONNECTING and TRANSIENT_FAILURE. Because the initial back-off delays can be very short, which is the time spent in TRANSIENT_FAILURE, while CONNECTING may be noticeably long, it's possible that the callback from RR LB may get CONNECTING from getState(), despite that the callback is actually called for the transition to TRANSIENT_FAILURE. RR LB wouldn't know that and still thinks the address is good and keeps sending requests on it.

I think we need to pass the state to the callback.

@zhangkun83
Copy link
Contributor

zhangkun83 commented Sep 17, 2016

I take it back. Passing the state to the callback is also flawed. Any state change between the callback being called and the callback calling getState() will be missed. It will be an issue if the user is particularly interested in whether the channel has been in a state recently. RR LB is a such case.

Technically notifyWhenStateChanged(source) is notifyWhenStateIsNot(unexpectedState). When I consider the typical use cases -- notify when ready, and the RR LB -- I find notifyWhenStateIs(expectedState) more useful. I also find the new names convey the semantics better -- you don't have to educate users to pass getStats()'s result as the source. So maybe:

class ManagedChannel {
  State getState();
  void notifyWhenStateIsNot(State unexpected, Runnable callback, boolean connect);
  void notifyWhenStateIs(State expected, Runnable callback, boolean connect);
}

or more powerful:

class ManagedChannel {
  State getState();
  void notifyWhenStateIsIn(EnumSet<State> expected, Runnable callback, boolean connect);
}

@lukaszx0
Copy link
Contributor

I skimmed through grpc connectivity semantics docs and read your proposals.

Instead, the callback would need to call getState() to get the current state.

Why is this a requirement? Is it mainly to keep API consistency with other languages?

Also, while I think I like unified notifyWhenStateIsIn better, why go with API that has the method handling Runnable and not observers (onStateChange)?

@ejona86
Copy link
Member Author

ejona86 commented Sep 19, 2016

Technically notifyWhenStateChanged(source) is notifyWhenStateIsNot(unexpectedState).

No, I don't think it is. The state can change and then change back to the previous state, so that some changes will appear spurious. I don't think we can use notifyWhenStateIsNot and notifyWhenStateIs because it implies the state when the callback is run, and we can't guarantee what the state will be.

Why is this a requirement? Is it mainly to keep API consistency with other languages?

It is to handle a very strong race. Basically, at times the state may change very, very rapidly. The listener should not be notified of every state transition, because that could lead to a heavy queue forming. Instead, just providing the current state has no performance issues and calling getState() explicitly makes it more obvious there is a race involved.

@ejona86
Copy link
Member Author

ejona86 commented Sep 19, 2016

Kun, the connect arg should be on getState(). Check out the C++ API (they actually have both a sync and async API). I don't quite know why they have deadline on the async API, but we may need a way to cancel/remove listeners...

class ManagedChannel {
  State getState(boolean tryToConnect);
  void notifyWhenStateChanged(State lastObserved, Runnable callback);
}

@zhangkun83
Copy link
Contributor

No, I don't think it is. The state can change and then change back to the previous state, so that some changes will appear spurious. I don't think we can use notifyWhenStateIsNot and notifyWhenStateIs because it implies the state when the callback is run, and we can't guarantee what the state will be.

Maybe these methods could be named in a way that doesn't imply what the state is when the callback is run, but instead imply what state triggers the callback.

My issue with notifyWhenStateChanged(source) is that it only tells you whether there was a edge out of a state, but doesn't tell you whether there was a edge into a state, which is necessary for RR LB.

@zhangkun83
Copy link
Contributor

Had a conversation with @a11r and @ejona86. @a11r doesn't think we need an alternative API. We came up with a workaround for the RR LB. It can mark a connection as bad if it has not been ready for a timeout, probably the same as the maximum connection timeout. Then it won't rely on seeing TRANSIENT_FAILURE.

@lukaszx0
Copy link
Contributor

lukaszx0 commented Sep 21, 2016

I believe it has been resolved with #2286 and this issue can be closed.

@zhangkun83
Copy link
Contributor

I have filed #2292 to track the implementation in ManagedChannelImpl.

@robeden
Copy link

robeden commented Apr 19, 2018

ManagedChannel.getState() and notifyWhenStateChanged(ConnectivityState, Runnable) are both still listed as experimental APIs referencing this issue. Is that correct seeing as this is closed and ManagedChannelImpl is also implemented?

@ejona86
Copy link
Member Author

ejona86 commented Apr 19, 2018

@robeden, thanks for pointing that out. I've created #4359 to track and #4360 to fix the links in the code. Because there wasn't a tracking issue, this did slip through the cracks. It should be stabilized. It'll be discussed in our next API review in two weeks (we just finished an API review discussion, and we do it every other week).

@lock lock bot locked as resolved and limited conversation to collaborators Sep 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants