disable GooglePubSubSender if pubsub cannot be reached #1030

mattnworb · 2016-12-01T20:54:49Z

Adds a isHealthy() method to the EventSender interface that is used to
filter out any unhealthy senders at startup.

This logic is done in a new class named EventSenderFactory, which
extracts the identical logic for constructing Lists of EventSenders from
the MasterService and the AgentService. To make this common class
possible, also extracted a class for the common configuration fields
between AgentConfig and MasterConfig. This new CommonConfiguration class
is incomplete - more fields can be moved to this class to avoid
duplication, but I left that for future commits.

I created EventSenderFactory in hopes of writing a test for the logic
that unhealthy senders are removed from the List, but the nature of this
class makes this test to impractical - since the
List-of-EventSender-building involves constructing new instances of
KafkaProvider/GooglePubSubProvider/etc and calling methods on instances
that those instances return, which is not really feasible to be mocked
out in a test.

mattnworb · 2016-12-01T20:56:28Z

@davidxia @rohansingh @lndbrg @zalenski

this aims to avoid issues like googleapis/google-cloud-java#1432

Adds a `isHealthy()` method to the EventSender interface that is used to filter out any unhealthy senders at startup. This logic is done in a new class named EventSenderFactory, which extracts the identical logic for constructing Lists of EventSenders from the MasterService and the AgentService. To make this common class possible, also extracted a class for the common configuration fields between AgentConfig and MasterConfig. This new CommonConfiguration class is incomplete - more fields can be moved to this class to avoid duplication, but I left that for future commits. I created EventSenderFactory in hopes of writing a test for the logic that unhealthy senders are removed from the List, but the nature of this class makes this test to impractical - since the List-of-EventSender-building involves constructing new instances of KafkaProvider/GooglePubSubProvider/etc and calling methods on instances that those instances return, which is not really feasible to be mocked out in a test.

codecov-io · 2016-12-01T22:07:38Z

Current coverage is 51.25% (diff: 40.81%)

Merging #1030 into master will increase coverage by 0.04%

@@             master      #1030   diff @@
==========================================
  Files           274        276     +2   
  Lines         13132      13135     +3   
  Methods           0          0          
  Messages          0          0          
  Branches       1700       1700          
==========================================
+ Hits           6725       6733     +8   
+ Misses         5903       5899     -4   
+ Partials        504        503     -1

Powered by Codecov. Last update 8cd160e...f14c677

lndbrg · 2016-12-02T13:16:48Z

@mattnworb i believe it would be beneficial to have a cache of senders that expires after n minutes and retries connection, in case the targeted event bus is down intermittently.

mattnworb · 2016-12-02T15:02:06Z

I think there are two approaches we could take to improve this so that an intermittent network problem at startup does not cause the pubsub sender to be disabled for as long as the agent is up:

Add a periodic out-of-band check (similar to getTopic here) that checks connectivity, and enables/disables the sender based on the result.
Remove the healthcheck-at-startup added here and modify the failure listener in the FutureCallback added to the future returned by pubsub.publishAsync(..) to set a flag that disables sending for some configured period of time. When this time elapses, subsequent publish attempts would be made again.

I am leaning towards 1 as 2 would still cause publishing attempts if the agent had total unconnectivity to the pubsub service, and so the ThresholdBundler will still fill up with requests, but just at a slower rate.

lndbrg · 2016-12-02T15:39:33Z

1 sounds like a good idea.

davidxia · 2016-12-02T17:51:04Z

helios-services/src/main/java/com/spotify/helios/servicescommon/KafkaSender.java

-  public void send(final KafkaRecord kafkaRecord) {
+  @Override
+  public boolean isHealthy() {
+    return true;


Unrelated to PR, but should we always return true here? Is there value in checking we can talk to kafka by using `KafkaProducer.waitOnMetadata() for example?

since we don't have any issues with the kafka sender and bad behavior when messages can't be published, I opted to leave that out here to avoid changing anything that is working ok today.

mattnworb force-pushed the pubsub-healthcheck branch from b6efcf2 to 82bd16c Compare December 1, 2016 20:59

mattnworb force-pushed the pubsub-healthcheck branch from 82bd16c to f14c677 Compare December 1, 2016 21:00

davidxia approved these changes Dec 2, 2016

View reviewed changes

mattnworb mentioned this pull request Dec 2, 2016

periodically check connectivity to Cloud Pubsub #1031

Merged

mattnworb closed this Dec 2, 2016

mattnworb deleted the pubsub-healthcheck branch January 12, 2017 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disable GooglePubSubSender if pubsub cannot be reached #1030

disable GooglePubSubSender if pubsub cannot be reached #1030

mattnworb commented Dec 1, 2016

mattnworb commented Dec 1, 2016

codecov-io commented Dec 1, 2016

lndbrg commented Dec 2, 2016 •

edited

Loading

mattnworb commented Dec 2, 2016

lndbrg commented Dec 2, 2016

davidxia Dec 2, 2016

mattnworb Dec 2, 2016

disable GooglePubSubSender if pubsub cannot be reached #1030

disable GooglePubSubSender if pubsub cannot be reached #1030

Conversation

mattnworb commented Dec 1, 2016

mattnworb commented Dec 1, 2016

codecov-io commented Dec 1, 2016

Current coverage is 51.25% (diff: 40.81%)

lndbrg commented Dec 2, 2016 • edited Loading

mattnworb commented Dec 2, 2016

lndbrg commented Dec 2, 2016

davidxia Dec 2, 2016

Choose a reason for hiding this comment

mattnworb Dec 2, 2016

Choose a reason for hiding this comment

lndbrg commented Dec 2, 2016 •

edited

Loading