Try out random DNS resolved endpoints in case of a failure. #3559

gregtatcam · 2020-07-23T18:47:49Z

If the resolver response includes a single IP, then we can just use it normally: attempt to connect to that IP and retrieve the list. Store the IP address and the success/failure status in the object and produce a log entry.

If the resolver response includes multiple IPs we need to use more advanced logic. Resolve the hostname again and:

If the previous attempt was successful and if the previously used IP is still valid use that, following the "single IP" logic above.
If this isn't the first attempt and the previous attempt failed, remove the failed IP from the list of resolved addresses if it still exists. Select an IP at random from the remaining choices and follow the "single IP" logic above.

FIXES #3494

If resolver returns multiple IP and if last endpoint is in the list and status is success then use last endpoint otherwise randomly select another endpoint from the list. FIXES 3494

nbougalis

Left one micro-nit. 👍

nbougalis · 2020-08-01T07:25:26Z

src/test/app/DNS_test.cpp

+BEAST_DEFINE_TESTSUITE_MANUAL_PRIO(DNS, ripple_data, ripple, 20);
+
+}  // namespace test
+}  // namespace ripple


Please add a trailing newline here.

seelabs

👍 However, I left a refactoring of the onResolve function for consideration.

seelabs · 2020-08-04T20:09:34Z

src/ripple/app/misc/detail/WorkBase.h

+    resolver_type::iterator end;
+    std::vector<
+        std::reference_wrapper<resolver_type::iterator::value_type const>>
+        entries;


I can see why this is written this way. It's a straight forward way to choose an random endpoint that was not previously failing. However, this code creates a vector of entries to randomly choose from, even when we won't be choosing from the vector. This will allocate memory, which we try to avoid. Even when we will be making a random choice, I don't think we need to create a vector to do so. What do you think about this refactoring?

template <class Impl> void WorkBase<Impl>::onResolve(error_code const& ec, resolver_type::iterator it) { if (ec) return fail(ec); lastEndpoint_ = [&]() -> endpoint_type { // Use last endpoint if it is successfully connected and is in the list, // otherwise pick a random endpoint from the list. If there is only one // endpoint and it is the last endpoint then use the last endpoint. resolver_type::iterator end; auto const foundIt = std::find(it, end, lastEndpoint_); if (lastStatus_ && foundIt != end) return lastEndpoint_; switch (auto const n = std::distance(it, end); n) { case 0: return lastEndpoint_; case 1: return *it; default: { // need to choose a random element from the collection that does // not equal `lastEndpoint_`` if `lastEndpoint_` is part of the // collection. if (foundIt == end) return *std::next(it, rand_int(n - 1)); // lastEndpoint_ is part of the collection // Pick a random number from the n-1 valid choices, if we use // this as an index, note the last element will never be chosen // and the `lastEndpoint_` index may be chosen. So when the // `lastEndpoint_` index is chosen, that is treated as if the // last element was chosen. auto randIndex = (n > 2) ? rand_int(n - 2) : 0; auto const foundIndex = std::distance(it, foundIt); if (randIndex == foundIndex) randIndex = n - 1; return *std::next(it, randIndex); } } }(); socket_.async_connect( lastEndpoint_, strand_.wrap(std::bind( &Impl::onConnect, impl().shared_from_this(), std::placeholders::_1))); }

FWIW, when I saw the vector being populated and usually not used I had concerns similar to those expressed by @seelabs. I'd prefer an approach which either does not use the vector or only populates the vector if it will actually use the contents.

Thank you for the suggestion. I updated with a slightly different implementation but it's pretty much the same idea.

scottschurr

👍 I left a few comments, but I don't think any of them should block merging.

scottschurr · 2020-08-04T00:46:24Z

src/ripple/app/misc/detail/WorkBase.h

@@ -182,8 +195,30 @@ WorkBase<Impl>::onResolve(error_code const& ec, resolver_type::iterator it)
    if (ec)
        return fail(ec);

+    resolver_type::iterator end;
+    std::vector<
+        std::reference_wrapper<resolver_type::iterator::value_type const>>


FWIW, the Boost 1.71 documentation says that the ip::tcp::resolver::iterator is deprecated.

https://www.boost.org/doc/libs/1_71_0/doc/html/boost_asio/reference/ip__tcp/resolver.html

And the usage, at least from here, looks really weird. A default constructed resolver::iterator acts like an end() iterator? At least that's how it's being used here. Are there any alternatives?

scottschurr · 2020-08-04T17:25:07Z

src/ripple/app/misc/impl/ValidatorSite.cpp

@@ -315,9 +324,9 @@ ValidatorSite::onTimer(std::size_t siteIdx, error_code const& ec)
    if (ec)
    {
        // Restart the timer if any errors are encountered, unless the error
-        // is from the wait operating being aborted due to a shutdown request.
+        // is from the wait operation being aborted due to a shutdown request.


Thanks for noticing and fixing the comment.

scottschurr · 2020-08-04T17:57:04Z

src/test/app/DNS_test.cpp

+    {
+        parse();
+        // First endpoint is random. Next three
+        // hould resolve to the same endpoint. Run a few times


hould -> should?

Thank you. Fixed.

scottschurr · 2020-08-04T22:01:34Z

src/test/app/DNS_test.cpp

+        for (int i = 0; i < 4; ++i)
+            makeRequest(lastEndpoint_, false);
+        // Should have more than one but some endpoints can repeat since
+        // selected at random.


I ran the test repeatedly and I feel like I have to agree with your conclusion. However I don't understand why. If the endpoint selection is really random it seems like occasionally all of the random numbers would be the same. With 4 possible endpoints being selected randomly 5 times I think you get a probability of 4:1024 (or 1:256) that you'd get the same value all five times.

Still, I've run the test over 4096 times and it has not failed. Why is that? You should put that explanation in the comments or else someone else will be just as puzzled as me when they maintain the test. If it's an artifact of the pseudo random number generator, then the test is fragile.

I added comments. The gist of this is that on failure we are not going to select the last end point. So it's not really random - we exclude the failed endpoint from the list.

scottschurr · 2020-08-04T22:07:32Z

src/ripple/app/misc/detail/WorkBase.h

+    resolver_type::iterator end;
+    std::vector<
+        std::reference_wrapper<resolver_type::iterator::value_type const>>
+        entries;


FWIW, when I saw the vector being populated and usually not used I had concerns similar to those expressed by @seelabs. I'd prefer an approach which either does not use the vector or only populates the vector if it will actually use the contents.

Change async_resolve handler arg type to results_type. Change random endpoint selection algorithm - remove vector. Clarify expected test result outcome in the unit test.

seelabs · 2020-08-05T18:18:11Z

src/ripple/app/misc/detail/WorkBase.h

+        // previous or next endpoint.
+        auto r = rand_int(results.size() - 1);
+        return *std::next(
+            results.begin(), r != foundIndex ? r : (r > 0 ? --r : ++r));


I think this is a biased sample. The foundIndex's alternate element has two chances of being selected, while every other element has only one chance of being selected. Consider make the r not include the last element. If foundIndex is selected, then select the last element instead. This way every element has the same probability of being selected.

You are right. Changed.

seelabs

👍 Nice job on this!

scottschurr

👍 LGTM!

Add last request endpoint and request status to Site.

02d22df

If resolver returns multiple IP and if last endpoint is in the list and status is success then use last endpoint otherwise randomly select another endpoint from the list. FIXES 3494

carlhua assigned nbougalis, seelabs and scottschurr Jul 23, 2020

nbougalis approved these changes Aug 1, 2020

View reviewed changes

Add trailing new line

13c1fe2

seelabs approved these changes Aug 4, 2020

View reviewed changes

scottschurr approved these changes Aug 4, 2020

View reviewed changes

Change async_resolve from deprecated overload.

8218d36

Change async_resolve handler arg type to results_type. Change random endpoint selection algorithm - remove vector. Clarify expected test result outcome in the unit test.

gregtatcam requested review from nbougalis, scottschurr and seelabs August 5, 2020 18:03

seelabs reviewed Aug 5, 2020

View reviewed changes

Fix sample bias in random endpoint selection algorithm

b98dea9

seelabs approved these changes Aug 5, 2020

View reviewed changes

scottschurr approved these changes Aug 5, 2020

View reviewed changes

nbougalis approved these changes Aug 26, 2020

View reviewed changes

carlhua added the Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. label Aug 26, 2020

nbougalis mentioned this pull request Sep 1, 2020

Proposed 1.7.0-b1 #3599

Merged

nbougalis closed this in #3599 Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try out random DNS resolved endpoints in case of a failure. #3559

Try out random DNS resolved endpoints in case of a failure. #3559

gregtatcam commented Jul 23, 2020 •

edited by intelliot

Loading

nbougalis left a comment

nbougalis Aug 1, 2020

gregtatcam Aug 3, 2020

seelabs left a comment

seelabs Aug 4, 2020 •

edited

Loading

scottschurr Aug 4, 2020

gregtatcam Aug 5, 2020

scottschurr left a comment

scottschurr Aug 4, 2020

gregtatcam Aug 5, 2020

scottschurr Aug 4, 2020

scottschurr Aug 4, 2020

gregtatcam Aug 5, 2020

scottschurr Aug 4, 2020

gregtatcam Aug 5, 2020

scottschurr Aug 4, 2020

seelabs Aug 5, 2020

gregtatcam Aug 5, 2020

seelabs left a comment

scottschurr left a comment

Try out random DNS resolved endpoints in case of a failure. #3559

Try out random DNS resolved endpoints in case of a failure. #3559

Conversation

gregtatcam commented Jul 23, 2020 • edited by intelliot Loading

nbougalis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seelabs left a comment

Choose a reason for hiding this comment

seelabs Aug 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottschurr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seelabs left a comment

Choose a reason for hiding this comment

scottschurr left a comment

Choose a reason for hiding this comment

gregtatcam commented Jul 23, 2020 •

edited by intelliot

Loading

seelabs Aug 4, 2020 •

edited

Loading