Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try out random DNS resolved endpoints in case of a failure. #3559

Closed
wants to merge 4 commits into from

Conversation

gregtatcam
Copy link
Collaborator

@gregtatcam gregtatcam commented Jul 23, 2020

If the resolver response includes a single IP, then we can just use it normally: attempt to connect to that IP and retrieve the list. Store the IP address and the success/failure status in the object and produce a log entry.

If the resolver response includes multiple IPs we need to use more advanced logic. Resolve the hostname again and:

If the previous attempt was successful and if the previously used IP is still valid use that, following the "single IP" logic above.
If this isn't the first attempt and the previous attempt failed, remove the failed IP from the list of resolved addresses if it still exists. Select an IP at random from the remaining choices and follow the "single IP" logic above.

FIXES #3494

If resolver returns multiple IP and if last endpoint
is in the list and status is success then use last
endpoint otherwise randomly select another endpoint
from the list.
FIXES 3494
Copy link
Contributor

@nbougalis nbougalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one micro-nit. 👍

BEAST_DEFINE_TESTSUITE_MANUAL_PRIO(DNS, ripple_data, ripple, 20);

} // namespace test
} // namespace ripple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a trailing newline here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@seelabs seelabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 However, I left a refactoring of the onResolve function for consideration.

resolver_type::iterator end;
std::vector<
std::reference_wrapper<resolver_type::iterator::value_type const>>
entries;
Copy link
Collaborator

@seelabs seelabs Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why this is written this way. It's a straight forward way to choose an random endpoint that was not previously failing. However, this code creates a vector of entries to randomly choose from, even when we won't be choosing from the vector. This will allocate memory, which we try to avoid. Even when we will be making a random choice, I don't think we need to create a vector to do so. What do you think about this refactoring?

template <class Impl>
void
WorkBase<Impl>::onResolve(error_code const& ec, resolver_type::iterator it)
{
    if (ec)
        return fail(ec);

    lastEndpoint_ = [&]() -> endpoint_type {
        // Use last endpoint if it is successfully connected and is in the list,
        // otherwise pick a random endpoint from the list. If there is only one
        // endpoint and it is the last endpoint then use the last endpoint.

        resolver_type::iterator end;
        auto const foundIt = std::find(it, end, lastEndpoint_);
        if (lastStatus_ && foundIt != end)
            return lastEndpoint_;

        switch (auto const n = std::distance(it, end); n)
        {
            case 0:
                return lastEndpoint_;
            case 1:
                return *it;
            default: {
                // need to choose a random element from the collection that does
                // not equal `lastEndpoint_`` if `lastEndpoint_` is part of the
                // collection.
                if (foundIt == end)
                    return *std::next(it, rand_int(n - 1));
                // lastEndpoint_ is part of the collection
                // Pick a random number from the n-1 valid choices, if we use
                // this as an index, note the last element will never be chosen
                // and the `lastEndpoint_` index may be chosen. So when the
                // `lastEndpoint_` index is chosen, that is treated as if the
                // last element was chosen.
                auto randIndex = (n > 2) ? rand_int(n - 2) : 0;
                auto const foundIndex = std::distance(it, foundIt);
                if (randIndex == foundIndex)
                    randIndex = n - 1;
                return *std::next(it, randIndex);
            }
        }
    }();

    socket_.async_connect(
        lastEndpoint_,
        strand_.wrap(std::bind(
            &Impl::onConnect,
            impl().shared_from_this(),
            std::placeholders::_1)));
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, when I saw the vector being populated and usually not used I had concerns similar to those expressed by @seelabs. I'd prefer an approach which either does not use the vector or only populates the vector if it will actually use the contents.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. I updated with a slightly different implementation but it's pretty much the same idea.

Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I left a few comments, but I don't think any of them should block merging.

@@ -182,8 +195,30 @@ WorkBase<Impl>::onResolve(error_code const& ec, resolver_type::iterator it)
if (ec)
return fail(ec);

resolver_type::iterator end;
std::vector<
std::reference_wrapper<resolver_type::iterator::value_type const>>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the Boost 1.71 documentation says that the ip::tcp::resolver::iterator is deprecated.

https://www.boost.org/doc/libs/1_71_0/doc/html/boost_asio/reference/ip__tcp/resolver.html

And the usage, at least from here, looks really weird. A default constructed resolver::iterator acts like an end() iterator? At least that's how it's being used here. Are there any alternatives?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@@ -315,9 +324,9 @@ ValidatorSite::onTimer(std::size_t siteIdx, error_code const& ec)
if (ec)
{
// Restart the timer if any errors are encountered, unless the error
// is from the wait operating being aborted due to a shutdown request.
// is from the wait operation being aborted due to a shutdown request.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing and fixing the comment.

{
parse();
// First endpoint is random. Next three
// hould resolve to the same endpoint. Run a few times
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hould -> should?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Fixed.

for (int i = 0; i < 4; ++i)
makeRequest(lastEndpoint_, false);
// Should have more than one but some endpoints can repeat since
// selected at random.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the test repeatedly and I feel like I have to agree with your conclusion. However I don't understand why. If the endpoint selection is really random it seems like occasionally all of the random numbers would be the same. With 4 possible endpoints being selected randomly 5 times I think you get a probability of 4:1024 (or 1:256) that you'd get the same value all five times.

Still, I've run the test over 4096 times and it has not failed. Why is that? You should put that explanation in the comments or else someone else will be just as puzzled as me when they maintain the test. If it's an artifact of the pseudo random number generator, then the test is fragile.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments. The gist of this is that on failure we are not going to select the last end point. So it's not really random - we exclude the failed endpoint from the list.

resolver_type::iterator end;
std::vector<
std::reference_wrapper<resolver_type::iterator::value_type const>>
entries;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, when I saw the vector being populated and usually not used I had concerns similar to those expressed by @seelabs. I'd prefer an approach which either does not use the vector or only populates the vector if it will actually use the contents.

Change async_resolve handler arg type to results_type.
Change random endpoint selection algorithm - remove vector.
Clarify expected test result outcome in the unit test.
// previous or next endpoint.
auto r = rand_int(results.size() - 1);
return *std::next(
results.begin(), r != foundIndex ? r : (r > 0 ? --r : ++r));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a biased sample. The foundIndex's alternate element has two chances of being selected, while every other element has only one chance of being selected. Consider make the r not include the last element. If foundIndex is selected, then select the last element instead. This way every element has the same probability of being selected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Changed.

Copy link
Collaborator

@seelabs seelabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Nice job on this!

Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM!

@carlhua carlhua added the Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. label Aug 26, 2020
@nbougalis nbougalis mentioned this pull request Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DNS resolution for validator lists
5 participants