-
Notifications
You must be signed in to change notification settings - Fork 137
Don't commit entries from previous terms by counting replicas + noop entry upon leader election #336
Conversation
Codecov Report
@@ Coverage Diff @@
## master #336 +/- ##
==========================================
+ Coverage 84.12% 84.19% +0.07%
==========================================
Files 50 50
Lines 9416 9460 +44
Branches 2500 2515 +15
==========================================
+ Hits 7921 7965 +44
+ Misses 940 936 -4
- Partials 555 559 +4
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Converted to draft to optimize the vote counting a bit first. |
Added an optimization, need to test it a bit more to feel comfortable to release it for review. In short, raft will only try to commit log entry indices that are smaller than or equal to the largest match_index of another voter in the cluster (except in some corner cases). This fixes a case where the loop that probes for new commit indices would run needlessly for e.g. a 1000 times when the leader's match index is for example 5000 and the next voter's match index is only 4000. In this case it only makes sense to start looking for a new commit index from 4000 and below. |
5cc0bed
to
4031e23
Compare
The overhaul of the |
src/replication.c
Outdated
if (index <= r->commit_index) { | ||
return; | ||
/* In a single voter cluster where this node is part of the configuration, | ||
* is a voter all stored entries can be committed. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is a voter
can be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will investigate if it can be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will investigate if it can be dropped.
Not sure it's clear, I mean that the words is a voter
can be dropped from the comment (probably they are a leftover? the full comment does quite sound grammatically correct to me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, yes the sentence indeed looks a bit strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests also still pass when I remove r->configuration.servers[server_index].role == RAFT_VOTER
from the check, I will investigate anyway to make sure my understanding is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests also still pass when I remove
r->configuration.servers[server_index].role == RAFT_VOTER
from the check, I will investigate anyway to make sure my understanding is correct.
I was a bit suspicious about that condition too, but didn't mention it actually. I would not expect that cases where there is a single server and that server is not a voter are possible, because otherwise there would be no way to make progress and commit further entries. So that condition could be turned into an assertion. However, I'm just hand-waving, not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check actually says that there is 1 voter in the cluster, not that there is a single server. Checking for r->configuration.servers[server_index].role == RAFT_VOTER
then means that this server was not demoted to standby or spare in the latest configuration and is busy committing the configuration change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah sorry, I misread the configurationVoterCount
call, that's the number of voters indeed, not servers. Makes sense then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I double-checked and I think the condition is fine. The test still succeed after removing the RAFT_VOTER
condition because the test is less strict then and the start_index
is chosen to be last_stored
on the current leader, which will be the highest possible index that can be committed anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me at first sight. Just a new minor nits.
4031e23
to
315bafb
Compare
85c397c
to
e35e216
Compare
src/replication.c
Outdated
* of matchIndex[i] >= N, and log[N].term == currentTerm: | ||
* set commitIndex = N (5.3, 5.4). | ||
* */ | ||
start_index = replicationCalculateCommitIndex(r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought of a different approach here: sort the nodes' match indices from highest to lowest, then check for a majority at each of those indices only. Whatever the highest index we can commit is, it's going to be equal to the match index i
of some node n
, such that for i
we have a majority but for i + 1
we don't (because we lose n
). That way the number of runs of the loop body below is bounded by the number of nodes, instead of the length of the log. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it'll always be the median (rounded up) of that sorted array of match indices, right? So no looping at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's worth it to optimize this, let me think about it for a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we can keep the current (what's on master) implementation of replicationQuorum
, while adding the check to not commit entries by counting if the term is older than the current term, on the condition that we call it here
Line 754 in faadebb
replicationQuorum(r, r->last_stored); |
last_index
instead of r->last_stored
. last_index
is equal to the match_index
of the server from which we have just received an AppendEntriesResultRPC
. (due to Line 722 in faadebb
if (!progressMaybeUpdate(r, i, last_index)) { |
This works because a new commit_index
is always equal to some server's match_index
and all match_index
updates trickle down to the replicationQuorum
call. Unless this call
Line 746 in faadebb
rv = triggerActualPromotion(r); |
RAFT_UNAVAILABLE
anyway.
It's a lot simpler solution than the current one proposed in this PR (and still passes the new tests). Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That reasoning seems sound to me, and I like the simplification. I guess that here
Lines 520 to 521 in 0ddb329
/* Check if we can commit some new entries. */ | |
replicationQuorum(r, r->last_stored); |
we'd still use r->last_stored, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That reasoning seems sound to me, and I like the simplification. I guess that here
Lines 520 to 521 in 0ddb329
/* Check if we can commit some new entries. */ replicationQuorum(r, r->last_stored); we'd still use r->last_stored, right?
yes
The callback could contain logic that depends on the value of last_applied, which logically should have increased after applying the log entry, but before calling the callback.
Add test to ensure the commit_index is updated when a leader continually receives apply requests. The test fails without the fix in this commit. Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Adapt the tests to take into account the add barrier entries in the log.
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
e35e216
to
b4bdc8a
Compare
Implemented the simplification described here. Removed the test that checks that we don't commit entries from previous terms by counting votes, as it was really testing nothing anymore after this change (it was hacky anyway) and from the code it's clear we can't commit entries from an older term by counting voters. |
It's surely simpler :) |
An entry from a previous term will no longer be committed by counting votes (which violates the raft spec), instead upon leader election a leader applies a noop barrier entry to its log. When committing the noop barrier entry, all previous entries in the log will be committed.
The noop entry upon leader election is needed because raft, on startup doesn't know if its latest configuration loaded from disk is committed or not, therefore it must set
configuration_uncommitted_index
(to be added in following PR) and will reject any configuration changes requested by the client. To clearconfiguration_uncommitted_index
the entry at that index must be committed. It's also needed when a follower becomes leader while still having a non-0configuration_uncommitted_index
.This also includes a fix to update
last_applied
before firing an applied request's callback. The callback could contain logic dependent onlast_applied
which should have logically increased after applying the entry, but before firing the callback. This fixes an issue in dqlite where a barrier could fire another barrier in its callback.fixes #220.
! This will break the dqlite tests, but not dqlite's functionality, I've updated raft's version so dqlite's configure step can test against it.