-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failsafes to prevent a consensus round from taking too long #5277
base: develop
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5277 +/- ##
=======================================
Coverage 78.1% 78.1%
=======================================
Files 790 790
Lines 67908 67984 +76
Branches 8227 8231 +4
=======================================
+ Hits 53028 53106 +78
+ Misses 14880 14878 -2
🚀 New features to boost your workflow:
|
f07992d
to
76c27a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cool to have a unit test for the last part of DisputedTx.h
; not sure how realistic that request is. Approved in any case.
6e513d9
to
26ab221
Compare
8dcbf91
to
a6d3cea
Compare
- Stable state means that neither we, nor any of our peers has changed a vote on a disputed transaction in a while. This is undesirable if an 80% consensus has not otherwise been reached. It will cause a validation to be sent, which will help get other (trusting) validators back on track using preferred ledger logic.
a6d3cea
to
197356b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version fails to build on MacOS because Mac's version of libstdc++ is dropping the assignment operator for std::map
pairs. There are two viable fixes:
-
Add an assignment operator to
ConsensusParms
:
We could add a custom assignment operator forConsensusParms
that, instead of assigning theavalancheCutoffs
map directly (which triggers the error), manually copies its contents into the target map. -
Make
avalancheCutoffs
const and remove the assignment:
By declaring the map as const, you avoid any assignment after its construction. I would prefer this option, but it means that we must update the unit tests that does
peer->consensusParms = parms;
so that it no longer tries to perform such an assignment.
* upstream/develop: chore: Rename missing-commits job, and combine nix job files (5268)
* upstream/develop: fix: Replace charge() by fee_.update() in OnMessage functions (5269) docs: ensure build_type and CMAKE_BUILD_TYPE match (5274) chore: Fix small typos in protocol files (5279)
* upstream/develop: Reduce duplicate peer traffic for ledger data (5126)
* upstream/develop: Revert "Reduce duplicate peer traffic for ledger data (5126)" (5300) docs: Revert peer port to 51235 (5299) Set version to 2.4.0-rc1 fix: Switch Permissioned Domain to Supported::yes (5287) docs: Clarifies default port of hosts (5290) Log proposals and validations (5291)
* upstream/develop: Add logging and improve counting of amendment votes from UNL (5173)
* upstream/develop: fix: Remove 'new parent hash' assert (5313)
785f2bb
to
979462f
Compare
* upstream/develop: Set version to 2.4.0-rc3 fix: Acquire previously failed transaction set from network as new proposal arrives (5318) Fix Replace `assert` with `XRPL_ASSERT` (5312)
* upstream/develop: Log detailed correlated consensus data together (5302)
* upstream/develop: chore: Move "assert" and "werr" flags from "actions/build" (5325)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two suggestions for now; I admire the unit test coverage of this but am not yet finished.
// See if enough time has passed to move on to the next. | ||
XRPL_ASSERT( | ||
nextCutoff.consensusTime >= currentCutoff.consensusTime, | ||
"ripple::DisputedTx::updateVote next state valid"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ripple::DisputedTx::updateVote next state valid"); | |
"ripple::getNeededWeight : next state valid"); |
enum AvalancheState { init, mid, late, stuck }; | ||
struct AvalancheCutoff | ||
{ | ||
std::size_t const consensusTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::size_t const consensusTime; | |
int const consensusTime; |
@@ -1692,6 +1710,17 @@ Consensus<Adaptor>::haveConsensus( | |||
JLOG(j_.debug()) << "Checking for TX consensus: agree=" << agree | |||
<< ", disagree=" << disagree; | |||
|
|||
ConsensusParms const& parms = adaptor_.parms(); | |||
// Stalling is BAD | |||
bool stalled = haveCloseTimeConsensus_ && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this const
please
JLOG(j_.info()) << "No change (" << (ourVote_ ? "YES" : "NO") | ||
<< ") : weight " << weight << ", percent " | ||
<< percentTime; | ||
<< percentTime | ||
<< ", round(s) with this vote: " << currentVoteCounter_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is also old behaviour, but would it be useful to show here tx_.id()
, like we do below when our vote is changed ?
clog)) | ||
{ | ||
JLOG(j.debug()) << "normal consensus"; | ||
CLOG(clog) << "reached. "; | ||
JLOG(j.debug()) << "normal consensus" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if stalled
was logged at a higher severity level, perhaps j.warn()
|
||
auto const& lcl = peer->lastClosedLedger; | ||
BEAST_EXPECT(lcl.id() == peer->prevLedgerID()); | ||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any value in BEAST_EXPECT
inside the commented out section ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good change. I like the unis tests coverage (but an open question re. commented out section), I like how the new Expired
state only kicks in when we are not stalled
on all transactions (nice decoupling on both states).
bool const ret = currentPercentage >= minConsensusPct; | ||
CLOG(clog) << "currentPercentage: " << currentPercentage | ||
<< (stalled ? ", stalled" : ""); | ||
bool const ret = stalled || currentPercentage >= minConsensusPct; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting. We only get stalled
when every disputed transaction is unequivocally either under 20% or over 80% consensus (repectively "nay" or "yay") so there is no way to manipulate transactions which do not cut the threshold into the ledger. Would be good to put explanation to such effect in a comment, though.
* upstream/develop: Set version to 2.4.0 Set version to 2.4.0-rc4 chore: Update XRPL Foundation Validator List URL (5326)
* upstream/develop: refactor: Remove unused and add missing includes (5293)
High Level Overview of Change
This PR, if merged, will introduce two fail safes into the consensus logic to prevent a consensus round from remaining in the establish phase indefinitely.
Context of Change
At about 9:54pm UTC on 2/4/2025, the network successfully validated ledger 93927173, and started the consensus round for 93927174. That round did not end for over an hour.
The current evidence indicates that two things happened.
This led to a deadlock-like situation where every node was waiting for some other node to make a change, while none of the nodes were willing to change.
This decision algorithm has been in place for at least 8 years, and possibly since the first release of
rippled
. The odds of it happening were thought to be 0, but it turns out they're just very very small.Type of Change
This change is fully backward and forward compatible, and does not require an amendment.
Before / After
This is an outline of the changes in this PR.
NetworkOPsImp::processTrustedProposal
, if the proposal is from us, it don't process it. This should be impossible, but this will help confirm if not.checkConsensusReached
's other restrictions, such as minimum proposers, still apply)avMIN_ROUNDS
) "inner rounds" ofphaseEstablish
,avSTALLED_ROUNDS
) "inner rounds",ledgerMAX_CONSENSUS
) to 120s (ledgerABANDON_CONSENSUS
).phaseEstablish
), thenConsensusState::Expired
is treated asConsensusState::No
.ConsensusParms.h
has been rewritten as more of an explicit state machine. It's basically a map of states to their time cutoff, percentage cutoff, and next state, and agetNeededWeight
function that will evaluate whether to move to the next state.avMIN_ROUNDS
) "inner rounds" in each state. Close time consensus does not have this restriction (but it could).