Failsafes to prevent a consensus round from taking too long #5277

ximinez · 2025-02-05T04:14:24Z

High Level Overview of Change

This PR, if merged, will introduce two fail safes into the consensus logic to prevent a consensus round from remaining in the establish phase indefinitely.

Detects if the consensus process is "stalled". If it is, then we can declare a consensus and end successfully even if we do not have 80% agreement on our proposal. Details below in "Before / After".
If we have been in the establish phase for more than 10x the previous consensus establish phase's time, then consensus is considered "expired", and we will leave the round, which sends a partial validation (indicating that the node is moving on without validating). There are restrictions intended to avoid prematurely exiting, or having an extended exit in extreme situations. Details below in "Before / After".
1. When enough nodes leave the round, any remaining nodes will see they've fallen behind, and move on, too, generally before hitting the timeout. Any validations or partial validations sent during this time will help the consensus process bring the nodes back together.

Context of Change

At about 9:54pm UTC on 2/4/2025, the network successfully validated ledger 93927173, and started the consensus round for 93927174. That round did not end for over an hour.

The current evidence indicates that two things happened.

Some disputed transactions had just enough "yes" votes that validators voting "yes" saw the approval as just over 95%, while those voting "no" saw the approval as just under 95%. Thus, every node thought that it was doing the right thing, and no nodes changed their vote. While this is annoying, normally consensus will move on because at least 80% of the UNL validators will be in agreement over which transaction set to use, and so consensus moves on with that set. However,
The disputed transactions with the close approval rates were distributed such that there were several clumps of validators voting yes for different transactions than other clumps of validators. This led to a situation where no transaction set had 80% approval.

This led to a deadlock-like situation where every node was waiting for some other node to make a change, while none of the nodes were willing to change.

This decision algorithm has been in place for at least 8 years, and possibly since the first release of rippled. The odds of it happening were thought to be 0, but it turns out they're just very very small.

Type of Change

Bug fix (non-breaking change which fixes an issue)

This change is fully backward and forward compatible, and does not require an amendment.

Before / After

This is an outline of the changes in this PR.

In NetworkOPsImp::processTrustedProposal, if the proposal is from us, it don't process it. This should be impossible, but this will help confirm if not.
Detects if the consensus process is "stalled". If it is, then we can declare a consensus and end successfully even if we do not have 80% agreement on our proposal. (checkConsensusReached's other restrictions, such as minimum proposers, still apply)
1. "Stalled" is distinct from "stuck" used as a consensus percentage state. Naming things is hard.
2. "Stalled" is defined as:
  1. We have a close time consensus
  2. Each disputed transaction is individually stalled:
    1. It has been in the final "stuck" 95% requirement for at least 2 (avMIN_ROUNDS) "inner rounds" of phaseEstablish,
    2. and either all of the other trusted proposers or this validator, if proposing, have had the same vote(s) vote for at least 4 (avSTALLED_ROUNDS) "inner rounds",
    3. and at least 80% of the validators (including this one, if appropriate) agree about the vote (whether yes or no).
If we have been in the establish phase for more than 10x the previous consensus round time, then consensus is considered "expired", and we will leave the round, which sends a partial validation. There are two restrictions intended to avoid prematurely exiting, or having an extended exit in extreme situations.
1. The 10x time is clamped to be within a range of 15s (ledgerMAX_CONSENSUS) to 120s (ledgerABANDON_CONSENSUS).
2. If consensus has not had an opportunity to walk through all percentage avalanche states (defined as not going through 8 "inner rounds" of phaseEstablish), then ConsensusState::Expired is treated as ConsensusState::No.
The close time avalanching defined in ConsensusParms.h has been rewritten as more of an explicit state machine. It's basically a map of states to their time cutoff, percentage cutoff, and next state, and a getNeededWeight function that will evaluate whether to move to the next state.
1. This function is used for both disputed transactions and close time consensus.
2. In addition to the the "previous round time percentage" limits, disputed transactions will also be required to spend at least 2 (avMIN_ROUNDS) "inner rounds" in each state. Close time consensus does not have this restriction (but it could).
3. This map is more easily modifiable than the previous individual variables, so if we decide to change parameters, it only requires changing the map.
Adds tests of the functionality that detects the "stalled" state.
Finally, it adds a simulation unit test that attempts to recreate the scenario that got us here in the first place. However, I'm not very familiar with the simulator, and we're still not 100% sure how we got into this state in the first place, so it's not very good.

codecov · 2025-02-05T04:39:22Z

Codecov Report

Attention: Patch coverage is 88.99083% with 12 lines in your changes missing coverage. Please review.

Project coverage is 78.1%. Comparing base (2406b28) to head (88cf631).

Files with missing lines	Patch %	Lines
src/xrpld/consensus/Consensus.h	81.1%	7 Missing ⚠️
src/xrpld/app/misc/NetworkOPs.cpp	66.7%	3 Missing ⚠️
src/xrpld/consensus/Consensus.cpp	92.3%	1 Missing ⚠️
src/xrpld/consensus/DisputedTx.h	96.8%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #5277   +/-   ##
=======================================
  Coverage     78.1%   78.1%           
=======================================
  Files          790     790           
  Lines        67908   67984   +76     
  Branches      8227    8231    +4     
=======================================
+ Hits         53028   53106   +78     
+ Misses       14880   14878    -2

Files with missing lines	Coverage Δ
src/xrpld/consensus/ConsensusParms.h	`100.0% <100.0%> (ø)`
src/xrpld/consensus/ConsensusTypes.h	`79.5% <ø> (ø)`
src/xrpld/consensus/Consensus.cpp	`75.2% <92.3%> (+1.3%)`	⬆️
src/xrpld/consensus/DisputedTx.h	`99.0% <96.8%> (+3.0%)`	⬆️
src/xrpld/app/misc/NetworkOPs.cpp	`69.1% <66.7%> (-<0.1%)`	⬇️
src/xrpld/consensus/Consensus.h	`88.3% <81.1%> (+1.5%)`	⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

src/xrpld/consensus/DisputedTx.h

src/xrpld/consensus/Consensus.cpp

Bronek

Would be cool to have a unit test for the last part of DisputedTx.h ; not sure how realistic that request is. Approved in any case.

- Stable state means that neither we, nor any of our peers has changed a vote on a disputed transaction in a while. This is undesirable if an 80% consensus has not otherwise been reached. It will cause a validation to be sent, which will help get other (trusting) validators back on track using preferred ledger logic.

vlntb

The current version fails to build on MacOS because Mac's version of libstdc++ is dropping the assignment operator for std::map pairs. There are two viable fixes:

Add an assignment operator to ConsensusParms:
We could add a custom assignment operator for ConsensusParms that, instead of assigning the avalancheCutoffs map directly (which triggers the error), manually copies its contents into the target map.
Make avalancheCutoffs const and remove the assignment:
By declaring the map as const, you avoid any assignment after its construction. I would prefer this option, but it means that we must update the unit tests that does

peer->consensusParms = parms;

so that it no longer tries to perform such an assignment.

src/xrpld/consensus/ConsensusParms.h

src/test/consensus/Consensus_test.cpp

* upstream/develop: chore: Rename missing-commits job, and combine nix job files (5268)

* upstream/develop: fix: Replace charge() by fee_.update() in OnMessage functions (5269) docs: ensure build_type and CMAKE_BUILD_TYPE match (5274) chore: Fix small typos in protocol files (5279)

* upstream/develop: Reduce duplicate peer traffic for ledger data (5126)

* upstream/develop: Revert "Reduce duplicate peer traffic for ledger data (5126)" (5300) docs: Revert peer port to 51235 (5299) Set version to 2.4.0-rc1 fix: Switch Permissioned Domain to Supported::yes (5287) docs: Clarifies default port of hosts (5290) Log proposals and validations (5291)

* upstream/develop: Add logging and improve counting of amendment votes from UNL (5173)

* upstream/develop: fix: Remove 'new parent hash' assert (5313)

ximinez · 2025-02-26T23:49:31Z

@Bronek @vlntb Other than testDisjointNetwork, or any changes in the logic / rules, I think this is as done as it's going to get. Could you please re-review?

CC: @JoelKatz

* upstream/develop: Set version to 2.4.0-rc3 fix: Acquire previously failed transaction set from network as new proposal arrives (5318) Fix Replace `assert` with `XRPL_ASSERT` (5312)

* upstream/develop: Log detailed correlated consensus data together (5302)

* upstream/develop: chore: Move "assert" and "werr" flags from "actions/build" (5325)

Bronek

Two suggestions for now; I admire the unit test coverage of this but am not yet finished.

Bronek · 2025-03-03T15:47:21Z

src/xrpld/consensus/ConsensusParms.h

+        // See if enough time has passed to move on to the next.
+        XRPL_ASSERT(
+            nextCutoff.consensusTime >= currentCutoff.consensusTime,
+            "ripple::DisputedTx::updateVote next state valid");


Suggested change

"ripple::DisputedTx::updateVote next state valid");

"ripple::getNeededWeight : next state valid");

Bronek · 2025-03-03T15:59:29Z

src/xrpld/consensus/ConsensusParms.h

+    enum AvalancheState { init, mid, late, stuck };
+    struct AvalancheCutoff
+    {
+        std::size_t const consensusTime;


Suggested change

std::size_t const consensusTime;

int const consensusTime;

Bronek · 2025-03-04T15:23:39Z

src/xrpld/consensus/Consensus.h

@@ -1692,6 +1710,17 @@ Consensus<Adaptor>::haveConsensus(
    JLOG(j_.debug()) << "Checking for TX consensus: agree=" << agree
                     << ", disagree=" << disagree;

+    ConsensusParms const& parms = adaptor_.parms();
+    // Stalling is BAD
+    bool stalled = haveCloseTimeConsensus_ &&


Make this const please

Bronek · 2025-03-05T14:20:10Z

src/xrpld/consensus/DisputedTx.h

        JLOG(j_.info()) << "No change (" << (ourVote_ ? "YES" : "NO")
                        << ") : weight " << weight << ", percent "
-                        << percentTime;
+                        << percentTime
+                        << ", round(s) with this vote: " << currentVoteCounter_;


I know this is also old behaviour, but would it be useful to show here tx_.id(), like we do below when our vote is changed ?

Bronek · 2025-03-05T14:40:59Z

src/xrpld/consensus/Consensus.cpp

            clog))
    {
-        JLOG(j.debug()) << "normal consensus";
-        CLOG(clog) << "reached. ";
+        JLOG(j.debug()) << "normal consensus"


I would prefer if stalled was logged at a higher severity level, perhaps j.warn()

Bronek · 2025-03-05T15:16:18Z

src/test/consensus/Consensus_test.cpp

+
+                auto const& lcl = peer->lastClosedLedger;
+                BEAST_EXPECT(lcl.id() == peer->prevLedgerID());
+                /*


Is there any value in BEAST_EXPECT inside the commented out section ?

Bronek

This is good change. I like the unis tests coverage (but an open question re. commented out section), I like how the new Expired state only kicks in when we are not stalled on all transactions (nice decoupling on both states).

Bronek · 2025-03-05T15:22:54Z

src/xrpld/consensus/Consensus.cpp

-    bool const ret = currentPercentage >= minConsensusPct;
+    CLOG(clog) << "currentPercentage: " << currentPercentage
+               << (stalled ? ", stalled" : "");
+    bool const ret = stalled || currentPercentage >= minConsensusPct;


This is interesting. We only get stalled when every disputed transaction is unequivocally either under 20% or over 80% consensus (repectively "nay" or "yay") so there is no way to manipulate transactions which do not cut the threshold into the ledger. Would be good to put explanation to such effect in a comment, though.

* upstream/develop: Set version to 2.4.0 Set version to 2.4.0-rc4 chore: Update XRPL Foundation Validator List URL (5326)

* upstream/develop: refactor: Remove unused and add missing includes (5293)

ximinez changed the title ~~Drop out of consensus if the round takes too long~~ Failsafes to prevent a consensus round from taking too long Feb 5, 2025

ximinez requested review from Bronek, JoelKatz and vlntb February 5, 2025 19:01

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/consensus/DisputedTx.h Outdated Show resolved Hide resolved

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/consensus/Consensus.cpp Outdated Show resolved Hide resolved

ximinez force-pushed the ximinez/consensus branch from f07992d to 76c27a0 Compare February 5, 2025 22:05

ximinez marked this pull request as ready for review February 5, 2025 23:16

ximinez requested a review from Bronek February 6, 2025 00:02

Bronek approved these changes Feb 6, 2025

View reviewed changes

ximinez force-pushed the ximinez/consensus branch 4 times, most recently from 6e513d9 to 26ab221 Compare February 11, 2025 01:15

bthomee added this to the 2.4.0 (Q1 2025) milestone Feb 11, 2025

ximinez force-pushed the ximinez/consensus branch 2 times, most recently from 8dcbf91 to a6d3cea Compare February 12, 2025 04:12

ximinez added 2 commits February 11, 2025 23:13

Drop out of consensus if the round takes too long

60de826

ximinez force-pushed the ximinez/consensus branch from a6d3cea to 197356b Compare February 12, 2025 04:13

ximinez added 2 commits February 12, 2025 00:18

[WIP] Consensus tests

8ad2cb3

[WIP] Fix builds

7e2fa5c

vlntb requested changes Feb 12, 2025

View reviewed changes

vlntb reviewed Feb 12, 2025

View reviewed changes

src/xrpld/consensus/ConsensusParms.h Outdated Show resolved Hide resolved

vlntb reviewed Feb 12, 2025

View reviewed changes

src/test/consensus/Consensus_test.cpp Show resolved Hide resolved

ximinez added 4 commits February 12, 2025 11:35

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

196f6b6

* upstream/develop: chore: Rename missing-commits job, and combine nix job files (5268)

Update levelization

5108e55

Make ConsensusParms const, remove unnecessary copy assignment

5889dc5

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

69c1b00

* upstream/develop: fix: Replace charge() by fee_.update() in OnMessage functions (5269) docs: ensure build_type and CMAKE_BUILD_TYPE match (5274) chore: Fix small typos in protocol files (5279)

ximinez added 3 commits February 14, 2025 18:47

[WIP] Checkpoint

a396385

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

9aa953b

* upstream/develop: Reduce duplicate peer traffic for ledger data (5126)

fixup! [WIP] Checkpoint

4a745d3

ximinez modified the milestones: 2.4.0 (Q1 2025), 2.5.0 (2025) Feb 15, 2025

Bronek self-requested a review February 15, 2025 11:29

ximinez added 6 commits February 20, 2025 13:17

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

688139f

* upstream/develop: Add logging and improve counting of amendment votes from UNL (5173)

Fix some bugs in stalled dispute logic, unit tests to demonstrate

56d78b8

Merge branch 'develop' into ximinez/consensus

71c1312

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

729821c

* upstream/develop: fix: Remove 'new parent hash' assert (5313)

Tweak the stalled condition, remove some unit test logs

979462f

ximinez force-pushed the ximinez/consensus branch from 785f2bb to 979462f Compare February 26, 2025 23:49

ximinez added 5 commits February 26, 2025 19:01

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

de040c9

* upstream/develop: Set version to 2.4.0-rc3 fix: Acquire previously failed transaction set from network as new proposal arrives (5318) Fix Replace `assert` with `XRPL_ASSERT` (5312)

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

ddd3436

* upstream/develop: Log detailed correlated consensus data together (5302)

Add to the consensus log ("clog") in some newly added places

e0b3b10

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

4d2f020

* upstream/develop: chore: Move "assert" and "werr" flags from "actions/build" (5325)

Fix build/merge errors

e196bcd

Bronek reviewed Mar 3, 2025

View reviewed changes

Bronek reviewed Mar 4, 2025

View reviewed changes

Bronek self-requested a review March 4, 2025 15:36

bthomee requested a review from vlntb March 4, 2025 21:20

Bronek reviewed Mar 5, 2025

View reviewed changes

Bronek approved these changes Mar 5, 2025

View reviewed changes

Bronek reviewed Mar 5, 2025

View reviewed changes

ximinez added 2 commits March 11, 2025 11:33

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

8a3341f

* upstream/develop: Set version to 2.4.0 Set version to 2.4.0-rc4 chore: Update XRPL Foundation Validator List URL (5326)

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

88cf631

* upstream/develop: refactor: Remove unused and add missing includes (5293)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failsafes to prevent a consensus round from taking too long #5277

Failsafes to prevent a consensus round from taking too long #5277

ximinez commented Feb 5, 2025 •

edited

Loading

codecov bot commented Feb 5, 2025 •

edited

Loading

Bronek left a comment

vlntb left a comment

ximinez commented Feb 26, 2025

Bronek left a comment

Bronek Mar 3, 2025 •

edited

Loading

Bronek Mar 3, 2025

Bronek Mar 4, 2025

Bronek Mar 5, 2025

Bronek Mar 5, 2025

Bronek Mar 5, 2025

Bronek left a comment

Bronek Mar 5, 2025

	"ripple::DisputedTx::updateVote next state valid");
	"ripple::getNeededWeight : next state valid");

Failsafes to prevent a consensus round from taking too long #5277

Are you sure you want to change the base?

Failsafes to prevent a consensus round from taking too long #5277

Conversation

ximinez commented Feb 5, 2025 • edited Loading

High Level Overview of Change

Context of Change

Type of Change

Before / After

codecov bot commented Feb 5, 2025 • edited Loading

Codecov Report

Bronek left a comment

Choose a reason for hiding this comment

vlntb left a comment

Choose a reason for hiding this comment

ximinez commented Feb 26, 2025

Bronek left a comment

Choose a reason for hiding this comment

Bronek Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

Bronek Mar 3, 2025

Choose a reason for hiding this comment

Bronek Mar 4, 2025

Choose a reason for hiding this comment

Bronek Mar 5, 2025

Choose a reason for hiding this comment

Bronek Mar 5, 2025

Choose a reason for hiding this comment

Bronek Mar 5, 2025

Choose a reason for hiding this comment

Bronek left a comment

Choose a reason for hiding this comment

Bronek Mar 5, 2025

Choose a reason for hiding this comment

ximinez commented Feb 5, 2025 •

edited

Loading

codecov bot commented Feb 5, 2025 •

edited

Loading

Bronek Mar 3, 2025 •

edited

Loading