From e8bcfb1d7baafc4f5ccf663282e0d3f85104a713 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Fri, 14 Jun 2024 11:56:36 +0100 Subject: [PATCH 01/45] Created CIP --- CIP-0123/README.md | 70 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 CIP-0123/README.md diff --git a/CIP-0123/README.md b/CIP-0123/README.md new file mode 100644 index 0000000000..01cff313c7 --- /dev/null +++ b/CIP-0123/README.md @@ -0,0 +1,70 @@ +--- +CIP: CIP-0123? +Title: Disaster Recovery Plan for Cardano +Category: Cardano +Status: Proposed +Authors: + - Kevin Hammond + - Sam Leathers + - Alex Moser + - Steve Wagendorp + - Rick McCracken + - Adam Dean +Implementors: [] +Discussions: + - https://github.com/cardano-foundation/CIPs/pull/? +Created: 2024-06-17 +License: CC-BY-4.0 +--- + + + +## Abstract + +It is necessary to consider how the Cardano network can be recovered in the event of a major failure +where the network does not recover itself. + + +## Motivation: why is this CIP necessary? + + + +## Specification + + +## Rationale: how does this CIP achieve its goals? + + +## Path to Active + +### Acceptance Criteria + + +### Implementation Plan + + + + +## References + +[Cardano Disaster Recovery Plan (May 2021)](https://iohk.io/en/research/library/papers/cardano-disaster-recovery-plan/) + +## Copyright + + + + From 37a564ac423a08562615dbc001d08c23e1f6f8d7 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Mon, 17 Jun 2024 12:30:12 +0100 Subject: [PATCH 02/45] addtions --- CIP-0123/README.md | 153 +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 142 insertions(+), 11 deletions(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index 01cff313c7..a54e030ea6 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -1,7 +1,7 @@ --- CIP: CIP-0123? Title: Disaster Recovery Plan for Cardano -Category: Cardano +Category: Cardano Information Status: Proposed Authors: - Kevin Hammond @@ -32,8 +32,10 @@ License: CC-BY-4.0 ## Abstract -It is necessary to consider how the Cardano network can be recovered in the event of a major failure -where the network does not recover itself. +While the Cardano network has proved to be highly reliable, it is necessary to consider how the Cardano network can be recovered in the unlikely +event of a major failure where the network does not recover itself. This CIP considers three representative scenarios and explains +in outline how the chain could recover if each of these situations were to arise. + ## Motivation: why is this CIP necessary? @@ -43,25 +45,154 @@ where the network does not recover itself. ## Specification +### Scenario 1: Long-Lived Network Partition + +Ouroboros Praos is designed to cope with real-world networking +conditions, in which some nodes may temporarily be disconnected from +the network. In this case, the network will continue to make blocks, +perhaps at some lower chain density (reflecting the temporary loss of +stake to the network as a whole). As nodes rejoin the network, they +will then participate in normal block production once again. In this +way, the network remains resilient to changes in connectivity. + +If many nodes become disconnected, the network could divide into two +or more completely disconnected parts. Each part of the network could +then form its own chain, backed by the stake that is participating in +its own partition. Under normal conditions, Praos will also deal with +this situation. When the partitioned group of nodes reconnects, the +longest chain will dominate, and the shorter chain will be discarded. +The nodes on the shorter chain will automatically rollback to the +point where the fork occurred, and then rejoin the main chain. This +is perfectly normal. Such forks will typically last only a few +blocks. + +However, in an extreme situation, the partition may persist beyond the +Praos rollback limit of *k* blocks (currently 2,160). In this case, the nodes +will not be able to rollback to rejoin the main chain, since this +would violate the required Praos guarantees. + + +#### Remediations + +Disconnected nodes must be reconnected to the main chain by their operators. This can be done +by truncating the local block database to a point before the chain fork and then resycing +against the main network. This can be done by the `db-truncator` tool. + +Full node wallets can also be recovered in the same way, though this +may require technical skills that the end users do not possess. It +may be easier, if slower, for them to simply resynchronize their nodes +from genesis. This could take some time. An alternative might be to +restore using a Mithril or other signed snapshot. In this case, care +needs to be taken to achieve the correct balance of trust against +speed of recovery. + + + +#### Additional Effects on Cardano Users + +Although block producing nodes will rejoin the main network following the remediation +described above, the blocks that they have +minted while they were disconnected will not be included in the main +chain. This may have real world effects that will not be +automatically remedied when the nodes rejoin the main chain. For +example, transactions may have been processed that have significant +real world value, or assumptions may have been made about chains of +evidence/validity, or the timing of transactions. End users should be +aware of the possibility and include provisions in their contracts to +cover this eventuality. It may be necessary to resubmit some or all of the +transactions that were processed on the minority chain onto the main chain. +To avoid unexpected effects, this should be done by the end users, and not +by block producers acting on their behalf. + +If they are not observant, stake pool operators, full node wallets and +other node users (e.g. explorers) could continue indefinitely on the minority +chain. Such users should take care to be aware of this situation and +take steps to rejoin the main chain as quickly as possible. +A reliable and trusted public warning system should be considered that can alert users +and advise them on how to rejoin the main chain. + + +#### Timing Considerations + +Partitions of less than 2,160 blocks will automatically rejoin the main chain. With current Cardano settings, this represents +a period of up to 12 hours during which automatic rollback. + + +### Scenario 2: Failure to Make Blocks for an Extended Period of Time + +Ouroboros Praos requires *at least* one block to be produced every *3k/f* slots. With the current Cardano mainnet +settings, that is a 36 hour period. Such an event is extremely unlikely, but if it were to happen then the network +would be unable to make any further blocks. + +#### Mitigation + +It is recommended to monitor the chain for block production. If a low density period is observed, then block producers +should be notified, and efforts made to mint new blocks prior to the expiry of the *3k/f* window. If this is not possible +then the remediation procedures should be followed. + +#### Remediation + +Identify a small group of block producing nodes that will be used to recover the chain. This group should have +sufficient delegated stake to be capable of generating at least 9 blocks in a 36 hour window. +It should be isolated from the rest of the network. +The chain can then be recovered by resetting the wall clocks on the group of block producing nodes, +restarting them from the last good block on Cardano mainnet, playing forward the chain production +at high speed (10x usual speed is recommended), while inserting new empty blocks at the slots which +are allocated to the block producers. An Ouroboros Genesis snapshot can be created once the recovery +nodes have caught up to real time. The recovery nodes can then be restarted with normal settings, including +connections to the network. Ouroboros Genesis then allows other nodes in the network to rapidly resynchronize +with the newly restored chain. + +#### Additional Effects on Cardano Users + +Unlike Scenario 1, no transactions will be submitted that need to be resubmitted on the chain. +Users will, however, experience an extended period during which the chain is unavailable. +Applications and contracts should be designed with this possibility in mind. +Full node wallets and other node users should recover quickly once the network is restarted +but there may be a period of instability while network connections are re-established +and the Ouroboros Genesis snapshot is distributed across all nodes. + +#### Timing Considerations + +The chain will tolerate a gap of up to *3k/f* blocks (36 hours with current Cardano settings). + +### Scenario 3: Bad Blocks + +In the event that a bad block was to be minted on-chain, then the chain + +#### Remediation + +### Mithril + + +## Recommended Actions + +1. Monitor the network for periods of low density and take early action if an extended peroo. +2. Identify a collection of block producer nodes that has sufficient stake to mint least 9 blocks in any 36 hour window. +3. Set up emergency communication channels with stake pool operators and other community members. +4. Practice disaster recovery procedures on a regular basis. +5. Provide signed Mithril snapshots and a way for full node wallet users and others to recover from this snapshot. +6. + + ## Rationale: how does this CIP achieve its goals? -## Path to Active +## References -### Acceptance Criteria - +[Cardano Disaster Recovery Plan (May 2021)](https://iohk.io/en/research/library/papers/cardano-disaster-recovery-plan/) -### Implementation Plan - +[DB Truncator Tool]() - +[DB Synthesizer Tool]() -## References +[Ouroboros Genesis]() + +[Mithril]() -[Cardano Disaster Recovery Plan (May 2021)](https://iohk.io/en/research/library/papers/cardano-disaster-recovery-plan/) ## Copyright From ab0bf9325ae0aa5907d50beb8d90fcc81d6034d5 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Mon, 17 Jun 2024 12:39:29 +0100 Subject: [PATCH 03/45] minor edits --- CIP-0123/README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index a54e030ea6..66698663bb 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -41,10 +41,14 @@ in outline how the chain could recover if each of these situations were to arise ## Motivation: why is this CIP necessary? +This CIP is needed to explain the processes and procedures that should be followed in the unlikely event +that the Cardano network encounters a situation where the built-in recovery mechanisms fail. ## Specification +While recovery will need to be tailored to the actual situation, three main scenarios can be identified. + ### Scenario 1: Long-Lived Network Partition Ouroboros Praos is designed to cope with real-world networking From 49ec0bc03ae777f810028a0506316a8c37903d20 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Mon, 17 Jun 2024 16:35:13 +0100 Subject: [PATCH 04/45] added Scenarios 3 --- CIP-0123/README.md | 83 ++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 70 insertions(+), 13 deletions(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index 66698663bb..c74695376a 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -32,10 +32,13 @@ License: CC-BY-4.0 ## Abstract + While the Cardano network has proved to be highly reliable, it is necessary to consider how the Cardano network can be recovered in the unlikely event of a major failure where the network does not recover itself. This CIP considers three representative scenarios and explains in outline how the chain could recover if each of these situations were to arise. +The CIP should be considered to be a living document. It is based on an earlier IOHK technical report, supplemented by internal documentation. + ## Motivation: why is this CIP necessary? @@ -119,7 +122,7 @@ and advise them on how to rejoin the main chain. #### Timing Considerations Partitions of less than 2,160 blocks will automatically rejoin the main chain. With current Cardano settings, this represents -a period of up to 12 hours during which automatic rollback. +a period of up to 12 hours during which automatic rollback will occur. ### Scenario 2: Failure to Make Blocks for an Extended Period of Time @@ -145,7 +148,13 @@ at high speed (10x usual speed is recommended), while inserting new empty blocks are allocated to the block producers. An Ouroboros Genesis snapshot can be created once the recovery nodes have caught up to real time. The recovery nodes can then be restarted with normal settings, including connections to the network. Ouroboros Genesis then allows other nodes in the network to rapidly resynchronize -with the newly restored chain. +with the newly restored chain. This would leave one or more gaps in the chain, interspersed with empty blocks. + +##### Rewards Donation + +In order to avoid allegations of unfair behaviour, block producing nodes that are used to recover the network should +donate any rewards that they receieve during recovery to the treasury. + #### Additional Effects on Cardano Users @@ -158,15 +167,53 @@ and the Ouroboros Genesis snapshot is distributed across all nodes. #### Timing Considerations -The chain will tolerate a gap of up to *3k/f* blocks (36 hours with current Cardano settings). +The chain will tolerate a gap of up to *3k/f* slots (36 hours with current Cardano settings). + -### Scenario 3: Bad Blocks +### Scenario 3: Bad Blocks Minted on Chain -In the event that a bad block was to be minted on-chain, then the chain +In the event that a bad block was to be minted on-chain, then some or all validators might be unable to process the block. +They would therefore stop, and be unable to restart. Wallet and other nodes might be unable to synchronise beyond the +point of the bad block. #### Remediation -### Mithril +Depending on the cause of the issue and its severity, alternative remediations might be possible. + +Scenario 3.1: if some existing node versions were able to process the block, but others were not, then +the chain would continue to grow at a lower chain density. SPOs would be encouraged to upgrade (or downgrade) +to a suitable node version. The chain density would then gradually recover to its normal level. + +Scenario 3.2: if no node version was able to process the block and a +gap of less than *3k/f* slots existed, then the chain could be rolled +back immediately before the bad block was created, and nodes +restarted. The chain would then grow as normal, with a small gap around the bad block. +In this case, care would need to be taken that the rogue +transaction was not accidentally reinserted into the chain. This might involve +clearing node mempools, applying filters on the transaction, or developing and deploying a new node version that +rejected the bad block. + +Scenario 3.3: an alternative to rolling back would be to develop and deploy a "hot-fix" node that could +accept the bad block. Nodes would then be able to incorporate the bad block as part of the chain, +minting new blocks as usual. +In this case, the bad block would persist on-chain indefinitely and future nodes +would need to also accept the bad block. This approach is best used when the rejected block has behaviour +that was unanticipated, but which is benign in nature. + +#### Timing Considerations + +If more than *3k/f* slots have passed since the bad block was minted on-chain (36 hours with current Cardano settings), +then a mix of recovery techniques will be needed, as described in Scenario 3.4. When deciding on the correct recovery +technique, consideration should be given as to whether the recovery can be successfully completed + +### Using Ouroboros Genesis Snapshots + +Ouroboros Genesis snapshots can be used to assist with recovery. TODO: expand this + + +### Using Mithril Snapshots + +Alternatively, Mithril snapshots can be used to assist with recovery. TODO: expand this ## Recommended Actions @@ -176,7 +223,16 @@ In the event that a bad block was to be minted on-chain, then the chain 3. Set up emergency communication channels with stake pool operators and other community members. 4. Practice disaster recovery procedures on a regular basis. 5. Provide signed Mithril snapshots and a way for full node wallet users and others to recover from this snapshot. -6. +6. Determine how to exploit Ouroboros Genesis snapshots as part of the disaster recovery process + +### Community Engagement + +One of the key requirements for successful disaster recovery will be proper engagement with the community. + +1. Identify block producers who can assist with disaster recovery +2. Discuss requirements with Intersect's Technical Working Groups and Security Council +3. Identify and establish the right communications channels with the community +4. Set up regular disaster recovery practice sessions ## Rationale: how does this CIP achieve its goals? @@ -185,21 +241,22 @@ In the event that a bad block was to be minted on-chain, then the chain It must also explain how the proposal affects the backward compatibility of existing solutions when applicable. If the proposal responds to a CPS, the 'Rationale' section should explain how it addresses the CPS, and answer any questions that the CPS poses for potential solutions. --> +TBC + ## References [Cardano Disaster Recovery Plan (May 2021)](https://iohk.io/en/research/library/papers/cardano-disaster-recovery-plan/) -[DB Truncator Tool]() +[DB Truncator Tool](TODO) -[DB Synthesizer Tool]() +[DB Synthesizer Tool](TODO) -[Ouroboros Genesis]() +[Ouroboros Genesis](TODO) -[Mithril]() +[Mithril](TODO) ## Copyright - - + This CIP is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode). From 6d6391cca867d70d44cdf250881693a3704b0681 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Mon, 17 Jun 2024 16:47:13 +0100 Subject: [PATCH 05/45] untabify --- CIP-0123/README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index c74695376a..c73dba6582 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -5,11 +5,11 @@ Category: Cardano Information Status: Proposed Authors: - Kevin Hammond - - Sam Leathers - - Alex Moser - - Steve Wagendorp - - Rick McCracken - - Adam Dean + - Sam Leathers + - Alex Moser + - Steve Wagendorp + - Rick McCracken + - Adam Dean Implementors: [] Discussions: - https://github.com/cardano-foundation/CIPs/pull/? From d7167f2a94deb416bb8ea47b671585d80a1332e4 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 18 Jun 2024 10:27:46 +0100 Subject: [PATCH 06/45] re-added Scenario 3.4 --- CIP-0123/README.md | 32 +++++++++++++++++++++++++------- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index c73dba6582..16ccbb27a5 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -35,9 +35,15 @@ License: CC-BY-4.0 While the Cardano network has proved to be highly reliable, it is necessary to consider how the Cardano network can be recovered in the unlikely event of a major failure where the network does not recover itself. This CIP considers three representative scenarios and explains -in outline how the chain could recover if each of these situations were to arise. +in outline how the chain could recover if each of these situations were to arise: Scenario 1 -- Long-Lived Network Partition; +Scenario 2 -- Failure to Make Blocks for an Extended Period of Time; Scenario 3: Bad Blocks Minted on Chain. Successful recovery depends +on the correct procedures being followed and good communication channels being established. It is recommended that these communication channels are +established now, and regular practice recoveries are undertaken, so that the community is familiar with the recovery procedures in the event that +the chain does need to be recovered. -The CIP should be considered to be a living document. It is based on an earlier IOHK technical report, supplemented by internal documentation. +This CIP is based on an earlier IOHK technical report that is referenced below, supplemented by +internal documentation and discussions that have not been publicly released. +It should be considered to be a living document that is reviewed and revised on a regular basis. @@ -122,7 +128,8 @@ and advise them on how to rejoin the main chain. #### Timing Considerations Partitions of less than 2,160 blocks will automatically rejoin the main chain. With current Cardano settings, this represents -a period of up to 12 hours during which automatic rollback will occur. +a period of up to 12 hours during which automatic rollback will occur. If the partition exceeds 2,160 blocks, then the +procedure described above will be necessary to allow nodes to rejoin the main chain. ### Scenario 2: Failure to Make Blocks for an Extended Period of Time @@ -150,7 +157,7 @@ nodes have caught up to real time. The recovery nodes can then be restarted with connections to the network. Ouroboros Genesis then allows other nodes in the network to rapidly resynchronize with the newly restored chain. This would leave one or more gaps in the chain, interspersed with empty blocks. -##### Rewards Donation +##### Rewards Donation by Recovery Block Producers In order to avoid allegations of unfair behaviour, block producing nodes that are used to recover the network should donate any rewards that they receieve during recovery to the treasury. @@ -167,7 +174,9 @@ and the Ouroboros Genesis snapshot is distributed across all nodes. #### Timing Considerations -The chain will tolerate a gap of up to *3k/f* slots (36 hours with current Cardano settings). +The chain will tolerate a gap of up to *3k/f* slots (36 hours with current Cardano settings). +This period of low chain density may have other implications (TODO: describe these), for which +Ouroboros Genesis may provide a remedy (TODO: confirm and describe this). ### Scenario 3: Bad Blocks Minted on Chain @@ -198,13 +207,18 @@ accept the bad block. Nodes would then be able to incorporate the bad block as minting new blocks as usual. In this case, the bad block would persist on-chain indefinitely and future nodes would need to also accept the bad block. This approach is best used when the rejected block has behaviour -that was unanticipated, but which is benign in nature. +that was unanticipated, but which is benign in nature. This approach will leave no abnormal gaps in the chain. + +Scenario 3.4: if more than *3k/f* slots have passed since the bad block was minted, then it will be necessary to roll back the chain immediately +prior to the bad block as in Scenario 3.2, and then proceed as described for Scenario 2. As with Scenario 2, this will leave +a series of gaps in the chain interspersed with empty blocks. #### Timing Considerations If more than *3k/f* slots have passed since the bad block was minted on-chain (36 hours with current Cardano settings), then a mix of recovery techniques will be needed, as described in Scenario 3.4. When deciding on the correct recovery -technique, consideration should be given as to whether the recovery can be successfully completed +technique for Scenarios 3.1-3.3, consideration should be given as to whether the recovery can be successfully completed before *3k/f* slots +have elapsed. In case of doubt, the procedure for Scenario 3.4 should be followed. ### Using Ouroboros Genesis Snapshots @@ -247,6 +261,10 @@ TBC [Cardano Disaster Recovery Plan (May 2021)](https://iohk.io/en/research/library/papers/cardano-disaster-recovery-plan/) +[Cardano Incident Reports](https://updates.cardano.intersectmbo.org/tags/incident) + +[January 2023 Block Production Temporary Outage](https://updates.cardano.intersectmbo.org/2023-04-17-ledger) + [DB Truncator Tool](TODO) [DB Synthesizer Tool](TODO) From ab29e39f441979fc91a7aaee14b3c27b192502fc Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 18 Jun 2024 14:50:33 +0100 Subject: [PATCH 07/45] minor edits --- CIP-0123/README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index 16ccbb27a5..d36ca63349 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -53,7 +53,7 @@ It should be considered to be a living document that is reviewed and revised on This CIP is needed to explain the processes and procedures that should be followed in the unlikely event that the Cardano network encounters a situation where the built-in recovery mechanisms fail. -## Specification +## Disaster Recovery Procedures While recovery will need to be tailored to the actual situation, three main scenarios can be identified. @@ -189,11 +189,11 @@ point of the bad block. Depending on the cause of the issue and its severity, alternative remediations might be possible. -Scenario 3.1: if some existing node versions were able to process the block, but others were not, then +**Scenario 3.1**: if some existing node versions were able to process the block, but others were not, then the chain would continue to grow at a lower chain density. SPOs would be encouraged to upgrade (or downgrade) to a suitable node version. The chain density would then gradually recover to its normal level. -Scenario 3.2: if no node version was able to process the block and a +**Scenario 3.2**: if no node version was able to process the block and a gap of less than *3k/f* slots existed, then the chain could be rolled back immediately before the bad block was created, and nodes restarted. The chain would then grow as normal, with a small gap around the bad block. @@ -202,14 +202,14 @@ transaction was not accidentally reinserted into the chain. This might involve clearing node mempools, applying filters on the transaction, or developing and deploying a new node version that rejected the bad block. -Scenario 3.3: an alternative to rolling back would be to develop and deploy a "hot-fix" node that could +**Scenario 3.3**: an alternative to rolling back would be to develop and deploy a "hot-fix" node that could accept the bad block. Nodes would then be able to incorporate the bad block as part of the chain, minting new blocks as usual. In this case, the bad block would persist on-chain indefinitely and future nodes would need to also accept the bad block. This approach is best used when the rejected block has behaviour that was unanticipated, but which is benign in nature. This approach will leave no abnormal gaps in the chain. -Scenario 3.4: if more than *3k/f* slots have passed since the bad block was minted, then it will be necessary to roll back the chain immediately +**Scenario 3.4**: if more than *3k/f* slots have passed since the bad block was minted, then it will be necessary to roll back the chain immediately prior to the bad block as in Scenario 3.2, and then proceed as described for Scenario 2. As with Scenario 2, this will leave a series of gaps in the chain interspersed with empty blocks. From cc2e50aabf51f7ee422f694a50c894f50911ba49 Mon Sep 17 00:00:00 2001 From: swagendorp <15338420+swagendorp@users.noreply.github.com> Date: Wed, 24 Jul 2024 22:45:58 +0200 Subject: [PATCH 08/45] update abstract --- CIP-0123/README.md | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index d36ca63349..33fe5cace8 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -33,18 +33,24 @@ License: CC-BY-4.0 ## Abstract -While the Cardano network has proved to be highly reliable, it is necessary to consider how the Cardano network can be recovered in the unlikely -event of a major failure where the network does not recover itself. This CIP considers three representative scenarios and explains -in outline how the chain could recover if each of these situations were to arise: Scenario 1 -- Long-Lived Network Partition; -Scenario 2 -- Failure to Make Blocks for an Extended Period of Time; Scenario 3: Bad Blocks Minted on Chain. Successful recovery depends -on the correct procedures being followed and good communication channels being established. It is recommended that these communication channels are -established now, and regular practice recoveries are undertaken, so that the community is familiar with the recovery procedures in the event that -the chain does need to be recovered. - -This CIP is based on an earlier IOHK technical report that is referenced below, supplemented by -internal documentation and discussions that have not been publicly released. -It should be considered to be a living document that is reviewed and revised on a regular basis. +While the Cardano mainnet has proven to be highly resilient, it is necessary to proactively +consider the possible recovery mechanisms and procedures that may be required in the unlikely +event of a major failure where the network is unable to recover itself. +This CIP considers three representative scenarios and addresses specific considerations relevant +in each case: + +Scenario 1 - __Long-Lived Network Partition__ +Scenario 2 - __Failure to Make Blocks for an Extended Period of Time__ +Scenario 3 - __Bad Blocks Minted on Chain__ + +To ensure successful recovery in the event of a chain failure, it's crucial to establish effective +communication channels and exercise recovery procedures in advance to familiarize the community and +SPOs with the process. + +This CIP is based on an earlier IOHK technical report that is referenced below, supplemented by internal +documentation and discussions that have not been publicly released. It should be considered to be a living +document that is reviewed and revised on a regular basis. ## Motivation: why is this CIP necessary? From db5d65a6bef940d562a8e7722571350ec3808520 Mon Sep 17 00:00:00 2001 From: swagendorp <15338420+swagendorp@users.noreply.github.com> Date: Wed, 24 Jul 2024 22:49:47 +0200 Subject: [PATCH 09/45] update motivation --- CIP-0123/README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index d36ca63349..9e13f433ae 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -50,8 +50,9 @@ It should be considered to be a living document that is reviewed and revised on ## Motivation: why is this CIP necessary? -This CIP is needed to explain the processes and procedures that should be followed in the unlikely event -that the Cardano network encounters a situation where the built-in recovery mechanisms fail. +This CIP is needed to familiarize stakeholders with the processes and procedures that should be +followed in the unlikely event that the Cardano mainnet encounters a situation where the built-in +recovery mechanisms fail. ## Disaster Recovery Procedures From 48df7a0773a4a3a99275441d45784649ebc59ef9 Mon Sep 17 00:00:00 2001 From: swagendorp <15338420+swagendorp@users.noreply.github.com> Date: Wed, 24 Jul 2024 22:54:41 +0200 Subject: [PATCH 10/45] update scenario_1 --- CIP-0123/README.md | 34 ++++++++++++++++++++-------------- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index d36ca63349..872d8bb4c7 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -56,7 +56,7 @@ that the Cardano network encounters a situation where the built-in recovery mech ## Disaster Recovery Procedures -While recovery will need to be tailored to the actual situation, three main scenarios can be identified. +While the exact recovery process will depend on the unique nature of the failure, there are three main scenarios we can consider. ### Scenario 1: Long-Lived Network Partition @@ -87,19 +87,25 @@ would violate the required Praos guarantees. #### Remediations -Disconnected nodes must be reconnected to the main chain by their operators. This can be done -by truncating the local block database to a point before the chain fork and then resycing -against the main network. This can be done by the `db-truncator` tool. - -Full node wallets can also be recovered in the same way, though this -may require technical skills that the end users do not possess. It -may be easier, if slower, for them to simply resynchronize their nodes -from genesis. This could take some time. An alternative might be to -restore using a Mithril or other signed snapshot. In this case, care -needs to be taken to achieve the correct balance of trust against -speed of recovery. - - +Disconnected nodes must be reconnected to the main chain by their operators. This can be done +by truncating the local block database to a point before the chain fork and then resyncing +against the main network. This can be done using the `db-truncator` tool. + +Full node wallets can also be recovered in the same way, though this may require technical +skills that the end users do not possess. It may be easier, if slower, for them to simply +resynchronize their nodes from genesis. + +Ouroboros Genesis provides additional resilience when recovering from long lived network partitions. +In Praos nodes resyncing from a point before the chain fork could still in some cases follow the +alternative chain (if it is the first one seen) and extra mechanisms may be needed to avoid this +possibility. In Praos, for example, this may require that all participants on the alternate chain +truncate the local block database prior to the partition being resolved. In Ouroboros Genesis +when resyncing from a point before the chain fork, the chain selection rules will ensure +selection of the correct path for the main chain assuming the partition has been resolved. + +Alternative methods to restore might include the use of Mithril or other signed snapshot. +In this case, care needs to be taken to achieve the correct balance of trust against speed +of recovery. #### Additional Effects on Cardano Users From 475b6148e6b3d946fcab031169ad3a8540cd4d00 Mon Sep 17 00:00:00 2001 From: swagendorp <15338420+swagendorp@users.noreply.github.com> Date: Wed, 24 Jul 2024 22:58:18 +0200 Subject: [PATCH 11/45] update scenario_2 --- CIP-0123/README.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index d36ca63349..629b73327b 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -175,7 +175,16 @@ and the Ouroboros Genesis snapshot is distributed across all nodes. #### Timing Considerations The chain will tolerate a gap of up to *3k/f* slots (36 hours with current Cardano settings). -This period of low chain density may have other implications (TODO: describe these), for which +A period of low chain density could have security implications that affect dynamic availability +and leave open the possibility for future long range attacks. This may be particularly +relevant should chain recovery be performed as described above (using less stake than is required +for an honest majority). To mitigate the presence of an extended period of low chain density we may +need to make use of the lightweight checkpointing mechanism in Ouroborus Genesis. Alternatively Mithril +could also be used to provide certified snapshots to SPOs as a means to verify the correct state of the ledger. + +The adoption of Mithril for fast bootstrapping by light clients and edge nodes should help to mitigate risks +for the types of users on the network that do not participate in consensus. + Ouroboros Genesis may provide a remedy (TODO: confirm and describe this). From edcb39cb13fb40eb0004f49fa32ee037f83191ac Mon Sep 17 00:00:00 2001 From: swagendorp <15338420+swagendorp@users.noreply.github.com> Date: Wed, 24 Jul 2024 23:10:54 +0200 Subject: [PATCH 12/45] update mithril --- CIP-0123/README.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index d36ca63349..dcab09d5bf 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -227,7 +227,17 @@ Ouroboros Genesis snapshots can be used to assist with recovery. TODO: expand t ### Using Mithril Snapshots -Alternatively, Mithril snapshots can be used to assist with recovery. TODO: expand this +Mythril is a stake-based threshold multi-signatures scheme. One of the applications of this protocol in Cardano +is the ability to create certified snapshots of the Cardano blockchain. Mythril snapshots allow applications +to obtain a verified copy of the current state of the blockchain without having to download and verify the full history. + +SPOs on mainnet that participate in the Mythril network provide signed snapshots to a Mythril aggregator that +is responsible for collecting individual signatures from Mithril signers and aggregating them into a multi-signature. +With this capability, the Mithril aggregator can then provide certified snapshots of the Cardano blockchain. + +Mythril may provide an alternative solution to genesis checkpoints as a way to verify the correct state of the ledger +provided that it gains sufficient adoption on Mainnet and that snapshots continue to be signed by an honest majority +following a chain recovery event. ## Recommended Actions From dcd03f9d849ab55f5126eb736f2688477a0b6947 Mon Sep 17 00:00:00 2001 From: swagendorp <15338420+swagendorp@users.noreply.github.com> Date: Wed, 24 Jul 2024 23:13:09 +0200 Subject: [PATCH 13/45] update rationale --- CIP-0123/README.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index d36ca63349..23cdd5c65c 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -255,7 +255,11 @@ One of the key requirements for successful disaster recovery will be proper enga It must also explain how the proposal affects the backward compatibility of existing solutions when applicable. If the proposal responds to a CPS, the 'Rationale' section should explain how it addresses the CPS, and answer any questions that the CPS poses for potential solutions. --> -TBC +This CIP outlines key disaster recovery scenarios that the Cardano community should understand to mitigate +potential network outages. As a living document, it will be regularly reviewed and updated to inform +stakeholders and encourage more detailed contingency planning. The CIP aims to facilitate discussions, +establish recovery procedures, and encourage regular recovery practice exercises to ensure preparedness +and validation of recovery actions in the event of an outage. ## References @@ -265,13 +269,13 @@ TBC [January 2023 Block Production Temporary Outage](https://updates.cardano.intersectmbo.org/2023-04-17-ledger) -[DB Truncator Tool](TODO) +[DB Truncator Tool](https://github.com/IntersectMBO/ouroboros-consensus/tree/486753d0b7d6b0d09621d1ef8be85e5117ff3d1e/ouroboros-consensus-cardano/app) -[DB Synthesizer Tool](TODO) +[DB Synthesizer Tool](https://github.com/IntersectMBO/ouroboros-consensus/tree/486753d0b7d6b0d09621d1ef8be85e5117ff3d1e/ouroboros-consensus-cardano/app) -[Ouroboros Genesis](TODO) +[Ouroboros Genesis](https://iohk.io/en/research/library/papers/ouroboros-genesis-composable-proof-of-stake-blockchains-with-dynamic-availability/) -[Mithril](TODO) +[Mithril](https://github.com/input-output-hk/mithril) ## Copyright From a57e728b20bda08cd01188dce40c04662e95c84e Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Mon, 29 Jul 2024 12:37:00 +0100 Subject: [PATCH 14/45] Small improvements following PR merges --- CIP-0123/README.md | 78 +++++++++++++++++++++++++--------------------- 1 file changed, 42 insertions(+), 36 deletions(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index 5b817990f3..beca0d36fe 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -46,7 +46,7 @@ Scenario 3 - __Bad Blocks Minted on Chain__ To ensure successful recovery in the event of a chain failure, it's crucial to establish effective communication channels and exercise recovery procedures in advance to familiarize the community and -SPOs with the process. +stake pool operators (SPOs) with the process. This CIP is based on an earlier IOHK technical report that is referenced below, supplemented by internal documentation and discussions that have not been publicly released. It should be considered to be a living @@ -58,7 +58,7 @@ document that is reviewed and revised on a regular basis. This CIP is needed to familiarize stakeholders with the processes and procedures that should be followed in the unlikely event that the Cardano mainnet encounters a situation where the built-in -recovery mechanisms fail. +on-chain recovery mechanisms fail. ## Disaster Recovery Procedures @@ -87,7 +87,7 @@ is perfectly normal. Such forks will typically last only a few blocks. However, in an extreme situation, the partition may persist beyond the -Praos rollback limit of *k* blocks (currently 2,160). In this case, the nodes +Praos rollback limit of *k* blocks (currently 2,160 blocks). In this case, the nodes will not be able to rollback to rejoin the main chain, since this would violate the required Praos guarantees. @@ -100,19 +100,20 @@ against the main network. This can be done using the `db-truncator` tool. Full node wallets can also be recovered in the same way, though this may require technical skills that the end users do not possess. It may be easier, if slower, for them to simply -resynchronize their nodes from genesis. +resynchronize their nodes fromb the start of the chain (i.e. from the genesis block). Ouroboros Genesis provides additional resilience when recovering from long lived network partitions. In Praos nodes resyncing from a point before the chain fork could still in some cases follow the alternative chain (if it is the first one seen) and extra mechanisms may be needed to avoid this -possibility. In Praos, for example, this may require that all participants on the alternate chain +possibility. In Praos, for example, this may require that all participants on the alternative chain truncate the local block database prior to the partition being resolved. In Ouroboros Genesis when resyncing from a point before the chain fork, the chain selection rules will ensure selection of the correct path for the main chain assuming the partition has been resolved. -Alternative methods to restore might include the use of Mithril or other signed snapshot. -In this case, care needs to be taken to achieve the correct balance of trust against speed -of recovery. +Alternative methods to resynchronise the node to the main chain might +include the use of Mithril or other signed snapshots. These would +allow faster recovery. However, in this case, care needs to be taken +to achieve the correct balance of trust against speed of recovery. #### Additional Effects on Cardano Users @@ -127,10 +128,10 @@ evidence/validity, or the timing of transactions. End users should be aware of the possibility and include provisions in their contracts to cover this eventuality. It may be necessary to resubmit some or all of the transactions that were processed on the minority chain onto the main chain. -To avoid unexpected effects, this should be done by the end users, and not +To avoid unexpected effects, this should be done by the end users/applications, and not by block producers acting on their behalf. -If they are not observant, stake pool operators, full node wallets and +If they are not observant, stake pools, full node wallets and other node users (e.g. explorers) could continue indefinitely on the minority chain. Such users should take care to be aware of this situation and take steps to rejoin the main chain as quickly as possible. @@ -180,11 +181,12 @@ donate any rewards that they receieve during recovery to the treasury. Unlike Scenario 1, no transactions will be submitted that need to be resubmitted on the chain. Users will, however, experience an extended period during which the chain is unavailable. -Applications and contracts should be designed with this possibility in mind. +Cardano applications and contracts should be designed with this possibility in mind. Full node wallets and other node users should recover quickly once the network is restarted but there may be a period of instability while network connections are re-established and the Ouroboros Genesis snapshot is distributed across all nodes. + #### Timing Considerations The chain will tolerate a gap of up to *3k/f* slots (36 hours with current Cardano settings). @@ -193,12 +195,12 @@ and leave open the possibility for future long range attacks. This may be partic relevant should chain recovery be performed as described above (using less stake than is required for an honest majority). To mitigate the presence of an extended period of low chain density we may need to make use of the lightweight checkpointing mechanism in Ouroborus Genesis. Alternatively Mithril -could also be used to provide certified snapshots to SPOs as a means to verify the correct state of the ledger. +could also be used to provide certified snapshots to stake pools as a means to verify the correct state of the ledger. The adoption of Mithril for fast bootstrapping by light clients and edge nodes should help to mitigate risks for the types of users on the network that do not participate in consensus. -Ouroboros Genesis may provide a remedy (TODO: confirm and describe this). +Ouroboros Genesis may also provide a remedy (TODO: confirm and describe this). ### Scenario 3: Bad Blocks Minted on Chain @@ -212,28 +214,29 @@ point of the bad block. Depending on the cause of the issue and its severity, alternative remediations might be possible. **Scenario 3.1**: if some existing node versions were able to process the block, but others were not, then -the chain would continue to grow at a lower chain density. SPOs would be encouraged to upgrade (or downgrade) -to a suitable node version. The chain density would then gradually recover to its normal level. +the chain would continue to grow at a lower chain density. SPOs would need to be persuaded to upgrade (or downgrade) +to a suitable node version that would allow the chain to continue. The chain density would then gradually recover to its normal level. +Other users would need to upgrade (or downgrade) to a version of the node that could follow the full chain. **Scenario 3.2**: if no node version was able to process the block and a gap of less than *3k/f* slots existed, then the chain could be rolled back immediately before the bad block was created, and nodes -restarted. The chain would then grow as normal, with a small gap around the bad block. -In this case, care would need to be taken that the rogue -transaction was not accidentally reinserted into the chain. This might involve -clearing node mempools, applying filters on the transaction, or developing and deploying a new node version that +restarted from this point. The chain would then grow as normal, with a small gap around the bad block. +In this case, care would need to be taken that the rogue transaction was not accidentally reinserted into the chain. +This might involve clearing node mempools, applying filters on the transaction, or developing and deploying a new node version that rejected the bad block. **Scenario 3.3**: an alternative to rolling back would be to develop and deploy a "hot-fix" node that could -accept the bad block. Nodes would then be able to incorporate the bad block as part of the chain, -minting new blocks as usual. +accept the bad block, either as an exception, or as new acceptable behaviour. +Nodes would then be able to incorporate the bad block as part of the chain, +minting new blocks as usual, or following the chain. In this case, the bad block would persist on-chain indefinitely and future nodes -would need to also accept the bad block. This approach is best used when the rejected block has behaviour -that was unanticipated, but which is benign in nature. This approach will leave no abnormal gaps in the chain. +would also need to accept the bad block. Such an approach is best used when the rejected block has behaviour +that was unanticipated, but which is benign in nature. This will leave no abnormal gaps in the chain. **Scenario 3.4**: if more than *3k/f* slots have passed since the bad block was minted, then it will be necessary to roll back the chain immediately prior to the bad block as in Scenario 3.2, and then proceed as described for Scenario 2. As with Scenario 2, this will leave -a series of gaps in the chain interspersed with empty blocks. +a series of gaps in the chain that are interspersed with empty blocks. #### Timing Considerations @@ -249,23 +252,26 @@ Ouroboros Genesis snapshots can be used to assist with recovery. TODO: expand t ### Using Mithril Snapshots -Mythril is a stake-based threshold multi-signatures scheme. One of the applications of this protocol in Cardano -is the ability to create certified snapshots of the Cardano blockchain. Mythril snapshots allow applications +Mithril is a stake-based threshold multi-signatures scheme. One of the applications of this protocol in Cardano +is to create certified snapshots of the Cardano blockchain. Mithril snapshots allow nodes or applications to obtain a verified copy of the current state of the blockchain without having to download and verify the full history. -SPOs on mainnet that participate in the Mythril network provide signed snapshots to a Mythril aggregator that +SPOs on mainnet that participate in the Mithril network provide signed snapshots to a Mithril aggregator that is responsible for collecting individual signatures from Mithril signers and aggregating them into a multi-signature. -With this capability, the Mithril aggregator can then provide certified snapshots of the Cardano blockchain. +Using this capability, the Mithril aggregator can then provide certified snapshots of the Cardano blockchain that +can potentially be used as a trusted source for recovery purposes. -Mythril may provide an alternative solution to genesis checkpoints as a way to verify the correct state of the ledger -provided that it gains sufficient adoption on Mainnet and that snapshots continue to be signed by an honest majority -following a chain recovery event. +Provided that it gains sufficient adoption on Mainnet and that +snapshots continue to be signed by an honest majority of stake pools +following a chain recovery event, Mithril may therefore provide an +alternative solution to Ouroboros Genesis checkpoints as a way to +verify the correct state of the ledger ## Recommended Actions -1. Monitor the network for periods of low density and take early action if an extended peroo. -2. Identify a collection of block producer nodes that has sufficient stake to mint least 9 blocks in any 36 hour window. +1. Monitor the network for periods of low density and take early action if an extended period is observed. +2. Identify a collection of block producer nodes that has sufficient stake to mint at least 9 blocks in any 36 hour window. 3. Set up emergency communication channels with stake pool operators and other community members. 4. Practice disaster recovery procedures on a regular basis. 5. Provide signed Mithril snapshots and a way for full node wallet users and others to recover from this snapshot. @@ -275,9 +281,9 @@ following a chain recovery event. One of the key requirements for successful disaster recovery will be proper engagement with the community. -1. Identify block producers who can assist with disaster recovery -2. Discuss requirements with Intersect's Technical Working Groups and Security Council -3. Identify and establish the right communications channels with the community +1. Identify stake pool operators (SPOs) who can assist with disaster recovery +2. Discuss disaster recovery requirements with Intersect's Technical Working Groups and Security Council +3. Identify and establish the right communications channels with the community, including Intersect 4. Set up regular disaster recovery practice sessions From 350fef4c71815ba43701ee0f8edc6a9d91995cfd Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 30 Jul 2024 19:39:56 +0100 Subject: [PATCH 15/45] Update README.md Small text change --- CIP-0123/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index beca0d36fe..22870a28d3 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -275,7 +275,7 @@ verify the correct state of the ledger 3. Set up emergency communication channels with stake pool operators and other community members. 4. Practice disaster recovery procedures on a regular basis. 5. Provide signed Mithril snapshots and a way for full node wallet users and others to recover from this snapshot. -6. Determine how to exploit Ouroboros Genesis snapshots as part of the disaster recovery process +6. Determine how to employ Ouroboros Genesis snapshots as part of the disaster recovery process ### Community Engagement From a50d969385b07d0695de393ecd907092f0302fcd Mon Sep 17 00:00:00 2001 From: Nicholas Clarke Date: Thu, 8 Aug 2024 16:33:46 +0200 Subject: [PATCH 16/45] Add section on Genesis checkpoints Expand on what the lightweight checkpoints introduced with Genesis are, and how they can assist with recovery from a disaster. --- CIP-0123/README.md | 37 ++++++++++++++++++++++++++++++++++--- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/CIP-0123/README.md b/CIP-0123/README.md index 22870a28d3..ba4ae09b14 100644 --- a/CIP-0123/README.md +++ b/CIP-0123/README.md @@ -10,6 +10,7 @@ Authors: - Steve Wagendorp - Rick McCracken - Adam Dean + - Nicholas Clarke Implementors: [] Discussions: - https://github.com/cardano-foundation/CIPs/pull/? @@ -166,8 +167,7 @@ It should be isolated from the rest of the network. The chain can then be recovered by resetting the wall clocks on the group of block producing nodes, restarting them from the last good block on Cardano mainnet, playing forward the chain production at high speed (10x usual speed is recommended), while inserting new empty blocks at the slots which -are allocated to the block producers. An Ouroboros Genesis snapshot can be created once the recovery -nodes have caught up to real time. The recovery nodes can then be restarted with normal settings, including +are allocated to the block producers. The recovery nodes can then be restarted with normal settings, including connections to the network. Ouroboros Genesis then allows other nodes in the network to rapidly resynchronize with the newly restored chain. This would leave one or more gaps in the chain, interspersed with empty blocks. @@ -247,7 +247,38 @@ have elapsed. In case of doubt, the procedure for Scenario 3.4 should be follow ### Using Ouroboros Genesis Snapshots -Ouroboros Genesis snapshots can be used to assist with recovery. TODO: expand this +Any of the above conditions may result in a period of lower chain density. The +updated consensus mechanism introduced in Ouroboros Genesis relies on making +chain density comparisons to assist a node when catching up with the network, +in order to reduce the reliance on having trusted peers when syncing. As +such, low-density periods pose a potential security risk for the future; they +are periods where a motivated adversary could perform a long-range attack by +building a higher density chain. + +In order to mitigate this, Genesis introduces the concepts of lightweight +checkpoints. A lightweight checkpoint is effectively a block point - a +combination of block number and hash - which can be distributed along with the +node. Unlike Mithril Snapshots (see below), Genesis lightweight snapshots are not assured by any committee - rather, they form part of the trusted codebase distributed with the node, or by other parties. + +When syncing, a Genesis node will refuse to validate past the block number of any lightweight checkpoint if the chain does not contain the correct block at that point. + +Genesis snapshots play two potential roles in disaster recovery: + +1. In scenarios where the network is split, a lightweight snapshot could guide + a node from the abandoned partition in connecting to the main partition. In + general this should not be needed, however, since the main partition should win + out in any Genesis density comparisons. This usage also falls closer to + scenario 2, in that it relies on an external source imposing a chain selection, + which must then be trusted by all parties. +2. Following a disaster recovery procedure, a sufficient number of blocks + covering the low density period should be added to the list of lightweight + checkpoints. These would serve the purpose of preventing a subsequent + long-range attack. + +Note that, in this second scenario, concens about the legitimacy of the +checkpoint are much less salient. The checkpoint can be issued post disaster +recovery, at such a time where the points it contains are in the past, and are +both agreed upon and easy to verify for all honest parties. ### Using Mithril Snapshots From ad153fde576e743f6f2ca8b30c555ae5580850b4 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Thu, 29 Aug 2024 12:57:42 +0100 Subject: [PATCH 17/45] renamed ro CIP-911 --- CIP-0911/README.md | 353 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 353 insertions(+) create mode 100644 CIP-0911/README.md diff --git a/CIP-0911/README.md b/CIP-0911/README.md new file mode 100644 index 0000000000..a9945b46b1 --- /dev/null +++ b/CIP-0911/README.md @@ -0,0 +1,353 @@ +--- +CIP: CIP-0911? +Title: Disaster Recovery Plan for Cardano +Category: Cardano Information +Status: Proposed +Authors: + - Kevin Hammond + - Sam Leathers + - Alex Moser + - Steve Wagendorp + - Rick McCracken + - Adam Dean + - Nicholas Clarke +Implementors: [] +Discussions: + - https://github.com/cardano-foundation/CIPs/pull/? +Created: 2024-06-17 +License: CC-BY-4.0 +--- + + + +## Abstract + + +While the Cardano mainnet has proven to be highly resilient, it is necessary to proactively +consider the possible recovery mechanisms and procedures that may be required in the unlikely +event of a major failure where the network is unable to recover itself. + +This CIP considers three representative scenarios and addresses specific considerations relevant +in each case: + +Scenario 1 - __Long-Lived Network Partition__ +Scenario 2 - __Failure to Make Blocks for an Extended Period of Time__ +Scenario 3 - __Bad Blocks Minted on Chain__ + +To ensure successful recovery in the event of a chain failure, it's crucial to establish effective +communication channels and exercise recovery procedures in advance to familiarize the community and +stake pool operators (SPOs) with the process. + +This CIP is based on an earlier IOHK technical report that is referenced below, supplemented by internal +documentation and discussions that have not been publicly released. It should be considered to be a living +document that is reviewed and revised on a regular basis. + + +## Motivation: why is this CIP necessary? + + +This CIP is needed to familiarize stakeholders with the processes and procedures that should be +followed in the unlikely event that the Cardano mainnet encounters a situation where the built-in +on-chain recovery mechanisms fail. + +## Disaster Recovery Procedures + + +While the exact recovery process will depend on the unique nature of the failure, there are three main scenarios we can consider. + +### Scenario 1: Long-Lived Network Partition + +Ouroboros Praos is designed to cope with real-world networking +conditions, in which some nodes may temporarily be disconnected from +the network. In this case, the network will continue to make blocks, +perhaps at some lower chain density (reflecting the temporary loss of +stake to the network as a whole). As nodes rejoin the network, they +will then participate in normal block production once again. In this +way, the network remains resilient to changes in connectivity. + +If many nodes become disconnected, the network could divide into two +or more completely disconnected parts. Each part of the network could +then form its own chain, backed by the stake that is participating in +its own partition. Under normal conditions, Praos will also deal with +this situation. When the partitioned group of nodes reconnects, the +longest chain will dominate, and the shorter chain will be discarded. +The nodes on the shorter chain will automatically rollback to the +point where the fork occurred, and then rejoin the main chain. This +is perfectly normal. Such forks will typically last only a few +blocks. + +However, in an extreme situation, the partition may persist beyond the +Praos rollback limit of *k* blocks (currently 2,160 blocks). In this case, the nodes +will not be able to rollback to rejoin the main chain, since this +would violate the required Praos guarantees. + + +#### Remediations + +Disconnected nodes must be reconnected to the main chain by their operators. This can be done +by truncating the local block database to a point before the chain fork and then resyncing +against the main network. This can be done using the `db-truncator` tool. + +Full node wallets can also be recovered in the same way, though this may require technical +skills that the end users do not possess. It may be easier, if slower, for them to simply +resynchronize their nodes fromb the start of the chain (i.e. from the genesis block). + +Ouroboros Genesis provides additional resilience when recovering from long lived network partitions. +In Praos nodes resyncing from a point before the chain fork could still in some cases follow the +alternative chain (if it is the first one seen) and extra mechanisms may be needed to avoid this +possibility. In Praos, for example, this may require that all participants on the alternative chain +truncate the local block database prior to the partition being resolved. In Ouroboros Genesis +when resyncing from a point before the chain fork, the chain selection rules will ensure +selection of the correct path for the main chain assuming the partition has been resolved. + +Alternative methods to resynchronise the node to the main chain might +include the use of Mithril or other signed snapshots. These would +allow faster recovery. However, in this case, care needs to be taken +to achieve the correct balance of trust against speed of recovery. + +#### Additional Effects on Cardano Users + +Although block producing nodes will rejoin the main network following the remediation +described above, the blocks that they have +minted while they were disconnected will not be included in the main +chain. This may have real world effects that will not be +automatically remedied when the nodes rejoin the main chain. For +example, transactions may have been processed that have significant +real world value, or assumptions may have been made about chains of +evidence/validity, or the timing of transactions. End users should be +aware of the possibility and include provisions in their contracts to +cover this eventuality. It may be necessary to resubmit some or all of the +transactions that were processed on the minority chain onto the main chain. +To avoid unexpected effects, this should be done by the end users/applications, and not +by block producers acting on their behalf. + +If they are not observant, stake pools, full node wallets and +other node users (e.g. explorers) could continue indefinitely on the minority +chain. Such users should take care to be aware of this situation and +take steps to rejoin the main chain as quickly as possible. +A reliable and trusted public warning system should be considered that can alert users +and advise them on how to rejoin the main chain. + + +#### Timing Considerations + +Partitions of less than 2,160 blocks will automatically rejoin the main chain. With current Cardano settings, this represents +a period of up to 12 hours during which automatic rollback will occur. If the partition exceeds 2,160 blocks, then the +procedure described above will be necessary to allow nodes to rejoin the main chain. + + +### Scenario 2: Failure to Make Blocks for an Extended Period of Time + +Ouroboros Praos requires *at least* one block to be produced every *3k/f* slots. With the current Cardano mainnet +settings, that is a 36 hour period. Such an event is extremely unlikely, but if it were to happen then the network +would be unable to make any further blocks. + +#### Mitigation + +It is recommended to monitor the chain for block production. If a low density period is observed, then block producers +should be notified, and efforts made to mint new blocks prior to the expiry of the *3k/f* window. If this is not possible +then the remediation procedures should be followed. + +#### Remediation + +Identify a small group of block producing nodes that will be used to recover the chain. This group should have +sufficient delegated stake to be capable of generating at least 9 blocks in a 36 hour window. +It should be isolated from the rest of the network. +The chain can then be recovered by resetting the wall clocks on the group of block producing nodes, +restarting them from the last good block on Cardano mainnet, playing forward the chain production +at high speed (10x usual speed is recommended), while inserting new empty blocks at the slots which +are allocated to the block producers. The recovery nodes can then be restarted with normal settings, including +connections to the network. Ouroboros Genesis then allows other nodes in the network to rapidly resynchronize +with the newly restored chain. This would leave one or more gaps in the chain, interspersed with empty blocks. + +##### Rewards Donation by Recovery Block Producers + +In order to avoid allegations of unfair behaviour, block producing nodes that are used to recover the network should +donate any rewards that they receieve during recovery to the treasury. + + +#### Additional Effects on Cardano Users + +Unlike Scenario 1, no transactions will be submitted that need to be resubmitted on the chain. +Users will, however, experience an extended period during which the chain is unavailable. +Cardano applications and contracts should be designed with this possibility in mind. +Full node wallets and other node users should recover quickly once the network is restarted +but there may be a period of instability while network connections are re-established +and the Ouroboros Genesis snapshot is distributed across all nodes. + + +#### Timing Considerations + +The chain will tolerate a gap of up to *3k/f* slots (36 hours with current Cardano settings). +A period of low chain density could have security implications that affect dynamic availability +and leave open the possibility for future long range attacks. This may be particularly +relevant should chain recovery be performed as described above (using less stake than is required +for an honest majority). To mitigate the presence of an extended period of low chain density we may +need to make use of the lightweight checkpointing mechanism in Ouroborus Genesis. Alternatively Mithril +could also be used to provide certified snapshots to stake pools as a means to verify the correct state of the ledger. + +The adoption of Mithril for fast bootstrapping by light clients and edge nodes should help to mitigate risks +for the types of users on the network that do not participate in consensus. + +Ouroboros Genesis may also provide a remedy (TODO: confirm and describe this). + + +### Scenario 3: Bad Blocks Minted on Chain + +In the event that a bad block was to be minted on-chain, then some or all validators might be unable to process the block. +They would therefore stop, and be unable to restart. Wallet and other nodes might be unable to synchronise beyond the +point of the bad block. + +#### Remediation + +Depending on the cause of the issue and its severity, alternative remediations might be possible. + +**Scenario 3.1**: if some existing node versions were able to process the block, but others were not, then +the chain would continue to grow at a lower chain density. SPOs would need to be persuaded to upgrade (or downgrade) +to a suitable node version that would allow the chain to continue. The chain density would then gradually recover to its normal level. +Other users would need to upgrade (or downgrade) to a version of the node that could follow the full chain. + +**Scenario 3.2**: if no node version was able to process the block and a +gap of less than *3k/f* slots existed, then the chain could be rolled +back immediately before the bad block was created, and nodes +restarted from this point. The chain would then grow as normal, with a small gap around the bad block. +In this case, care would need to be taken that the rogue transaction was not accidentally reinserted into the chain. +This might involve clearing node mempools, applying filters on the transaction, or developing and deploying a new node version that +rejected the bad block. + +**Scenario 3.3**: an alternative to rolling back would be to develop and deploy a "hot-fix" node that could +accept the bad block, either as an exception, or as new acceptable behaviour. +Nodes would then be able to incorporate the bad block as part of the chain, +minting new blocks as usual, or following the chain. +In this case, the bad block would persist on-chain indefinitely and future nodes +would also need to accept the bad block. Such an approach is best used when the rejected block has behaviour +that was unanticipated, but which is benign in nature. This will leave no abnormal gaps in the chain. + +**Scenario 3.4**: if more than *3k/f* slots have passed since the bad block was minted, then it will be necessary to roll back the chain immediately +prior to the bad block as in Scenario 3.2, and then proceed as described for Scenario 2. As with Scenario 2, this will leave +a series of gaps in the chain that are interspersed with empty blocks. + +#### Timing Considerations + +If more than *3k/f* slots have passed since the bad block was minted on-chain (36 hours with current Cardano settings), +then a mix of recovery techniques will be needed, as described in Scenario 3.4. When deciding on the correct recovery +technique for Scenarios 3.1-3.3, consideration should be given as to whether the recovery can be successfully completed before *3k/f* slots +have elapsed. In case of doubt, the procedure for Scenario 3.4 should be followed. + +### Using Ouroboros Genesis Snapshots + +Any of the above conditions may result in a period of lower chain density. The +updated consensus mechanism introduced in Ouroboros Genesis relies on making +chain density comparisons to assist a node when catching up with the network, +in order to reduce the reliance on having trusted peers when syncing. As +such, low-density periods pose a potential security risk for the future; they +are periods where a motivated adversary could perform a long-range attack by +building a higher density chain. + +In order to mitigate this, Genesis introduces the concepts of lightweight +checkpoints. A lightweight checkpoint is effectively a block point - a +combination of block number and hash - which can be distributed along with the +node. Unlike Mithril Snapshots (see below), Genesis lightweight snapshots are not assured by any committee - rather, they form part of the trusted codebase distributed with the node, or by other parties. + +When syncing, a Genesis node will refuse to validate past the block number of any lightweight checkpoint if the chain does not contain the correct block at that point. + +Genesis snapshots play two potential roles in disaster recovery: + +1. In scenarios where the network is split, a lightweight snapshot could guide + a node from the abandoned partition in connecting to the main partition. In + general this should not be needed, however, since the main partition should win + out in any Genesis density comparisons. This usage also falls closer to + scenario 2, in that it relies on an external source imposing a chain selection, + which must then be trusted by all parties. +2. Following a disaster recovery procedure, a sufficient number of blocks + covering the low density period should be added to the list of lightweight + checkpoints. These would serve the purpose of preventing a subsequent + long-range attack. + +Note that, in this second scenario, concens about the legitimacy of the +checkpoint are much less salient. The checkpoint can be issued post disaster +recovery, at such a time where the points it contains are in the past, and are +both agreed upon and easy to verify for all honest parties. + + +### Using Mithril Snapshots + +Mithril is a stake-based threshold multi-signatures scheme. One of the applications of this protocol in Cardano +is to create certified snapshots of the Cardano blockchain. Mithril snapshots allow nodes or applications +to obtain a verified copy of the current state of the blockchain without having to download and verify the full history. + +SPOs on mainnet that participate in the Mithril network provide signed snapshots to a Mithril aggregator that +is responsible for collecting individual signatures from Mithril signers and aggregating them into a multi-signature. +Using this capability, the Mithril aggregator can then provide certified snapshots of the Cardano blockchain that +can potentially be used as a trusted source for recovery purposes. + +Provided that it gains sufficient adoption on Mainnet and that +snapshots continue to be signed by an honest majority of stake pools +following a chain recovery event, Mithril may therefore provide an +alternative solution to Ouroboros Genesis checkpoints as a way to +verify the correct state of the ledger + + +## Recommended Actions + +1. Monitor the network for periods of low density and take early action if an extended period is observed. +2. Identify a collection of block producer nodes that has sufficient stake to mint at least 9 blocks in any 36 hour window. +3. Set up emergency communication channels with stake pool operators and other community members. +4. Practice disaster recovery procedures on a regular basis. +5. Provide signed Mithril snapshots and a way for full node wallet users and others to recover from this snapshot. +6. Determine how to employ Ouroboros Genesis snapshots as part of the disaster recovery process + +### Community Engagement + +One of the key requirements for successful disaster recovery will be proper engagement with the community. + +1. Identify stake pool operators (SPOs) who can assist with disaster recovery +2. Discuss disaster recovery requirements with Intersect's Technical Working Groups and Security Council +3. Identify and establish the right communications channels with the community, including Intersect +4. Set up regular disaster recovery practice sessions + + +## Rationale: how does this CIP achieve its goals? + + +This CIP outlines key disaster recovery scenarios that the Cardano community should understand to mitigate +potential network outages. As a living document, it will be regularly reviewed and updated to inform +stakeholders and encourage more detailed contingency planning. The CIP aims to facilitate discussions, +establish recovery procedures, and encourage regular recovery practice exercises to ensure preparedness +and validation of recovery actions in the event of an outage. + +## References + +[Cardano Disaster Recovery Plan (May 2021)](https://iohk.io/en/research/library/papers/cardano-disaster-recovery-plan/) + +[Cardano Incident Reports](https://updates.cardano.intersectmbo.org/tags/incident) + +[January 2023 Block Production Temporary Outage](https://updates.cardano.intersectmbo.org/2023-04-17-ledger) + +[DB Truncator Tool](https://github.com/IntersectMBO/ouroboros-consensus/tree/486753d0b7d6b0d09621d1ef8be85e5117ff3d1e/ouroboros-consensus-cardano/app) + +[DB Synthesizer Tool](https://github.com/IntersectMBO/ouroboros-consensus/tree/486753d0b7d6b0d09621d1ef8be85e5117ff3d1e/ouroboros-consensus-cardano/app) + +[Ouroboros Genesis](https://iohk.io/en/research/library/papers/ouroboros-genesis-composable-proof-of-stake-blockchains-with-dynamic-availability/) + +[Mithril](https://github.com/input-output-hk/mithril) + + +## Copyright + + + This CIP is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode). From 8e71c949fd551f8e7260e9a601291bc15386eb3b Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Fri, 30 Aug 2024 09:55:28 +0100 Subject: [PATCH 18/45] updated authors --- CIP-0911/README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index a9945b46b1..3eec998227 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -8,8 +8,7 @@ Authors: - Sam Leathers - Alex Moser - Steve Wagendorp - - Rick McCracken - - Adam Dean + - Andrew Westberg - Nicholas Clarke Implementors: [] Discussions: From 7fdc156bf7b984e71a4af0b771da75f3e423ab15 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Fri, 30 Aug 2024 09:56:30 +0100 Subject: [PATCH 19/45] updated authors --- CIP-0123/README.md | 353 --------------------------------------------- 1 file changed, 353 deletions(-) delete mode 100644 CIP-0123/README.md diff --git a/CIP-0123/README.md b/CIP-0123/README.md deleted file mode 100644 index ba4ae09b14..0000000000 --- a/CIP-0123/README.md +++ /dev/null @@ -1,353 +0,0 @@ ---- -CIP: CIP-0123? -Title: Disaster Recovery Plan for Cardano -Category: Cardano Information -Status: Proposed -Authors: - - Kevin Hammond - - Sam Leathers - - Alex Moser - - Steve Wagendorp - - Rick McCracken - - Adam Dean - - Nicholas Clarke -Implementors: [] -Discussions: - - https://github.com/cardano-foundation/CIPs/pull/? -Created: 2024-06-17 -License: CC-BY-4.0 ---- - - - -## Abstract - - -While the Cardano mainnet has proven to be highly resilient, it is necessary to proactively -consider the possible recovery mechanisms and procedures that may be required in the unlikely -event of a major failure where the network is unable to recover itself. - -This CIP considers three representative scenarios and addresses specific considerations relevant -in each case: - -Scenario 1 - __Long-Lived Network Partition__ -Scenario 2 - __Failure to Make Blocks for an Extended Period of Time__ -Scenario 3 - __Bad Blocks Minted on Chain__ - -To ensure successful recovery in the event of a chain failure, it's crucial to establish effective -communication channels and exercise recovery procedures in advance to familiarize the community and -stake pool operators (SPOs) with the process. - -This CIP is based on an earlier IOHK technical report that is referenced below, supplemented by internal -documentation and discussions that have not been publicly released. It should be considered to be a living -document that is reviewed and revised on a regular basis. - - -## Motivation: why is this CIP necessary? - - -This CIP is needed to familiarize stakeholders with the processes and procedures that should be -followed in the unlikely event that the Cardano mainnet encounters a situation where the built-in -on-chain recovery mechanisms fail. - -## Disaster Recovery Procedures - - -While the exact recovery process will depend on the unique nature of the failure, there are three main scenarios we can consider. - -### Scenario 1: Long-Lived Network Partition - -Ouroboros Praos is designed to cope with real-world networking -conditions, in which some nodes may temporarily be disconnected from -the network. In this case, the network will continue to make blocks, -perhaps at some lower chain density (reflecting the temporary loss of -stake to the network as a whole). As nodes rejoin the network, they -will then participate in normal block production once again. In this -way, the network remains resilient to changes in connectivity. - -If many nodes become disconnected, the network could divide into two -or more completely disconnected parts. Each part of the network could -then form its own chain, backed by the stake that is participating in -its own partition. Under normal conditions, Praos will also deal with -this situation. When the partitioned group of nodes reconnects, the -longest chain will dominate, and the shorter chain will be discarded. -The nodes on the shorter chain will automatically rollback to the -point where the fork occurred, and then rejoin the main chain. This -is perfectly normal. Such forks will typically last only a few -blocks. - -However, in an extreme situation, the partition may persist beyond the -Praos rollback limit of *k* blocks (currently 2,160 blocks). In this case, the nodes -will not be able to rollback to rejoin the main chain, since this -would violate the required Praos guarantees. - - -#### Remediations - -Disconnected nodes must be reconnected to the main chain by their operators. This can be done -by truncating the local block database to a point before the chain fork and then resyncing -against the main network. This can be done using the `db-truncator` tool. - -Full node wallets can also be recovered in the same way, though this may require technical -skills that the end users do not possess. It may be easier, if slower, for them to simply -resynchronize their nodes fromb the start of the chain (i.e. from the genesis block). - -Ouroboros Genesis provides additional resilience when recovering from long lived network partitions. -In Praos nodes resyncing from a point before the chain fork could still in some cases follow the -alternative chain (if it is the first one seen) and extra mechanisms may be needed to avoid this -possibility. In Praos, for example, this may require that all participants on the alternative chain -truncate the local block database prior to the partition being resolved. In Ouroboros Genesis -when resyncing from a point before the chain fork, the chain selection rules will ensure -selection of the correct path for the main chain assuming the partition has been resolved. - -Alternative methods to resynchronise the node to the main chain might -include the use of Mithril or other signed snapshots. These would -allow faster recovery. However, in this case, care needs to be taken -to achieve the correct balance of trust against speed of recovery. - -#### Additional Effects on Cardano Users - -Although block producing nodes will rejoin the main network following the remediation -described above, the blocks that they have -minted while they were disconnected will not be included in the main -chain. This may have real world effects that will not be -automatically remedied when the nodes rejoin the main chain. For -example, transactions may have been processed that have significant -real world value, or assumptions may have been made about chains of -evidence/validity, or the timing of transactions. End users should be -aware of the possibility and include provisions in their contracts to -cover this eventuality. It may be necessary to resubmit some or all of the -transactions that were processed on the minority chain onto the main chain. -To avoid unexpected effects, this should be done by the end users/applications, and not -by block producers acting on their behalf. - -If they are not observant, stake pools, full node wallets and -other node users (e.g. explorers) could continue indefinitely on the minority -chain. Such users should take care to be aware of this situation and -take steps to rejoin the main chain as quickly as possible. -A reliable and trusted public warning system should be considered that can alert users -and advise them on how to rejoin the main chain. - - -#### Timing Considerations - -Partitions of less than 2,160 blocks will automatically rejoin the main chain. With current Cardano settings, this represents -a period of up to 12 hours during which automatic rollback will occur. If the partition exceeds 2,160 blocks, then the -procedure described above will be necessary to allow nodes to rejoin the main chain. - - -### Scenario 2: Failure to Make Blocks for an Extended Period of Time - -Ouroboros Praos requires *at least* one block to be produced every *3k/f* slots. With the current Cardano mainnet -settings, that is a 36 hour period. Such an event is extremely unlikely, but if it were to happen then the network -would be unable to make any further blocks. - -#### Mitigation - -It is recommended to monitor the chain for block production. If a low density period is observed, then block producers -should be notified, and efforts made to mint new blocks prior to the expiry of the *3k/f* window. If this is not possible -then the remediation procedures should be followed. - -#### Remediation - -Identify a small group of block producing nodes that will be used to recover the chain. This group should have -sufficient delegated stake to be capable of generating at least 9 blocks in a 36 hour window. -It should be isolated from the rest of the network. -The chain can then be recovered by resetting the wall clocks on the group of block producing nodes, -restarting them from the last good block on Cardano mainnet, playing forward the chain production -at high speed (10x usual speed is recommended), while inserting new empty blocks at the slots which -are allocated to the block producers. The recovery nodes can then be restarted with normal settings, including -connections to the network. Ouroboros Genesis then allows other nodes in the network to rapidly resynchronize -with the newly restored chain. This would leave one or more gaps in the chain, interspersed with empty blocks. - -##### Rewards Donation by Recovery Block Producers - -In order to avoid allegations of unfair behaviour, block producing nodes that are used to recover the network should -donate any rewards that they receieve during recovery to the treasury. - - -#### Additional Effects on Cardano Users - -Unlike Scenario 1, no transactions will be submitted that need to be resubmitted on the chain. -Users will, however, experience an extended period during which the chain is unavailable. -Cardano applications and contracts should be designed with this possibility in mind. -Full node wallets and other node users should recover quickly once the network is restarted -but there may be a period of instability while network connections are re-established -and the Ouroboros Genesis snapshot is distributed across all nodes. - - -#### Timing Considerations - -The chain will tolerate a gap of up to *3k/f* slots (36 hours with current Cardano settings). -A period of low chain density could have security implications that affect dynamic availability -and leave open the possibility for future long range attacks. This may be particularly -relevant should chain recovery be performed as described above (using less stake than is required -for an honest majority). To mitigate the presence of an extended period of low chain density we may -need to make use of the lightweight checkpointing mechanism in Ouroborus Genesis. Alternatively Mithril -could also be used to provide certified snapshots to stake pools as a means to verify the correct state of the ledger. - -The adoption of Mithril for fast bootstrapping by light clients and edge nodes should help to mitigate risks -for the types of users on the network that do not participate in consensus. - -Ouroboros Genesis may also provide a remedy (TODO: confirm and describe this). - - -### Scenario 3: Bad Blocks Minted on Chain - -In the event that a bad block was to be minted on-chain, then some or all validators might be unable to process the block. -They would therefore stop, and be unable to restart. Wallet and other nodes might be unable to synchronise beyond the -point of the bad block. - -#### Remediation - -Depending on the cause of the issue and its severity, alternative remediations might be possible. - -**Scenario 3.1**: if some existing node versions were able to process the block, but others were not, then -the chain would continue to grow at a lower chain density. SPOs would need to be persuaded to upgrade (or downgrade) -to a suitable node version that would allow the chain to continue. The chain density would then gradually recover to its normal level. -Other users would need to upgrade (or downgrade) to a version of the node that could follow the full chain. - -**Scenario 3.2**: if no node version was able to process the block and a -gap of less than *3k/f* slots existed, then the chain could be rolled -back immediately before the bad block was created, and nodes -restarted from this point. The chain would then grow as normal, with a small gap around the bad block. -In this case, care would need to be taken that the rogue transaction was not accidentally reinserted into the chain. -This might involve clearing node mempools, applying filters on the transaction, or developing and deploying a new node version that -rejected the bad block. - -**Scenario 3.3**: an alternative to rolling back would be to develop and deploy a "hot-fix" node that could -accept the bad block, either as an exception, or as new acceptable behaviour. -Nodes would then be able to incorporate the bad block as part of the chain, -minting new blocks as usual, or following the chain. -In this case, the bad block would persist on-chain indefinitely and future nodes -would also need to accept the bad block. Such an approach is best used when the rejected block has behaviour -that was unanticipated, but which is benign in nature. This will leave no abnormal gaps in the chain. - -**Scenario 3.4**: if more than *3k/f* slots have passed since the bad block was minted, then it will be necessary to roll back the chain immediately -prior to the bad block as in Scenario 3.2, and then proceed as described for Scenario 2. As with Scenario 2, this will leave -a series of gaps in the chain that are interspersed with empty blocks. - -#### Timing Considerations - -If more than *3k/f* slots have passed since the bad block was minted on-chain (36 hours with current Cardano settings), -then a mix of recovery techniques will be needed, as described in Scenario 3.4. When deciding on the correct recovery -technique for Scenarios 3.1-3.3, consideration should be given as to whether the recovery can be successfully completed before *3k/f* slots -have elapsed. In case of doubt, the procedure for Scenario 3.4 should be followed. - -### Using Ouroboros Genesis Snapshots - -Any of the above conditions may result in a period of lower chain density. The -updated consensus mechanism introduced in Ouroboros Genesis relies on making -chain density comparisons to assist a node when catching up with the network, -in order to reduce the reliance on having trusted peers when syncing. As -such, low-density periods pose a potential security risk for the future; they -are periods where a motivated adversary could perform a long-range attack by -building a higher density chain. - -In order to mitigate this, Genesis introduces the concepts of lightweight -checkpoints. A lightweight checkpoint is effectively a block point - a -combination of block number and hash - which can be distributed along with the -node. Unlike Mithril Snapshots (see below), Genesis lightweight snapshots are not assured by any committee - rather, they form part of the trusted codebase distributed with the node, or by other parties. - -When syncing, a Genesis node will refuse to validate past the block number of any lightweight checkpoint if the chain does not contain the correct block at that point. - -Genesis snapshots play two potential roles in disaster recovery: - -1. In scenarios where the network is split, a lightweight snapshot could guide - a node from the abandoned partition in connecting to the main partition. In - general this should not be needed, however, since the main partition should win - out in any Genesis density comparisons. This usage also falls closer to - scenario 2, in that it relies on an external source imposing a chain selection, - which must then be trusted by all parties. -2. Following a disaster recovery procedure, a sufficient number of blocks - covering the low density period should be added to the list of lightweight - checkpoints. These would serve the purpose of preventing a subsequent - long-range attack. - -Note that, in this second scenario, concens about the legitimacy of the -checkpoint are much less salient. The checkpoint can be issued post disaster -recovery, at such a time where the points it contains are in the past, and are -both agreed upon and easy to verify for all honest parties. - - -### Using Mithril Snapshots - -Mithril is a stake-based threshold multi-signatures scheme. One of the applications of this protocol in Cardano -is to create certified snapshots of the Cardano blockchain. Mithril snapshots allow nodes or applications -to obtain a verified copy of the current state of the blockchain without having to download and verify the full history. - -SPOs on mainnet that participate in the Mithril network provide signed snapshots to a Mithril aggregator that -is responsible for collecting individual signatures from Mithril signers and aggregating them into a multi-signature. -Using this capability, the Mithril aggregator can then provide certified snapshots of the Cardano blockchain that -can potentially be used as a trusted source for recovery purposes. - -Provided that it gains sufficient adoption on Mainnet and that -snapshots continue to be signed by an honest majority of stake pools -following a chain recovery event, Mithril may therefore provide an -alternative solution to Ouroboros Genesis checkpoints as a way to -verify the correct state of the ledger - - -## Recommended Actions - -1. Monitor the network for periods of low density and take early action if an extended period is observed. -2. Identify a collection of block producer nodes that has sufficient stake to mint at least 9 blocks in any 36 hour window. -3. Set up emergency communication channels with stake pool operators and other community members. -4. Practice disaster recovery procedures on a regular basis. -5. Provide signed Mithril snapshots and a way for full node wallet users and others to recover from this snapshot. -6. Determine how to employ Ouroboros Genesis snapshots as part of the disaster recovery process - -### Community Engagement - -One of the key requirements for successful disaster recovery will be proper engagement with the community. - -1. Identify stake pool operators (SPOs) who can assist with disaster recovery -2. Discuss disaster recovery requirements with Intersect's Technical Working Groups and Security Council -3. Identify and establish the right communications channels with the community, including Intersect -4. Set up regular disaster recovery practice sessions - - -## Rationale: how does this CIP achieve its goals? - - -This CIP outlines key disaster recovery scenarios that the Cardano community should understand to mitigate -potential network outages. As a living document, it will be regularly reviewed and updated to inform -stakeholders and encourage more detailed contingency planning. The CIP aims to facilitate discussions, -establish recovery procedures, and encourage regular recovery practice exercises to ensure preparedness -and validation of recovery actions in the event of an outage. - -## References - -[Cardano Disaster Recovery Plan (May 2021)](https://iohk.io/en/research/library/papers/cardano-disaster-recovery-plan/) - -[Cardano Incident Reports](https://updates.cardano.intersectmbo.org/tags/incident) - -[January 2023 Block Production Temporary Outage](https://updates.cardano.intersectmbo.org/2023-04-17-ledger) - -[DB Truncator Tool](https://github.com/IntersectMBO/ouroboros-consensus/tree/486753d0b7d6b0d09621d1ef8be85e5117ff3d1e/ouroboros-consensus-cardano/app) - -[DB Synthesizer Tool](https://github.com/IntersectMBO/ouroboros-consensus/tree/486753d0b7d6b0d09621d1ef8be85e5117ff3d1e/ouroboros-consensus-cardano/app) - -[Ouroboros Genesis](https://iohk.io/en/research/library/papers/ouroboros-genesis-composable-proof-of-stake-blockchains-with-dynamic-availability/) - -[Mithril](https://github.com/input-output-hk/mithril) - - -## Copyright - - - This CIP is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode). From d5e6849cfa046eede66ca7ccc326190f8828b875 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Fri, 30 Aug 2024 10:10:20 +0100 Subject: [PATCH 20/45] added change log --- CIP-0911/README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index 3eec998227..b708373b38 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -329,6 +329,12 @@ stakeholders and encourage more detailed contingency planning. The CIP aims to f establish recovery procedures, and encourage regular recovery practice exercises to ensure preparedness and validation of recovery actions in the event of an outage. +## Change Log + +| Version | Date | Description | +| -------- | -------- | ------- | +| 0.1 | 2024-08-30 | Initial submitted version | + ## References [Cardano Disaster Recovery Plan (May 2021)](https://iohk.io/en/research/library/papers/cardano-disaster-recovery-plan/) From a4fae458987b26b24447576cd249704c6688d46b Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:22:46 +0100 Subject: [PATCH 21/45] Update CIP-0911/README.md Co-authored-by: Robert Phair --- CIP-0911/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index b708373b38..460aea7821 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -1,6 +1,6 @@ --- CIP: CIP-0911? -Title: Disaster Recovery Plan for Cardano +Title: Disaster Recovery Plan for Mainnet Category: Cardano Information Status: Proposed Authors: From de7fa2673159413943fa642799fb5dd610c50ecb Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:23:00 +0100 Subject: [PATCH 22/45] Update CIP-0911/README.md Co-authored-by: Ryan <44342099+Ryun1@users.noreply.github.com> --- CIP-0911/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index 460aea7821..74eebe38a3 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -12,7 +12,7 @@ Authors: - Nicholas Clarke Implementors: [] Discussions: - - https://github.com/cardano-foundation/CIPs/pull/? + - https://github.com/cardano-foundation/CIPs/pull/893 Created: 2024-06-17 License: CC-BY-4.0 --- From 5f184f9668332380aac5484d947f2003c41bcfd8 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:23:45 +0100 Subject: [PATCH 23/45] Update CIP-0911/README.md Co-authored-by: Robert Phair --- CIP-0911/README.md | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index 74eebe38a3..52e58e96ce 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -17,19 +17,6 @@ Created: 2024-06-17 License: CC-BY-4.0 --- - - ## Abstract From 97e066c99e20c58837fb97e5ba48d9f398b67a3e Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:24:14 +0100 Subject: [PATCH 24/45] Update CIP-0911/README.md Co-authored-by: Robert Phair --- CIP-0911/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index 52e58e96ce..f871b869b4 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -273,7 +273,7 @@ Mithril is a stake-based threshold multi-signatures scheme. One of the applicati is to create certified snapshots of the Cardano blockchain. Mithril snapshots allow nodes or applications to obtain a verified copy of the current state of the blockchain without having to download and verify the full history. -SPOs on mainnet that participate in the Mithril network provide signed snapshots to a Mithril aggregator that +SPOs on Mainnet that participate in the Mithril network provide signed snapshots to a Mithril aggregator that is responsible for collecting individual signatures from Mithril signers and aggregating them into a multi-signature. Using this capability, the Mithril aggregator can then provide certified snapshots of the Cardano blockchain that can potentially be used as a trusted source for recovery purposes. From b8e00706b86b7549bd7c69704308bd6a834d5110 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:24:29 +0100 Subject: [PATCH 25/45] Update CIP-0911/README.md Co-authored-by: Robert Phair --- CIP-0911/README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index f871b869b4..3d7a277b7c 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -18,7 +18,6 @@ License: CC-BY-4.0 --- ## Abstract - While the Cardano mainnet has proven to be highly resilient, it is necessary to proactively consider the possible recovery mechanisms and procedures that may be required in the unlikely From f0cda3947d8f30b4786fea8b983cdb9f007beb77 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:24:52 +0100 Subject: [PATCH 26/45] Update CIP-0911/README.md Co-authored-by: Robert Phair --- CIP-0911/README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index 3d7a277b7c..7ae9f87ec2 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -40,7 +40,6 @@ document that is reviewed and revised on a regular basis. ## Motivation: why is this CIP necessary? - This CIP is needed to familiarize stakeholders with the processes and procedures that should be followed in the unlikely event that the Cardano mainnet encounters a situation where the built-in From 1a4de56c040bb3a8f24172814ab71f5653739357 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:25:07 +0100 Subject: [PATCH 27/45] Update CIP-0911/README.md Co-authored-by: Robert Phair --- CIP-0911/README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index 7ae9f87ec2..39d4c9cf67 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -46,7 +46,6 @@ followed in the unlikely event that the Cardano mainnet encounters a situation w on-chain recovery mechanisms fail. ## Disaster Recovery Procedures - While the exact recovery process will depend on the unique nature of the failure, there are three main scenarios we can consider. From 8083f09ecba94ab258335cd163d7f410286a1e92 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:25:21 +0100 Subject: [PATCH 28/45] Update CIP-0911/README.md Co-authored-by: Robert Phair --- CIP-0911/README.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index 39d4c9cf67..4a2fd28d5f 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -302,10 +302,6 @@ One of the key requirements for successful disaster recovery will be proper enga ## Rationale: how does this CIP achieve its goals? - This CIP outlines key disaster recovery scenarios that the Cardano community should understand to mitigate potential network outages. As a living document, it will be regularly reviewed and updated to inform From 4bd432ed81bd9bc2a2dbb4fa4874a0db204a3ff0 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:25:32 +0100 Subject: [PATCH 29/45] Update CIP-0911/README.md Co-authored-by: Robert Phair --- CIP-0911/README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index 4a2fd28d5f..61dc31d834 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -333,6 +333,5 @@ and validation of recovery actions in the event of an outage. ## Copyright - This CIP is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode). From c32c4b1217edaaf2968db289bc48f39e14410433 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:26:12 +0100 Subject: [PATCH 30/45] Update CIP-0911/README.md Co-authored-by: Robert Phair --- CIP-0911/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index 61dc31d834..d4a39a62db 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -1,7 +1,7 @@ --- CIP: CIP-0911? Title: Disaster Recovery Plan for Mainnet -Category: Cardano Information +Category: Tools Status: Proposed Authors: - Kevin Hammond From 58e3e589b7d1ba8525a3e5f27d36a2bc8fbcb3f3 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Tue, 10 Sep 2024 12:15:22 +0100 Subject: [PATCH 31/45] edited to make it clear which procedures are generic, and which apply to mainnet; removed TODO --- CIP-0911/README.md | 38 +++++++++++++++++++++----------------- 1 file changed, 21 insertions(+), 17 deletions(-) diff --git a/CIP-0911/README.md b/CIP-0911/README.md index d4a39a62db..25f67b718a 100644 --- a/CIP-0911/README.md +++ b/CIP-0911/README.md @@ -1,6 +1,6 @@ --- CIP: CIP-0911? -Title: Disaster Recovery Plan for Mainnet +Title: Disaster Recovery Plan for Cardano networks (including mainnet) Category: Tools Status: Proposed Authors: @@ -38,12 +38,15 @@ This CIP is based on an earlier IOHK technical report that is referenced below, documentation and discussions that have not been publicly released. It should be considered to be a living document that is reviewed and revised on a regular basis. +Note that although the focus of disaster recovery is on Cardano mainnet, the recovery procedures are generic and apply to other Cardano +networks, including SanchoNet, Preview, PreProd or private networks. + ## Motivation: why is this CIP necessary? This CIP is needed to familiarize stakeholders with the processes and procedures that should be -followed in the unlikely event that the Cardano mainnet encounters a situation where the built-in -on-chain recovery mechanisms fail. +followed in the unlikely event that the Cardano mainnet, or another Cardano network, encounters +a situation where the built-in on-chain recovery mechanisms fail. ## Disaster Recovery Procedures @@ -71,8 +74,8 @@ is perfectly normal. Such forks will typically last only a few blocks. However, in an extreme situation, the partition may persist beyond the -Praos rollback limit of *k* blocks (currently 2,160 blocks). In this case, the nodes -will not be able to rollback to rejoin the main chain, since this +Praos rollback limit of *k* blocks (currently 2,160 blocks on mainnet). +In this case, the nodes will not be able to rollback to rejoin the main chain, since this would violate the required Praos guarantees. @@ -80,7 +83,7 @@ would violate the required Praos guarantees. Disconnected nodes must be reconnected to the main chain by their operators. This can be done by truncating the local block database to a point before the chain fork and then resyncing -against the main network. This can be done using the `db-truncator` tool. +against the main network, using the `db-truncator` tool, for example. Full node wallets can also be recovered in the same way, though this may require technical skills that the end users do not possess. It may be easier, if slower, for them to simply @@ -125,9 +128,10 @@ and advise them on how to rejoin the main chain. #### Timing Considerations -Partitions of less than 2,160 blocks will automatically rejoin the main chain. With current Cardano settings, this represents +On Cardano mainnet, partitions of less than 2,160 blocks will automatically rejoin the main chain. With current Cardano mainnet settings, this represents a period of up to 12 hours during which automatic rollback will occur. If the partition exceeds 2,160 blocks, then the -procedure described above will be necessary to allow nodes to rejoin the main chain. +procedure described above will be necessary to allow nodes to rejoin the main chain. Other Cardano networks may have different +timing characteristics. ### Scenario 2: Failure to Make Blocks for an Extended Period of Time @@ -144,11 +148,11 @@ then the remediation procedures should be followed. #### Remediation -Identify a small group of block producing nodes that will be used to recover the chain. This group should have +Identify a small group of block producing nodes that will be used to recover the chain. For Cardano mainnet, this group should have sufficient delegated stake to be capable of generating at least 9 blocks in a 36 hour window. It should be isolated from the rest of the network. The chain can then be recovered by resetting the wall clocks on the group of block producing nodes, -restarting them from the last good block on Cardano mainnet, playing forward the chain production +restarting them from the last good block on the Cardano network, playing forward the chain production at high speed (10x usual speed is recommended), while inserting new empty blocks at the slots which are allocated to the block producers. The recovery nodes can then be restarted with normal settings, including connections to the network. Ouroboros Genesis then allows other nodes in the network to rapidly resynchronize @@ -172,7 +176,7 @@ and the Ouroboros Genesis snapshot is distributed across all nodes. #### Timing Considerations -The chain will tolerate a gap of up to *3k/f* slots (36 hours with current Cardano settings). +The chain will tolerate a gap of up to *3k/f* slots (36 hours with current Cardano mainnet settings). A period of low chain density could have security implications that affect dynamic availability and leave open the possibility for future long range attacks. This may be particularly relevant should chain recovery be performed as described above (using less stake than is required @@ -183,7 +187,7 @@ could also be used to provide certified snapshots to stake pools as a means to v The adoption of Mithril for fast bootstrapping by light clients and edge nodes should help to mitigate risks for the types of users on the network that do not participate in consensus. -Ouroboros Genesis may also provide a remedy (TODO: confirm and describe this). +As described below, Ouroboros Genesis snapshots may also be useful as part of the recovery process. ### Scenario 3: Bad Blocks Minted on Chain @@ -223,7 +227,7 @@ a series of gaps in the chain that are interspersed with empty blocks. #### Timing Considerations -If more than *3k/f* slots have passed since the bad block was minted on-chain (36 hours with current Cardano settings), +If more than *3k/f* slots have passed since the bad block was minted on-chain (36 hours with current Cardano mainnet settings), then a mix of recovery techniques will be needed, as described in Scenario 3.4. When deciding on the correct recovery technique for Scenarios 3.1-3.3, consideration should be given as to whether the recovery can be successfully completed before *3k/f* slots have elapsed. In case of doubt, the procedure for Scenario 3.4 should be followed. @@ -270,21 +274,21 @@ Mithril is a stake-based threshold multi-signatures scheme. One of the applicati is to create certified snapshots of the Cardano blockchain. Mithril snapshots allow nodes or applications to obtain a verified copy of the current state of the blockchain without having to download and verify the full history. -SPOs on Mainnet that participate in the Mithril network provide signed snapshots to a Mithril aggregator that +SPOs that participate in the Mithril network provide signed snapshots to a Mithril aggregator that is responsible for collecting individual signatures from Mithril signers and aggregating them into a multi-signature. Using this capability, the Mithril aggregator can then provide certified snapshots of the Cardano blockchain that can potentially be used as a trusted source for recovery purposes. -Provided that it gains sufficient adoption on Mainnet and that +Provided that it gains sufficient adoption on the Cardano network and that snapshots continue to be signed by an honest majority of stake pools following a chain recovery event, Mithril may therefore provide an alternative solution to Ouroboros Genesis checkpoints as a way to verify the correct state of the ledger -## Recommended Actions +## Recommended Actions for Cardano mainnet -1. Monitor the network for periods of low density and take early action if an extended period is observed. +1. Monitor Cardano mainnet for periods of low density and take early action if an extended period is observed. 2. Identify a collection of block producer nodes that has sufficient stake to mint at least 9 blocks in any 36 hour window. 3. Set up emergency communication channels with stake pool operators and other community members. 4. Practice disaster recovery procedures on a regular basis. From 4a8c67a67597f140be6fce7ef63298acc2295e56 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Wed, 18 Sep 2024 10:41:39 +0100 Subject: [PATCH 32/45] renamed directory --- {CIP-0911 => CIP-0135}/README.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename {CIP-0911 => CIP-0135}/README.md (100%) diff --git a/CIP-0911/README.md b/CIP-0135/README.md similarity index 100% rename from CIP-0911/README.md rename to CIP-0135/README.md From da29acb7fd2a16adedd8b4b9e2c82747b7121350 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Wed, 18 Sep 2024 10:59:56 +0100 Subject: [PATCH 33/45] added "path to active" plus small editing changes --- CIP-0135/README.md | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index 25f67b718a..fac242a1ee 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -1,6 +1,6 @@ --- -CIP: CIP-0911? -Title: Disaster Recovery Plan for Cardano networks (including mainnet) +CIP: CIP-0135 +Title: Disaster Recovery Plan for Cardano networks Category: Tools Status: Proposed Authors: @@ -10,7 +10,7 @@ Authors: - Steve Wagendorp - Andrew Westberg - Nicholas Clarke -Implementors: [] +Implementors: N/A Discussions: - https://github.com/cardano-foundation/CIPs/pull/893 Created: 2024-06-17 @@ -19,7 +19,7 @@ License: CC-BY-4.0 ## Abstract -While the Cardano mainnet has proven to be highly resilient, it is necessary to proactively +While the Cardano mainnet and other networks have proven to be highly resilient, it is necessary to proactively consider the possible recovery mechanisms and procedures that may be required in the unlikely event of a major failure where the network is unable to recover itself. @@ -38,15 +38,17 @@ This CIP is based on an earlier IOHK technical report that is referenced below, documentation and discussions that have not been publicly released. It should be considered to be a living document that is reviewed and revised on a regular basis. -Note that although the focus of disaster recovery is on Cardano mainnet, the recovery procedures are generic and apply to other Cardano +Note that although the focus of disaster recovery is on Cardano mainnet, since this is the greatest risk +of loss of funds, the recovery procedures are generic and apply to other Cardano networks, including SanchoNet, Preview, PreProd or private networks. +Appropriate adjustments may need to be made to reflect differences in timing or other concerns. ## Motivation: why is this CIP necessary? This CIP is needed to familiarize stakeholders with the processes and procedures that should be followed in the unlikely event that the Cardano mainnet, or another Cardano network, encounters -a situation where the built-in on-chain recovery mechanisms fail. +a situation where the built-in on-chain recovery mechanisms fail. ## Disaster Recovery Procedures @@ -313,6 +315,22 @@ stakeholders and encourage more detailed contingency planning. The CIP aims to f establish recovery procedures, and encourage regular recovery practice exercises to ensure preparedness and validation of recovery actions in the event of an outage. +## Path to Active + +### Acceptance criteria + +- [x] The proposal has been reviewed by the community and sufficiently advertised on various channels. + - [x] Intersect Channels + - [x] Cardano Forum + - [x] Twitter + - [x] Reddit + +- [x] All major concerns or feedback have been addressed. + +### Implementation Plan + +N/A + ## Change Log | Version | Date | Description | From ca4cfc6e044a1fce65d674ba5fb3e24ed0913f89 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Wed, 18 Sep 2024 11:06:28 +0100 Subject: [PATCH 34/45] tweaked path to active --- CIP-0135/README.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index fac242a1ee..ab7a07f5ee 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -319,13 +319,14 @@ and validation of recovery actions in the event of an outage. ### Acceptance criteria -- [x] The proposal has been reviewed by the community and sufficiently advertised on various channels. - - [x] Intersect Channels - - [x] Cardano Forum - - [x] Twitter - - [x] Reddit +- [ ] The proposal has been reviewed by the community and sufficiently advertised on various channels. + - [ ] Intersect Technical Groups + - [ ] Intersect Discord Channels + - [ ] Cardano Forum + - [ ] Twitter + - [ ] Reddit -- [x] All major concerns or feedback have been addressed. +- [ ] All major concerns or feedback have been addressed. ### Implementation Plan @@ -336,6 +337,8 @@ N/A | Version | Date | Description | | -------- | -------- | ------- | | 0.1 | 2024-08-30 | Initial submitted version | +| 0.2 | 2024-09-10 | Revised version to emphasize genericity of recovery techniques | +| 0.3 | 2024-09-18 | Revised version following CIP editors meeting | ## References From ce95ddff801c8bb131671ccce8a271d2bc06574c Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Wed, 18 Sep 2024 14:53:19 +0100 Subject: [PATCH 35/45] Restructured to single section "Specification" --- CIP-0135/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index ab7a07f5ee..c88ebe7601 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -50,7 +50,7 @@ This CIP is needed to familiarize stakeholders with the processes and procedures followed in the unlikely event that the Cardano mainnet, or another Cardano network, encounters a situation where the built-in on-chain recovery mechanisms fail. -## Disaster Recovery Procedures +## Specification While the exact recovery process will depend on the unique nature of the failure, there are three main scenarios we can consider. @@ -288,7 +288,7 @@ alternative solution to Ouroboros Genesis checkpoints as a way to verify the correct state of the ledger -## Recommended Actions for Cardano mainnet +### Recommended Actions for Cardano mainnet 1. Monitor Cardano mainnet for periods of low density and take early action if an extended period is observed. 2. Identify a collection of block producer nodes that has sufficient stake to mint at least 9 blocks in any 36 hour window. @@ -297,7 +297,7 @@ verify the correct state of the ledger 5. Provide signed Mithril snapshots and a way for full node wallet users and others to recover from this snapshot. 6. Determine how to employ Ouroboros Genesis snapshots as part of the disaster recovery process -### Community Engagement +#### Community Engagement One of the key requirements for successful disaster recovery will be proper engagement with the community. From 6fbbcf6f1d1094df1eac875751f1f8544d42c483 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Fri, 20 Sep 2024 21:04:17 +0100 Subject: [PATCH 36/45] Update CIP-0135/README.md Co-authored-by: Ryan <44342099+Ryun1@users.noreply.github.com> --- CIP-0135/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index c88ebe7601..005e29f7e0 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -1,5 +1,5 @@ --- -CIP: CIP-0135 +CIP: 135 Title: Disaster Recovery Plan for Cardano networks Category: Tools Status: Proposed From 34b2556320dbe5e81b02a0873ed8c120fb76bb50 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Fri, 20 Sep 2024 21:04:35 +0100 Subject: [PATCH 37/45] Update CIP-0135/README.md Co-authored-by: Thomas Vellekoop <107037423+perturbing@users.noreply.github.com> --- CIP-0135/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index 005e29f7e0..a1d68aad50 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -264,7 +264,7 @@ Genesis snapshots play two potential roles in disaster recovery: checkpoints. These would serve the purpose of preventing a subsequent long-range attack. -Note that, in this second scenario, concens about the legitimacy of the +Note that, in this second scenario, concerns about the legitimacy of the checkpoint are much less salient. The checkpoint can be issued post disaster recovery, at such a time where the points it contains are in the past, and are both agreed upon and easy to verify for all honest parties. From 55c4412c9ce61bc2127759df264f13a4d369acd9 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Fri, 20 Sep 2024 21:04:50 +0100 Subject: [PATCH 38/45] Update CIP-0135/README.md Co-authored-by: Thomas Vellekoop <107037423+perturbing@users.noreply.github.com> --- CIP-0135/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index a1d68aad50..4ac5e5f78d 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -163,7 +163,7 @@ with the newly restored chain. This would leave one or more gaps in the chain, ##### Rewards Donation by Recovery Block Producers In order to avoid allegations of unfair behaviour, block producing nodes that are used to recover the network should -donate any rewards that they receieve during recovery to the treasury. +donate any rewards that they receive during recovery to the treasury. #### Additional Effects on Cardano Users From 36291d63d70f6071605460df7d6867f02a6b0093 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Fri, 20 Sep 2024 21:05:42 +0100 Subject: [PATCH 39/45] Update CIP-0135/README.md Co-authored-by: Thomas Vellekoop <107037423+perturbing@users.noreply.github.com> --- CIP-0135/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index 4ac5e5f78d..43fe0e83bd 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -89,7 +89,7 @@ against the main network, using the `db-truncator` tool, for example. Full node wallets can also be recovered in the same way, though this may require technical skills that the end users do not possess. It may be easier, if slower, for them to simply -resynchronize their nodes fromb the start of the chain (i.e. from the genesis block). +resynchronize their nodes from the start of the chain (i.e. from the genesis block). Ouroboros Genesis provides additional resilience when recovering from long lived network partitions. In Praos nodes resyncing from a point before the chain fork could still in some cases follow the From 9a3959fadf57dddfce0d4500dbd416be1344d742 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Fri, 20 Sep 2024 21:06:36 +0100 Subject: [PATCH 40/45] Update CIP-0135/README.md Co-authored-by: Thomas Vellekoop <107037423+perturbing@users.noreply.github.com> --- CIP-0135/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index 43fe0e83bd..00eff84997 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -183,7 +183,7 @@ A period of low chain density could have security implications that affect dynam and leave open the possibility for future long range attacks. This may be particularly relevant should chain recovery be performed as described above (using less stake than is required for an honest majority). To mitigate the presence of an extended period of low chain density we may -need to make use of the lightweight checkpointing mechanism in Ouroborus Genesis. Alternatively Mithril +need to make use of the lightweight checkpointing mechanism in Ouroborus Genesis. Alternatively, Mithril could also be used to provide certified snapshots to stake pools as a means to verify the correct state of the ledger. The adoption of Mithril for fast bootstrapping by light clients and edge nodes should help to mitigate risks From 0f9ca60cf6402cd6b891e80750017d691328b936 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Fri, 20 Sep 2024 21:11:06 +0100 Subject: [PATCH 41/45] removed spaces --- CIP-0135/README.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index 00eff84997..29339cef3f 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -24,11 +24,11 @@ consider the possible recovery mechanisms and procedures that may be required in event of a major failure where the network is unable to recover itself. This CIP considers three representative scenarios and addresses specific considerations relevant -in each case: +in each case: -Scenario 1 - __Long-Lived Network Partition__ -Scenario 2 - __Failure to Make Blocks for an Extended Period of Time__ -Scenario 3 - __Bad Blocks Minted on Chain__ +Scenario 1 - __Long-Lived Network Partition__ +Scenario 2 - __Failure to Make Blocks for an Extended Period of Time__ +Scenario 3 - __Bad Blocks Minted on Chain__ To ensure successful recovery in the event of a chain failure, it's crucial to establish effective communication channels and exercise recovery procedures in advance to familiarize the community and @@ -36,7 +36,7 @@ stake pool operators (SPOs) with the process. This CIP is based on an earlier IOHK technical report that is referenced below, supplemented by internal documentation and discussions that have not been publicly released. It should be considered to be a living -document that is reviewed and revised on a regular basis. +document that is reviewed and revised on a regular basis. Note that although the focus of disaster recovery is on Cardano mainnet, since this is the greatest risk of loss of funds, the recovery procedures are generic and apply to other Cardano @@ -200,7 +200,7 @@ point of the bad block. #### Remediation -Depending on the cause of the issue and its severity, alternative remediations might be possible. +Depending on the cause of the issue and its severity, alternative remediations might be possible. **Scenario 3.1**: if some existing node versions were able to process the block, but others were not, then the chain would continue to grow at a lower chain density. SPOs would need to be persuaded to upgrade (or downgrade) @@ -237,7 +237,7 @@ have elapsed. In case of doubt, the procedure for Scenario 3.4 should be follow ### Using Ouroboros Genesis Snapshots Any of the above conditions may result in a period of lower chain density. The -updated consensus mechanism introduced in Ouroboros Genesis relies on making +updated consensus mechanism introduced in Ouroboros Genesis relies on making chain density comparisons to assist a node when catching up with the network, in order to reduce the reliance on having trusted peers when syncing. As such, low-density periods pose a potential security risk for the future; they @@ -261,7 +261,7 @@ Genesis snapshots play two potential roles in disaster recovery: which must then be trusted by all parties. 2. Following a disaster recovery procedure, a sufficient number of blocks covering the low density period should be added to the list of lightweight - checkpoints. These would serve the purpose of preventing a subsequent + checkpoints. These would serve the purpose of preventing a subsequent long-range attack. Note that, in this second scenario, concerns about the legitimacy of the @@ -334,11 +334,11 @@ N/A ## Change Log -| Version | Date | Description | +| Version | Date | Description | | -------- | -------- | ------- | -| 0.1 | 2024-08-30 | Initial submitted version | -| 0.2 | 2024-09-10 | Revised version to emphasize genericity of recovery techniques | -| 0.3 | 2024-09-18 | Revised version following CIP editors meeting | +| 0.1 | 2024-08-30 | Initial submitted version | +| 0.2 | 2024-09-10 | Revised version to emphasize genericity of recovery techniques | +| 0.3 | 2024-09-18 | Revised version following CIP editors meeting | ## References From 27481f76ae668be8e0f6020882b5b68749fb3e7c Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Fri, 20 Sep 2024 21:12:47 +0100 Subject: [PATCH 42/45] added some path to active checks --- CIP-0135/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index 29339cef3f..f184f12683 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -320,9 +320,9 @@ and validation of recovery actions in the event of an outage. ### Acceptance criteria - [ ] The proposal has been reviewed by the community and sufficiently advertised on various channels. - - [ ] Intersect Technical Groups - - [ ] Intersect Discord Channels - - [ ] Cardano Forum + - [x] Intersect Technical Groups + - [x] Intersect Discord Channels + - [x] Cardano Forum - [ ] Twitter - [ ] Reddit From 7f24d3b7de471fc8eef090d4ef16259aa76930ae Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Wed, 25 Sep 2024 10:30:43 +0100 Subject: [PATCH 43/45] updated acceptance criteria --- CIP-0135/README.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index f184f12683..0d391f5625 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -319,12 +319,10 @@ and validation of recovery actions in the event of an outage. ### Acceptance criteria -- [ ] The proposal has been reviewed by the community and sufficiently advertised on various channels. +- [x] The proposal has been reviewed by the community and sufficiently advertised on various channels. - [x] Intersect Technical Groups - [x] Intersect Discord Channels - [x] Cardano Forum - - [ ] Twitter - - [ ] Reddit - [ ] All major concerns or feedback have been addressed. From fb020af733649a8082d6c4d1167b698edebe419d Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Mon, 30 Sep 2024 21:34:51 +0100 Subject: [PATCH 44/45] updated acceptance criteria --- CIP-0135/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index 0d391f5625..7264616d5d 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -324,7 +324,7 @@ and validation of recovery actions in the event of an outage. - [x] Intersect Discord Channels - [x] Cardano Forum -- [ ] All major concerns or feedback have been addressed. +- [x] All major concerns or feedback have been addressed. ### Implementation Plan From 8c43db384b7b5ac5f44026069c1998af439ed343 Mon Sep 17 00:00:00 2001 From: Kevin Hammond <12563287+kevinhammond@users.noreply.github.com> Date: Wed, 16 Oct 2024 11:51:33 +0100 Subject: [PATCH 45/45] changed status to active --- CIP-0135/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CIP-0135/README.md b/CIP-0135/README.md index 7264616d5d..28c04c3936 100644 --- a/CIP-0135/README.md +++ b/CIP-0135/README.md @@ -2,7 +2,7 @@ CIP: 135 Title: Disaster Recovery Plan for Cardano networks Category: Tools -Status: Proposed +Status: Active Authors: - Kevin Hammond - Sam Leathers