docs/adr: ADR #3 March 2022 Testnet Design for Celestia Node (#244)

celestiaorg · Dec 7, 2021 · 8e42e6a · 8e42e6a
1 parent bbcfc13
commit 8e42e6a
Showing 1 changed file with 184 additions and 0 deletions.
diff --git a/docs/adr/adr-003-march2022-testnet.md b/docs/adr/adr-003-march2022-testnet.md
@@ -0,0 +1,184 @@
+# ADR #003: March 2022 Testnet Celestia Node
+
+<hr style="border:3px solid gray"> </hr>
+
+## Authors
+
+@renaynay @Wondertan
+
+## Changelog
+
+* 2021-11-25: initial draft
+
+<hr style="border:2px solid gray"> </hr>
+
+## Legend
+
+### Celestia DA Network
+
+Refers to the data availability "halo" network created around the Core network.
+
+### **Bridge Node**
+
+A **bridge** node is a **full** node that is connected to a Celestia Core node via RPC. It receives either a remote
+address from a running Core node or it can run a Core node as an embedded process, but the critical difference is that
+instead of reconstructing blocks via downloading enough shares from the network, it receives headers and blocks directly from its 
+trusted Core node, validating blocks, erasure coding them, and producing `ExtendedHeader`s to broadcast to the Celestia 
+DA network.
+
+### **Full Node**
+
+A **full** node is the same thing as a **light** node, but instead of performing `LightAvailability` (the process of 
+DASing to verify a header is legitimate), it performs `FullAvailability` which downloads enough shares from the network in order 
+to fully reconstruct the block and store it, serving shares to the rest of the network.
+
+### **Light Node**
+
+A **light** node listens for `ExtendedHeader`s from the DA network and performs DAS on the received headers.
+
+<hr style="border:2px solid gray"> </hr>
+
+## Context
+
+This ADR describes a design for the March 2022 Celestia Testnet that we decided at the Berlin 2021 offsite. Now that
+we have a basic scaffolding and structure for a celestia node, the focus of the next engineering sprint is to continue 
+refactoring and improving this structure to include more features (defined later in this document).
+
+<hr style="border:2px solid gray"> </hr>
+
+## Decision
+
+## New Features
+
+### [New node type definitions](https://github.com/celestiaorg/celestia-node/issues/250)
+* Introduce a standalone **full** node and rename current full node implementation to **bridge** node. 
+* Remove **dev** as a node type and make it a flag on every available node type.
+
+### Introduce bad encoding fraud proofs
+Bad encoding fraud proofs will be generated by **full** nodes inside of `ShareService`, upon reconstructing a block
+via the sampling process. 
+
+If fraud is detected, the **full** node will generate the proof and broadcast it to the `FraudSub` gossip network and
+will subsequently halt all operations. If no fraud is detected, the **full** node will continue operations without 
+propagating any messages to the network. Since **full** nodes reconstruct every block, they do not have to listen to 
+`FraudSub` as they perform the necessary encoding checks on every block.
+
+**Light** nodes, however, will listen to `FraudSub` for bad encoding fraud proofs. **Light** nodes will verify the 
+fraud proofs against the relevant header hash to ensure that the fraud proof is valid. 
+If the fraud proof is valid, the node should immediately halt all operations. If it is invalid, the node proceeds 
+operations as usual. 
+
+Eventually, we may choose to use the reputation tracking system provided by [gossipsub](https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md#peer-scoring) for nodes who broadcast invalid fraud 
+proofs to the network, but that is not a requirement for this iteration.
+
+### [Introduce an RPC structure and some basic APIs](https://github.com/celestiaorg/celestia-node/issues/169)
+Implement scaffolding for RPC on all node types, such that a user can access the following methods: 
+
+`HeaderAPI`
+
+* `Header(_height_)` -> ExtendedHeader{}
+* `Header(_hash_)` -> ExtendedHeader{}
+
+`NodeAPI`
+
+* `P2PInfo()` -> returns a blob of p2p info (can be broken into several subcommands, such as `net_info`)
+* `Config()` -> returns the node's config
+* `NodeType()` -> returns the node's type (e.g. **full** | **bridge** | **light** )
+* `RPCInfo()` -> RPC port, version, available APIs, etc.
+
+`UserAPI`
+
+* `AccountBalance(_acct_)` -> returns balance for given account
+* `SubmitTx(_txdata_)` -> submits a transaction to the network
+
+*Note: it is likely more methods will be added, but the above listed are the essential ones for this iteration.*
+
+### Introduce `StateService`
+`StateService` is responsible for fetching state relevant to a user being able to submit a transaction, such as account
+balance, preparing the transaction, and propagating it via `TxSub`. **Bridge** nodes will be responsible for listening
+to `TxSub` and relaying the transactions into the Core mempool. **Light** and **full** nodes will be able to publish
+transactions to `TxSub`, but do not need to listen for them.
+
+Celestia-node's state interaction will be detailed further in a subsequent ADR.
+
+### [Data Availability Sampling during `HeaderSync`](https://github.com/celestiaorg/celestia-node/issues/181)
+
+Currently, both **light** and **full* nodes are unable to perform data availability sampling (DAS) while syncing.
+They only begin sampling once the node is synced up to head of chain. 
+
+`HeaderSync` and the `DASer` will be refactored such that the `DASer` will be able to perform sampling on past headers
+as the node is syncing. A possible approach would be to for the syncing algorithms in both the `DASer` and `HeaderSync` 
+to align such that headers received during sync will be propagated to the `DASer` for sampling via an internal pubsub.
+
+The `DASer` will maintain a checkpoint to the last sampled header so that it can continue sampling from the last 
+checkpoint on any new headers.
+
+
+<hr style="border:1px solid gray"> </hr>
+
+## Refactoring
+
+### `HeaderService` becomes main component around which most other services are focused
+Initially, we started with BlockService being the more “important” component during devnet architecture, but overlooked
+some problems with regards to sync (we initially made the decision that a celestia full node would have to be started 
+together at the same time as a core node).
+
+This led us to an issue where eventually we needed to connect to an already-running core node and sync from it. We were
+missing a component to do that, so we implemented `HeaderExchange` over the core client (wrapping another interface we 
+had previously created for `BlockService` called `BlockFetcher`), and we had to do this last minute because it wouldn’t 
+work otherwise, leading to last-minute solutions, like having to hand both the celestia **light** and **full** node a 
+“trusted” hash of a header from the already-running chain so that it can sync from that point and start listening for 
+new headers.
+
+#### Proposed new architecture: [`BlockService` is only responsible for reconstructing the block from Shares handed to it by the `ShareService`](https://github.com/celestiaorg/celestia-node/issues/251).
+Right now, the `BlockService` is in charge of fetching new blocks from the core node, erasure coding them, generating 
+DAH, generating `ExtendedHeader`, broadcasting `ExtendedHeader` to `HeaderSub` network, and storing the block data 
+(after some validation checks).
+
+Instead, a **full** node will rely on `ShareService` sampling to fetch us *enough* shares to reconstruct the block 
+inside of `BlockService`. Contrastingly, a **bridge** node will not do block reconstruction via sampling, but rather 
+rely on the `header.CoreSubscriber` implementation of `header.Subscriber` for blocks. `header.CoreSubscriber` will 
+handle listening for new block events from the core node via RPC, erasure code the new block, generate the 
+`ExtendedHeader` and pipe the erasure coded block through to `BlockService` via an internal subscription.
+
+### `HeaderSync` optimizations
+* Implement disconnect toleration 
+
+### Unbonding period handling
+The **light** and **full**  nodes currently are prone to long-range attacks. To mitigate it, we should 
+introduce an additional `trustPeriod` variable (equal to unbonding period) which applies to headers. Suppose a node 
+starts with the period between subjective head and objective head being higher than the unbonding period - 
+in that case, the **light** node must not trust the subjective head anymore, specifically its `ValidatorSet`. Therefore, 
+instead of syncing subsequent headers on top of the untrusted subjective head, the node should request a new objective 
+head from the `trustedPeer` and set it as a new trusted subjective head. This approach will follow the Tendermint model
+for 
+[light client attack detection](https://github.com/tendermint/spec/blob/master/spec/light-client/detection/detection_003_reviewed.md#light-client-attack-detector).
+
+<hr style="border:1px solid gray"> </hr>
+
+## Nice to have
+
+### `ShareService` optimizations
+* Implement parallelization for retrieving shares by namespace. This
+  [issue](https://github.com/celestiaorg/celestia-node/issues/184) is already being worked on.
+* NMT/Shares/Namespace storage optimizations:
+  * Right now we prepend to each Share 17 additional bytes, Luckily, for each reason why the prepended bytes were added, 
+  there is an alternative solution: It is possible to get NMT Node type indirectly, without serializing the type itself
+  by looking at the amount of links. To recover the namespace of the erasured data, we should not encode namespaces into 
+  the data itself. It is possible to get the namespace for each share encoded in inner non-leaf nodes of the NMT tree. 
+* Pruning for shares. 
+
+
+### [Move IPLD from celetia-node repo into its own repo](https://github.com/celestiaorg/celestia-node/issues/111)
+Since the IPLD package is pretty much entirely separate from the celestia-node implementation, it makes sense that it
+is removed from the celestia-node repository and maintained separately. The extraction of IPLD should also include a 
+review and refactoring as there are still some legacy components that are either no longer necessary and the 
+documentation also needs updating.
+
+### Implement additional light node verification logic similar to the Tendermint Light Client Model
+At the moment, the syncing logic for a **light** nodes is simple in that it syncs each header from a single peer. 
+Instead, the **light** node should double-check headers with another randomly chosen 
+["witness"](https://github.com/tendermint/tendermint/blob/02d456b8b8274088e8d3c6e1714263a47ffe13ac/light/client.go#L154-L161) 
+peer than the primary peer from which it received the header, as described in the 
+[light client attack detector](https://github.com/tendermint/spec/blob/master/spec/light-client/detection/detection_003_reviewed.md#light-client-attack-detector)
+model from Tendermint.