Skip to content

Commit

Permalink
docs/adr: ADR #3 March 2022 Testnet Design for Celestia Node (#244)
Browse files Browse the repository at this point in the history
  • Loading branch information
renaynay authored Dec 7, 2021
1 parent bbcfc13 commit 8e42e6a
Showing 1 changed file with 184 additions and 0 deletions.
184 changes: 184 additions & 0 deletions docs/adr/adr-003-march2022-testnet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# ADR #003: March 2022 Testnet Celestia Node

<hr style="border:3px solid gray"> </hr>

## Authors

@renaynay @Wondertan

## Changelog

* 2021-11-25: initial draft

<hr style="border:2px solid gray"> </hr>

## Legend

### Celestia DA Network

Refers to the data availability "halo" network created around the Core network.

### **Bridge Node**

A **bridge** node is a **full** node that is connected to a Celestia Core node via RPC. It receives either a remote
address from a running Core node or it can run a Core node as an embedded process, but the critical difference is that
instead of reconstructing blocks via downloading enough shares from the network, it receives headers and blocks directly from its
trusted Core node, validating blocks, erasure coding them, and producing `ExtendedHeader`s to broadcast to the Celestia
DA network.

### **Full Node**

A **full** node is the same thing as a **light** node, but instead of performing `LightAvailability` (the process of
DASing to verify a header is legitimate), it performs `FullAvailability` which downloads enough shares from the network in order
to fully reconstruct the block and store it, serving shares to the rest of the network.

### **Light Node**

A **light** node listens for `ExtendedHeader`s from the DA network and performs DAS on the received headers.

<hr style="border:2px solid gray"> </hr>

## Context

This ADR describes a design for the March 2022 Celestia Testnet that we decided at the Berlin 2021 offsite. Now that
we have a basic scaffolding and structure for a celestia node, the focus of the next engineering sprint is to continue
refactoring and improving this structure to include more features (defined later in this document).

<hr style="border:2px solid gray"> </hr>

## Decision

## New Features

### [New node type definitions](https://github.com/celestiaorg/celestia-node/issues/250)
* Introduce a standalone **full** node and rename current full node implementation to **bridge** node.
* Remove **dev** as a node type and make it a flag on every available node type.

### Introduce bad encoding fraud proofs
Bad encoding fraud proofs will be generated by **full** nodes inside of `ShareService`, upon reconstructing a block
via the sampling process.

If fraud is detected, the **full** node will generate the proof and broadcast it to the `FraudSub` gossip network and
will subsequently halt all operations. If no fraud is detected, the **full** node will continue operations without
propagating any messages to the network. Since **full** nodes reconstruct every block, they do not have to listen to
`FraudSub` as they perform the necessary encoding checks on every block.

**Light** nodes, however, will listen to `FraudSub` for bad encoding fraud proofs. **Light** nodes will verify the
fraud proofs against the relevant header hash to ensure that the fraud proof is valid.
If the fraud proof is valid, the node should immediately halt all operations. If it is invalid, the node proceeds
operations as usual.

Eventually, we may choose to use the reputation tracking system provided by [gossipsub](https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md#peer-scoring) for nodes who broadcast invalid fraud
proofs to the network, but that is not a requirement for this iteration.

### [Introduce an RPC structure and some basic APIs](https://github.com/celestiaorg/celestia-node/issues/169)
Implement scaffolding for RPC on all node types, such that a user can access the following methods:

`HeaderAPI`

* `Header(_height_)` -> ExtendedHeader{}
* `Header(_hash_)` -> ExtendedHeader{}

`NodeAPI`

* `P2PInfo()` -> returns a blob of p2p info (can be broken into several subcommands, such as `net_info`)
* `Config()` -> returns the node's config
* `NodeType()` -> returns the node's type (e.g. **full** | **bridge** | **light** )
* `RPCInfo()` -> RPC port, version, available APIs, etc.

`UserAPI`

* `AccountBalance(_acct_)` -> returns balance for given account
* `SubmitTx(_txdata_)` -> submits a transaction to the network

*Note: it is likely more methods will be added, but the above listed are the essential ones for this iteration.*

### Introduce `StateService`
`StateService` is responsible for fetching state relevant to a user being able to submit a transaction, such as account
balance, preparing the transaction, and propagating it via `TxSub`. **Bridge** nodes will be responsible for listening
to `TxSub` and relaying the transactions into the Core mempool. **Light** and **full** nodes will be able to publish
transactions to `TxSub`, but do not need to listen for them.

Celestia-node's state interaction will be detailed further in a subsequent ADR.

### [Data Availability Sampling during `HeaderSync`](https://github.com/celestiaorg/celestia-node/issues/181)

Currently, both **light** and **full* nodes are unable to perform data availability sampling (DAS) while syncing.
They only begin sampling once the node is synced up to head of chain.

`HeaderSync` and the `DASer` will be refactored such that the `DASer` will be able to perform sampling on past headers
as the node is syncing. A possible approach would be to for the syncing algorithms in both the `DASer` and `HeaderSync`
to align such that headers received during sync will be propagated to the `DASer` for sampling via an internal pubsub.

The `DASer` will maintain a checkpoint to the last sampled header so that it can continue sampling from the last
checkpoint on any new headers.


<hr style="border:1px solid gray"> </hr>

## Refactoring

### `HeaderService` becomes main component around which most other services are focused
Initially, we started with BlockService being the more “important” component during devnet architecture, but overlooked
some problems with regards to sync (we initially made the decision that a celestia full node would have to be started
together at the same time as a core node).

This led us to an issue where eventually we needed to connect to an already-running core node and sync from it. We were
missing a component to do that, so we implemented `HeaderExchange` over the core client (wrapping another interface we
had previously created for `BlockService` called `BlockFetcher`), and we had to do this last minute because it wouldn’t
work otherwise, leading to last-minute solutions, like having to hand both the celestia **light** and **full** node a
“trusted” hash of a header from the already-running chain so that it can sync from that point and start listening for
new headers.

#### Proposed new architecture: [`BlockService` is only responsible for reconstructing the block from Shares handed to it by the `ShareService`](https://github.com/celestiaorg/celestia-node/issues/251).
Right now, the `BlockService` is in charge of fetching new blocks from the core node, erasure coding them, generating
DAH, generating `ExtendedHeader`, broadcasting `ExtendedHeader` to `HeaderSub` network, and storing the block data
(after some validation checks).

Instead, a **full** node will rely on `ShareService` sampling to fetch us *enough* shares to reconstruct the block
inside of `BlockService`. Contrastingly, a **bridge** node will not do block reconstruction via sampling, but rather
rely on the `header.CoreSubscriber` implementation of `header.Subscriber` for blocks. `header.CoreSubscriber` will
handle listening for new block events from the core node via RPC, erasure code the new block, generate the
`ExtendedHeader` and pipe the erasure coded block through to `BlockService` via an internal subscription.

### `HeaderSync` optimizations
* Implement disconnect toleration

### Unbonding period handling
The **light** and **full** nodes currently are prone to long-range attacks. To mitigate it, we should
introduce an additional `trustPeriod` variable (equal to unbonding period) which applies to headers. Suppose a node
starts with the period between subjective head and objective head being higher than the unbonding period -
in that case, the **light** node must not trust the subjective head anymore, specifically its `ValidatorSet`. Therefore,
instead of syncing subsequent headers on top of the untrusted subjective head, the node should request a new objective
head from the `trustedPeer` and set it as a new trusted subjective head. This approach will follow the Tendermint model
for
[light client attack detection](https://github.com/tendermint/spec/blob/master/spec/light-client/detection/detection_003_reviewed.md#light-client-attack-detector).

<hr style="border:1px solid gray"> </hr>

## Nice to have

### `ShareService` optimizations
* Implement parallelization for retrieving shares by namespace. This
[issue](https://github.com/celestiaorg/celestia-node/issues/184) is already being worked on.
* NMT/Shares/Namespace storage optimizations:
* Right now we prepend to each Share 17 additional bytes, Luckily, for each reason why the prepended bytes were added,
there is an alternative solution: It is possible to get NMT Node type indirectly, without serializing the type itself
by looking at the amount of links. To recover the namespace of the erasured data, we should not encode namespaces into
the data itself. It is possible to get the namespace for each share encoded in inner non-leaf nodes of the NMT tree.
* Pruning for shares.


### [Move IPLD from celetia-node repo into its own repo](https://github.com/celestiaorg/celestia-node/issues/111)
Since the IPLD package is pretty much entirely separate from the celestia-node implementation, it makes sense that it
is removed from the celestia-node repository and maintained separately. The extraction of IPLD should also include a
review and refactoring as there are still some legacy components that are either no longer necessary and the
documentation also needs updating.

### Implement additional light node verification logic similar to the Tendermint Light Client Model
At the moment, the syncing logic for a **light** nodes is simple in that it syncs each header from a single peer.
Instead, the **light** node should double-check headers with another randomly chosen
["witness"](https://github.com/tendermint/tendermint/blob/02d456b8b8274088e8d3c6e1714263a47ffe13ac/light/client.go#L154-L161)
peer than the primary peer from which it received the header, as described in the
[light client attack detector](https://github.com/tendermint/spec/blob/master/spec/light-client/detection/detection_003_reviewed.md#light-client-attack-detector)
model from Tendermint.

0 comments on commit 8e42e6a

Please sign in to comment.