Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/adr: ADR #003 March 2022 Testnet Design for Celestia Node #244

Merged
merged 15 commits into from
Dec 7, 2021
159 changes: 159 additions & 0 deletions docs/adr/adr-003-march2022-testnet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# ADR #003: March 2022 Testnet Celestia Node

<hr style="border:3px solid gray"> </hr>

## Authors

@renaynay @Wondertan

## Changelog

* 2021-11-25: initial draft

<hr style="border:2px solid gray"> </hr>

## Legend

### Celestia DA Network

Refers to the data availability "halo" network created around the Core network.

### **Bridge Node**

A **bridge** node is a **full** node that is connected to a Celestia Core node via RPC. It receives either a remote
address from a running Core node or it can run a Core node as an embedded process, but the critical difference is that
instead of constructing blocks via sampling the network for shares, it receives headers and blocks directly from its
trusted Core node, validating blocks and producing `ExtendedHeader`s to broadcast to the Celestia DA network.

### **Full Node**

A **full** node is the same thing as a **light** node, but instead of performing `LightAvailability` (the process of
DASing to verify a header is legitimate), it performs `FullAvailability` which samples the network for shares in order
to fully reconstruct the block and store it, serving shares to the rest of the network.

### **Light Node**

A **light** node listens for `ExtendedHeader`s from the DA network and performs DAS on the received headers.

<hr style="border:2px solid gray"> </hr>

## Context

This ADR describes a design for the March 2022 Celestia Testnet that we decided at the Berlin 2021 offsite. Now that
we have a basic scaffolding and structure for a celestia node, the focus of the next engineering sprint is to continue
refactoring and improving this structure to include more features (defined later in this document).


<hr style="border:2px solid gray"> </hr>

## Decision

## New Features

### [New node type definitions](https://github.com/celestiaorg/celestia-node/issues/250)
* Introduce a standalone **full** node and rename current full node implementation to **bridge** node.
* Remove **dev** as a node type and make it a flag on every available node type.

### Introduce bad encoding fraud proofs
Bad encoding fraud proofs will be generated by **full** nodes inside of `ShareService`, upon reconstructing a block
via the sampling process.

If fraud is detected, the **full** node will generate the proof and broadcast it to the `FraudSub` gossip network and
will subsequently halt all operations. If no fraud is detected, the **full** node will continue operations without
propagating any messages to the network. Since **full** nodes reconstruct every block, they do not have to listen to
`FraudSub` as they perform the necessary encoding checks on every block.

**Light** nodes, however, will listen to `FraudSub` for bad encoding fraud proofs. **Light** nodes will verify the
fraud proofs against the relevant header hash to ensure that the fraud proof is valid.
If the fraud proof is valid, the node should immediately halt all operations. If it is invalid, the node proceeds
operations as usual.

Eventually, we may implement a reputation tracking system for nodes who broadcast invalid fraud proofs to the network,
but that is for later iterations.

### [Introduce an RPC structure and some basic APIs](https://github.com/celestiaorg/celestia-node/issues/169)
Implement scaffolding for RPC on all node types, such that a user can access the following methods:

`HeaderAPI`

* `Header(_height_)` -> ExtendedHeader{}
* `Header(_hash_)` -> ExtendedHeader{}

`NodeAPI`

* `P2PInfo()` -> returns a blob of p2p info (can be broken into several subcommands, such as `net_info`)
* `Config()` -> returns the node's config
* `NodeType()` -> returns the node's type (e.g. **full** | **bridge** | **light** )
* `RPCInfo()` -> RPC port, version, available APIs, etc.

`UserAPI`

* `AccountBalance(_acct_)` -> returns balance for given account
* `SubmitTx(_txdata_)` -> submits a transaction to the network

### Introduce `StateService`
`StateService` is responsible for fetching state relevant to a user being able to submit a transaction, such as account
balance, preparing the transaction, and propagating it via `TxSub`. **Bridge** nodes will be responsible for listening
to `TxSub` and relaying the transactions into the Core mempool.

Celestia-node's state interaction will be detailed further in a subsequent ADR.

### [Data Availability Sampling during `HeaderSync`](https://github.com/celestiaorg/celestia-node/issues/181)

Currently, both **light** and **full* nodes are unable to perform data availability sampling (DAS) while syncing.
They only begin sampling once the node is synced up to head of chain.

`HeaderSync` and the `DASer` will be refactored such that the `DASer` will be able to perform sampling on past headers
as the node is syncing. To do this, the syncing algorithms in both the `DASer` and `HeaderSync` should align so that
headers received during sync will be propagated to the `DASer` for sampling via an internal pubsub.

The `DASer` will maintain a checkpoint to the last sampled header so that it can continue sampling from the last
checkpoint on any new headers.


<hr style="border:1px solid gray"> </hr>

## Refactoring

### `HeaderService` becomes main component around which most other services are focused
Initially, we started with BlockService being the more “important” component during devnet architecture, but overlooked
some problems with regards to sync (we initially made the decision that a celestia full node would have to be started
together at the same time as a core node -- which is the reason for embedding the core node).

This led us to an issue where eventually we needed to connect to an already-running core node and sync from it. We were
missing a component to do that, so we implemented `HeaderExchange` over the core client (wrapping another interface we
had previously created for BlockService called `BlockFetcher`), and we had to do this last minute b/c it wouldn’t work
otherwise, leading to a bunch of hacks and other issues (like having to hand the celestia full node a “trusted” hash of
a header from the already-running chain so that it can sync up to that point and start listening for new headers.

**Proposed new architecture**:

### [`BlockService` is only responsible for reconstructing the block from Shares handed to it by the `ShareService`](https://github.com/celestiaorg/celestia-node/issues/251).
Right now, the `BlockService` is in charge of fetching new blocks from the core node, erasure coding them, generating
DAH, generating `ExtendedHeader`, broadcasting `ExtendedHeader` to `HeaderSub` network, and storing the block data
(after some validation checks).

Instead, we should rely on ShareService sampling to fetch us *enough* shares to reconstruct the block inside of
`BlockService`.

### `ShareService` optimizations
* Implement parallelization for retrieving shares by namespace. This
[issue](https://github.com/celestiaorg/celestia-node/issues/184) is already being worked on.
* NMT/Shares/Namespace storage optimizations (**TODO @WONDERTAN**)
* Pruning/GC for shares.(**TODO @WONDERTAN**)

### `HeaderSync` optimizations
* Implement disconnect toleration

### Bonding period handling
(**TODO @WONDERTAN**)

<hr style="border:1px solid gray"> </hr>

## Nice to have

### [Move IPLD from celetia-node repo into its own repo](https://github.com/celestiaorg/celestia-node/issues/111)
Since the IPLD package is pretty much entirely separate from the celestia-node implementation, it makes sense that it
is removed from the celestia-node repository and maintained separately. The extraction of IPLD should also include a
review and refactoring as there are still some legacy components that are either no longer necessary and the
documentation also needs updating.