[optimization] Consider generating non-leaf nodes on the fly without storing #271

Wondertan · 2021-12-07T15:44:01Z

Try to imagine the request flow starting from data root up to leave and what the logic would look like for a responder that only has leaves.

Here is the naive way:

Case 1) requesting all data for a height for instance:
just return all leaves. requester can recompute everything locally and just compare against root.
Case 2) requesting one leaf (e.g. per index):
Load block of that height (all leaves), recompute the inner nodes of corresponding Merkle tree and respond with leaf + corresponding inclusion proof.
case 3) request all leaves according of one namespace:
like in 2) but return a bunch of leaves and a with the corresponding (recomputed) proof.

A less naive approach would acknowledge the fact that always loading all leaves to respond might be undesirable as well. So what could be a middle ground? Only store some inner nodes, e.g. store the tree in packages of smaller subtrees with their roots (aka some inner nodes) and only recompute the missing inner nodes when necessary.

I think the latter is how roughly trillian and diem handle tree storage if I understand correctly. IIRC, diem treats every 4 level binary tree as one inner node and in trillian this is configurable. Both are approaches are probably not only there to decrease storage but also to optimize for IO (matters for large trees).

Tendermint also doesn't store inner nodes for tx data storage at all but still can return inner proof nodes for each tx (always simply recomputes all though). For the state tree though, I do think all inner nodes are stored in a nodedb there. IO is purely tried to be improved via a cache IIRC.

Our decision to store everything is only the fact that this is how the used libraries do things. Which might not necessarily be the best for our use-cases and access-patterns. We don't need to worry much about it right now. But we should keep in mind that this might either be an IO bottleneck in the future, or, cause node operators to complain about disk usage (as far as I remember we store inner nodes for both rows and columns this roughly corresponds to a 4x increase corresponding to the raw data, additionally there might be a lot of padding involved to make the squares a power of two ... so it's probably fair to say on average the increase is about 8-10x the actual block data/txs). I suspect, if we don't make this efficient long-term node operators will simply try to game it and not run the DA part / node at all.

Originally posted by @liamsi in #244 (comment)

liamsi · 2021-12-08T01:42:12Z

Another example for the approach drafted above is the go transparency log btw (not surprising as it is based on trillian): https://research.swtch.com/tlog#tiling_a_log

adlerjohn · 2021-12-13T04:59:23Z

A few thoughts:

If we're concerned with disk I/O and CPU usage (which are real costs), I think the best is actually to store the entire branch for each leaf in a single contiguous array. Note that with a 64x64 square, we're looking at 6x32 bytes = 192 bytes per 256-byte share, which is not even doubling storage costs.
If we're concerned with disk storage, then yes the inner nodes can be re-computed on the fly from just leaf nodes.

I imagine validator nodes won't even have to serve historical transactions or shares, so they should do the latter. For storage nodes though, or infrastructure nodes that serve lots of requests, the former is probably better.

Wondertan · 2023-06-12T11:24:09Z

Related to #2038

vgonkivs · 2024-10-17T12:13:50Z

Shwap fixed it

Wondertan mentioned this issue Dec 7, 2021

docs/adr: ADR #003 March 2022 Testnet Design for Celestia Node #244

Merged

renaynay changed the title ~~ipld: Consider generating non-leaf nodes on the fly without storing~~ [optimization] Consider generating non-leaf nodes on the fly without storing Apr 26, 2022

renaynay added the area:ipld IPLD plugin label Apr 26, 2022

renaynay added this to Celestia Node Apr 26, 2022

renaynay moved this to TODO in Celestia Node Apr 26, 2022

vgonkivs closed this as completed Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[optimization] Consider generating non-leaf nodes on the fly without storing #271

[optimization] Consider generating non-leaf nodes on the fly without storing #271

Wondertan commented Dec 7, 2021

liamsi commented Dec 8, 2021

adlerjohn commented Dec 13, 2021

Wondertan commented Jun 12, 2023

vgonkivs commented Oct 17, 2024

[optimization] Consider generating non-leaf nodes on the fly without storing #271

[optimization] Consider generating non-leaf nodes on the fly without storing #271

Comments

Wondertan commented Dec 7, 2021

liamsi commented Dec 8, 2021

adlerjohn commented Dec 13, 2021

Wondertan commented Jun 12, 2023

vgonkivs commented Oct 17, 2024