Introduce trie level cache & recorder #157

bkchr · 2022-05-12T18:55:48Z

This pr introduces a trie level cache and recorder. The cache is mainly useful to speed up the access to the trie. The recorder inside the trie is required to be able to record proofs while having a cache. For a full reasoning see the following Substrate pr: paritytech/substrate#11407

The first thing this pr is doing is the introduction of the builder pattern for constructing a TrieDB or TrieDBMut instance. This makes it relative easy to construct one of these instances.

The second big change in this pr was the introduction of the NodeOwned type and all other "owned" types. These are required for the cache to keep the data in memory. Before the Node was only operating on the slices of bytes, which resulted in constant decoding of data for every access to the trie. The NodeOwned type decodes a node once and then keeps it in the memory. Alongside this type we also introduced a simple wrapper around bytes, the Bytes type. It is basically a Arc<[u8]>. With this Bytes we also added the BytesWeak type. This is especially important for the CachedValue type to hold the value without keeping a strong reference on it. This CachedValue type is used to store values inside the cache.

The third big change was the introduction of get_hash. This function enables an user to get the hash of a value. Together with the cache it means that we can return this hash without needing to calculate it.

The TrieCache and TrieRecorder are both traits that needs to be implemented by down stream users. For the trie recorder there exists one simple implementation inside the trie-db crate. This implementation works and can be used, but for Substrate there exists a much more sophisticated implementation to provide better performance etc. As the recorder is now on the trie level, it can be that we missed some situations to record an access. However, for Substrate we support all required code paths (hopefully).

The TrieCache is expected to store NodeOwned and CachedValue. To ensure that we don't waste memory and that the NodeOwned is staying the owner of the values, CachedValue only stores a BytesWeak. This enables a downstream implementation (like Substrate) to have a bounded cache implementation (bounded by maximum memory usage). The idea is that the NodeOwned are in a LRU cache and when they are being evicted, the corresponding CachedValue can still be in a different LRU cache, but not holding the entire value in memory.

This reverts commit 5ddba5d.

This reverts commit 648d6b6.

cheme

Started looking. Generally it feels that the recorder and cache are less independent (I mean recorder act as a cache a bit).
I am not totally sure why the deterministic change is needed ( I imagine the usecase is to follow the actual proof size without calling extract_proof, seems more straight forward this way anyway).

trie-db/src/triedbmut.rs

trie-db/src/lib.rs

trie-db/src/nibble/nibbleslice.rs

Co-authored-by: cheme <emericchevalier.pro@gmail.com>

trie-db/src/lib.rs

trie-db/src/lookup.rs

trie-db/src/triedb.rs

trie-db/src/lib.rs

Co-authored-by: David <dvdplm@gmail.com>

test-support/keccak-hasher/src/lib.rs

trie-db/src/nibble/nibbleslice.rs

trie-db/src/lib.rs

Co-authored-by: Andronik <write@reusable.software>

bkchr · 2022-07-25T11:33:06Z

@cheme I'm open to work on some improvements to this pr in some follow ups, to make some stuff more easier etc. However, I would really finally get to the point where this is merged. There has been a lot of discussion around this and also a lot of downstream projects would finally like to use a working cache in Substrate. There are also things like iteration that still don't use the cache, but that was done on purpose to come to some end.

So, please give this again some final review to get this merged.

cheme · 2022-07-25T12:44:57Z

@bkchr , what do you think of https://github.com/paritytech/trie/compare/bkchr-funny-branch...cheme:bkchr-funny-branch3?expand=1 like if we could avoid the new struct, it would be great.
It's actually not such a big issue to have two struct for in memory trie, can probably live with it. Just it would have allowed that all access to db was decorated by the cache calls (for the key value access I understand with the way the weak reference work it is not easy to do so).

bkchr · 2022-08-01T12:45:15Z

@bkchr , what do you think of https://github.com/paritytech/trie/compare/bkchr-funny-branch...cheme:bkchr-funny-branch3?expand=1 like if we could avoid the new struct, it would be great. It's actually not such a big issue to have two struct for in memory trie, can probably live with it. Just it would have allowed that all access to db was decorated by the cache calls (for the key value access I understand with the way the weak reference work it is not easy to do so).

As already talked in person, I would like to move forward with this. There are several refactorings that could be done on top of this. Stuff like the proposed merging of OwnedNode and NodeOwned. I just needed to stop at some point, because the current changeset already exploded and I don't wanted to change even more.

That said, @cheme could you please give the final approval :)

cheme · 2022-08-01T13:40:55Z

Will pass again on the full pr tomorrow first thing in the morning.

bkchr · 2022-08-01T21:38:44Z

Will pass again on the full pr tomorrow first thing in the morning.

Ty :)

cheme

So the PR basically allow caching instantiating Node and add a key value cache.

Speed up against simple encoded trie node cache should come from not decoding multiple time the node.

Instantiated node being a internal trie structure, it looks indeed better to have this in trie crate (still a rather big update, so the speed improvment must be worth it I don't remember well the number). Note that this trie cache access is the same as node recording and I feel like both abstraction should be united somehow.

Having key value cache into trie crate is more discutable to me: my opinion is that it should be managed at substrate level. The thing it allows here is to weak ref over the trie node cache, but should still be doable by adding a new param in get_or_insert_node: an optional key when the node is containing a value (also optional hash is needed when hash is known). This way the trie cache could put the weak reference to the substrate key value cache.
There will be an issue for inline node: but in this case we can copy the value (eg just use this weak reference for value bigger than 32 byte) and only cache explicitely accessed content. Recorded key could still check by aaccessing the substrate key value cache.

But seems like we are at a point where it will be good to move forward with this PR and ignore concerns at first(assuming that the speed improvment over simple encoded trie node cache is worth it).

Certainly the current technical debt of the trie crate is quite high, the PR will make the situation worse. So I plan on creating the issue discussed (see draft end of this comment) in this review.

Wether we should push forward to implement them is a different question as this is rather a bit of work, and future direction could be:

keep up with this trie crate
- split from ethereum
- remove rocksdb specific when it get no longer supported by substrate
- improve code (linked issues and possibly rewrite part of triedbmut and other suited refact).
rewrite. Would not make sense until two first item from previous directions are not reached. So does not sound very useful (except if we want more radix support, but I remember it being fairly easilly implementable on current crate).
look in other directions (I am thinking of radix tree index in parity db with merkle hash attachment).

I did spend a bit in the third direction previously (got good read speed improvement on some poc branch that only run a single state), can be a bit ambitious, so would probably be better in parallel to first direction (just maybe not investing too much on the code improvement in this case).

This PR is rather impacting so would be good to have a second review, I can only think of @arkpar.

Draft of issues to create related to these changes:

Use as single in memory node representation

NodeOwned from triedbmut and OwnedNode are two representation of node in memory, making code very redundant and leading to unneeded type convertion.

A single representation should be use.
Branch https://github.com/cheme/trie/commits/bkchr-funny-branch3 starts implementing it but contains other unrelated changes and lack tests and polish, but can be a good starting point.

Add cache to triedb iterator

Cache update as for triedbmut should be optional.

Additionally all hash-db accesses should be decorated with cache accesses and trie node recording in a systematic way.

issue merge cache and recorder

Cache and recorder are a bit redundant, both are on top of hash-db access in implementation there is need to check recorder before accessing value cache.
Both should be under a same struct (branch starts it), or using a same trait.

Note that it may make sense to have a cache/recorder that do not write cache (eg with for triedbmut and substrate, these are basiaclly all accessed from previous read and not really needed to refresh cache, but still must have recorded).

Branch https://github.com/cheme/trie/commits/bkchr-funny-branch go into the trait direction, but using a single internal struct may be more suitable (no api changes).

remove key value cache

key value cache could be manage outside of trie cache with a slightly less fine inline value caching.

issue redundancy lookup.rs

after [issue merge cache and recorde] get solved, the code in lookup.rs could certainly be factored to avoid redundancy.
Simply always passing around the cache/recorder.
Also removing value cache as in [issue remove key value cache] will in itself remove part of the redundancy here.

test-support/reference-trie/src/lib.rs

trie-db/src/triedbmut.rs

trie-db/src/lib.rs

trie-db/src/lookup.rs

trie-db/src/triedbmut.rs

Co-authored-by: cheme <emericchevalier.pro@gmail.com>

arkpar · 2022-08-03T17:47:16Z

trie-db/src/node.rs

+		}
+	}
+
+	/// Returns the size in bytes of this node.


And all its children?

arkpar · 2022-08-03T17:52:48Z

trie-db/src/recorder.rs

-/// A record of a visited node.
-#[cfg_attr(feature = "std", derive(Debug))]
-#[derive(PartialEq, Eq, Clone)]
-pub struct Record<HO> {


nit: I'd keep the struct. Code using it looks more readable

arkpar

Agree with @cheme that it would be good to merge NodeOwned and OwnedNode in a future PR. Another reason to do so is that naming is confusing.

I've looked through this PR, though probably not as thoroughly as @cheme. Looks good with a couple of nits.

cheme and others added 30 commits November 3, 2021 12:13

optim aligned nibbleslices

8da6ae1

Yep

fe050f7

Shitty implementation

ef53d6b

Start cleaning up

3838b2c

New constructor

67891cd

Switch to &mut self

5ddba5d

Add NodeCache trait

f1ecab3

Small fixes

a944234

Fix PartialEq implementation

0d323ad

Adds hacky fast cache

caca226

Revert "Switch to &mut self"

677b701

This reverts commit 5ddba5d.

Use RefCell

2ae95a8

Refactor a little bit more

aafd185

Use reference

648d6b6

Revert "Use reference"

1a2db59

This reverts commit 648d6b6.

Cache nodes in TrieDbMut

2a7d54a

Use Bytes

1ad99b5

Cache the data

e9e07c2

Adds from_existing_with_cache function

4b92b53

Some more cleanups

070cd50

Change leaf_node signature

beb93f3

Start implementing to_encoded for NodeOwned

422e343

Finish and fix right_iter for NibbleVec

55ad8f5

Finish to_encoded and add test

f8fa97c

Remove useless lifetime

05c71ae

Remove useless parameter

67e0d8a

Introduce TrieDBBuilder

112bd31

New recorder trait

a0c19c6

Improve Lookup

a2c4e24

Cleanups

570fb80

cheme reviewed Jun 10, 2022

View reviewed changes

trie-db/src/triedbmut.rs Outdated Show resolved Hide resolved

trie-db/src/triedbmut.rs Outdated Show resolved Hide resolved

trie-db/src/lib.rs Outdated Show resolved Hide resolved

trie-db/src/lib.rs Show resolved Hide resolved

trie-db/src/nibble/nibbleslice.rs Show resolved Hide resolved

bkchr and others added 4 commits June 13, 2022 10:59

Update trie-db/src/triedbmut.rs

7d7c922

Co-authored-by: cheme <emericchevalier.pro@gmail.com>

Update trie-db/src/triedbmut.rs

2a70e33

Co-authored-by: cheme <emericchevalier.pro@gmail.com>

Feedback

4de50b8

More feedback

b41b01e

cheme reviewed Jun 15, 2022

View reviewed changes

shawntabrizi mentioned this pull request Jun 15, 2022

Add content for 4.5: Database and Merklized Storage Polkadot-Blockchain-Academy/pba-content#31

Merged

dvdplm reviewed Jun 20, 2022

View reviewed changes

trie-db/src/lib.rs Outdated Show resolved Hide resolved

trie-db/src/lib.rs Outdated Show resolved Hide resolved

trie-db/src/lib.rs Show resolved Hide resolved

trie-db/src/lib.rs Outdated Show resolved Hide resolved

trie-db/src/lib.rs Outdated Show resolved Hide resolved

bkchr and others added 6 commits June 21, 2022 13:04

Update trie-db/src/lib.rs

15ac61b

Co-authored-by: David <dvdplm@gmail.com>

Update trie-db/src/lib.rs

f7d2349

Co-authored-by: David <dvdplm@gmail.com>

Update trie-db/src/lib.rs

efb2466

Co-authored-by: David <dvdplm@gmail.com>

Update trie-db/src/lib.rs

da5020f

Co-authored-by: David <dvdplm@gmail.com>

Remove some redudant code

9f2896e

Some docs

16e3ff3

ordian reviewed Jun 21, 2022

View reviewed changes

test-support/keccak-hasher/src/lib.rs Outdated Show resolved Hide resolved

trie-db/src/nibble/nibbleslice.rs Show resolved Hide resolved

trie-db/src/lib.rs Outdated Show resolved Hide resolved

trie-db/src/lib.rs Show resolved Hide resolved

bkchr and others added 2 commits June 23, 2022 12:14

Update test-support/keccak-hasher/src/lib.rs

86ed58d

Co-authored-by: Andronik <write@reusable.software>

Update trie-db/src/lib.rs

2561d9d

Co-authored-by: Andronik <write@reusable.software>

cheme approved these changes Aug 2, 2022

View reviewed changes

Apply suggestions from code review

f06e942

Co-authored-by: cheme <emericchevalier.pro@gmail.com>

arkpar reviewed Aug 3, 2022

View reviewed changes

trie-db/src/node.rs Outdated

}

}

/// Returns the size in bytes of this node.

Copy link

Member

arkpar Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And all its children?

arkpar reviewed Aug 3, 2022

View reviewed changes

arkpar approved these changes Aug 3, 2022

View reviewed changes

Pr feedback

78af8d6

bkchr merged commit aff1cba into master Aug 4, 2022

bkchr deleted the bkchr-funny-branch branch August 4, 2022 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce trie level cache & recorder #157

Introduce trie level cache & recorder #157

bkchr commented May 12, 2022 •

edited

Loading

cheme left a comment

bkchr commented Jul 25, 2022

cheme commented Jul 25, 2022

bkchr commented Aug 1, 2022

cheme commented Aug 1, 2022

bkchr commented Aug 1, 2022

cheme left a comment

arkpar Aug 3, 2022

arkpar Aug 3, 2022

arkpar left a comment

Introduce trie level cache & recorder #157

Introduce trie level cache & recorder #157

Conversation

bkchr commented May 12, 2022 • edited Loading

cheme left a comment

Choose a reason for hiding this comment

bkchr commented Jul 25, 2022

cheme commented Jul 25, 2022

bkchr commented Aug 1, 2022

cheme commented Aug 1, 2022

bkchr commented Aug 1, 2022

cheme left a comment

Choose a reason for hiding this comment

arkpar Aug 3, 2022

Choose a reason for hiding this comment

arkpar Aug 3, 2022

Choose a reason for hiding this comment

arkpar left a comment

Choose a reason for hiding this comment

bkchr commented May 12, 2022 •

edited

Loading