This repository has been archived by the owner on Jul 14, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6
Improvements to the networking storage read requests #10
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
--- | ||
Title: Improvements to the networking storage read requests | ||
Number: 0 | ||
Status: Proposed | ||
Authors: | ||
- Pierre Krieger | ||
Created: 2023-02-21 | ||
Category: Networking | ||
Requires: | ||
Replaces: | ||
--- | ||
|
||
## Summary | ||
Improve the networking messages that query storage items from the remote, in order to reduce the bandwidth usage and number of round trips of light clients. | ||
|
||
## Motivation | ||
Clients on the Polkadot peer-to-peer network can be divided into two categories: full nodes and light clients. So-called full nodes are nodes that store the content of the chain locally on their disk, while light clients are nodes that don't. In order to access for example the balance of an account, a full node can do a disk read, while a light client needs to send a network message to a full node and wait for the full node to reply with the desired value. This reply is in the form of a Merkle proof, which makes it possible for the light client to verify the exactness of the value. | ||
|
||
Unfortunately, this network protocol is suffering from some issues: | ||
|
||
- It is not possible for the querier to check whether a key exists in the storage of the chain except by querying the value of that key. The reply will thus include the value of the key, only for that value to be discarded by the querier that isn't interested by it. This is a waste of bandwidth. | ||
- It is not possible for the querier to know whether a value in the storage of the chain has been modified between two blocks except by querying this value for both blocks and comparing them. Only a few storage values get modified in a block, and thus most of the time the comparison will be equal. This leads to a waste of bandwidth as the values have to be transferred. | ||
- While it is possible to ask for multiple specific storage keys at the same time, it is not possible to ask for a list of keys that start with a certain prefix. Due to the way FRAME works, storage keys are grouped by "prefix", for example all account balances start with the same prefix. It is thus a common necessity for a light client to obtain the list of all keys (and possibly their values) that start with a specific prefix. This is currently not possible except by performing multiple queries serially that "walk down" the trie. | ||
|
||
Once Polkadot and Kusama will have transitioned to `state_version = 1`, which modifies the format of the trie entries, it will be possible to generate Merkle proofs that contain only the hashes of values in the storage. Thanks to this, it is already possible to prove the existence of a key without sending its entire value (only its hash), or to prove that a value has changed or not between two blocks (by sending just their hashes). | ||
Thus, the only reason why aforementioned issues exist is because the existing networking messages don't give the possibility for the querier to query this. This is what this proposal aims at fixing. | ||
|
||
## Detailed Solution design | ||
The protobuf schema of the networking protocol can be found here: https://github.com/paritytech/substrate/blob/5b6519a7ff4a2d3cc424d78bc4830688f3b184c0/client/network/light/src/schema/light.v1.proto | ||
|
||
The proposal is to modify this protocol in this way: | ||
|
||
```diff | ||
@@ -11,6 +11,7 @@ message Request { | ||
RemoteReadRequest remote_read_request = 2; | ||
RemoteReadChildRequest remote_read_child_request = 4; | ||
// Note: ids 3 and 5 were used in the past. It would be preferable to not re-use them. | ||
+ RemoteReadRequestV2 remote_read_request_v2 = 6; | ||
} | ||
} | ||
|
||
@@ -48,6 +49,21 @@ message RemoteReadRequest { | ||
repeated bytes keys = 3; | ||
} | ||
|
||
+message RemoteReadRequestV2 { | ||
+ required bytes block = 1; | ||
+ optional ChildTrieInfo child_trie_info = 2; // Read from the main trie if missing. | ||
+ repeated Key keys = 3; | ||
+ optional bytes onlyKeysAfter = 4; | ||
+ optional bool onlyKeysAfterIgnoreLastNibble = 5; | ||
+} | ||
+ | ||
+message ChildTrieInfo { | ||
+ enum ChildTrieNamespace { | ||
+ DEFAULT = 1; | ||
+ } | ||
+ | ||
+ required bytes hash = 1; | ||
+ required ChildTrieNamespace namespace = 2; | ||
+} | ||
+ | ||
// Remote read response. | ||
message RemoteReadResponse { | ||
// Read proof. If missing, indicates that the remote couldn't answer, for example because | ||
@@ -65,3 +81,8 @@ message RemoteReadChildRequest { | ||
// Storage keys. | ||
repeated bytes keys = 6; | ||
} | ||
+ | ||
+message Key { | ||
+ required bytes key = 1; | ||
+ optional bool skipValue = 2; // Defaults to `false` if missing | ||
+ optional bool includeDescendants = 3; // Defaults to `false` if missing | ||
+} | ||
``` | ||
|
||
Note that the field names aren't very important as they are not sent over the wire. They can be changed at any time without any consequence. I would invite people to not discuss these field names as they are implementation details. | ||
|
||
This diff adds a new type of request (`RemoteReadRequestV2`). | ||
|
||
The new `child_trie_info` field in the request makes it possible to specify which trie is concerned by the request. The current networking protocol uses two different structs (`RemoteReadRequest` and `RemoteReadChildRequest`) for main trie and child trie queries, while this new request would make it possible to query either. This change doesn't fix any of the issues mentioned in the previous section, but is a side change that has been done for simplicity. | ||
An alternative could have been to specify the `child_trie_info` for each individual `Key`. However this would make it necessary to send the child trie hash many times over the network, which leads to a waste of bandwidth, and in my opinion make things more complicated for no actual gain. If a querier would like to access more than one trie at the same time, it is always possible to send in parallel one query per trie. | ||
|
||
If `skipValue` is `true` for a `Key`, then the value associated with this key isn't important to the querier, and the replier is encouraged to replace the value with its hash provided that the storage item has a `state_version` equal to 1. If the storage value has a `state_version` equal to 0, then the optimization isn't possible and the replier should behave as if `skipValue` was `false`. | ||
|
||
If `includeDescendants` is `true` for a `Key`, then the replier must also include in the proof all keys that are descendant of the given key (in other words, its children, children of children, children of children of children, etc.). It must do so even if `key` itself doesn't have any storage value associated to it. The values of all of these descendants are replaced with their hashes if `skipValue` is `true`, similarly to `key` itself. | ||
|
||
The optional `onlyKeysAfter` and `onlyKeysAfterIgnoreLastNibble` fields can provide a lower bound for the keys contained in the proof. The responder must not include in its proof any node whose key is strictly inferior to the value in `onlyKeysAfter`. If `onlyKeysAfterIgnoreLastNibble` is provided, then the last 4 bits for `onlyKeysAfter` must be ignored. This makes it possible to represent a trie branch node that doesn't have an even number of nibbles. If no `onlyKeysAfter` is provided, it is equivalent to being empty, meaning that the response must start with the root node of the trie. | ||
|
||
In order for the logic of the `onlyKeysAfter` field to work properly, the proof generation must be deterministic. In other words, the response to a specific request (specifically the proof contained in the response) should always be the same no matter which peer is queried. In order for this to be the case, the response must always contain the keys and nodes in lexicographical order. No ordering was enforced before this proposal. After this proposal, a proof is invalid if it doesn't respect the order. | ||
|
||
If `onlyKeysAfterIgnoreLastNibble` is missing, it is equivalent to `false`. If `onlyKeysAfterIgnoreLastNibble` is `true` and `onlyKeysAfter` is missing or empty, then the request is invalid. | ||
|
||
For the purpose of this networking protocol, it should be considered as if the main trie contained an entry for each default child trie whose key is `concat(":child_storage:default:", child_trie_hash)` and whose value is equal to the trie root hash of that default child trie. This behavior is consistent with what the host functions observe when querying the storage. This behavior is present in the existing networking protocol, in other words this proposal doesn't change anything to the situation, but it is worth mentioning. | ||
Also note that child tries aren't considered as descendants of the main trie when it comes to the `includeDescendants` flag. In other words, if the request concerns the main trie, no content coming from child tries is ever sent back. | ||
|
||
## Security considerations | ||
The main security consideration concerns the size of replies and the resources necessary to generate them. It is for example easily possible to ask for all keys and values of the chain, which would take a very long time to generate. Since responses to this networking protocol have a maximum size, the replier should truncate proofs that would lead to the response being too large. Note that it is already possible to send a query that would lead to a very large reply with the existing network protocol. The only thing that this proposal changes is that it would make it less complicated to perform such an attack. | ||
|
||
Implementers of the replier side should be careful to detect early on when a reply would exceed the maximum reply size, rather than inconditionally generate a reply, as this could take a very large amount of CPU, disk I/O, and memory. Existing implementations might currently be accidentally protected from such an attack thanks to the fact that requests have a maximum size, and thus that the list of keys in the query was bounded. After this proposal, this accidental protection would no longer exist. | ||
|
||
It is not possible for the querier to know in advance whether its query will lead to a reply that exceeds the maximum size. If the reply is too large, the replier should send back a truncated proof. The querier should then send the same request, but with a value of `onlyKeysAfter` equal to the last key of the previous response. Because the proof generation is deterministic, additional requests can be sent either to the same peer or to a different one. | ||
|
||
## Alternatives | ||
I don't really see any alternative way of solving the issues exposed. | ||
|
||
Not doing this change leads to the current situation, where light clients download more data than necessary and preform more networking round trips than necessary. | ||
|
||
This is particularly problematic in three situations: | ||
|
||
- When querying large number of keys, such as the list of validators. The query requires around 5 to 6 round trips where the multiple replies overlap a lot, and the replies include the values only for them to be immediately discarded. | ||
- When checking whether storage values have changed between two blocks. There is no way to "watch" a value, except download it at every block. | ||
- At initialization, the light client always needs to download the runtime code, which takes a long time, while after this change it will be possible to cache the runtime code on disk and check whether it has changed since then. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An idea here would be to allow:
optional bytes usingRoot = 5;
This will be the state root, and could allow server to skip all block hash/number checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like an implementation detail to me.
An implementation needs to find the trie/storage that is requested in its database. If it uses the trie root hash as its database index, then it's indeed faster to provide this trie root hash, but that's really not something the networking should be aware of.
This would also raise the question: what happens if the request is invalid?
If an implementation uses the block hash as its database index, then what should it do if the provided root hash doesn't match?
It means that some implementations would accept the request and some others would reject the request, unless you make it mandatory for all implementations to check that the root hash matches the block hash, which would kill the point of providing this root hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If root hash is present the block hash is not checked, same for header.
In fact it should be make mandatory that block hash / number are not checked when this hash is provided and the client support using it.
I got this idea as a way to restore processing (since it is a network protocol we can be a bit technical).
The issue is that the implementer should not be force to implement it and could refer to block number/hash if unsupported.
Probably a cleaner design would be to pass it in place of the block hash/number (like an enum) and return unsupported if client do not implement or allow direct state access from root. (I initially put it next to OnlyKeys as the intention is also to resume a query).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very much in favor of not doing this, in the name of simplicity/purity. This really feels like introducing a wart in the design just to save a few hundred microseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on the implementation, and for child trie it skip a full top state query, but yeah caching can solve this in a pretty non deterministic way in client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative would be to just use a state root instead of the block hash/number (since we are at network protocol level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well actually I cannot implement it due to internals with rocksdb, maybe as a V3 :(