-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(orm)!: ordered variable length encoding for uint32 and uint64 types #11090
Conversation
Codecov Report
@@ Coverage Diff @@
## master #11090 +/- ##
==========================================
- Coverage 65.96% 65.79% -0.18%
==========================================
Files 698 684 -14
Lines 70810 69487 -1323
==========================================
- Hits 46709 45718 -991
+ Misses 21235 20973 -262
+ Partials 2866 2796 -70
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminds of me https://docs.substrate.io/v3/advanced/scale-codec/#compactgeneral-integers :) Too bad that one uses LE
return c.FixedBufferSize(), nil | ||
} | ||
|
||
// EncodeCompactUint64 encodes uint64 values in 2,4,6 or 9 bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you wrote, compact encoding is more suitable for small numbers (e.g. auto-incrementing ids). In which case, we can maybe save the maximum space for smaller ints, namely, for the 2 MSB:
- 00 -> 1 byte, for n<64
- 01 -> 2 bytes, for n<16384
- 10 -> 3 bytes, for n<2^22
- 11 -> 9 bytes for uint64 or 5 bytes for uint32
We'll have a the best space saving for integers up until 2^22 (4million), then it'll be worst than fixed length. And I guess most ids will stay <4M.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the 64 values in byte 1 is too small of an ID space to really worry about. But if we did take this approach I would push the third case to 4 or 5 bytes maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that numbers <64 will still be the most used overall, so we should save max space on those.
I would push the third case to 4 or 5 bytes maybe
Could you explain why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this spreadsheet: https://docs.google.com/spreadsheets/d/1PkIzCVkFwTD-POyAGk7idpOuAbK20cS0aopNUL6SW1c/
When we have just a few values it's kilobytes of storage difference betwen 1 and 2 bytes.
In what cases do you think numbers <64 would be most used? In the SDK we are using uint64's for proposals and votes in the x/group module and those will likely number in the thousands and millions respectively over time.
The cases where I think small numbers will be used quite often are maybe the Market
or Class
id's in the Regen credits module. But still in that module, I expect things like credit batches and orders to number in the thousands if not millions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the estimations! I see your point of KBs of diff for small IDs vs GBs of diff for lots of them.
In what cases do you think numbers <64 would be most used? ... those will likely number in the thousands and millions respectively over time.
Actually million is the order of magnitude I had in mind too for IDs on chain, which is why I proposed 1,2,3,9 (optimized for <4M).
But the million-to-billion range is where 1,2,3,9 really underperforms over other ones. And I think we should maybe consider that range to be plausible too. I'm good with 2,4,6,9 👍, seems like the best compromise overall.
I actually think the values in this PR (2,4,6,9) scale the best for uint64, even though if there are not a lot of entities something like 1,2,3,9 is a bit better for that case. But it is megabytes of storage difference when we have a few entities vs gigabytes of difference when there are a lot. I think more than 1 billion entities is not unrealistic nowadays so that storage difference may someday be meaningful I did a quick comparison of the different choices here: https://docs.google.com/spreadsheets/d/1PkIzCVkFwTD-POyAGk7idpOuAbK20cS0aopNUL6SW1c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, okay for 2,4,6,9 for uint64. Left 2 questions
case protoreflect.Uint64Kind, protoreflect.Fixed64Kind: | ||
return Uint64Codec{}, nil | ||
case protoreflect.Uint32Kind: | ||
return CompactUint32Codec{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it easy to make the encoding of N<2**32-1 numbers the same in both CompactUint{32,64}Codec
? I feel like people will be confused if the same number is encoded in 2 different compact ways depending if it's uint32 or uint64.
if we keep 2,4,6,9 for uint64, then only 2,4 modes will be valid for CompactUint32Codec
. It will be less space-performant for sure, but maybe worth the mental overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... these aren't really ever intended to be used by people. it's sort of like a database internal detail which you should only know if you're making a custom implementation. users should just need to know that the database made a performant and correct choice and have some guidance on when to use which data type to get which performance characteristics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's only database-internal. Storage keys relate to proofs, so IBC, UI client libs, light clients etc. Those developers will be re-implementing this, at least once per language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true... but not sure it's worth the compromise. it will mean large uin32's will need 6 bytes...
Going to go ahead and merge this. We can tweak parameters in a follow-up if we find better ones up until when this goes into production |
Description
uint64
values are used in the ORM as auto-incrementing primary keys. Always using 8 bytes for these values is a bit of a waste of space. Unfortunately, varint encoding does not support ordered prefix iteration.This PR introduces a compact, well-ordered variable length encoding for
uint32
anduint64
types.fixed32
andfixed64
integers are still encoded as 4 and 8 byte fixed-length big-endian arrays. With this, users have a choice of encoding based on what type of data they are storing. An auto-incrementing primary key should prefer the variable lengthuint64
whereas a fixed precision decimal might want to usefixed64
.See the golden test updates to see how this reduces key lengths.
This encoding works by using the first two bits to encode the buffer length (4 possible lengths). I'm not sure if my choice of 2,4,6 and 9 bytes is the right choice of 4 lenths for
uint64
- there are many alternate choices. I could have also chosen 3 bits and allowed for 8 possible lengths, but way waste an extra bit? Input on the right design parameters would be appreciated.Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!
to the type prefix if API or client breaking changeCHANGELOG.md
Reviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...
!
in the type prefix if API or client breaking change