Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

updates rewards at epoch boundary using cached accounts #24200

Merged
merged 6 commits into from
Apr 21, 2022

Conversation

behzadnouri
Copy link
Contributor

@behzadnouri behzadnouri commented Apr 8, 2022

Problem

Updating rewards at epoch boundary is slow; and Loading vote and stake accounts from accounts-db takes a significant portion of that computation time.

Summary of Changes

  • This change bypasses accounts-db on the read path and instead uses vote and stake accounts cached in bank stakes:
    https://github.com/solana-labs/solana/blob/d2702201c/runtime/src/stakes.rs#L148-L152
  • These cached accounts are synchronized with accounts-db after each transaction, and so there should not be any change in the resulting computation:
    https://github.com/solana-labs/solana/blob/d2702201c/runtime/src/bank.rs#L4526
    Nevertheless, to avoid any chances of introducing a consensus issue, the switch to cached account is feature gated.
  • In order to maintain backward compatibility, values in stake_delegations map in Stakes struct are now generic. Stakes<Delegation> is equivalent to the old code and is used for backward compatibility in BankFieldsTo{Serialize,Deserialize}.
  • But banks cache Stakes<StakeAccount> which includes the entire stake account and StakeState deserialized from the account. Doing so, will remove the need to load the stake account from accounts-db when working with stake-delegations.
  • The change shows ~30% improvement in update_rewards_with_thread_pool_us metric.

Feature Gate Issue: #24278

@behzadnouri
Copy link
Contributor Author

@jstarry still doing more testings but you may want to take a look in case of comments.

@behzadnouri behzadnouri force-pushed the cache-stake-accounts-lite branch 2 times, most recently from b5d9c53 to d1ff7ac Compare April 8, 2022 20:27
@codecov
Copy link

codecov bot commented Apr 10, 2022

Codecov Report

Merging #24200 (d950b39) into master (3155270) will increase coverage by 11.9%.
The diff coverage is 80.6%.

❗ Current head d950b39 differs from pull request most recent head 634b353. Consider uploading reports for the commit 634b353 to get more accurate results

@@             Coverage Diff             @@
##           master   #24200       +/-   ##
===========================================
+ Coverage    70.0%    82.0%    +11.9%     
===========================================
  Files          36      594      +558     
  Lines        2255   164444   +162189     
  Branches      322        0      -322     
===========================================
+ Hits         1580   134891   +133311     
- Misses        560    29553    +28993     
+ Partials      115        0      -115     

@behzadnouri behzadnouri force-pushed the cache-stake-accounts-lite branch from d1ff7ac to fa6e579 Compare April 11, 2022 19:42
@jeffwashington
Copy link
Contributor

what did you learn?

@behzadnouri behzadnouri force-pushed the cache-stake-accounts-lite branch from fa6e579 to 00a1173 Compare April 11, 2022 22:01
Copy link
Contributor

@jstarry jstarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, the serialization compatibility in particular looks really good. I'm hoping that updating StakeAccount a bit will help remove a lot of the fallible methods when converting from StakeAccount to Delegation.

@jstarry
Copy link
Contributor

jstarry commented Apr 12, 2022

@behzadnouri since this PR introduces a new feature gate, please create a new feature tracker issue and link it to this PR, thanks! Details here: https://github.com/solana-labs/solana/blob/master/CONTRIBUTING.md#the-pr--issue-labels

@jstarry jstarry added the feature-gate Pull Request adds or modifies a runtime feature gate label Apr 12, 2022
@behzadnouri
Copy link
Contributor Author

@behzadnouri since this PR introduces a new feature gate, please create a new feature tracker issue and link it to this PR, thanks! Details here: https://github.com/solana-labs/solana/blob/master/CONTRIBUTING.md#the-pr--issue-labels

done.
#24278

@behzadnouri
Copy link
Contributor Author

what did you learn?

@jeffwashington As mentioned in the description, the change shows ~30% improvement in update_rewards_with_thread_pool_us metric. However that was done before some further optimizations I did regarding storing stakes in EpochStakes;

Note that this change will save both on loading accounts (by bypassing accounts-db altogether) and deserializing state from accounts (because the state is already deserialized in these cached accounts). Also does not force evicting other accounts from readonly cache.

@behzadnouri behzadnouri force-pushed the cache-stake-accounts-lite branch 7 times, most recently from 4cafe3c to d83c161 Compare April 14, 2022 15:48
@@ -196,6 +196,12 @@ impl From<AccountSharedData> for VoteAccount {
}
}

impl From<VoteAccount> for AccountSharedData {
fn from(account: VoteAccount) -> Self {
Self::from(account.0.account.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wen stable unwrap_or_clone 🥺

};
let stake_state = StakeState::Stake(Meta::example(), Stake::example());
let mut account = Account::example();
account.data.resize(256, 0u8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the extra 56 bytes make a difference in how this thing works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it does not. reduced the size to 196 bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_of<StakeState>() is 200, fwiw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe because of struct paddings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's some padding, I added a constant to prevent this issue in the future: #24416

// vote_accounts_cache_miss_count is shown to be always zero.
let account = self.get_account_with_fixed_root(vote_pubkey)?;
if account.owner() == &solana_vote_program
&& VoteState::deserialize(account.data()).is_ok()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we reuse this work to construct the VoteAccount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code already has comments here that this is only for sanity check and this path should never return a valid vote-account. My intention is to remove this piece all together once it is confirmed vote_accounts_cache_miss_count is always zero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzadnouri I missed this earlier but I think that the code here will produce false positives for the vote_accounts_cache_miss_count counter. We only add vote accounts to the cache if they are initialized so it's possible that an uninitialized vote account has a stake delegation and deserializes correctly. If that happens, this counter would indicate that we had a cache miss but the cache was actually correct. This all stems from the fact that the vote state doesn't have an intuitive "uninitialized" state. Best we can do is check if the authorized voter is non-zero:

pub fn is_uninitialized(&self) -> bool {
match self {
VoteStateVersions::V0_23_5(vote_state) => {
vote_state.authorized_voter == Pubkey::default()
}
VoteStateVersions::Current(vote_state) => vote_state.authorized_voters.is_empty(),
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then shouldn't this code also check if the vote-account is initialized?
https://github.com/solana-labs/solana/blob/30d2b112e/runtime/src/bank.rs#L2848-L2888

Seems to me either we should skip uninitialized vote accounts there as well, or store them in the cache as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. I brought this up because I think this code needs to check for initialization but it isn't doing that right now.

t-nelson
t-nelson previously approved these changes Apr 15, 2022
Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! just a nit

Err(stake_account::Error::InvalidDelegation(_)) => {
// This should not happen because StakeAccount<()>
// does not check if stake state is a Delegation.
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log if it happens anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

let (account, _slot) = bank_rc.accounts.load_with_fixed_root(&ancestors, pubkey)?;
Some(account)
})
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will panic if the stakes cache and accountsdb are inconsistent right? Seems like we should should rebuild the cache in this case, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it is a major logic bug if these are inconsistent, and can indicate bugs in the cache or accounts-db or a corrupted snapshot.
I think having this panic here will actually help to detect such inconsistencies earlier.
Should this actually happen, I think the best would be to hold on going down this path until we can debug this inconsistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having this panic here will actually help to detect such inconsistencies earlier.

This is true, can you change the unwrap to an expect with a message describing that it "can indicate bugs in the cache or accounts-db or a corrupted snapshot"

Should this actually happen, I think the best would be to hold on going down this path until we can debug this inconsistency.

You mean try this on our nodes first to see if it's happening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change the unwrap to an expect with a message describing that it "can indicate bugs in the cache or accounts-db or a corrupted snapshot"

done.

Should this actually happen, I think the best would be to hold on going down this path until we can debug this inconsistency.

You mean try this on our nodes first to see if it's happening?

For all nodes which submit metrics we can monitor panics through the panic hook:
https://github.com/solana-labs/solana/blob/6c5a3ca4a/metrics/src/metrics.rs#L442-L454

So any node which runs the master branch, ours or not ours, will do. And then testnet nodes once this is backported to testnet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, sounds good

@behzadnouri behzadnouri force-pushed the cache-stake-accounts-lite branch from abb347f to 7757c08 Compare April 16, 2022 16:23
@mergify mergify bot dismissed t-nelson’s stale review April 16, 2022 16:23

Pull request has been modified.

@behzadnouri behzadnouri force-pushed the cache-stake-accounts-lite branch from 7757c08 to d10c2f9 Compare April 19, 2022 13:13
@behzadnouri
Copy link
Contributor Author

@carllin @ryoqun It will be great if you can also take a look in case I am missing something. thanks.

@behzadnouri behzadnouri force-pushed the cache-stake-accounts-lite branch from d10c2f9 to 20ab672 Compare April 19, 2022 20:06
@behzadnouri behzadnouri force-pushed the cache-stake-accounts-lite branch from 20ab672 to d950b39 Compare April 20, 2022 16:12
jstarry
jstarry previously approved these changes Apr 21, 2022
@behzadnouri behzadnouri force-pushed the cache-stake-accounts-lite branch from d950b39 to 56b9a41 Compare April 21, 2022 13:14
@mergify mergify bot dismissed jstarry’s stale review April 21, 2022 13:14

Pull request has been modified.

The added type does sanity checks on the Account and stores deserialized
StakeState. Following commits will use this type instead of Delegation
in Stakes.
The commit makes values in stake_delegations map in Stakes struct
generic. Stakes<Delegation> is equivalent to the old code and is used
for backward compatibility in BankFieldsTo{Serialize,Deserialize}.

But banks cache Stakes<StakeAccount> which includes the entire stake
account and StakeState deserialized from account. Doing so, will remove
the need to load stake account from accounts-db when working with
stake-delegations.
For backward compatibility, we can only serialize and deserialize
Stakes<Delegation>. However Bank caches Stakes<StakeAccount>. This type
mismatch incurs a conversion cost at epoch boundary when updating
EpochStakes.

This commit adds StakesEnum which allows EpochStakes to include either a
Stakes<StakeAccount> or Stakes<Delegation> and so bypass the conversion
cost between the two at the epoch boundary.
Loading vote and stake accounts from accounts-db takes a significant
portion of time updating rewards at epoch boundary.

This commit bypasses accounts-db and instead uses vote and stake
accounts cached in bank stakes:
https://github.com/solana-labs/solana/blob/d2702201c/runtime/src/stakes.rs#L148-L152

These cached accounts are synchronized with accounts-db after each
transaction, and so there should not be any change in the resulting
computation:
https://github.com/solana-labs/solana/blob/d2702201c/runtime/src/bank.rs#L4526

Nevertheless, to avoid any chances of introducing a consensus issue, the
switch to cached account is feature gated.
StakeAccount<Delegation> can only wrap a stake-state which is a
Delegation; whereas StakeAccount<()> wraps any account with stake state.

As a result, StakeAccount::<Delegation>::delegation() will return
Delegation instead of Option<Delegation>.
Tracking mismatches between cached vote/stake accounts and accounts-db
in preparation of activating feature for updating rewards at
epoch-boundary using cached vote/stake accounts.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature-gate Pull Request adds or modifies a runtime feature gate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants