Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reapply 3515: add tpu client next to sts #4758

Conversation

KirillLykov
Copy link

@KirillLykov KirillLykov commented Feb 3, 2025

Problem

Introduce TpuClientNextClient to the SendTransactionService (STS). So user of can choose which client to use: ConnectionCacheClient or TpuClientNextClient. By default it is ConnectionCacheClient and, hence, it is used in the RPC service. Which means that no changes on the validator behavior.

Both clients implement now NotifyKeyUpdate. This is need for the client to use a new validator identity when a corresponding admin rpc call is sent. Not used as part of these changes, because currently in the validator.rs we use ConnectionCache directly and this notify handling is done there.

This PR partially adds back reverted PR #3515.

Summary of Changes

@KirillLykov KirillLykov force-pushed the klykov/add-tpu-client-next-to-sts-again branch from 3be592a to 96192fd Compare February 5, 2025 19:10
@KirillLykov KirillLykov marked this pull request as ready for review February 5, 2025 19:10
@KirillLykov KirillLykov force-pushed the klykov/add-tpu-client-next-to-sts-again branch from 96192fd to f20cc08 Compare February 5, 2025 19:12
@KirillLykov
Copy link
Author

CI breaks for unrelated reasons, fixed CI problem in this PR: #4809

@KirillLykov KirillLykov force-pushed the klykov/add-tpu-client-next-to-sts-again branch from f20cc08 to e417014 Compare February 6, 2025 09:20
Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments along the way.

It's a little tough to review some of the function implementations without knowing exactly how they will be used/get called

leader_forward_count: usize,
) -> ConnectionWorkersSchedulerConfig {
ConnectionWorkersSchedulerConfig {
bind: SocketAddr::new(Ipv4Addr::new(0, 0, 0, 0).into(), 0),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @alexpyattaev . Might want to consider how this gets setup in a multi-home design

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are several places where we bind client on 0.0.0.0. I think we need to come up with some unified approach to all. If we change it to 127.0.0.1 some unit tests are failing (not in this crate, but in ForwardinStage), haven't checked on this yet.

@KirillLykov
Copy link
Author

KirillLykov commented Feb 6, 2025

It's a little tough to review some of the function implementations without knowing exactly how they will be used/get called

Yeah, I understand but it looks like the price for the small PRs.
The final PR which integrates all this code in validator is #3893
Why Updater is implemented -- to register client here

agave/core/src/validator.rs

Lines 1638 to 1640 in 50fd073

if let Some(client_updater) = client_updater {
key_notifies.push(client_updater);
}

@KirillLykov KirillLykov force-pushed the klykov/add-tpu-client-next-to-sts-again branch from e417014 to 67adc40 Compare February 7, 2025 16:34
@KirillLykov KirillLykov requested a review from bw-solana February 7, 2025 16:42
Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic side seems fine to me. Just one question about back-pressuring.

I think we should split out the cosmetic changes.

}

pub struct ConnectionCacheClient<T: TpuInfoWithSendStatic> {
connection_cache: Arc<ConnectionCache>,
tpu_address: SocketAddr,
tpu_peers: Option<Vec<SocketAddr>>,
leader_info_provider: Arc<Mutex<CurrentLeaderInfo<T>>>,
leader_forward_count: u64,
leader_fanout: u64,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar thing here... I think this naming is more clear, but maybe we can move all the cosmetic stuff into a separate PR.

My apologies. Mea culpa

{
// The channel size represents 8s worth of transactions at a rate of
// 1000 tps, assuming batch size is 64.
let (sender, receiver) = mpsc::channel(128);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still curious about how backpressuring will work

{
// The channel size represents 8s worth of transactions at a rate of
// 1000 tps, assuming batch size is 64.
let (sender, receiver) = mpsc::channel(128);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're often sending batch size of 1, then this means we can only buffer ~100ms of traffic at a rate of 1k TPS, right? Is this okay?

Seems like increasing the channel to something like 1k would cost almost nothing and give us a lot of breathing room for fluctuations

async fn do_update_key(&self, identity: &Keypair) -> Result<(), Box<dyn std::error::Error>> {
let runtime_handle = self.runtime_handle.clone();
let config = Self::create_config(
Some(identity.insecure_clone()),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬

Copy link
Author

@KirillLykov KirillLykov Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because the value is moved inside the spawned task. The NotifyKeyUpdate trait wasn't designed for agent-based design which creates all these workarounds.
Thinking if there is a cheap way to overcome. I would change the NotifyKeyUpdate to take Arc<Keypair> instead of &Keypair but not sure how invasive it could be

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to avoid using insecure_clone is to do the following changes: #4985
From security prospective, insecure_clone in this particular place doesn't do anything worse but just for the sake of avoiding this ever, we the workaround is proposed.

Another would be to fix NotifyKeyUpdate to take Arc<Keypair> but this is hard to because it is public.

};
let (handle, token) = std::mem::take(&mut *lock);
token.cancel();
if let Some(handle) = handle {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could get rid of a layer of nesting with:

let Some(handle) = handle else {
    // comment on why we might not have a handle
    return;
}

error!("Failed to stop scheduler: TpuClientNext task panicked.");
return;
};
let (handle, token) = std::mem::take(&mut *lock);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to use a consistent name for the cancellation token. I see we call it cancel above and token here

@KirillLykov
Copy link
Author

KirillLykov commented Feb 12, 2025

if we're often sending batch size of 1, then this means we can only buffer ~100ms of traffic at a rate of 1k TPS, right? Is this okay?
Seems like increasing the channel to something like 1k would cost almost nothing and give us a lot of breathing room for fluctuations

I tested using validator with meaningful configuration (rpc transaction batch size 1->64 and batch send rate ms 1->10) and with ingress of 10k txs per slot I never saturated the 128 channel.

But with default values, RPC manages to saturate this channel. Having channel as large as 1024 will allow to mitigate the problem of having full channel. However it looks like the overall TPS slowdown is mostly due to the higher cost of handling txs one by one (30% TPS slowdown overall in comparison with the meaningful configuration). So it looks like solving problem with channel capacity will not improve overall TPS anyways.

@KirillLykov KirillLykov force-pushed the klykov/add-tpu-client-next-to-sts-again branch from 7b75c4e to 27f263c Compare February 12, 2025 17:33
@KirillLykov KirillLykov force-pushed the klykov/add-tpu-client-next-to-sts-again branch from a8a2baa to e8d6d3f Compare February 17, 2025 16:20
@KirillLykov
Copy link
Author

@bw-solana got rid of insecure_clone

@bw-solana
Copy link

I tested using validator with meaningful configuration (rpc transaction batch size 1->64 and batch send rate ms 1->10) and with ingress of 10k txs per slot I never saturated the 128 channel.

But with default values, RPC manages to saturate this channel. Having channel as large as 1024 will allow to mitigate the problem of having full channel. However it looks like the overall TPS slowdown is mostly due to the higher cost of handling txs one by one (30% TPS slowdown overall in comparison with the meaningful configuration). So it looks like solving problem with channel capacity will not improve overall TPS anyways.

Should we change the default values?

Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment about possibly changing the default config to help users get max performance, but overall LGTM

@KirillLykov KirillLykov merged commit 216f827 into anza-xyz:master Feb 19, 2025
58 checks passed
@KirillLykov
Copy link
Author

@bw-solana yes, I think we should change them. I created an issue for that some time ago, maybe will do that soon #3613

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants