-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reapply 3515: add tpu client next to sts #4758
Reapply 3515: add tpu client next to sts #4758
Conversation
3be592a
to
96192fd
Compare
96192fd
to
f20cc08
Compare
CI breaks for unrelated reasons, fixed CI problem in this PR: #4809 |
f20cc08
to
e417014
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments along the way.
It's a little tough to review some of the function implementations without knowing exactly how they will be used/get called
leader_forward_count: usize, | ||
) -> ConnectionWorkersSchedulerConfig { | ||
ConnectionWorkersSchedulerConfig { | ||
bind: SocketAddr::new(Ipv4Addr::new(0, 0, 0, 0).into(), 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @alexpyattaev . Might want to consider how this gets setup in a multi-home design
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are several places where we bind client on 0.0.0.0
. I think we need to come up with some unified approach to all. If we change it to 127.0.0.1 some unit tests are failing (not in this crate, but in ForwardinStage), haven't checked on this yet.
Yeah, I understand but it looks like the price for the small PRs. Lines 1638 to 1640 in 50fd073
|
e417014
to
67adc40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic side seems fine to me. Just one question about back-pressuring.
I think we should split out the cosmetic changes.
} | ||
|
||
pub struct ConnectionCacheClient<T: TpuInfoWithSendStatic> { | ||
connection_cache: Arc<ConnectionCache>, | ||
tpu_address: SocketAddr, | ||
tpu_peers: Option<Vec<SocketAddr>>, | ||
leader_info_provider: Arc<Mutex<CurrentLeaderInfo<T>>>, | ||
leader_forward_count: u64, | ||
leader_fanout: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar thing here... I think this naming is more clear, but maybe we can move all the cosmetic stuff into a separate PR.
My apologies. Mea culpa
{ | ||
// The channel size represents 8s worth of transactions at a rate of | ||
// 1000 tps, assuming batch size is 64. | ||
let (sender, receiver) = mpsc::channel(128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still curious about how backpressuring will work
{ | ||
// The channel size represents 8s worth of transactions at a rate of | ||
// 1000 tps, assuming batch size is 64. | ||
let (sender, receiver) = mpsc::channel(128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're often sending batch size of 1, then this means we can only buffer ~100ms of traffic at a rate of 1k TPS, right? Is this okay?
Seems like increasing the channel to something like 1k would cost almost nothing and give us a lot of breathing room for fluctuations
async fn do_update_key(&self, identity: &Keypair) -> Result<(), Box<dyn std::error::Error>> { | ||
let runtime_handle = self.runtime_handle.clone(); | ||
let config = Self::create_config( | ||
Some(identity.insecure_clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed because the value is moved inside the spawned task. The NotifyKeyUpdate
trait wasn't designed for agent-based design which creates all these workarounds.
Thinking if there is a cheap way to overcome. I would change the NotifyKeyUpdate to take Arc<Keypair>
instead of &Keypair
but not sure how invasive it could be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to avoid using insecure_clone is to do the following changes: #4985
From security prospective, insecure_clone in this particular place doesn't do anything worse but just for the sake of avoiding this ever, we the workaround is proposed.
Another would be to fix NotifyKeyUpdate
to take Arc<Keypair>
but this is hard to because it is public.
}; | ||
let (handle, token) = std::mem::take(&mut *lock); | ||
token.cancel(); | ||
if let Some(handle) = handle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could get rid of a layer of nesting with:
let Some(handle) = handle else {
// comment on why we might not have a handle
return;
}
error!("Failed to stop scheduler: TpuClientNext task panicked."); | ||
return; | ||
}; | ||
let (handle, token) = std::mem::take(&mut *lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to use a consistent name for the cancellation token. I see we call it cancel
above and token
here
I tested using validator with meaningful configuration (rpc transaction batch size But with default values, RPC manages to saturate this channel. Having channel as large as |
7b75c4e
to
27f263c
Compare
a8a2baa
to
e8d6d3f
Compare
@bw-solana got rid of |
Should we change the default values? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one comment about possibly changing the default config to help users get max performance, but overall LGTM
@bw-solana yes, I think we should change them. I created an issue for that some time ago, maybe will do that soon #3613 |
Problem
Introduce
TpuClientNextClient
to theSendTransactionService
(STS). So user of can choose which client to use:ConnectionCacheClient
orTpuClientNextClient
. By default it isConnectionCacheClient
and, hence, it is used in the RPC service. Which means that no changes on the validator behavior.Both clients implement now
NotifyKeyUpdate
. This is need for the client to use a new validator identity when a corresponding admin rpc call is sent. Not used as part of these changes, because currently in the validator.rs we useConnectionCache
directly and this notify handling is done there.This PR partially adds back reverted PR #3515.
Summary of Changes