Reapply 3515: add tpu client next to sts #4758

KirillLykov · 2025-02-03T17:15:53Z

Problem

Introduce TpuClientNextClient to the SendTransactionService (STS). So user of can choose which client to use: ConnectionCacheClient or TpuClientNextClient. By default it is ConnectionCacheClient and, hence, it is used in the RPC service. Which means that no changes on the validator behavior.

Both clients implement now NotifyKeyUpdate. This is need for the client to use a new validator identity when a corresponding admin rpc call is sent. Not used as part of these changes, because currently in the validator.rs we use ConnectionCache directly and this notify handling is done there.

This PR partially adds back reverted PR #3515.

Summary of Changes

KirillLykov · 2025-02-05T19:58:45Z

CI breaks for unrelated reasons, fixed CI problem in this PR: #4809

bw-solana

Left some comments along the way.

It's a little tough to review some of the function implementations without knowing exactly how they will be used/get called

send-transaction-service/src/send_transaction_service.rs

send-transaction-service/src/tpu_info.rs

send-transaction-service/src/transaction_client.rs

bw-solana · 2025-02-06T16:21:39Z

send-transaction-service/src/transaction_client.rs

+        leader_forward_count: usize,
+    ) -> ConnectionWorkersSchedulerConfig {
+        ConnectionWorkersSchedulerConfig {
+            bind: SocketAddr::new(Ipv4Addr::new(0, 0, 0, 0).into(), 0),


CC @alexpyattaev . Might want to consider how this gets setup in a multi-home design

there are several places where we bind client on 0.0.0.0. I think we need to come up with some unified approach to all. If we change it to 127.0.0.1 some unit tests are failing (not in this crate, but in ForwardinStage), haven't checked on this yet.

send-transaction-service/src/transaction_client.rs

KirillLykov · 2025-02-06T17:30:07Z

It's a little tough to review some of the function implementations without knowing exactly how they will be used/get called

Yeah, I understand but it looks like the price for the small PRs.
The final PR which integrates all this code in validator is #3893
Why Updater is implemented -- to register client here

agave/core/src/validator.rs

Lines 1638 to 1640 in 50fd073

    
           if let Some(client_updater) = client_updater { 
        
               key_notifies.push(client_updater); 
        
           }

bw-solana

Logic side seems fine to me. Just one question about back-pressuring.

I think we should split out the cosmetic changes.

bw-solana · 2025-02-10T16:40:13Z

send-transaction-service/src/transaction_client.rs

 }

 pub struct ConnectionCacheClient<T: TpuInfoWithSendStatic> {
    connection_cache: Arc<ConnectionCache>,
    tpu_address: SocketAddr,
    tpu_peers: Option<Vec<SocketAddr>>,
    leader_info_provider: Arc<Mutex<CurrentLeaderInfo<T>>>,
-    leader_forward_count: u64,
+    leader_fanout: u64,


similar thing here... I think this naming is more clear, but maybe we can move all the cosmetic stuff into a separate PR.

My apologies. Mea culpa

bw-solana · 2025-02-10T16:40:53Z

send-transaction-service/src/transaction_client.rs

+    {
+        // The channel size represents 8s worth of transactions at a rate of
+        // 1000 tps, assuming batch size is 64.
+        let (sender, receiver) = mpsc::channel(128);


Still curious about how backpressuring will work

bw-solana · 2025-02-12T15:22:20Z

send-transaction-service/src/transaction_client.rs

+    {
+        // The channel size represents 8s worth of transactions at a rate of
+        // 1000 tps, assuming batch size is 64.
+        let (sender, receiver) = mpsc::channel(128);


if we're often sending batch size of 1, then this means we can only buffer ~100ms of traffic at a rate of 1k TPS, right? Is this okay?

Seems like increasing the channel to something like 1k would cost almost nothing and give us a lot of breathing room for fluctuations

bw-solana · 2025-02-12T15:27:11Z

send-transaction-service/src/transaction_client.rs

+    async fn do_update_key(&self, identity: &Keypair) -> Result<(), Box<dyn std::error::Error>> {
+        let runtime_handle = self.runtime_handle.clone();
+        let config = Self::create_config(
+            Some(identity.insecure_clone()),


This is needed because the value is moved inside the spawned task. The NotifyKeyUpdate trait wasn't designed for agent-based design which creates all these workarounds.
Thinking if there is a cheap way to overcome. I would change the NotifyKeyUpdate to take Arc<Keypair> instead of &Keypair but not sure how invasive it could be

One way to avoid using insecure_clone is to do the following changes: #4985
From security prospective, insecure_clone in this particular place doesn't do anything worse but just for the sake of avoiding this ever, we the workaround is proposed.

Another would be to fix NotifyKeyUpdate to take Arc<Keypair> but this is hard to because it is public.

bw-solana · 2025-02-12T15:32:25Z

send-transaction-service/src/transaction_client.rs

+        };
+        let (handle, token) = std::mem::take(&mut *lock);
+        token.cancel();
+        if let Some(handle) = handle {


could get rid of a layer of nesting with:

let Some(handle) = handle else { // comment on why we might not have a handle return; }

bw-solana · 2025-02-12T15:38:41Z

send-transaction-service/src/transaction_client.rs

+            error!("Failed to stop scheduler: TpuClientNext task panicked.");
+            return;
+        };
+        let (handle, token) = std::mem::take(&mut *lock);


would be nice to use a consistent name for the cancellation token. I see we call it cancel above and token here

KirillLykov · 2025-02-12T16:50:07Z

if we're often sending batch size of 1, then this means we can only buffer ~100ms of traffic at a rate of 1k TPS, right? Is this okay?
Seems like increasing the channel to something like 1k would cost almost nothing and give us a lot of breathing room for fluctuations

I tested using validator with meaningful configuration (rpc transaction batch size 1->64 and batch send rate ms 1->10) and with ingress of 10k txs per slot I never saturated the 128 channel.

But with default values, RPC manages to saturate this channel. Having channel as large as 1024 will allow to mitigate the problem of having full channel. However it looks like the overall TPS slowdown is mostly due to the higher cost of handling txs one by one (30% TPS slowdown overall in comparison with the meaningful configuration). So it looks like solving problem with channel capacity will not improve overall TPS anyways.

KirillLykov · 2025-02-17T16:21:54Z

@bw-solana got rid of insecure_clone

bw-solana · 2025-02-18T15:36:04Z

I tested using validator with meaningful configuration (rpc transaction batch size 1->64 and batch send rate ms 1->10) and with ingress of 10k txs per slot I never saturated the 128 channel.

But with default values, RPC manages to saturate this channel. Having channel as large as 1024 will allow to mitigate the problem of having full channel. However it looks like the overall TPS slowdown is mostly due to the higher cost of handling txs one by one (30% TPS slowdown overall in comparison with the meaningful configuration). So it looks like solving problem with channel capacity will not improve overall TPS anyways.

Should we change the default values?

bw-solana

Left one comment about possibly changing the default config to help users get max performance, but overall LGTM

KirillLykov · 2025-02-19T00:16:21Z

@bw-solana yes, I think we should change them. I created an issue for that some time ago, maybe will do that soon #3613

KirillLykov mentioned this pull request Feb 3, 2025

Add support of tpu-client-next to validator #3454

Open

10 tasks

KirillLykov force-pushed the klykov/add-tpu-client-next-to-sts-again branch from 3be592a to 96192fd Compare February 5, 2025 19:10

KirillLykov marked this pull request as ready for review February 5, 2025 19:10

KirillLykov force-pushed the klykov/add-tpu-client-next-to-sts-again branch from 96192fd to f20cc08 Compare February 5, 2025 19:12

KirillLykov requested review from alessandrod and bw-solana February 5, 2025 19:57

KirillLykov force-pushed the klykov/add-tpu-client-next-to-sts-again branch from f20cc08 to e417014 Compare February 6, 2025 09:20

bw-solana reviewed Feb 6, 2025

View reviewed changes

KirillLykov force-pushed the klykov/add-tpu-client-next-to-sts-again branch from e417014 to 67adc40 Compare February 7, 2025 16:34

KirillLykov requested a review from bw-solana February 7, 2025 16:42

bw-solana reviewed Feb 10, 2025

View reviewed changes

bw-solana reviewed Feb 12, 2025

View reviewed changes

KirillLykov force-pushed the klykov/add-tpu-client-next-to-sts-again branch from 7b75c4e to 27f263c Compare February 12, 2025 17:33

KirillLykov added 6 commits February 17, 2025 17:12

Add tpu-client-next client to STS

df12b5d

Address PR comments

2215fc8

rename back leader_fanout to leader_forward_count

3ea5fab

address some PR comments

b4e4599

Update Cargo.lock

f921c6d

get rid of insecure_clone

e8d6d3f

KirillLykov force-pushed the klykov/add-tpu-client-next-to-sts-again branch from a8a2baa to e8d6d3f Compare February 17, 2025 16:20

KirillLykov requested a review from bw-solana February 17, 2025 16:22

bw-solana approved these changes Feb 18, 2025

View reviewed changes

KirillLykov merged commit 216f827 into anza-xyz:master Feb 19, 2025
58 checks passed

KirillLykov mentioned this pull request Mar 2, 2025

Reapply 3515: add tpu client next to SendTransactionService #4727

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reapply 3515: add tpu client next to sts #4758

Reapply 3515: add tpu client next to sts #4758

KirillLykov commented Feb 3, 2025 •

edited

Loading

KirillLykov commented Feb 5, 2025

bw-solana left a comment

bw-solana Feb 6, 2025

KirillLykov Feb 6, 2025

KirillLykov commented Feb 6, 2025 •

edited

Loading

bw-solana left a comment

bw-solana Feb 10, 2025

bw-solana Feb 10, 2025

bw-solana Feb 12, 2025

bw-solana Feb 12, 2025

KirillLykov Feb 12, 2025 •

edited

Loading

KirillLykov Feb 14, 2025

bw-solana Feb 12, 2025

bw-solana Feb 12, 2025

KirillLykov commented Feb 12, 2025 •

edited

Loading

KirillLykov commented Feb 17, 2025

bw-solana commented Feb 18, 2025

bw-solana left a comment

KirillLykov commented Feb 19, 2025

Reapply 3515: add tpu client next to sts #4758

Reapply 3515: add tpu client next to sts #4758

Conversation

KirillLykov commented Feb 3, 2025 • edited Loading

Problem

Summary of Changes

KirillLykov commented Feb 5, 2025

bw-solana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KirillLykov commented Feb 6, 2025 • edited Loading

bw-solana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KirillLykov Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KirillLykov commented Feb 12, 2025 • edited Loading

KirillLykov commented Feb 17, 2025

bw-solana commented Feb 18, 2025

bw-solana left a comment

Choose a reason for hiding this comment

KirillLykov commented Feb 19, 2025

KirillLykov commented Feb 3, 2025 •

edited

Loading

KirillLykov commented Feb 6, 2025 •

edited

Loading

KirillLykov Feb 12, 2025 •

edited

Loading

KirillLykov commented Feb 12, 2025 •

edited

Loading