Bad tail latency under high contention #243

drauziooppenheimer · 2025-02-27T03:35:17Z

I've been doing some load testing with the bb8 and deadpool, and I've observed that under high concurrent load (e.g., 1000 concurrent Tokio tasks) with a limited number of connections (50 connections in my tests), bb8 seems to be handling the connections poorly in terms of distribution, some tasks are experiencing significant delays when trying to acquire a connection from the pool, while deadpool has a more consistent, well-behaved latency.

Here are the measured connection acquisition times for both pools in this scenario:

bb8

Average: 29ms
p50: 2.8ms
p90: 6.7ms
p95: 115ms
p99: 780ms
p99.9: 1800ms
p99.99: 3742ms

deadpool

Average: 32ms
p50: 31ms
p90: 34ms
p95: 35ms
p99: 40ms
p99.9: 119ms
p99.99: 171ms

djc · 2025-02-27T09:37:37Z

What is your goal with this issue? Why not just use deadpool? How self-contained is your test program?

IIRC deadpool uses a Semaphore while bb8 uses a Mutex. There might also be other ways to track contention, although that's probably hard to do without adversely affecting fairness.

drauziooppenheimer · 2025-02-27T14:44:29Z

I created the issue to seek guidance about the potential problem, so I can try to submit a PR to improve it.

As for using deadpool, for my use case bb8 is much better as it maintains a minimum number of connections in the pool and checks the connection health before adding it back to the pool, deadpool does not support those features.

djc · 2025-02-27T15:14:01Z

Ah, cool. So do you feel this is enough guidance? Feel free to ask more questions!

drauziooppenheimer · 2025-02-27T17:32:42Z

I think that's enough, thank you so much! I'll play around with it this weekend and see what I can get.

drauziooppenheimer · 2025-02-28T17:58:07Z

I did a proof-of-concept to validate your assumption about the semaphore and you're correct, I did the following change:

Add a semaphore to the SharedPool struct:

pub(crate) struct SharedPool<M>
where
    M: ManageConnection + Send,
{
    pub(crate) statics: Builder<M>,
    pub(crate) manager: M,
    pub(crate) internals: Mutex<PoolInternals<M>>,
    pub(crate) notify: Arc<Notify>,
    pub(crate) statistics: AtomicStatistics,
    pub(crate) semaphore: Semaphore,
}

Initialize the semaphore with max_size

    pub(crate) fn new(statics: Builder<M>, manager: M) -> Self {
        Self {
            semaphore: Semaphore::new(statics.max_size as usize),
            statics,
            manager,
            internals: Mutex::new(PoolInternals::default()),
            notify: Arc::new(Notify::new()),
            statistics: AtomicStatistics::default(),
        }
    }

Update the get(&self) -> Result<PooledConnection<'_, M>, RunError<M::Error>> function to first get a permit from the semaphore before locking the mutex. I used the RunError::TimedOut here just to validate it.

pub(crate) async fn get(&self) -> Result<PooledConnection<'_, M>, RunError<M::Error>> {
    let mut kind = StatsGetKind::Direct;
    let mut wait_time_start = None;

    let future = async {
        let _permit = self
            .inner
            .semaphore
            .acquire()
            .await
            .map_err(|_| RunError::TimedOut)?;

        let getting = self.inner.start_get();
...

With that, those are the numbers I got:

Without Semaphore

avg: 100.01ms
p50: 0.0005ms
p90: 0.0015ms
p95: 552.50ms
p99: 2673.37ms
p99.9: 5524.42ms
p99.99: 7827.25ms
execution time: 14.05s

With Semaphore

avg: 131.53ms
p50: 129.39ms
p90: 138.36ms
p95: 141.93ms
p99: 172.41ms
p99.9: 371.70ms
p99.99: 386.01ms
execution time: 13.91s

Do you see any concerns or problems with this solution?

djc · 2025-02-28T18:19:55Z

Glad to see it worked out, nice work!

I think we should probably replace the Approval stuff with semaphore permits instead of just adding it.

djc changed the title ~~Connection Acquisition Under High Concurrency~~ Bad tail latency under high contention Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad tail latency under high contention #243

Bad tail latency under high contention #243

drauziooppenheimer commented Feb 27, 2025

djc commented Feb 27, 2025

drauziooppenheimer commented Feb 27, 2025

djc commented Feb 27, 2025

drauziooppenheimer commented Feb 27, 2025

drauziooppenheimer commented Feb 28, 2025

djc commented Feb 28, 2025

Bad tail latency under high contention #243

Bad tail latency under high contention #243

Comments

drauziooppenheimer commented Feb 27, 2025

djc commented Feb 27, 2025

drauziooppenheimer commented Feb 27, 2025

djc commented Feb 27, 2025

drauziooppenheimer commented Feb 27, 2025

drauziooppenheimer commented Feb 28, 2025

djc commented Feb 28, 2025