Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad tail latency under high contention #243

Open
drauziooppenheimer opened this issue Feb 27, 2025 · 6 comments
Open

Bad tail latency under high contention #243

drauziooppenheimer opened this issue Feb 27, 2025 · 6 comments

Comments

@drauziooppenheimer
Copy link

I've been doing some load testing with the bb8 and deadpool, and I've observed that under high concurrent load (e.g., 1000 concurrent Tokio tasks) with a limited number of connections (50 connections in my tests), bb8 seems to be handling the connections poorly in terms of distribution, some tasks are experiencing significant delays when trying to acquire a connection from the pool, while deadpool has a more consistent, well-behaved latency.

Here are the measured connection acquisition times for both pools in this scenario:

bb8

  • Average: 29ms
  • p50: 2.8ms
  • p90: 6.7ms
  • p95: 115ms
  • p99: 780ms
  • p99.9: 1800ms
  • p99.99: 3742ms

deadpool

  • Average: 32ms
  • p50: 31ms
  • p90: 34ms
  • p95: 35ms
  • p99: 40ms
  • p99.9: 119ms
  • p99.99: 171ms
@djc djc changed the title Connection Acquisition Under High Concurrency Bad tail latency under high contention Feb 27, 2025
@djc
Copy link
Owner

djc commented Feb 27, 2025

What is your goal with this issue? Why not just use deadpool? How self-contained is your test program?

IIRC deadpool uses a Semaphore while bb8 uses a Mutex. There might also be other ways to track contention, although that's probably hard to do without adversely affecting fairness.

@drauziooppenheimer
Copy link
Author

I created the issue to seek guidance about the potential problem, so I can try to submit a PR to improve it.

As for using deadpool, for my use case bb8 is much better as it maintains a minimum number of connections in the pool and checks the connection health before adding it back to the pool, deadpool does not support those features.

@djc
Copy link
Owner

djc commented Feb 27, 2025

Ah, cool. So do you feel this is enough guidance? Feel free to ask more questions!

@drauziooppenheimer
Copy link
Author

I think that's enough, thank you so much! I'll play around with it this weekend and see what I can get.

@drauziooppenheimer
Copy link
Author

I did a proof-of-concept to validate your assumption about the semaphore and you're correct, I did the following change:

  1. Add a semaphore to the SharedPool struct:
pub(crate) struct SharedPool<M>
where
    M: ManageConnection + Send,
{
    pub(crate) statics: Builder<M>,
    pub(crate) manager: M,
    pub(crate) internals: Mutex<PoolInternals<M>>,
    pub(crate) notify: Arc<Notify>,
    pub(crate) statistics: AtomicStatistics,
    pub(crate) semaphore: Semaphore,
}
  1. Initialize the semaphore with max_size
    pub(crate) fn new(statics: Builder<M>, manager: M) -> Self {
        Self {
            semaphore: Semaphore::new(statics.max_size as usize),
            statics,
            manager,
            internals: Mutex::new(PoolInternals::default()),
            notify: Arc::new(Notify::new()),
            statistics: AtomicStatistics::default(),
        }
    }
  1. Update the get(&self) -> Result<PooledConnection<'_, M>, RunError<M::Error>> function to first get a permit from the semaphore before locking the mutex. I used the RunError::TimedOut here just to validate it.
pub(crate) async fn get(&self) -> Result<PooledConnection<'_, M>, RunError<M::Error>> {
    let mut kind = StatsGetKind::Direct;
    let mut wait_time_start = None;

    let future = async {
        let _permit = self
            .inner
            .semaphore
            .acquire()
            .await
            .map_err(|_| RunError::TimedOut)?;

        let getting = self.inner.start_get();
...

With that, those are the numbers I got:

  1. Without Semaphore
  • avg: 100.01ms
  • p50: 0.0005ms
  • p90: 0.0015ms
  • p95: 552.50ms
  • p99: 2673.37ms
  • p99.9: 5524.42ms
  • p99.99: 7827.25ms
  • execution time: 14.05s
  1. With Semaphore
  • avg: 131.53ms
  • p50: 129.39ms
  • p90: 138.36ms
  • p95: 141.93ms
  • p99: 172.41ms
  • p99.9: 371.70ms
  • p99.99: 386.01ms
  • execution time: 13.91s

Do you see any concerns or problems with this solution?

@djc
Copy link
Owner

djc commented Feb 28, 2025

Glad to see it worked out, nice work!

I think we should probably replace the Approval stuff with semaphore permits instead of just adding it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants