Connection pool stuck - some (not all) connections stuck in Failed after failure #990

mgoodfellow · 2021-05-17T07:02:30Z

Hi,

Thanks for the great library!

Background, similarities to existing issue

We have been trying to track down an issue that has been causing us an issue with the Connection Pool. Unfortunately, a lot of my evidence is anecdotal, but I can try to run you through what we have tried, and what we have noticed while trying to get to the bottom of it.

We updated a few weeks ago to 1.3.2 (since upgraded to 1.3.8 which has made things better), and we started getting connection pool errors stating "All pooled connections in use". It appears this happens on a sync call on a slow running API call which does similar to:

DB fetch
3rd party API call (slow 1-3s)
DB fetch
DB commit

We found that on this particular API call it wasn't an async connector call, so we changed that and since got "Connect timeout" when this failure happens.

Our setup:

4x pods running on a k8s cluster, handling about 50req/s, most API calls make a couple of DB calls (UoW style), and API perf is normally 100-150ms on average for most calls. Connection string params of interest are: ConnectionTimeout=5;DefaultCommandTimeout=5;ConnectionReset=true;MinimumPoolSize=25

What happens:

We get a failure mode where one or more pods "get stuck", their req/s slows right down, response time goes through the roof, and they eat up all the connections in the pool. They can be running for days with no apparent issues, and this happens all within the space of a minute. Luckily for us, we have a health check API call that K8s is calling that also pings the DB. This times out as well and causes K8s to reset the pods which are failing.

The issues we have suffered sound very similar to:#979

After we started getting these issues, we started rolling back library dependencies, and we rolled back to Connector 1.0.2, but the same issue then occurred. We moved on thinking the issue was elsewhere.

In hindsight, we made a change to the ThreadPool on app startup, to help our StackExchange.Redis library throughput:

ThreadPool.SetMinThreads(250, 250)

I only mention this as this is the one change we cannot easily rollback and confirm if it is that due to needing that change on our production platform for other reasons. I'm also aware issues with the async thread starvation was mentioned in the above issue. Maybe its useful information.

I don't really have anything extra to add over the excellent debug information provided in the above issue, however, last night a new failure mode occurred, hence I'm opening a new issue.

The new failure mode from last night

We had an actual very brief connection failure between the K8s cluster and the DB it seems, and all connections were used up:

This is the 4x pods going from their min connections to max connections as they recover the network blip. Then winding back down after.

However, one of the pods got stuck in a new failure mode which might help point more evidence towards the above issue, or might just be muddying the water.

The MySQL Connector logs are showing this every minute on the failed pod:

The application logs are full of:

In the range of 100s a min (10s buckets):

But the real kicker is that that pod was still serving traffic as it obviously had some working connections which meant the health check was passing. Looking at the logs, it was happily serving traffic all night.

It seems a bunch of connections had died, and got stuck as unrecoverable, but some still worked.

Aside from the log noise, it strikes me as this is an unrecoverable failure mode? As this is all closely related to the connection pool and the handling of connections in it, I also wonder if it would provide insight as to other issues occurring within it.

The text was updated successfully, but these errors were encountered:

mgoodfellow · 2021-05-17T07:11:49Z

We get a failure mode where one or more pods "get stuck", their req/s slows right down, response time goes through the roof, and they eat up all the connections in the pool.

I just wanted to add that this happens more frequently under load. Primarily we have reduced the frequency of failures by doubling the number of pods from 2-4 which effectively halves the number of requests they are dealing with. This reduced the frequency of failures from almost daily to < once a week.

Last point is that when we have that failure, its only ever one pod at a time that crashes across the cluster, so that leads me to rule out network layer issues.

mgoodfellow · 2021-05-21T19:01:06Z

Actually, I wonder if this is NHibernate and MySQLConnector not playing nicely together?

We use NHibernate along with the MySQLConnector driver. Although NHibernate is fully async as of v5, I'm not 100% sure that is strictly true. At a guess I would imagine NHibernate to be relying on ADO features for compatibility across DB drivers. This means connection opening etc. would be synchronous (and potentially even commits/rollbacks). I wonder if this could cause this runaway issue which seems to feel like threadpool starvation.

In the MySQLConnector best practices it mentions specific async replacements https://mysqlconnector.net/tutorials/best-practices/

The NHibernate MySQLConnector driver consists of a simple interface on how to create connections etc. as per: https://github.com/nhibernate/NHibernate.MySqlConnector/blob/master/src/NHibernate.Driver.MySqlConnector/MySqlConnectorDriver.cs

This means under the hood, it might well be mixing async/sync when talking to the connector. Possibly only async on the readers.

I will try to dig in and investigate how NHibernate actually handles the connections under the hood when I have some time.

mgoodfellow · 2021-05-21T19:24:13Z

It does seem that both BeginTransaction and Commit/Rollback will always be the ADO compatible sync variants:

https://github.com/nhibernate/nhibernate-core/blob/master/src/NHibernate/Async/Transaction/AdoNetTransactionFactory.cs#L54

For now we have reverted back to the MySQL.Data connector library, and will see if that improves the situation

bgrainger · 2021-05-21T19:41:29Z

A similar exception message (but with a different "state": ClearingPendingCancellation instead of Failed) was reported in BackgroundConnectionResetHelper: #993

It seems plausible that any exception in that background-reset code path could end up "leaking" a connection, which would eventually deadlock the application (on no free connections).

I'd need to investigate what does happen in the failure case, and see if failures can be deliberately injected for testing.

bgrainger · 2021-06-05T22:04:09Z

It seems plausible that any exception in that background-reset code path could end up "leaking" a connection

Most exceptions are caught in TryResetConnectionAsync, so I'm not able to force a leak in this scenario.

rhssiiy · 2021-06-10T12:19:28Z

I faced the same problem and was able to reproduce it. The key – thread pool starvation, Command Timeout and DeferConnectionReset=false.

Minimal working code:

    public class DB : IDisposable, IAsyncDisposable
    {
        public readonly MySqlConnection c;

        public DB()
        {
            c = new MySqlConnection("server=/var/lib/mysql/mysql.sock; protocol=unix; database=xxx; user=xxx; password=xxx; Pooling=true; Min Pool Size=0; Max Pool Size=10; Connection Lifetime=300; DeferConnectionReset=false;");
        }

        public void Dispose() => c.Dispose();
        public async ValueTask DisposeAsync() => await c.DisposeAsync();
    }

    class Program
    {
        static void Main(string[] args)
        {
            List<Task> tasks = new();

            for (int i = 0; i < 100; i++)
            {
                int j = i;
                tasks.Add(Task.Run(async () => await Runner(j)));
            }

            Thread.Sleep(TimeSpan.FromSeconds(3));

            for (int i = 0; i < 100; i++)
            {
                tasks.Add(Task.Run(() => Thread.Sleep(TimeSpan.FromSeconds(30))));
            }

            Task.WaitAll(tasks.ToArray());

            Console.WriteLine("Now connection pool is full of dead connections");
            Console.ReadLine();

            for (int i = 0; i < 100; i++)
            {
                int j = i;
                tasks.Add(Task.Run(async () => await Runner(j)));
            }

            Task.WaitAll(tasks.ToArray());

            Console.WriteLine("No connection has been established. App is dead");
            Console.ReadLine();
        }

        static async Task Runner(int i)
        {
            Console.WriteLine($"started {i}");

            try
            {
                await using DB db = new();
                await db.c.ExecuteAsync("select sleep(100)");
            }
            catch (Exception e)
            {
                Console.WriteLine(e.Message);
            }

            Console.WriteLine($"stopped {i}");
        }

cproj:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net5.0</TargetFramework>
    <ServerGarbageCollection>true</ServerGarbageCollection>
    <ThreadPoolMinThreads>2</ThreadPoolMinThreads>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="Dapper" Version="2.0.90" />
    <PackageReference Include="MySqlConnector" Version="1.3.9" />
  </ItemGroup>

</Project>

Reproducible on linux, on windows - doesn't.

It started due of a slow disk and a lot of network traffic. I disabled flushing log after each record and set DeferConnectionReset=true and the problem seems to be gone.

bgrainger · 2021-07-05T17:11:01Z

This should be improved in 1.3.10, which sets DeferConnectionReset=true by default. #1013 will remove this code.

bgrainger mentioned this issue Jul 5, 2021

Remove BackgroundConnectionResetHelper (used by DeferConnectionReset=false) #1013

Closed

bgrainger closed this as completed Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection pool stuck - some (not all) connections stuck in Failed after failure #990

Connection pool stuck - some (not all) connections stuck in Failed after failure #990

mgoodfellow commented May 17, 2021

mgoodfellow commented May 17, 2021

mgoodfellow commented May 21, 2021

mgoodfellow commented May 21, 2021

bgrainger commented May 21, 2021

bgrainger commented Jun 5, 2021

rhssiiy commented Jun 10, 2021

bgrainger commented Jul 5, 2021

Connection pool stuck - some (not all) connections stuck in Failed after failure #990

Connection pool stuck - some (not all) connections stuck in Failed after failure #990

Comments

mgoodfellow commented May 17, 2021

Background, similarities to existing issue

The new failure mode from last night

mgoodfellow commented May 17, 2021

mgoodfellow commented May 21, 2021

mgoodfellow commented May 21, 2021

bgrainger commented May 21, 2021

bgrainger commented Jun 5, 2021

rhssiiy commented Jun 10, 2021

bgrainger commented Jul 5, 2021