Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major data corruption issue with PR #956 and connection pooling #1344

Closed
agclark27 opened this issue Oct 13, 2021 · 10 comments · Fixed by #1352
Closed

Major data corruption issue with PR #956 and connection pooling #1344

agclark27 opened this issue Oct 13, 2021 · 10 comments · Fixed by #1352
Labels
💥 Regression Issues that are regressions introduced from earlier PRs. ✔️ Repro Available Issues that are reproducible with repro provided.

Comments

@agclark27
Copy link

We have identified a major bug that appears to have been introduced in PR #956. When using connection pooling under high concurrency, cancelled SqlCommand objects may return the connection to the pool with data still in flight. In the reproduction example below, which occurs in 4.0.0-preview2 but is not reproducible prior to the PR in 4.0.0-preview1, we spawn 100 threads that are opening connections, executing a simple SQL statement that passes in a parameter set to a unique GUID and gets it echoed back through a second parameter, and closes the connection, with a low cancellation timeout so that some will fail and some will succeed. We are finding that the parameter values from a previous connection are being returned in a new, unrelated command object. This suggests that the changes made in PR #956 are no longer fully resetting the connection and in-flight data may come into a new command using a connection retrieved from the pool. This could have significant data corruption implications where output parameters from an unrelated command are returned.

Reproduction Code:

    private static void Exec()
    {
      var cancellationToken = new System.Threading.CancellationTokenSource(50);
      var connection = new Microsoft.Data.SqlClient.SqlConnection("Data Source=your-data-source;Integrated Security=SSPI;Trust Server Certificate=true;Max Pool Size=256;Min Pool Size=0;");
      var expectedGuid = System.Guid.NewGuid();
      var cm = new Microsoft.Data.SqlClient.SqlCommand();
      cm.Connection = connection;
      cm.CommandType = System.Data.CommandType.Text;
      cm.CommandText = "select @id2 = @id;";
      cm.Parameters.Add(new Microsoft.Data.SqlClient.SqlParameter("@id", System.Data.SqlDbType.UniqueIdentifier) { Value = expectedGuid });
      cm.Parameters.Add(new Microsoft.Data.SqlClient.SqlParameter("@id2", System.Data.SqlDbType.UniqueIdentifier) { Direction = System.Data.ParameterDirection.Output });
      cm.CommandTimeout = 2;
      try
      {
        connection.Open();
        var task = cm.ExecuteNonQueryAsync(cancellationToken.Token);
        task.Wait();
      }
      catch (System.Exception)
      { 
        //ignore cancellations
      }
      finally
      {
        connection.Close();
      }
      if (cm.Parameters["@id2"].Value == null) return;
      else if ((System.Guid)cm.Parameters["@id2"].Value != expectedGuid) throw new System.InvalidOperationException(string.Format("OH NO: Expected {0}, Got {1}", expectedGuid, ((System.Guid)cm.Parameters["@id2"].Value).ToString()));
    }

    public static void Main(string[] args)
    {
      var threads = new System.Collections.Generic.List<System.Threading.Thread>();
      for (int i = 0; i < 100; i++)
      {
        var t = new System.Threading.Thread(() => { for (int j = 0; j < 1000; j++) Exec(); });
        t.Start();
        threads.Add(t);
      }
      for (int i = 0; i < threads.Count; i++) threads[i].Join();
      System.Console.WriteLine("END");
      System.Console.ReadKey();
    }

Just a few seconds into execution, you'll see it throw the InvalidOperationException with something like:

System.InvalidOperationException: 'OH NO: Expected f90b93e5-e651-4543-a663-562998fca20d, Got 50f1c311-9e34-4573-9637-e3e3f0a4737b'

The output parameter is clearing being populated with data from another past iteration of the connection.

@JRahnama
Copy link
Contributor

@agclark27 thanks for bringing this up. We will look into it shortly.

CC: @Wraith2

@JRahnama JRahnama added the ✔️ Repro Available Issues that are reproducible with repro provided. label Oct 13, 2021
@JRahnama
Copy link
Contributor

I ran the repro and at some point I got the exception and rest again worked fine and after a second one more exception again and rest worked fine.
image

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 13, 2021

I think this is going to be a case where you'll only see it in the traces. It's really hard to debug this when it's running at high speed. It's probably worth reverting that specific PR and looking to see if this repro still works.

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Oct 13, 2021

Edit: It's not consistently reproducible, so might be just that.. Trying to create a consistent repro.

@JRahnama
Copy link
Contributor

also with previous versions I tested with preview1 and 3.0.1 and it worked fine.

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 13, 2021

also with previous versions I tested with preview1 and 3.0.1 and it worked fine.

Sure. And I agree that it's likely to be #956 but we should really bisect by just removing that one PR and retesting so make sure of that because there are other changes between 3 and 4.

@johnnypham
Copy link
Contributor

also with previous versions I tested with preview1 and 3.0.1 and it worked fine.

Sure. And I agree that it's likely to be #956 but we should really bisect by just removing that one PR and retesting so make sure of that because there are other changes between 3 and 4.

I haven't been able to reproduce it when #956 is reverted.

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 14, 2021

Then it looks like the safest route is to revert that PR and see if a new version can be worked out before 4.0.0. Reverting to the previous bad but known behaviour seems safer than having the possibility of data return on the wrong connection.

@JRahnama
Copy link
Contributor

@Wraith2 something to consider is the code is trying to get the value when connection is closed. It might be related to this as well.

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 14, 2021

The repro doesn't look like it's doing anything illegal. It's likely to be an issue similar to the ones we've seen in the past where a connection is returned to to pool without being completely closed. Whatever the cause I think reverting that one PR is probably the best place to start.

@cheenamalhotra cheenamalhotra linked a pull request Oct 16, 2021 that will close this issue
@cheenamalhotra cheenamalhotra added the 💥 Regression Issues that are regressions introduced from earlier PRs. label Oct 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💥 Regression Issues that are regressions introduced from earlier PRs. ✔️ Repro Available Issues that are reproducible with repro provided.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants