Database-agnostic way to detect transient database errors #34817

roji · 2020-04-10T11:37:54Z

Provider tracking issues

This new API has been merged, following are tracking issues for implementation in the different providers:

SqlClient: .NET 5.0 | Implement SqlException.IsTransient SqlClient#649
Microsoft.Data.Sqlite: Implement SqliteException.IsTransient efcore#21598
Npgsql: Override SqlState and IsTransient from DbException npgsql/npgsql#3058
MySqlConnector: Implement MySqlException.IsTransient mysql-net/MySqlConnector#849

Summary

A recurring theme for database programming has been building automated retry policies, but System.Data doesn't currently allow database drivers to expose whether a database error can be expected to be transient or not. This leads to database-specific logic for detecting transient exceptions being built again and again outside of the driver.

API Proposal

class DbException
{
    public virtual bool IsTransient => false;
}

Notes

IsTransient should be optimistic, i.e. return true in all cases where a simple retry of the operation - without any other change - may be successful. We prefer to err on the side of transience, since the downside is only a few extra retries.
In at least some cases, if an operation transiently fails within a transaction, the entire transaction must be rolled back and retried (including all previous operations). This would be out of scope of this proposal: DbException would only inform the consumers that the exception is transient, but the actual handling (rolling back, reissuing previous commands) would be the consumer's responsibility.

Resources and examples

EF Core's SQL Server transient exception detector.
Npgsql's NpgsqlException.IsTransient property, which is likely to resemble the API for this proposal.
SqlClient's internal retry mechanism, important especially for Azure SQL. See also this conversation on possibly evolving that mechanism.

Old notes on SupportsTransienceDetection (removed from proposal)

SupportsTransienceDetection can be used to signal that IsTransient has been implemented by a driver, and can be expected to be true where relevant. Note that in general it should be fine to just check IsTransient directly, as it would default to false in any case.
It's a bit problematic for to be on DbException:
- This makes it hard to do a simple check on startup (one needs an actual DbException instance thrown from the driver).
- This feature flag would ideally be in a general metadata/feature discovery facility (see ADO.NET metadata/feature/capability discovery API #28763), but we don't have that at the moment.
- We could put this flag on DbProviderFactory, but that would turn it into a general place for feature flags as opposed to a factory (current flags such as CanCreateDataAdapter are about factory methods on the type, so much more appropriate).
- As the problem isn't very acute, having this property on DbException is probably OK.

/cc @David-Engel @cheenamalhotra @bgrainger @bricelam @ajcvickers @stephentoub @terrajobst @mgravell @FransBouma

Edit history

Date	Modification
2020-04-10	API proposal
2020-04-18	Removed SupportsTransienceDetection

FransBouma · 2020-04-10T13:35:39Z

Is SupportsTransienceDetection really necessary? All I'm interested in as a consumer of the exception is whether it's a transient error. Now I have a list of transient error codes per database in my transient error recovery strategy classes and that's not very flexible (if error codes are added I have to maintain this list plus I have to look up these codes for all databases supported). So I just want to know 'is this exception a transient error?' If so, use the strategy, if not, throw the exception further up like a normal exception.

If IsTransient is false, whether that's because it's not a transient error or because the driver/ado.net provider doesn't support this, the net end result is the same: the recovery strategy is bypassed and the exception is rethrown as a normal exception.

Perhaps I'm overlooking a use case tho :)

roji · 2020-04-10T13:54:54Z

You're right that it's not terribly useful (i.e. "Note that in general it should be fine to just check IsTransient directly, as it would default to false in any case.").

The main case I had in mind was to be able to surface that information to a user, or possibly to fail at startup if transience isn't supported by a provider. I'm not sure how relevant that really is, if people think it's useless we can also drop it.

FransBouma · 2020-04-10T14:23:35Z

Ah now I understand it. :) Not sure how that information can be used in a practical way however: a recovery strategy of some sort has to be implemented regardless: a re-entrant aware system that can deal with retrying the commands in a given form, and if that strategy is built around 'IsTransient' it will simply bubble up all exceptions (as they're never 'transient') which is effectively the same as having code checking SupportsTransienceDetection and switching based on that between no recovery strategy and a recovery strategy. Except perhaps a tiny performance gain in the 'no recovery strategy' scenario perhaps.

roji · 2020-04-10T15:02:20Z

[...] and switching based on that between no recovery strategy and a recovery strategy. Except perhaps a tiny performance gain in the 'no recovery strategy' scenario perhaps.

Yeah, I wouldn't really expect anyone to switch on SupportsTransienceDetectionfor recovery/no-recovery (and I definitely don't think there's any real perf to be gained). But in theory someone could try to do something at a higher level if they know there's no transience detection. I admit I don't really have a compelling use-case - it's more a general idea of allowing consumers to know whether it's been implemented or not, for any reason they might have.

I'm definitely open to removing this - let's see what other people think.

saurabh500 · 2020-04-10T19:08:27Z

While working on SqlClient, I realized that the classification of Transient errors is a combination of recommendations offered by the database and the nature of the customer workload and the customer's own understanding of their infrastructure running the applications and databases.

A driver can classify an error as transient, if it is recommended. And such recommendations are usually made for a very small subset of errors. At least this is what my observation is, while interacting with SqlClient customers.

For many other errors, it is the application that decides whether it should be transient or not.

Consider SqlConnection.Open() in SqlClient, it retries on a bunch of errors while connecting to Azure SQL DB. But those errors are returned by the server. However there are many cases in which customers want to retry on networking specific errors, which may be caused by broken connections or caused by failure in TCP paths.
Some customers want to have retries on such errors and some customers want to detect these ASAP and have no retries.

Similar arguments go for Command Execution where the customers want to retry on some errors and fail fast on some errors. E.g. Deadlock is a problem that some customer applications want to retry on, because in their workload it is transient and for some customer applications, deadlocks need investigation.

Basically the known transient errors in SqlClient are much fewer and this proposal makes more sense with a mechanism via which the customers can tell us if they want a particular error to be treated as transient and retried on, vs errors that we fail fast with.

Otherwise the client's may not be authoritatively able to classify errors as transient and in case of client considering errors as transient may be undesirable.

One of the asks from a few customers of SqlClient is to be able to specify what errors the applications want to consider transient and have the client retry internally on those errors.

I wonder if the above situations apply for other drivers, they probably might.

An API ecosystem where customers have the ability to specify what errors should be treated transient would be lot more powerful.

roji · 2020-04-10T20:26:37Z

Thanks for these thoughts @saurabh500.

One of the asks from a few customers of SqlClient is to be able to specify what errors the applications want to consider transient and have the client retry internally on those errors.

I think there may be a bit of confusion here between the transient retrying implemented inside of SqlClient, and external retrying implemented outside of any specific driver, and which would leverage the proposed IsTransient property. Anything happening inside SqlClient is specific to that driver, and although you could add an API to make its retry mechanism more flexible, I don't think that would be applicable for a database-agnostic approach. Npgsql (as well as most other drivers AFAIK) doesn't consider retrying to be within its scope of responsibility and doesn't contain any logic internally for doing so - NpgsqlConnection.Open (almost) always immediately throws if any I/O error occurs. The user is expected to handle retrying according to their needs (typically using something like Polly), which obviates any need for an API to tell the driver what to retry and what not to retry. The proposed IsTransient property is meant to help with writing such an external retrying strategy.

It definitely makes sense that different customers may have different needs around retrying, and around which errors should be classified as transient. From a database-agnostic perspective, doing anything here would require some sort of database-agnostic classification/categorization system: networking errors, transaction deadlock errors, etc. If such a classification existed, driver users would be able to react differently to types of errors in a database-agnostic way. However, I'm skeptical of whether such a system can work; as I wrote in dotnet/SqlClient#518 (comment), database errors and error codes in general seem very divergent across databases, and I'd be very wary for us to go into the business of standardizing database error categories for the purpose of transience.

(As a aside note, the standard SQLSTATE is precisely an attempt to categorize database errors in a database-agnostic way (although SQLSTATE isn't really concerned with transience as a first-level concept). As I wrote in dotnet/SqlClient#518 (comment), I'm not aware of this being implemented in SQL Server, for example.)

Some further thoughts:

There is one definition of transience which seems clear and unambiguous to me, regardless of specific customer needs - whether retrying the same operation may succeed without doing any other change. Under this definition, all network errors are transient, as are any locking issues which could disappear on a second attempt.
It seems to me that this definition has enough value for the general/default case; at the end of the day, consumers are always free to ignore the proposed IsTransient property and define whatever retry policy they want. However, if in the typical case customers really do need to configure and tweak exactly which error type they want to retry, then I guess there's little value in this proposal and a non-agnostic approach is the only possibility (as today). It would be good to see what others think about this.

Let me know what you think or if I've misunderstood your points.

saurabh500 · 2020-04-10T20:49:18Z

I am not asking for ADO.Net base classes to be database aware or to categorize errors. That would not be desirable and may not serve all databases and their ways of categorizing errors.
I should have not used too many references to SqlClient in my explanation :)

For a driver to say whether DbException.IsTransient should be set true or false is prior knowledge. So a driver knows that there are errors that are transient since those errors are documented. It is only for those errors that the driver can definitely set IsTransient to true. For all other errors, there should be a way for the consumer of ado.net driver to add to the list of transient errors.

If the customer can feed in any more errors codes that they want to consider transient, then the driver can set the IsTransient property to true for those additional errors along with the errors it knows about and any transient error handling framework like Polly can simply retry since the information is plumbed into the exception from the driver.

Does this make sense?

saurabh500 · 2020-04-10T21:11:29Z

I had a chat with Shay:

The fundamental problem is that every driver / database has errors which need to be interpreted differently, they may be in different formats, an error code may be overloaded with lot more information than just being a code.

Hence providing a DB agnostic way of getting these errors from the ADO.Net layer, may not satisfy all providers.

The recommendation would be to have the driver expose a mechanism (say via Connection String) to take in a list of errors that the driver user wants to consider transient. Then the driver can simply make use of DbException.IsTransient to express that the error is to be considered transient.

Basically the mechanism of ingestion of the custom error will be dependent on the driver and ADO.Net DbException will provide a way for the driver to express if the error is transient or not.

saurabh500 · 2020-04-10T21:12:55Z

@roji, I hope this captures the discussion. Please add to it if I missed anything.

roji · 2020-04-10T21:23:40Z

Yeah, I think that makes sense.

I admit that for Npgsql (where IsTransient already exists) I haven't received requests to tweak its meaning. However, if for a particular driver it makes sense to expose a configuration interface (via connection strings or via a special API), which affects what the driver returns via InTransient, then it's by all means free to do so (again, this would be specific to that driver since errors aren't agnostic).

I'll just remark that in that case, instead of configuring the driver to modify its IsTransient values, a user could just implement their logic wherever the IsTransient is being consulted (since things like Polly are almost always configurable). That is, say I want to treat error code X as transient, although by default the driver doesn't treat it as such; I could modify my external retry strategy (which is under my direct control) to recognize both IsTransient and code X. But it may indeed still make sense to influence IsTransient in the driver in some scenarios.

(so all good)

rgarrison12345 · 2020-04-11T02:50:17Z

Thanks everyone for the conversation on this. I've been reading through and this is the first time a suggestion of mine has made it to an API proposal. What is the next step for this to take this from a proposal to an actionable item? This looks to appear to be adding a property to DbException with a default implementation so this shouldn't be a breaking change. I wouldn't think this would require any heavy testing because this is just a property on the abstract class.

Assuming the conversation regarding this API proposal is for the most part agreed upon.

roji · 2020-04-11T09:01:59Z

@rgarrison12345 let's give this a bit more time to let other people comment and post their views; it's always good to have a consensus from multiple people involved in .NET database programming and drivers. Having said that, this indeed is a small non-breaking change, which I think we'll be able to get in to .NET 5.0.

roji · 2020-04-12T10:31:58Z

A couple more thoughts on in-driver and external resilience from the conversation with @saurabh500 (this doesn't have a direct bearing on this issue):

Azure SQL seems to require clients to implement retrying strategies, because apparently transient errors there are a normal occurrence, as opposed to other scenarios. I'm not aware of the same thing with other cloud database services (e.g. Citus for PostgreSQL), and am interested if anyone has more info on this.
As a result, and in order to provide a good experience for Azure SQL out of the box, SqlClient implements a retrying strategy inside the driver, rather than pushing the problem to the user. Given the Azure SQL behavior, this of course makes total sense, but the general model raises some problems:
- Retrying policies are very open-ended and configurable (how many times to retry, what kind of back-off...) - this is why products such as Polly exist solely to handle this. Internalizing retrying into the driver introduces all this complexity/configurability inside the driver, and likely can never be as configurable/extensible as a dedicated product etc.
- This also forces all language drivers to implement retrying, which seems like quite a burden. I'm curious, do non-.NET SQL Server drivers implement the same kind of resilience logic as SqlClient because of Azure SQL? @David-Engel do you have any info here?

FransBouma · 2020-04-12T10:50:30Z

It's new to me that SQLClient does the retrying itself on Azure, is this behavior introduced in Microsoft.Data.SqlClient? As I had to deliberately implement retry strategies for azure for System.Data.SqlClient.

roji · 2020-04-12T11:50:17Z

I think the relevant info is in https://docs.microsoft.com/en-us/azure/sql-database/sql-database-connectivity-issues#net-sqlconnection-parameters-for-connection-retry, but @saurabh500, @David-Engel and @cheenamalhotra can provide more info.

saurabh500 · 2020-04-12T18:05:05Z

@FransBouma The behavior was introduced in System.Data.SqlClient.SqlConnection.Open() in 4.6.1 and has been carried over to Microsoft.Data.SqlClient. The link above from @roji provides the connection string parameters that can be used to control this feature.

@roji
Whether this specific feature is available in other drivers or not, I am not sure about it. @David-Engel and @cheenamalhotra will definitely have more information

However the various driver teams try to keep the feature offerings in different drivers the same, so that all languages and platforms get the SQL Server/Azure SQL DB related benefits and they are equally easy to use in all the cases. E.g. If ODBC driver offers Always Encrypted feature, then all other drivers would, or if ADO.Net offers Azure Active Directory auth, then all other drivers would eventually try to offer the feature.

I am not sure why you think this is a burden? Knowing that all drivers on all platforms where SQL Server drivers are available, are trying to offer the same features is quite nice.

Retrying policies are very open-ended and configurable (how many times to retry, what kind of back-off...) - this is why products such as Polly exist solely to handle this. Internalizing retrying into the driver introduces all this complexity/configurability inside the driver, and likely can never be as configurable/extensible as a dedicated product etc.

I agree that drivers shouldn't go to the extent of offering what Polly or other transient fault handling frameworks offer, I do believe that offering some kind of default strategy to allow for common failure use cases in drivers is a great way of not bringing in another dependency in the application framework to make sure that a connection.Open() succeeds. It made sense for SqlClient and we did see reduction in cases where failures were reported to us, because legacy applications didn't have a retry policy in place while being moved to Azure.
There would be many cases where the driver doesn't offer the retry feature which can satisfy the needs to the application, then the developers can turn off the driver feature (in case of SqlClient usign a connection string parameter) and move to transient handling frameworks that they would prefer, or use transient fault handling framework in conjunction with Driver's retry handling if that serves the purpose.

roji · 2020-04-12T18:27:42Z

I am not sure why you think this is a burden? Knowing that all drivers on all platforms where SQL Server drivers are available, are trying to offer the same features is quite nice.

Don't get me wrong, given that Azure SQL DB works this way, I think it does make sense for SqlClient (and other drivers) to implement this, otherwise it seems that it would be unusable with Azure on its own. However, what is basically happening is that Azure SQL DB is passing a mandatory resiliency handling burden onto its clients (i.e. onto drivers), where it would have been better to it to simple handle them internally, exposing a reliable service. There may be technical architectural reasons why things have to be this way, but I'm not aware of other cloud data services where things work this way (granted, I don't know a huge deal about this).

Let's put it this way - setting aside Azure SQL DB for a moment (where apparently resilience is particularly problematic), would it make sense to have in-driver retrying for regular, on-premise where networking issues are likely to be much more rare? I'm asking this because I've never received requests for such a feature for Npgsql, and I'm not aware of other drivers doing this apart from SqlClient (I may simply be unaware though).

One last reservation I have about this, is that in-driver resiliency can only go so far. For example, as per the documentation, if a command has started executing and fails, no retry attempt will happen. This is reasonable behavior to expect from the driver (because re-executing commands can be tricky, considering double execution, transactions...), but if transient failures really are common then users must deal with this as well. In other words, driver resiliency coverage seems like it would necessarily be limited, no?

In any case, this discussion is quite academic. I do think SqlClient is doing things right given the Azure SQL DB resiliency situation, it's just that in an ideal world Azure SQL DB would be more reliable and not shift this burden to the driver, that's all. And as you pointed out, users can out-in (or out) of in-driver retrying, in any case.

saurabh500 · 2020-04-12T19:28:44Z

I am curious as to where this discussion is going. What are you planning to achieve from this discussion about retries?
I have put in a bunch of historical data points here about what how we handled it with Azure SQL DB and I feel that I am giving the wrong perception that Azure SQL DB is full of problems with with transient errors. Certainly not the picture I was trying to paint :)

I know about SqlClient and I am only trying to tell you what problems we have faced in the past, my recommendation about how such a common theme can be incorporated into ADO.Net programming.
However your paraphrasing of my comments are a bit discouraging and painting what I work on or I have worked on, in the wrong light. :)

roji · 2020-04-12T21:51:31Z

Sorry @saurabh500, that's not at all what I wanted to express...

I really think SqlClient is doing the right thing here, and am not criticizing any of it. I also admit most of the above isn't very relevant in a concrete way to this proposal (which is just about adding IsTransient to DbException).

The above comes only from my surprise in learning that Azure SQL DB passes the burden of resilience checks to the client (i.e. driver), that's all - I'm simply not used to this from a cloud data service. Please accept my apologies if any of the above cast your work in a negative light - that was really the last thing I wanted to do. Your comments helped me personally understand the landscape and customer needs better.

I guess at this point I should stop posting general thoughts that don't make this issue move forward (unless you feel there's additional value in continuing this conversation). Sorry once again.

David-Engel · 2020-04-13T16:50:07Z

I'm curious, do non-.NET SQL Server drivers implement the same kind of resilience logic as SqlClient because of Azure SQL? @David-Engel do you have any info here?

@roji SqlClient is the most feature-rich in terms of this type of functionality, but yes, other drivers do implement resilience logic. ODBC has automatic connection open retry logic, for example and automatic reconnection of idle connections which have been disconnected by network devices. Saurabh touched on what I believe is the biggest use case for these features: existing/legacy applications which were built on the assumption of local resources where "blips" rarely occurred (whether that's a network hiccup, fail over event, provisioning delay, etc) and where it is burdensome to make code changes. Enabling those to be migrated to cloud databases without suffering from those cloud "blips" has very high value to customers.

roji · 2020-04-13T16:59:19Z

Thanks @David-Engel, makes complete sense.

Let me know if you guys think that the IsTransient property proposed here makes sense as-is (as well as SupportsTransienceDetection).

FransBouma · 2020-04-13T18:34:42Z

Transient errors and retries aren't only connection open / reconnect oriented, i.e. if you have issued 3 commands and the 4th fails due to error 1205 (result streaming) you have to retry all 4, if they're in the same transaction. I have no idea if SqlClient's connectionstring retry logic takes care of that (I doubt it).

Hence it's a nice idea, but to get full resilience you really have to perform the retry logic at the app level. (IMHO)

saurabh500 · 2020-04-13T19:15:41Z

You are right @FransBouma There is no retry around command execution in SqlClient.

What would a retry framework do in case IsTransient is set to false and there is no feature detection (i.e. no SupportsTransienceDetection property)? Will the retry framework fallback to the user configured retry strategy?

roji · 2020-06-09T11:25:40Z

@olle-j if you want to implement your custom logic on what you consider transient, how would such an interface be helpful to you? You're going to have to implement a transience detector in any case, outside of the driver, and implement your custom logic there. See EF Core's current transient exception detector for SQL Server for an example - the idea would be to move a lot of that logic (possibly even all of it) into SqlClient.

Error 208 is not an transient error. But I know that this error in combination with the message "Invalid object name 'sys.sysschobjs'" when using temporary tables is an transient error in Azure. It will never make its way up to the official transient implementation, but this is one of the reasons I need a custom detection strategy today.

If indeed this error is transient on Azure when using temporary tables, then I'd expect SqlClient to flag it as such...

If I could decorate an Exception with this interface I would be good to go.

Once I again, if you're implementing your custom strategy, then why would you need to decorate anything? Your strategy would contain the logic to identify whatever you consider as transient.

olle-j · 2020-06-09T13:03:06Z

The framework that handles the retry (any framework really, EF, Polly etc.) can handle IsTransient. It pushes the logic down in the stack. I can of cause make my own transient detector and handle all kinds of errors, but that should not be needed if I only can check IsTransient. Polly would as an example have one line of code, regardless if its created in the SqlClient or not.

I don't want a custom strategy. I only want to handle IsTransient. But its nice to be able to throw these by other frameworks than SqlClient. Like you say. It push the implementation down in the stack.

Does your suggestion take into account TransactionScope? That is where I have to implement custom detection like Polly today, since they span over multiple SqlConnections.

roji · 2020-06-09T17:47:27Z

@olle-j I'm still not really understanding what you're asking for - would you have us change other .NET BCL exceptions to implement the interface you're proposing? I don't see how that would make sense, since a SocketException, for example, isn't transient or non-transient on its own - that's up to the specific application to decide based on context. The reason DbException is different, is that it's an abstraction over multiple database drivers, and it's important for these drivers to pass information on transience through that exception.

I don't want a custom strategy. I only want to handle IsTransient.

Once again, I simply don't think that makes sense. Different exceptions are transient to different applications in different contexts.

Does your suggestion take into account TransactionScope? That is where I have to implement custom detection like Polly today, since they span over multiple SqlConnections.

What exact relationship do you see between an exception being transient and TransactionScope? Once again, the point is only to expose information on DbException which could be used by Polly (in addition to other logic). What Polly (or any other retry strategy) should do when a transient (or non-transient) exception is thrown is out of scope here, in my mind.

terrajobst · 2020-06-30T19:09:27Z

Video

Makes sense as proposed.

namespace System.Data.Common
{
    public partial class DbException
    {
        public virtual bool IsTransient => false;
    }
}

Closes dotnet#35601 Closes dotnet#34817

Closes #35601 Closes #34817

bgrainger · 2020-07-12T23:02:19Z

Now I have a list of transient error codes per database in my transient error recovery strategy classes and that's not very flexible (if error codes are added I have to maintain this list plus I have to look up these codes for all databases supported).

@FransBouma Would you be willing to share the list of MySQL error codes you consider to be transient for MySQL? I think MySqlConnector could adopt them, instead of coming up with a different list (that you'd have to augment because you couldn't trust that IsTransient is sufficient for your needs).

mysql-net/MySqlConnector#849

FransBouma · 2020-07-13T08:39:24Z

@bgrainger
I don't have such a list, we only ship a transient error recovery class for SQL Server with boilerplate code for users if they want to add support for their database for that.
Basically any error that doesn't kill the transaction / connection is usable for transient error recovery, however to figure out which ones that are is a painful process with some databases.

roji · 2020-07-13T08:54:30Z

Basically any error that doesn't kill the transaction / connection is usable for transient error recovery [...]

I tend to be a bit more conservative on this, e.g. a unique constraint violation isn't something I'd consider transient, since the exact same operation would in principle produce the same failure (even if the same operation but with a different ID would succeed). The way I'm thinking about this is for the development of automatic retry strategies (Polly, EF Core, LLBLGen Pro).

It's indeed a good idea for us to end up with the same concept of transience across providers (that's the whole point of the feature), so let's continue having this conversation.

FransBouma · 2020-07-13T08:58:02Z

Doesn't a UC violation terminate the transaction? For SQL Server we use these:

switch(Convert.ToInt32(errorNumber))
{
	// Time out. 
	case -2:
	// DBNETLIB Error Code: 20
	// The instance of SQL Server you attempted to connect to does not support encryption.
	case 20:
	// SQL Error Code: 64
	// A connection was successfully established with the server, but then an error occurred during the login process. 
	// (provider: TCP Provider, error: 0 - The specified network name is no longer available.) 
	case 64:
	// SQL Error Code: 233
	// The client was unable to establish a connection because of an error during connection initialization process before login. 
	// Possible causes include the following: the client tried to connect to an unsupported version of SQL Server; the server was too busy 
	// to accept new connections; or there was a resource limitation (insufficient memory or maximum allowed connections) on the server. 
	// (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)
	case 233:
	// SQL Error Code: 1205
	// Dead lock situation. 
	case 1205:
	// SQL Error Code: 10053
	// A transport-level error has occurred when receiving results from the server.
	// An established connection was aborted by the software in your host machine.
	case 10053:
	// SQL Error Code: 10054
	// A transport-level error has occurred when sending the request to the server. 
	// (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)
	case 10054:
	// SQL Error Code: 10060
	// A network-related or instance-specific error occurred while establishing a connection to SQL Server. 
	// The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server 
	// is configured to allow remote connections. (provider: TCP Provider, error: 0 - A connection attempt failed 
	// because the connected party did not properly respond after a period of time, or established connection failed 
	// because connected host has failed to respond.)"}
	case 10060:
	// SQL Error Code: 10928
	// Resource ID: %d. The %s limit for the database is %d and has been reached. For more information,
	case 10928:
	// SQL Error Code: 10929
	// Resource ID: %d. The %s minimum guarantee is %d, maximum limit is %d and the current usage for the database is %d.
	// However, the server is currently too busy to support requests greater than %d for this database.
	case 10929:
	// SQL Error Code: 40197
	// The service has encountered an error processing your request. Please try again.
	case 40197:
	// SQL Error Code: 40143
	// The service has encountered an error processing your request. Please try again.
	case 40143:
	// SQL Error Code: 40501
	// The service is currently busy. Retry the request after 10 seconds.
	case 40501:
	// SQL Error Code: 40613
	// Database XXXX on server YYYY is not currently available. Please retry the connection later. If the problem persists, contact customer 
	// support, and provide them the session tracing ID of ZZZZZ.
	case 40613:
		return true;

so basically everything related to creating a connection and keeping it alive. SQL Server documentation is pretty good on this, but I found other database documentation pretty vague on which errors do and which don't terminate a transaction for instance.

roji · 2020-07-13T10:32:41Z

The PG logic is pretty simple (and somewhat extreme) - any error puts the ongoing transaction in a failed state, and you can't do anything aside from rolling the transaction back (or rolling back to a savepoint).

But I'm not sure that IsTransient should mean the same thing as "does it terminate the transaction"... I was only thinking of this as an indication that a retry may be successful (even if that retry involves rolling back the transaction and replaying statements). In other words, I'd expect an ORM/retry policy to check this flag, and if it's true, roll back the ongoing transaction (if any) and re-attempt.

It's true that an error may be transient but not cause transaction termination, and so rolling back the entire transaction to retry may be overkill (it could be possible to just retry the latest command). That doesn't seem terribly important to me as transient exceptions aren't supposed to be that common for it to matter (am I wrong?).

Does this make sense to everyone? Am I missing something?

KalleOlaviNiemitalo · 2020-07-13T10:48:16Z

Our application also needs to know whether a different command on a new database connection could succeed, or the entire database or server is down. The application currently recognizes error codes for this purpose but we could perhaps make it instead check which method threw the exception.

roji · 2020-07-13T13:32:34Z

@KalleOlaviNiemitalo detecting that the database/server is down is notoriously difficult (e.g. a network issue could indicate the server is down (or not)), am curious what you're current logic for this is...

But either way this seems to be orthogonal (and so out of scope) for this feature, which is only about identifying transient errors - or do you see things otherwise?

KalleOlaviNiemitalo · 2020-07-13T13:51:16Z

@roji Our application does not need to distinguish between a network issue, the server being down, and the database being down. It needs to know that those errors do not depend on the database commands that it was running in the transaction.

I agree that such a feature is not in scope for this issue, but I mentioned it as an example of difficulty in migrating to DbException.IsTransient from application-side recognition of error numbers.

FransBouma · 2020-07-13T14:23:37Z

The PG logic is pretty simple (and somewhat extreme) - any error puts the ongoing transaction in a failed state, and you can't do anything aside from rolling the transaction back (or rolling back to a savepoint).

But I'm not sure that IsTransient should mean the same thing as "does it terminate the transaction"... I was only thinking of this as an indication that a retry may be successful (even if that retry involves rolling back the transaction and replaying statements). In other words, I'd expect an ORM/retry policy to check this flag, and if it's true, roll back the ongoing transaction (if any) and re-attempt.

It's true that an error may be transient but not cause transaction termination, and so rolling back the entire transaction to retry may be overkill (it could be possible to just retry the latest command). That doesn't seem terribly important to me as transient exceptions aren't supposed to be that common for it to matter (am I wrong?).

Does this make sense to everyone? Am I missing something?

I think what you describe is very sensible and in practice it likely leads to redoing the whole transaction again anyway, but I think the key difference with e.g. a UC violation in the middle of a transaction is that the error is outside of the scope of the data, i.e. there's nothing to 'correct' on the client / app side; the only thing it can do is retry the whole unit of work.

An error in the data, violating a UC, an FK, retrying the same Unit of work will lead to the same error, as the error is inside the scope of the data.

Would that help determining when the IsTransient flag should be set or not?

roji · 2020-07-13T16:37:04Z

@ KalleOlaviNiemitalo

I agree that such a feature is not in scope for this issue, but I mentioned it as an example of difficulty in migrating to DbException.IsTransient from application-side recognition of error numbers.

Yeah, IsTransient unfortunately won't solve everything requiring applications to look at provider-specific error numbers. Note that there's SqlState which is also being added to DbException (#35601), which could also help (e.g. for recognizing unique constraint violations across databases).

@FransBouma

An error in the data, violating a UC, an FK, retrying the same Unit of work will lead to the same error, as the error is inside the scope of the data.

There could conceivably be some variation here across products... If retrying the UoW means retrying the exact same statements (with the same ID values), then yeah, the same error should be reproduced; that's why I think that UC/FK violations shouldn't be treated as transient. But it's possible that for some ORM, retrying would also mean generating new IDs, in which case the UC/FK violations would no longer occur.

I do think that in the general case data issues shouldn't be treated as transient - am interested in any other views though.

roji added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Apr 10, 2020

roji self-assigned this Apr 10, 2020

Dotnet-GitSync-Bot added area-Meta untriaged New issue has not been triaged by the area owner labels Apr 10, 2020

roji mentioned this issue Apr 10, 2020

SqlErrorCode Enumeration dotnet/SqlClient#518

Closed

roji added area-System.Data and removed area-Meta labels Apr 10, 2020

ajcvickers added untriaged New issue has not been triaged by the area owner and removed untriaged New issue has not been triaged by the area owner labels Jun 22, 2020

terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review labels Jun 30, 2020

rgarrison12345 mentioned this issue Jul 7, 2020

Add DbException.IsTransient property #38837

Closed

roji added a commit to roji/runtime that referenced this issue Jul 12, 2020

Add SqlState and IsTransient to DbException

4de88bd

Closes dotnet#35601 Closes dotnet#34817

roji mentioned this issue Jul 12, 2020

Add SqlState and IsTransient to DbException #39157

Merged

roji closed this as completed in #39157 Jul 12, 2020

roji added a commit that referenced this issue Jul 12, 2020

Add SqlState and IsTransient to DbException (#39157)

1b8d1dc

Closes #35601 Closes #34817

This was referenced Jul 12, 2020

.NET 5.0 | Implement SqlException.IsTransient dotnet/SqlClient#649

Open

Implement SqliteException.IsTransient dotnet/efcore#21598

Closed

Implement MySqlException.IsTransient mysql-net/MySqlConnector#849

Closed

roji mentioned this issue Jul 13, 2020

Provide a RetryingExecutionStrategy that relies on the new DbException.IsTransient dotnet/efcore#21606

Open

UsaRandom mentioned this issue Jul 28, 2020

Remove direct dependency on Microsoft.Data.SqlClient quartznet/quartznet#903

Closed

DavoudEshtehari mentioned this issue Sep 25, 2020

Configurable Retry Logic - Preview 1 dotnet/SqlClient#693

Merged

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Database-agnostic way to detect transient database errors #34817

Database-agnostic way to detect transient database errors #34817

Comments

roji commented Apr 10, 2020 • edited Loading

Provider tracking issues

Summary

API Proposal

Notes

Resources and examples

Old notes on SupportsTransienceDetection (removed from proposal)

Edit history

FransBouma commented Apr 10, 2020

roji commented Apr 10, 2020

FransBouma commented Apr 10, 2020

roji commented Apr 10, 2020

saurabh500 commented Apr 10, 2020

roji commented Apr 10, 2020 • edited Loading

saurabh500 commented Apr 10, 2020

saurabh500 commented Apr 10, 2020 • edited Loading

saurabh500 commented Apr 10, 2020

roji commented Apr 10, 2020

rgarrison12345 commented Apr 11, 2020 • edited Loading

roji commented Apr 11, 2020

roji commented Apr 12, 2020

FransBouma commented Apr 12, 2020

roji commented Apr 12, 2020

saurabh500 commented Apr 12, 2020

roji commented Apr 12, 2020

saurabh500 commented Apr 12, 2020

roji commented Apr 12, 2020

David-Engel commented Apr 13, 2020

roji commented Apr 13, 2020

FransBouma commented Apr 13, 2020

saurabh500 commented Apr 13, 2020

roji commented Jun 9, 2020

olle-j commented Jun 9, 2020 • edited Loading

roji commented Jun 9, 2020 • edited Loading

terrajobst commented Jun 30, 2020 • edited Loading

bgrainger commented Jul 12, 2020

FransBouma commented Jul 13, 2020

roji commented Jul 13, 2020

FransBouma commented Jul 13, 2020

roji commented Jul 13, 2020

KalleOlaviNiemitalo commented Jul 13, 2020

roji commented Jul 13, 2020

KalleOlaviNiemitalo commented Jul 13, 2020

FransBouma commented Jul 13, 2020

roji commented Jul 13, 2020

roji commented Apr 10, 2020 •

edited

Loading

roji commented Apr 10, 2020 •

edited

Loading

saurabh500 commented Apr 10, 2020 •

edited

Loading

rgarrison12345 commented Apr 11, 2020 •

edited

Loading

olle-j commented Jun 9, 2020 •

edited

Loading

roji commented Jun 9, 2020 •

edited

Loading

terrajobst commented Jun 30, 2020 •

edited

Loading