-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database-agnostic way to detect transient database errors #34817
Comments
Is If IsTransient is false, whether that's because it's not a transient error or because the driver/ado.net provider doesn't support this, the net end result is the same: the recovery strategy is bypassed and the exception is rethrown as a normal exception. Perhaps I'm overlooking a use case tho :) |
You're right that it's not terribly useful (i.e. "Note that in general it should be fine to just check IsTransient directly, as it would default to false in any case."). The main case I had in mind was to be able to surface that information to a user, or possibly to fail at startup if transience isn't supported by a provider. I'm not sure how relevant that really is, if people think it's useless we can also drop it. |
Ah now I understand it. :) Not sure how that information can be used in a practical way however: a recovery strategy of some sort has to be implemented regardless: a re-entrant aware system that can deal with retrying the commands in a given form, and if that strategy is built around 'IsTransient' it will simply bubble up all exceptions (as they're never 'transient') which is effectively the same as having code checking |
Yeah, I wouldn't really expect anyone to switch on I'm definitely open to removing this - let's see what other people think. |
While working on SqlClient, I realized that the classification of Transient errors is a combination of recommendations offered by the database and the nature of the customer workload and the customer's own understanding of their infrastructure running the applications and databases. A driver can classify an error as transient, if it is recommended. And such recommendations are usually made for a very small subset of errors. At least this is what my observation is, while interacting with SqlClient customers. For many other errors, it is the application that decides whether it should be transient or not. Consider SqlConnection.Open() in SqlClient, it retries on a bunch of errors while connecting to Azure SQL DB. But those errors are returned by the server. However there are many cases in which customers want to retry on networking specific errors, which may be caused by broken connections or caused by failure in TCP paths. Similar arguments go for Command Execution where the customers want to retry on some errors and fail fast on some errors. E.g. Deadlock is a problem that some customer applications want to retry on, because in their workload it is transient and for some customer applications, deadlocks need investigation. Basically the known transient errors in SqlClient are much fewer and this proposal makes more sense with a mechanism via which the customers can tell us if they want a particular error to be treated as transient and retried on, vs errors that we fail fast with. Otherwise the client's may not be authoritatively able to classify errors as transient and in case of client considering errors as transient may be undesirable. One of the asks from a few customers of SqlClient is to be able to specify what errors the applications want to consider transient and have the client retry internally on those errors. I wonder if the above situations apply for other drivers, they probably might. An API ecosystem where customers have the ability to specify what errors should be treated transient would be lot more powerful. |
Thanks for these thoughts @saurabh500.
I think there may be a bit of confusion here between the transient retrying implemented inside of SqlClient, and external retrying implemented outside of any specific driver, and which would leverage the proposed IsTransient property. Anything happening inside SqlClient is specific to that driver, and although you could add an API to make its retry mechanism more flexible, I don't think that would be applicable for a database-agnostic approach. Npgsql (as well as most other drivers AFAIK) doesn't consider retrying to be within its scope of responsibility and doesn't contain any logic internally for doing so - NpgsqlConnection.Open (almost) always immediately throws if any I/O error occurs. The user is expected to handle retrying according to their needs (typically using something like Polly), which obviates any need for an API to tell the driver what to retry and what not to retry. The proposed IsTransient property is meant to help with writing such an external retrying strategy. It definitely makes sense that different customers may have different needs around retrying, and around which errors should be classified as transient. From a database-agnostic perspective, doing anything here would require some sort of database-agnostic classification/categorization system: networking errors, transaction deadlock errors, etc. If such a classification existed, driver users would be able to react differently to types of errors in a database-agnostic way. However, I'm skeptical of whether such a system can work; as I wrote in dotnet/SqlClient#518 (comment), database errors and error codes in general seem very divergent across databases, and I'd be very wary for us to go into the business of standardizing database error categories for the purpose of transience. (As a aside note, the standard SQLSTATE is precisely an attempt to categorize database errors in a database-agnostic way (although SQLSTATE isn't really concerned with transience as a first-level concept). As I wrote in dotnet/SqlClient#518 (comment), I'm not aware of this being implemented in SQL Server, for example.) Some further thoughts:
Let me know what you think or if I've misunderstood your points. |
I am not asking for ADO.Net base classes to be database aware or to categorize errors. That would not be desirable and may not serve all databases and their ways of categorizing errors. For a driver to say whether DbException.IsTransient should be set true or false is prior knowledge. So a driver knows that there are errors that are transient since those errors are documented. It is only for those errors that the driver can definitely set IsTransient to true. For all other errors, there should be a way for the consumer of ado.net driver to add to the list of transient errors. If the customer can feed in any more errors codes that they want to consider transient, then the driver can set the IsTransient property to true for those additional errors along with the errors it knows about and any transient error handling framework like Polly can simply retry since the information is plumbed into the exception from the driver. Does this make sense? |
I had a chat with Shay: The fundamental problem is that every driver / database has errors which need to be interpreted differently, they may be in different formats, an error code may be overloaded with lot more information than just being a code. Hence providing a DB agnostic way of getting these errors from the ADO.Net layer, may not satisfy all providers. The recommendation would be to have the driver expose a mechanism (say via Connection String) to take in a list of errors that the driver user wants to consider transient. Then the driver can simply make use of DbException.IsTransient to express that the error is to be considered transient. Basically the mechanism of ingestion of the custom error will be dependent on the driver and ADO.Net DbException will provide a way for the driver to express if the error is transient or not. |
@roji, I hope this captures the discussion. Please add to it if I missed anything. |
Yeah, I think that makes sense. I admit that for Npgsql (where IsTransient already exists) I haven't received requests to tweak its meaning. However, if for a particular driver it makes sense to expose a configuration interface (via connection strings or via a special API), which affects what the driver returns via InTransient, then it's by all means free to do so (again, this would be specific to that driver since errors aren't agnostic). I'll just remark that in that case, instead of configuring the driver to modify its IsTransient values, a user could just implement their logic wherever the IsTransient is being consulted (since things like Polly are almost always configurable). That is, say I want to treat error code X as transient, although by default the driver doesn't treat it as such; I could modify my external retry strategy (which is under my direct control) to recognize both IsTransient and code X. But it may indeed still make sense to influence IsTransient in the driver in some scenarios. (so all good) |
Thanks everyone for the conversation on this. I've been reading through and this is the first time a suggestion of mine has made it to an API proposal. What is the next step for this to take this from a proposal to an actionable item? This looks to appear to be adding a property to DbException with a default implementation so this shouldn't be a breaking change. I wouldn't think this would require any heavy testing because this is just a property on the abstract class. Assuming the conversation regarding this API proposal is for the most part agreed upon. |
@rgarrison12345 let's give this a bit more time to let other people comment and post their views; it's always good to have a consensus from multiple people involved in .NET database programming and drivers. Having said that, this indeed is a small non-breaking change, which I think we'll be able to get in to .NET 5.0. |
A couple more thoughts on in-driver and external resilience from the conversation with @saurabh500 (this doesn't have a direct bearing on this issue):
|
It's new to me that SQLClient does the retrying itself on Azure, is this behavior introduced in Microsoft.Data.SqlClient? As I had to deliberately implement retry strategies for azure for System.Data.SqlClient. |
I think the relevant info is in https://docs.microsoft.com/en-us/azure/sql-database/sql-database-connectivity-issues#net-sqlconnection-parameters-for-connection-retry, but @saurabh500, @David-Engel and @cheenamalhotra can provide more info. |
@FransBouma The behavior was introduced in System.Data.SqlClient.SqlConnection.Open() in 4.6.1 and has been carried over to Microsoft.Data.SqlClient. The link above from @roji provides the connection string parameters that can be used to control this feature. @roji However the various driver teams try to keep the feature offerings in different drivers the same, so that all languages and platforms get the SQL Server/Azure SQL DB related benefits and they are equally easy to use in all the cases. E.g. If ODBC driver offers Always Encrypted feature, then all other drivers would, or if ADO.Net offers Azure Active Directory auth, then all other drivers would eventually try to offer the feature. I am not sure why you think this is a burden? Knowing that all drivers on all platforms where SQL Server drivers are available, are trying to offer the same features is quite nice.
I agree that drivers shouldn't go to the extent of offering what Polly or other transient fault handling frameworks offer, I do believe that offering some kind of default strategy to allow for common failure use cases in drivers is a great way of not bringing in another dependency in the application framework to make sure that a |
Don't get me wrong, given that Azure SQL DB works this way, I think it does make sense for SqlClient (and other drivers) to implement this, otherwise it seems that it would be unusable with Azure on its own. However, what is basically happening is that Azure SQL DB is passing a mandatory resiliency handling burden onto its clients (i.e. onto drivers), where it would have been better to it to simple handle them internally, exposing a reliable service. There may be technical architectural reasons why things have to be this way, but I'm not aware of other cloud data services where things work this way (granted, I don't know a huge deal about this). Let's put it this way - setting aside Azure SQL DB for a moment (where apparently resilience is particularly problematic), would it make sense to have in-driver retrying for regular, on-premise where networking issues are likely to be much more rare? I'm asking this because I've never received requests for such a feature for Npgsql, and I'm not aware of other drivers doing this apart from SqlClient (I may simply be unaware though). One last reservation I have about this, is that in-driver resiliency can only go so far. For example, as per the documentation, if a command has started executing and fails, no retry attempt will happen. This is reasonable behavior to expect from the driver (because re-executing commands can be tricky, considering double execution, transactions...), but if transient failures really are common then users must deal with this as well. In other words, driver resiliency coverage seems like it would necessarily be limited, no? In any case, this discussion is quite academic. I do think SqlClient is doing things right given the Azure SQL DB resiliency situation, it's just that in an ideal world Azure SQL DB would be more reliable and not shift this burden to the driver, that's all. And as you pointed out, users can out-in (or out) of in-driver retrying, in any case. |
I am curious as to where this discussion is going. What are you planning to achieve from this discussion about retries? I know about SqlClient and I am only trying to tell you what problems we have faced in the past, my recommendation about how such a common theme can be incorporated into ADO.Net programming. |
Sorry @saurabh500, that's not at all what I wanted to express... I really think SqlClient is doing the right thing here, and am not criticizing any of it. I also admit most of the above isn't very relevant in a concrete way to this proposal (which is just about adding IsTransient to DbException). The above comes only from my surprise in learning that Azure SQL DB passes the burden of resilience checks to the client (i.e. driver), that's all - I'm simply not used to this from a cloud data service. Please accept my apologies if any of the above cast your work in a negative light - that was really the last thing I wanted to do. Your comments helped me personally understand the landscape and customer needs better. I guess at this point I should stop posting general thoughts that don't make this issue move forward (unless you feel there's additional value in continuing this conversation). Sorry once again. |
@roji SqlClient is the most feature-rich in terms of this type of functionality, but yes, other drivers do implement resilience logic. ODBC has automatic connection open retry logic, for example and automatic reconnection of idle connections which have been disconnected by network devices. Saurabh touched on what I believe is the biggest use case for these features: existing/legacy applications which were built on the assumption of local resources where "blips" rarely occurred (whether that's a network hiccup, fail over event, provisioning delay, etc) and where it is burdensome to make code changes. Enabling those to be migrated to cloud databases without suffering from those cloud "blips" has very high value to customers. |
Thanks @David-Engel, makes complete sense. Let me know if you guys think that the IsTransient property proposed here makes sense as-is (as well as SupportsTransienceDetection). |
Transient errors and retries aren't only connection open / reconnect oriented, i.e. if you have issued 3 commands and the 4th fails due to error 1205 (result streaming) you have to retry all 4, if they're in the same transaction. I have no idea if SqlClient's connectionstring retry logic takes care of that (I doubt it). Hence it's a nice idea, but to get full resilience you really have to perform the retry logic at the app level. (IMHO) |
You are right @FransBouma There is no retry around command execution in SqlClient. What would a retry framework do in case |
@olle-j if you want to implement your custom logic on what you consider transient, how would such an interface be helpful to you? You're going to have to implement a transience detector in any case, outside of the driver, and implement your custom logic there. See EF Core's current transient exception detector for SQL Server for an example - the idea would be to move a lot of that logic (possibly even all of it) into SqlClient.
If indeed this error is transient on Azure when using temporary tables, then I'd expect SqlClient to flag it as such...
Once I again, if you're implementing your custom strategy, then why would you need to decorate anything? Your strategy would contain the logic to identify whatever you consider as transient. |
The framework that handles the retry (any framework really, EF, Polly etc.) can handle IsTransient. It pushes the logic down in the stack. I can of cause make my own transient detector and handle all kinds of errors, but that should not be needed if I only can check IsTransient. Polly would as an example have one line of code, regardless if its created in the SqlClient or not. I don't want a custom strategy. I only want to handle IsTransient. But its nice to be able to throw these by other frameworks than SqlClient. Like you say. It push the implementation down in the stack. Does your suggestion take into account TransactionScope? That is where I have to implement custom detection like Polly today, since they span over multiple SqlConnections. |
@olle-j I'm still not really understanding what you're asking for - would you have us change other .NET BCL exceptions to implement the interface you're proposing? I don't see how that would make sense, since a SocketException, for example, isn't transient or non-transient on its own - that's up to the specific application to decide based on context. The reason DbException is different, is that it's an abstraction over multiple database drivers, and it's important for these drivers to pass information on transience through that exception.
Once again, I simply don't think that makes sense. Different exceptions are transient to different applications in different contexts.
What exact relationship do you see between an exception being transient and TransactionScope? Once again, the point is only to expose information on DbException which could be used by Polly (in addition to other logic). What Polly (or any other retry strategy) should do when a transient (or non-transient) exception is thrown is out of scope here, in my mind. |
namespace System.Data.Common
{
public partial class DbException
{
public virtual bool IsTransient => false;
}
} |
@FransBouma Would you be willing to share the list of MySQL error codes you consider to be transient for MySQL? I think MySqlConnector could adopt them, instead of coming up with a different list (that you'd have to augment because you couldn't trust that |
@bgrainger |
I tend to be a bit more conservative on this, e.g. a unique constraint violation isn't something I'd consider transient, since the exact same operation would in principle produce the same failure (even if the same operation but with a different ID would succeed). The way I'm thinking about this is for the development of automatic retry strategies (Polly, EF Core, LLBLGen Pro). It's indeed a good idea for us to end up with the same concept of transience across providers (that's the whole point of the feature), so let's continue having this conversation. |
Doesn't a UC violation terminate the transaction? For SQL Server we use these:
so basically everything related to creating a connection and keeping it alive. SQL Server documentation is pretty good on this, but I found other database documentation pretty vague on which errors do and which don't terminate a transaction for instance. |
The PG logic is pretty simple (and somewhat extreme) - any error puts the ongoing transaction in a failed state, and you can't do anything aside from rolling the transaction back (or rolling back to a savepoint). But I'm not sure that IsTransient should mean the same thing as "does it terminate the transaction"... I was only thinking of this as an indication that a retry may be successful (even if that retry involves rolling back the transaction and replaying statements). In other words, I'd expect an ORM/retry policy to check this flag, and if it's true, roll back the ongoing transaction (if any) and re-attempt. It's true that an error may be transient but not cause transaction termination, and so rolling back the entire transaction to retry may be overkill (it could be possible to just retry the latest command). That doesn't seem terribly important to me as transient exceptions aren't supposed to be that common for it to matter (am I wrong?). Does this make sense to everyone? Am I missing something? |
Our application also needs to know whether a different command on a new database connection could succeed, or the entire database or server is down. The application currently recognizes error codes for this purpose but we could perhaps make it instead check which method threw the exception. |
@KalleOlaviNiemitalo detecting that the database/server is down is notoriously difficult (e.g. a network issue could indicate the server is down (or not)), am curious what you're current logic for this is... But either way this seems to be orthogonal (and so out of scope) for this feature, which is only about identifying transient errors - or do you see things otherwise? |
@roji Our application does not need to distinguish between a network issue, the server being down, and the database being down. It needs to know that those errors do not depend on the database commands that it was running in the transaction. I agree that such a feature is not in scope for this issue, but I mentioned it as an example of difficulty in migrating to DbException.IsTransient from application-side recognition of error numbers. |
I think what you describe is very sensible and in practice it likely leads to redoing the whole transaction again anyway, but I think the key difference with e.g. a UC violation in the middle of a transaction is that the error is outside of the scope of the data, i.e. there's nothing to 'correct' on the client / app side; the only thing it can do is retry the whole unit of work. An error in the data, violating a UC, an FK, retrying the same Unit of work will lead to the same error, as the error is inside the scope of the data. Would that help determining when the IsTransient flag should be set or not? |
@ KalleOlaviNiemitalo
Yeah, IsTransient unfortunately won't solve everything requiring applications to look at provider-specific error numbers. Note that there's SqlState which is also being added to DbException (#35601), which could also help (e.g. for recognizing unique constraint violations across databases).
There could conceivably be some variation here across products... If retrying the UoW means retrying the exact same statements (with the same ID values), then yeah, the same error should be reproduced; that's why I think that UC/FK violations shouldn't be treated as transient. But it's possible that for some ORM, retrying would also mean generating new IDs, in which case the UC/FK violations would no longer occur. I do think that in the general case data issues shouldn't be treated as transient - am interested in any other views though. |
Provider tracking issues
This new API has been merged, following are tracking issues for implementation in the different providers:
Summary
A recurring theme for database programming has been building automated retry policies, but System.Data doesn't currently allow database drivers to expose whether a database error can be expected to be transient or not. This leads to database-specific logic for detecting transient exceptions being built again and again outside of the driver.
API Proposal
Notes
Resources and examples
Old notes on SupportsTransienceDetection (removed from proposal)
/cc @David-Engel @cheenamalhotra @bgrainger @bricelam @ajcvickers @stephentoub @terrajobst @mgravell @FransBouma
Edit history
The text was updated successfully, but these errors were encountered: