-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dial DNS lookup error not retried #2295
Comments
I'm working on a fix. I will create a PR when I can validate in prod that this fixes our issue. |
Hi @a-canya , Thanks for reaching out.
This error indicates that DNS resolution succeeds but not able to establish a connection. This is often for transient reasons. From an SDK design perspective we want to retry connection timeouts since they are often solved with a retry.
This one is a DNS lookup failure, which often indicates more persistent issues than mere connection timeouts. Hence its not retryable by default. The SDK tries to accommodate transient error by default, and not general networking errors. It aims to be an isolated layer on top of the standard go library for networking purposes and let customers configure their own networking stack errors (perhaps with a custom retry strategy implemented by the user) Im inclined to close this issue but if you have any clarification / case for why this should be changed, I'd be happy to discuss it with the team. Thanks, |
FYI @a-canya I've taken a look at your patch. I'm fine with taking the change to look at |
Hi, thanks for you quick answers!
I believe the eror is a connection timeout while doing a DNS lookup, which should be retryable. I moved the check outside the switch because just looking at net.DNSError we might not be able to determine if the error is retryable, and in this case it's better to keep trying other options than to just assume that it is not retryable. The log I got shows an error that wraps a long chain of errors of different types (apart from a net.DNSError, there is also a net.OpError). If
My teammate is trying to recreate the same situation where this error occurred with increased logs. |
OK I see that now. There is definitely a subtle difference between terminating at Your patch looks solid then if you'd like to PR it - I imagine we'd take it regardless of whether it actually fixes your issue, since you've made a real behavioral improvement regardless. |
Hi, thanks @lucix-aws ! I will open the PR. FYI the error happened again in our system and I managed to get more logs:
As suspected, |
Resolved by #2300 |
|
Describe the bug
Using the standard retryer (
retry.NewStandard()
), I have observed that the following error is retried:whereas this error is not retried:
Expected Behavior
Both errors are retried.
Current Behavior
Error with DNS lookup is not retried.
Reproduction Steps
I don't know how to reproduce the error. Happened in production a few times.
Possible Solution
In https://github.com/aws/aws-sdk-go-v2/blob/main/aws/retry/retryable_error.go#L100, there is a cast to *net.DNSError which was introduced in #2083.
This has several issues.
IsTemporary
field is only guaranteed to be valid if true. There is also a method (Temporary()
) which also considers that the error might be a time out (which I think might be my case). For reference see net.DNSError. In any case, if false the code should fall through to the other checks.Additional Information/Context
No response
AWS Go SDK V2 Module Versions Used
v1.21.0
Compiler and Version used
go version go1.20.2 linux/amd64
Operating System and version
alpine:latest
The text was updated successfully, but these errors were encountered: