Add retry to eks cluster delete #23366

eddiezane · 2022-02-24T19:30:38Z

EKS delete cluster calls can fail due to an in-progress cluster upgrade operation (customer initiated or automated scaling by EKS) or service outage.

This PR helps mitigate delete cluster failures by adding a retry with backoff.

~~The suite is currently running and I will update when it finishes.~~ The failure seems like a flake?

Output from acceptance testing:

$  make testacc TESTS=TestAccEKSCluster PKG=eks AWS_PROFILE=burner ACCTEST_PARALLELISM=5 ACCTEST_TIMEOUT=5000m

==> Checking that code complies with gofmt requirements...                                                                                                                                     
TF_ACC=1 go test ./internal/service/eks/... -v -count 1 -parallel 5 -run='TestAccEKSCluster'  -timeout 5000m
=== RUN   TestAccEKSClusterAuthDataSource_basic
=== PAUSE TestAccEKSClusterAuthDataSource_basic
=== RUN   TestAccEKSClusterDataSource_basic
=== PAUSE TestAccEKSClusterDataSource_basic
=== RUN   TestAccEKSCluster_basic
=== PAUSE TestAccEKSCluster_basic
=== RUN   TestAccEKSCluster_disappears
=== PAUSE TestAccEKSCluster_disappears
=== RUN   TestAccEKSCluster_Encryption_create
=== PAUSE TestAccEKSCluster_Encryption_create
=== RUN   TestAccEKSCluster_Encryption_update
=== PAUSE TestAccEKSCluster_Encryption_update
=== RUN   TestAccEKSCluster_Encryption_versionUpdate
=== PAUSE TestAccEKSCluster_Encryption_versionUpdate
=== RUN   TestAccEKSCluster_version
=== PAUSE TestAccEKSCluster_version
=== RUN   TestAccEKSCluster_logging
=== PAUSE TestAccEKSCluster_logging
=== RUN   TestAccEKSCluster_tags
=== PAUSE TestAccEKSCluster_tags
=== RUN   TestAccEKSCluster_VPC_securityGroupIDs
=== PAUSE TestAccEKSCluster_VPC_securityGroupIDs
=== RUN   TestAccEKSCluster_VPC_endpointPrivateAccess
=== PAUSE TestAccEKSCluster_VPC_endpointPrivateAccess
=== RUN   TestAccEKSCluster_VPC_endpointPublicAccess
=== PAUSE TestAccEKSCluster_VPC_endpointPublicAccess
=== RUN   TestAccEKSCluster_VPC_publicAccessCIDRs
=== PAUSE TestAccEKSCluster_VPC_publicAccessCIDRs
=== RUN   TestAccEKSCluster_Network_serviceIPv4CIDR
=== PAUSE TestAccEKSCluster_Network_serviceIPv4CIDR
=== RUN   TestAccEKSCluster_Network_ipFamily
=== PAUSE TestAccEKSCluster_Network_ipFamily
=== RUN   TestAccEKSClustersDataSource_basic
=== PAUSE TestAccEKSClustersDataSource_basic
=== CONT  TestAccEKSClusterAuthDataSource_basic
=== CONT  TestAccEKSCluster_tags
=== CONT  TestAccEKSClustersDataSource_basic
=== CONT  TestAccEKSCluster_VPC_publicAccessCIDRs                                                                                                                                                     === CONT  TestAccEKSCluster_Encryption_update
--- PASS: TestAccEKSClusterAuthDataSource_basic (5.64s)
=== CONT  TestAccEKSCluster_VPC_endpointPublicAccess
--- PASS: TestAccEKSClustersDataSource_basic (805.65s)
=== CONT  TestAccEKSCluster_VPC_endpointPrivateAccess
--- PASS: TestAccEKSCluster_tags (855.32s)
=== CONT  TestAccEKSCluster_VPC_securityGroupIDs
--- PASS: TestAccEKSCluster_VPC_publicAccessCIDRs (1150.25s)
=== CONT  TestAccEKSCluster_Network_ipFamily
--- PASS: TestAccEKSCluster_VPC_securityGroupIDs (881.04s)
=== CONT  TestAccEKSCluster_version
--- PASS: TestAccEKSCluster_VPC_endpointPublicAccess (1834.17s)
=== CONT  TestAccEKSCluster_logging
--- PASS: TestAccEKSCluster_Network_ipFamily (1465.34s)
=== CONT  TestAccEKSCluster_disappears
--- PASS: TestAccEKSCluster_logging (941.18s)
=== CONT  TestAccEKSCluster_Encryption_create
--- PASS: TestAccEKSCluster_VPC_endpointPrivateAccess (2414.62s)
=== CONT  TestAccEKSCluster_Encryption_versionUpdate
--- PASS: TestAccEKSCluster_disappears (799.48s)
=== CONT  TestAccEKSCluster_basic
--- PASS: TestAccEKSCluster_Encryption_create (839.34s)
=== CONT  TestAccEKSCluster_Network_serviceIPv4CIDR
    cluster_test.go:495: Step 6/9 error: Error running apply: exit status 1

        Error: error associating EC2 Subnet (subnet-0f7909314a41e5e17) IPv6 CIDR block (2600:1f13:894:aa00::/64): InvalidSubnet.Conflict: The CIDR '2600:1f13:894:aa00::/64' conflicts with another subnet
                status code: 400, request id: 1917bc0e-92cc-4e93-833d-deb32cf0d515

          with aws_subnet.test[0],
          on terraform_plugin_test.tf line 48, in resource "aws_subnet" "test":
          48: resource "aws_subnet" "test" {

--- FAIL: TestAccEKSCluster_Network_serviceIPv4CIDR (133.53s)
=== CONT  TestAccEKSClusterDataSource_basic
--- PASS: TestAccEKSCluster_Encryption_update (3925.55s)
--- PASS: TestAccEKSCluster_basic (779.18s)
--- PASS: TestAccEKSCluster_version (2541.70s)
--- PASS: TestAccEKSClusterDataSource_basic (838.88s)
--- PASS: TestAccEKSCluster_Encryption_versionUpdate (2773.93s)
FAIL
FAIL    github.com/hashicorp/terraform-provider-aws/internal/service/eks        5994.229s
FAIL
make: *** [GNUmakefile:36: testacc] Error 1

I tested this manually by creating an EKS cluster using https://github.com/hashicorp/learn-terraform-provision-eks-cluster. I deployed the sample as is and then updated the config to use Kubernetes 1.21. I then built a copy of the provider that commented out waitClusterUpdateSuccessful. I applied the change and started the cluster upgrade. Then I ran the destroy and observed that it retries successfully while an upgrade is in progress.

Output

$ terraform destroy                                                                                              
╷
│ Warning: Provider development overrides are in effect
│
│ The following provider development overrides are set in the CLI configuration:
│  - hashicorp/aws in /tmp/GOPATH/bin
│
│ The behavior may therefore not match any released version of the provider and applying changes may cause the state to become incompatible with published releases.
╵
random_string.suffix: Refreshing state... [id=cOG3qldS]
module.vpc.aws_vpc.this[0]: Refreshing state... [id=vpc-0e2c849d939ff6725]
module.eks.aws_iam_policy.cluster_elb_sl_role_creation[0]: Refreshing state... [id=a..
module.eks.aws_iam_role.workers[0]: Destruction complete after 1s
module.eks.aws_eks_cluster.this[0]: Destroying... [id=education-eks-cOG3qldS]
aws_security_group.worker_group_mgmt_one: Destruction complete after 1s
aws_security_group.worker_group_mgmt_two: Destruction complete after 1s
module.eks.aws_security_group_rule.workers_egress_internet[0]: Destruction complete after 1s
module.eks.aws_security_group_rule.workers_ingress_cluster[0]: Destruction complete after 1s
module.eks.aws_security_group_rule.workers_ingress_self[0]: Destruction complete after 2s
...
...
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 10s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 20s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 30s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 40s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 50s elapsed]
...
...
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 26m20s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 26m30s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 26m40s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 26m50s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 27m0s elapsed]
module.eks.aws_eks_cluster.this[0]: Destruction complete after 27m5s
module.eks.aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy[0]: Destroying... [id=education-eks-cOG3qldS20220224182823855600000003-20220224182825030700000007]
module.eks.aws_iam_role_policy_attachment.cluster_AmazonEKSVPCResourceControllerPolicy[0]: Destroying... [id=education-eks-cOG3qldS20220224182823855600000003-20220224182824921900000005]
module.eks.aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy[0]: Destroying... [id=education-eks-cOG3qldS20220224182823855600000003-20220224182824922100000006]
...
...
random_string.suffix: Destroying... [id=cOG3qldS]
random_string.suffix: Destruction complete after 0s

Destroy complete! Resources: 53 destroyed.

Community Note

Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for pull request followers and do not help prioritize the request

Signed-off-by: Eddie Zaneski <eddiezane@gmail.com>

github-actions

Welcome @eddiezane 👋

It looks like this is your first Pull Request submission to the Terraform AWS Provider! If you haven’t already done so please make sure you have checked out our CONTRIBUTING guide and FAQ to make sure your contribution is adhering to best practice and has all the necessary elements in place for a successful approval.

Also take a look at our FAQ which details how we prioritize Pull Requests for inclusion.

Thanks again, and welcome to the community! 😃

anGie44

Hi @eddiezane , thank you for this PR! Overall looks like a great workaround to include in this resource. I've pushed up the changelog entry as the change enhances the current delete functionality and will refactor to use tfresource.RetryWhen since we only need to retry on the ErrCodeResourceInUseException

eddiezane · 2022-02-28T21:32:59Z

Thanks @anGie44!

I went with RetryConfigContext because it didn't seem like there was a backoff with RetryWhen. I just took another look and it seems that the wait won't ever exceed 10 seconds. I don't know anything about the rate limits but if that's intentional then it LGTM.

https://github.com/hashicorp/terraform-plugin-sdk/blob/45133e6e2aebbe0aca05427cbcd360f968979e98/helper/resource/state.go#L182-L198

anGie44 · 2022-02-28T21:42:05Z

Oh that's a great point @eddiezane , I overlooked that and it doesn't hurt to use the RetryConfig in the case of possible rate limit exceeded throttling exceptions.

github-actions · 2022-03-04T22:32:51Z

This functionality has been released in v4.4.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

github-actions · 2022-05-12T02:32:15Z

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Add retry to eks cluster delete

522e897

Signed-off-by: Eddie Zaneski <eddiezane@gmail.com>

github-actions bot added service/eks Issues and PRs that pertain to the eks service. needs-triage Waiting for first response or review from a maintainer. size/S Managed by automation to categorize the size of a PR. labels Feb 24, 2022

github-actions bot reviewed Feb 24, 2022

View reviewed changes

anGie44 added enhancement Requests to existing resources that expand the functionality or scope. and removed needs-triage Waiting for first response or review from a maintainer. labels Feb 24, 2022

anGie44 self-assigned this Feb 24, 2022

Update CHANGELOG for #23366

7aff343

anGie44 approved these changes Feb 28, 2022

View reviewed changes

r/eks_cluster: keep retrying only on the resourceinuse condition

deeb917

github-actions bot added the linter Pertains to changes to or issues with the various linters. label Feb 28, 2022

semgrep: add rule for RetryConfigContext

6377bb6

anGie44 modified the milestones: v4.3.0, v4.4.0 Mar 1, 2022

anGie44 merged commit d44680f into hashicorp:main Mar 1, 2022

eddiezane deleted the ez/add-retry-to-eks-cluster-delete branch March 1, 2022 03:30

github-actions bot locked as resolved and limited conversation to collaborators May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry to eks cluster delete #23366

Add retry to eks cluster delete #23366

eddiezane commented Feb 24, 2022 •

edited

Loading

github-actions bot left a comment

anGie44 left a comment •

edited

Loading

eddiezane commented Feb 28, 2022

anGie44 commented Feb 28, 2022

github-actions bot commented Mar 4, 2022

github-actions bot commented May 12, 2022

Add retry to eks cluster delete #23366

Add retry to eks cluster delete #23366

Conversation

eddiezane commented Feb 24, 2022 • edited Loading

Community Note

github-actions bot left a comment

Choose a reason for hiding this comment

anGie44 left a comment • edited Loading

Choose a reason for hiding this comment

eddiezane commented Feb 28, 2022

anGie44 commented Feb 28, 2022

github-actions bot commented Mar 4, 2022

github-actions bot commented May 12, 2022

eddiezane commented Feb 24, 2022 •

edited

Loading

anGie44 left a comment •

edited

Loading