Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry to eks cluster delete #23366

Merged
merged 4 commits into from
Mar 1, 2022
Merged

Add retry to eks cluster delete #23366

merged 4 commits into from
Mar 1, 2022

Conversation

eddiezane
Copy link
Contributor

@eddiezane eddiezane commented Feb 24, 2022

EKS delete cluster calls can fail due to an in-progress cluster upgrade operation (customer initiated or automated scaling by EKS) or service outage.

This PR helps mitigate delete cluster failures by adding a retry with backoff.

The suite is currently running and I will update when it finishes. The failure seems like a flake?

Output from acceptance testing:
$  make testacc TESTS=TestAccEKSCluster PKG=eks AWS_PROFILE=burner ACCTEST_PARALLELISM=5 ACCTEST_TIMEOUT=5000m

==> Checking that code complies with gofmt requirements...                                                                                                                                     
TF_ACC=1 go test ./internal/service/eks/... -v -count 1 -parallel 5 -run='TestAccEKSCluster'  -timeout 5000m
=== RUN   TestAccEKSClusterAuthDataSource_basic
=== PAUSE TestAccEKSClusterAuthDataSource_basic
=== RUN   TestAccEKSClusterDataSource_basic
=== PAUSE TestAccEKSClusterDataSource_basic
=== RUN   TestAccEKSCluster_basic
=== PAUSE TestAccEKSCluster_basic
=== RUN   TestAccEKSCluster_disappears
=== PAUSE TestAccEKSCluster_disappears
=== RUN   TestAccEKSCluster_Encryption_create
=== PAUSE TestAccEKSCluster_Encryption_create
=== RUN   TestAccEKSCluster_Encryption_update
=== PAUSE TestAccEKSCluster_Encryption_update
=== RUN   TestAccEKSCluster_Encryption_versionUpdate
=== PAUSE TestAccEKSCluster_Encryption_versionUpdate
=== RUN   TestAccEKSCluster_version
=== PAUSE TestAccEKSCluster_version
=== RUN   TestAccEKSCluster_logging
=== PAUSE TestAccEKSCluster_logging
=== RUN   TestAccEKSCluster_tags
=== PAUSE TestAccEKSCluster_tags
=== RUN   TestAccEKSCluster_VPC_securityGroupIDs
=== PAUSE TestAccEKSCluster_VPC_securityGroupIDs
=== RUN   TestAccEKSCluster_VPC_endpointPrivateAccess
=== PAUSE TestAccEKSCluster_VPC_endpointPrivateAccess
=== RUN   TestAccEKSCluster_VPC_endpointPublicAccess
=== PAUSE TestAccEKSCluster_VPC_endpointPublicAccess
=== RUN   TestAccEKSCluster_VPC_publicAccessCIDRs
=== PAUSE TestAccEKSCluster_VPC_publicAccessCIDRs
=== RUN   TestAccEKSCluster_Network_serviceIPv4CIDR
=== PAUSE TestAccEKSCluster_Network_serviceIPv4CIDR
=== RUN   TestAccEKSCluster_Network_ipFamily
=== PAUSE TestAccEKSCluster_Network_ipFamily
=== RUN   TestAccEKSClustersDataSource_basic
=== PAUSE TestAccEKSClustersDataSource_basic
=== CONT  TestAccEKSClusterAuthDataSource_basic
=== CONT  TestAccEKSCluster_tags
=== CONT  TestAccEKSClustersDataSource_basic
=== CONT  TestAccEKSCluster_VPC_publicAccessCIDRs                                                                                                                                                     === CONT  TestAccEKSCluster_Encryption_update
--- PASS: TestAccEKSClusterAuthDataSource_basic (5.64s)
=== CONT  TestAccEKSCluster_VPC_endpointPublicAccess
--- PASS: TestAccEKSClustersDataSource_basic (805.65s)
=== CONT  TestAccEKSCluster_VPC_endpointPrivateAccess
--- PASS: TestAccEKSCluster_tags (855.32s)
=== CONT  TestAccEKSCluster_VPC_securityGroupIDs
--- PASS: TestAccEKSCluster_VPC_publicAccessCIDRs (1150.25s)
=== CONT  TestAccEKSCluster_Network_ipFamily
--- PASS: TestAccEKSCluster_VPC_securityGroupIDs (881.04s)
=== CONT  TestAccEKSCluster_version
--- PASS: TestAccEKSCluster_VPC_endpointPublicAccess (1834.17s)
=== CONT  TestAccEKSCluster_logging
--- PASS: TestAccEKSCluster_Network_ipFamily (1465.34s)
=== CONT  TestAccEKSCluster_disappears
--- PASS: TestAccEKSCluster_logging (941.18s)
=== CONT  TestAccEKSCluster_Encryption_create
--- PASS: TestAccEKSCluster_VPC_endpointPrivateAccess (2414.62s)
=== CONT  TestAccEKSCluster_Encryption_versionUpdate
--- PASS: TestAccEKSCluster_disappears (799.48s)
=== CONT  TestAccEKSCluster_basic
--- PASS: TestAccEKSCluster_Encryption_create (839.34s)
=== CONT  TestAccEKSCluster_Network_serviceIPv4CIDR
    cluster_test.go:495: Step 6/9 error: Error running apply: exit status 1

        Error: error associating EC2 Subnet (subnet-0f7909314a41e5e17) IPv6 CIDR block (2600:1f13:894:aa00::/64): InvalidSubnet.Conflict: The CIDR '2600:1f13:894:aa00::/64' conflicts with another subnet
                status code: 400, request id: 1917bc0e-92cc-4e93-833d-deb32cf0d515

          with aws_subnet.test[0],
          on terraform_plugin_test.tf line 48, in resource "aws_subnet" "test":
          48: resource "aws_subnet" "test" {

--- FAIL: TestAccEKSCluster_Network_serviceIPv4CIDR (133.53s)
=== CONT  TestAccEKSClusterDataSource_basic
--- PASS: TestAccEKSCluster_Encryption_update (3925.55s)
--- PASS: TestAccEKSCluster_basic (779.18s)
--- PASS: TestAccEKSCluster_version (2541.70s)
--- PASS: TestAccEKSClusterDataSource_basic (838.88s)
--- PASS: TestAccEKSCluster_Encryption_versionUpdate (2773.93s)
FAIL
FAIL    github.com/hashicorp/terraform-provider-aws/internal/service/eks        5994.229s
FAIL
make: *** [GNUmakefile:36: testacc] Error 1

I tested this manually by creating an EKS cluster using https://github.com/hashicorp/learn-terraform-provision-eks-cluster. I deployed the sample as is and then updated the config to use Kubernetes 1.21. I then built a copy of the provider that commented out waitClusterUpdateSuccessful. I applied the change and started the cluster upgrade. Then I ran the destroy and observed that it retries successfully while an upgrade is in progress.

Output
$ terraform destroy                                                                                              
╷
│ Warning: Provider development overrides are in effect
│
│ The following provider development overrides are set in the CLI configuration:
│  - hashicorp/aws in /tmp/GOPATH/bin
│
│ The behavior may therefore not match any released version of the provider and applying changes may cause the state to become incompatible with published releases.
╵
random_string.suffix: Refreshing state... [id=cOG3qldS]
module.vpc.aws_vpc.this[0]: Refreshing state... [id=vpc-0e2c849d939ff6725]
module.eks.aws_iam_policy.cluster_elb_sl_role_creation[0]: Refreshing state... [id=a..
module.eks.aws_iam_role.workers[0]: Destruction complete after 1s
module.eks.aws_eks_cluster.this[0]: Destroying... [id=education-eks-cOG3qldS]
aws_security_group.worker_group_mgmt_one: Destruction complete after 1s
aws_security_group.worker_group_mgmt_two: Destruction complete after 1s
module.eks.aws_security_group_rule.workers_egress_internet[0]: Destruction complete after 1s
module.eks.aws_security_group_rule.workers_ingress_cluster[0]: Destruction complete after 1s
module.eks.aws_security_group_rule.workers_ingress_self[0]: Destruction complete after 2s
...
...
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 10s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 20s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 30s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 40s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 50s elapsed]
...
...
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 26m20s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 26m30s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 26m40s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 26m50s elapsed]
module.eks.aws_eks_cluster.this[0]: Still destroying... [id=education-eks-cOG3qldS, 27m0s elapsed]
module.eks.aws_eks_cluster.this[0]: Destruction complete after 27m5s
module.eks.aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy[0]: Destroying... [id=education-eks-cOG3qldS20220224182823855600000003-20220224182825030700000007]
module.eks.aws_iam_role_policy_attachment.cluster_AmazonEKSVPCResourceControllerPolicy[0]: Destroying... [id=education-eks-cOG3qldS20220224182823855600000003-20220224182824921900000005]
module.eks.aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy[0]: Destroying... [id=education-eks-cOG3qldS20220224182823855600000003-20220224182824922100000006]
...
...
random_string.suffix: Destroying... [id=cOG3qldS]
random_string.suffix: Destruction complete after 0s

Destroy complete! Resources: 53 destroyed.

Community Note

  • Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for pull request followers and do not help prioritize the request

Signed-off-by: Eddie Zaneski <eddiezane@gmail.com>
@github-actions github-actions bot added service/eks Issues and PRs that pertain to the eks service. needs-triage Waiting for first response or review from a maintainer. size/S Managed by automation to categorize the size of a PR. labels Feb 24, 2022
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome @eddiezane 👋

It looks like this is your first Pull Request submission to the Terraform AWS Provider! If you haven’t already done so please make sure you have checked out our CONTRIBUTING guide and FAQ to make sure your contribution is adhering to best practice and has all the necessary elements in place for a successful approval.

Also take a look at our FAQ which details how we prioritize Pull Requests for inclusion.

Thanks again, and welcome to the community! 😃

@anGie44 anGie44 added enhancement Requests to existing resources that expand the functionality or scope. and removed needs-triage Waiting for first response or review from a maintainer. labels Feb 24, 2022
@anGie44 anGie44 self-assigned this Feb 24, 2022
Copy link
Contributor

@anGie44 anGie44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eddiezane , thank you for this PR! Overall looks like a great workaround to include in this resource. I've pushed up the changelog entry as the change enhances the current delete functionality and will refactor to use tfresource.RetryWhen since we only need to retry on the ErrCodeResourceInUseException

@eddiezane
Copy link
Contributor Author

Thanks @anGie44!

I went with RetryConfigContext because it didn't seem like there was a backoff with RetryWhen. I just took another look and it seems that the wait won't ever exceed 10 seconds. I don't know anything about the rate limits but if that's intentional then it LGTM.

https://github.com/hashicorp/terraform-plugin-sdk/blob/45133e6e2aebbe0aca05427cbcd360f968979e98/helper/resource/state.go#L182-L198

@anGie44
Copy link
Contributor

anGie44 commented Feb 28, 2022

Oh that's a great point @eddiezane , I overlooked that and it doesn't hurt to use the RetryConfig in the case of possible rate limit exceeded throttling exceptions.

@github-actions github-actions bot added the linter Pertains to changes to or issues with the various linters. label Feb 28, 2022
@anGie44 anGie44 modified the milestones: v4.3.0, v4.4.0 Mar 1, 2022
@anGie44 anGie44 merged commit d44680f into hashicorp:main Mar 1, 2022
@eddiezane eddiezane deleted the ez/add-retry-to-eks-cluster-delete branch March 1, 2022 03:30
@github-actions
Copy link

github-actions bot commented Mar 4, 2022

This functionality has been released in v4.4.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Requests to existing resources that expand the functionality or scope. linter Pertains to changes to or issues with the various linters. service/eks Issues and PRs that pertain to the eks service. size/S Managed by automation to categorize the size of a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants