Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource/aws_autoscaling_group: add instance_refresh block #13791

Merged

Conversation

roberth-k
Copy link
Contributor

@roberth-k roberth-k commented Jun 17, 2020

Community Note

  • Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for pull request followers and do not help prioritize the request

Closes #13785

Release note for CHANGELOG:

ENHANCEMENTS:

* resource/aws_autoscaling_group: add `instance_refresh` block

Output from acceptance testing:

$ make testacc TESTARGS='-run=TestAccAWSAutoScalingGroup_'
--- PASS: TestAccAWSAutoScalingGroup_launchTempPartitionNum (57.76s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_LaunchTemplate_LaunchTemplateSpecification_LaunchTemplateName (88.14s)
--- PASS: TestAccAWSAutoScalingGroup_serviceLinkedRoleARN (96.66s)
--- PASS: TestAccAWSAutoScalingGroup_classicVpcZoneIdentifier (104.83s)
--- PASS: TestAccAWSAutoScalingGroup_MaxInstanceLifetime (116.14s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_LaunchTemplate_LaunchTemplateSpecification_Version (116.58s)
--- PASS: TestAccAWSAutoScalingGroup_withMetrics (117.32s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_UpdateToZeroOnDemandBaseCapacity (120.95s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_LaunchTemplate_Override_InstanceType (123.72s)
--- PASS: TestAccAWSAutoScalingGroup_WithLoadBalancer_ToTargetGroup (379.68s)
--- PASS: TestAccAWSAutoScalingGroup_WithLoadBalancer (425.49s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_SpotMaxPrice (141.26s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_SpotInstancePools (86.65s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_SpotAllocationStrategy (57.79s)
--- PASS: TestAccAWSAutoScalingGroup_VpcUpdates (119.12s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_LaunchTemplate_Override_WeightedCapacity (174.51s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_OnDemandPercentageAboveBaseCapacity (88.72s)
--- PASS: TestAccAWSAutoScalingGroup_emptyAvailabilityZones (95.30s)
--- PASS: TestAccAWSAutoScalingGroup_autoGeneratedName (58.22s)
--- PASS: TestAccAWSAutoScalingGroup_LaunchTemplate_IAMInstanceProfile (59.49s)
--- PASS: TestAccAWSAutoScalingGroup_enablingMetrics (210.73s)
--- PASS: TestAccAWSAutoScalingGroup_ALB_TargetGroups (232.50s)
--- PASS: TestAccAWSAutoScalingGroup_launchTemplate (89.84s)
--- PASS: TestAccAWSAutoScalingGroup_suspendingProcesses (244.34s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy (53.03s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_OnDemandAllocationStrategy (90.53s)
--- PASS: TestAccAWSAutoScalingGroup_initialLifecycleHook (269.89s)
--- PASS: TestAccAWSAutoScalingGroup_namePrefix (89.34s)
--- PASS: TestAccAWSAutoScalingGroup_basic (277.90s)
--- PASS: TestAccAWSAutoScalingGroup_terminationPolicies (162.21s)
--- PASS: TestAccAWSAutoScalingGroup_TargetGroupArns (296.46s)
--- PASS: TestAccAWSAutoScalingGroup_withPlacementGroup (180.03s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_OnDemandBaseCapacity (116.18s)
--- PASS: TestAccAWSAutoScalingGroup_launchTemplate_update (185.73s)
--- PASS: TestAccAWSAutoScalingGroup_ALB_TargetGroups_ELBCapacity (348.21s)
--- PASS: TestAccAWSAutoScalingGroup_tags (276.82s)
--- PASS: TestAccAWSAutoScalingGroup_InstanceRefresh_Enabled (452.57s)
--- PASS: TestAccAWSAutoScalingGroup_InstanceRefresh_Triggers (477.33s)
--- PASS: TestAccAWSAutoScalingGroup_LoadBalancers (786.57s)

The following block will be added to aws_autoscaling_group:

resource "aws_autoscaling_group" "default" {
  instance_refresh {
    instance_warmup_seconds = 300
    min_healthy_percentage = 90
    strategy = "Rolling"
  }
}

If the block is present, start an ASG Instance Refresh with the given parameters when the ASG is updated. The refresh is started only when a property is changed that would need a rollout to take effect (e.g. launch_template). If there is already an active refresh during an update, it is cancelled.

The instance refresh is "fire and forget": the resource does not wait for the refresh to complete. This comes partly down to the difficulty of choosing a sensible timeout (highly dependent on ASG size, warmup settings, healthy percentages), and partly down to difficulties related to the representation of a rollout in Terraform state. Since the ASG update and Instance Refresh are separate operations, the resource might initially fail the apply, but then report a misleading success on the second run. Workarounds would require an opinionated solution, such as using a separate resource so that the identity of an Instance Refresh could be tracked individually.

@ghost ghost added size/M Managed by automation to categorize the size of a PR. service/autoscaling Issues and PRs that pertain to the autoscaling service. needs-triage Waiting for first response or review from a maintainer. size/L Managed by automation to categorize the size of a PR. and removed size/M Managed by automation to categorize the size of a PR. labels Jun 17, 2020
@Vivalldi
Copy link

Should wait_for_completion be true by default?

I'm in favor of having the wait enabled by default. However, I don't think we need the wait_for_completion flag as we already have wait_for_completion_timeout. If we follow the behavior of wait_for_capacity_timeout we should be good. Wait for the specified time in wait_for_completion_timeout. If it's set to "0" don't wait.

@roberth-k
Copy link
Contributor Author

roberth-k commented Jun 18, 2020

@Vivalldi Consistency with the conventions of the other timeout field settles it.

I'm currently looking at the following remaining issues:

  • Handling repeat updates. At present: if the first refresh fails, a subsequent update of the resource will succeed even though the refresh hasn't happened. This is misleading, if not dangerous.
  • Handling conflicting updates. Only one Instance Refresh can be active at a time. The resource might want to start a refresh while another one is still in progress. Should the resource fail or wait? (Interestingly: the Instance Refresh appears to capture the latest updates to an ASG even when they have been made mid-refresh)
  • When the refresh is cancelled after a timeout, should the resource wait for the cancellation to complete?

With the above plus acceptance tests, it will be a few more days before this is ready.

Edit: waiting and timeouts have been dropped

@roberth-k roberth-k force-pushed the f-aws_autoscaling_group-instance_refresh branch from 8e84b3a to 12119c2 Compare June 20, 2020 20:51
@ghost ghost added documentation Introduces or discusses updates to documentation. size/XL Managed by automation to categorize the size of a PR. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure. and removed size/L Managed by automation to categorize the size of a PR. labels Jun 20, 2020
@roberth-k roberth-k changed the title [WIP] aws_autoscaling_group: add instance_refresh block resource/aws_autoscaling_group: add instance_refresh block Jun 20, 2020
@roberth-k roberth-k marked this pull request as ready for review June 20, 2020 21:11
@roberth-k roberth-k requested a review from a team June 20, 2020 21:11
@Vivalldi
Copy link

PR looks good to me. Appreciate all the work you've put in.

Out of sheer curiosy, what's the reasoning behind dropping waiting and timeouts?

@roberth-k
Copy link
Contributor Author

@Vivalldi I couldn't find a way of managing failures and timeouts such that the outcome is obvious to the user. Not unsolvable, but it would have been a distraction in this PR.

The link between an ASG's state and a particular Instance Refresh is weak, so realistically the provider would need a new resource in order to be able to declaratively start and track the outcome of a single refresh. An update to the ASG would cause ForceNew by way of a null_resource-esque trigger map. To do this inside of aws_autoscaling_group would have required refactoring it using partial state.

@flosell
Copy link
Contributor

flosell commented Jun 26, 2020

On waiting for completion: I think that's an extremely valuable feature, given that (at least as far as I can tell) one of the main use-cases for instance_refresh would be managing deployments. Essentially, what we are saying with instance_refresh is "ensure my autoscaling group and all the instances attached to it are configured as specified in the code". This operation is only complete once the instance refresh is done.

I don't have a strong opinion on adding this in a separate PR though (might even help creating one) as long as this PR doesn't make it hard to do so.

Would be curious @roberth-k what troubles you ran into, to me a behaviour like @Vivalldi mentioned, using wait_for_completion sounds reasonable. I'm thinking the DescribeInstanceRefreshes might help get around the misleading success on second run.

@roberth-k
Copy link
Contributor Author

@flosell

Two main complications:

  1. Terraform won't store the state of aws_autoscaling_group when the refresh fails (unless the resource is refactored using partial state).
  2. DescribeInstanceRefreshes doesn't offer enough information to reliably say which ASG update a refresh pertains to.

Hence if the first terraform apply fails (e.g. instances failed to start, or maybe the timeout was too short), the resource doesn't maintain the ID of the refresh it triggered. On the second run, terraform apply notices that the resource diff is unchanged (the ASG itself was updated after all), and gives the green light even though the ASG may have been left in an inconsistent state.

(Note: CloudFormation doesn't have this issue, as it would roll back the entire ASG on failure)

This can happen more often than it seems, as there's no one size fits all default to the timeout. A large ASG with a long warmup time could take ages to refresh.

Read at diff time might notice that there is an ongoing (or failed) refresh, but wouldn't be able to tell whether it was the previous terraform apply, or perhaps whether an external agent had triggered it for its own reasons. DescribeInstanceRefreshes only provides the Instance Refresh ID and its start and end times, but DescribeAutoScalingGroups doesn't provide an UpdatedTime.

A separate aws_autoscaling_instance_refresh resource can work around these issues, as it would be able to track the lifecycle of a specific instance refresh separately from the ASG.

@roberth-k
Copy link
Contributor Author

Superseded by #14442.

@roberth-k roberth-k closed this Jul 31, 2020
@ghost
Copy link

ghost commented Aug 31, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Aug 31, 2020
@gdavison
Copy link
Contributor

gdavison commented Dec 9, 2020

Hi @roberth-k. Thanks for both of your PRs for Instance Refresh. We've decided to base the Instance Refresh feature implementation on this PR, not #14442.

The Instance Refresh operation is triggered as part of the changes made to the resource, so it fits better as a configuration of the resource itself. A Refresh can also happen concurrently with a scaling operation, so we don't want to wait for scaling to complete before triggering the Refresh.

@gdavison gdavison reopened this Dec 9, 2020
@gdavison gdavison merged commit ebfcec2 into hashicorp:master Dec 18, 2020
@breathingdust breathingdust removed the needs-triage Waiting for first response or review from a maintainer. label Sep 17, 2021
@roberth-k roberth-k deleted the f-aws_autoscaling_group-instance_refresh branch January 16, 2022 20:23
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Introduces or discusses updates to documentation. service/autoscaling Issues and PRs that pertain to the autoscaling service. size/XL Managed by automation to categorize the size of a PR. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ASG Instance Refresh
5 participants