resource/aws_autoscaling_group: add instance_refresh block #13791

roberth-k · 2020-06-17T12:29:46Z

Community Note

Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for pull request followers and do not help prioritize the request

Release note for CHANGELOG:

ENHANCEMENTS:

* resource/aws_autoscaling_group: add `instance_refresh` block

Output from acceptance testing:

$ make testacc TESTARGS='-run=TestAccAWSAutoScalingGroup_'
--- PASS: TestAccAWSAutoScalingGroup_launchTempPartitionNum (57.76s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_LaunchTemplate_LaunchTemplateSpecification_LaunchTemplateName (88.14s)
--- PASS: TestAccAWSAutoScalingGroup_serviceLinkedRoleARN (96.66s)
--- PASS: TestAccAWSAutoScalingGroup_classicVpcZoneIdentifier (104.83s)
--- PASS: TestAccAWSAutoScalingGroup_MaxInstanceLifetime (116.14s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_LaunchTemplate_LaunchTemplateSpecification_Version (116.58s)
--- PASS: TestAccAWSAutoScalingGroup_withMetrics (117.32s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_UpdateToZeroOnDemandBaseCapacity (120.95s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_LaunchTemplate_Override_InstanceType (123.72s)
--- PASS: TestAccAWSAutoScalingGroup_WithLoadBalancer_ToTargetGroup (379.68s)
--- PASS: TestAccAWSAutoScalingGroup_WithLoadBalancer (425.49s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_SpotMaxPrice (141.26s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_SpotInstancePools (86.65s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_SpotAllocationStrategy (57.79s)
--- PASS: TestAccAWSAutoScalingGroup_VpcUpdates (119.12s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_LaunchTemplate_Override_WeightedCapacity (174.51s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_OnDemandPercentageAboveBaseCapacity (88.72s)
--- PASS: TestAccAWSAutoScalingGroup_emptyAvailabilityZones (95.30s)
--- PASS: TestAccAWSAutoScalingGroup_autoGeneratedName (58.22s)
--- PASS: TestAccAWSAutoScalingGroup_LaunchTemplate_IAMInstanceProfile (59.49s)
--- PASS: TestAccAWSAutoScalingGroup_enablingMetrics (210.73s)
--- PASS: TestAccAWSAutoScalingGroup_ALB_TargetGroups (232.50s)
--- PASS: TestAccAWSAutoScalingGroup_launchTemplate (89.84s)
--- PASS: TestAccAWSAutoScalingGroup_suspendingProcesses (244.34s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy (53.03s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_OnDemandAllocationStrategy (90.53s)
--- PASS: TestAccAWSAutoScalingGroup_initialLifecycleHook (269.89s)
--- PASS: TestAccAWSAutoScalingGroup_namePrefix (89.34s)
--- PASS: TestAccAWSAutoScalingGroup_basic (277.90s)
--- PASS: TestAccAWSAutoScalingGroup_terminationPolicies (162.21s)
--- PASS: TestAccAWSAutoScalingGroup_TargetGroupArns (296.46s)
--- PASS: TestAccAWSAutoScalingGroup_withPlacementGroup (180.03s)
--- PASS: TestAccAWSAutoScalingGroup_MixedInstancesPolicy_InstancesDistribution_OnDemandBaseCapacity (116.18s)
--- PASS: TestAccAWSAutoScalingGroup_launchTemplate_update (185.73s)
--- PASS: TestAccAWSAutoScalingGroup_ALB_TargetGroups_ELBCapacity (348.21s)
--- PASS: TestAccAWSAutoScalingGroup_tags (276.82s)
--- PASS: TestAccAWSAutoScalingGroup_InstanceRefresh_Enabled (452.57s)
--- PASS: TestAccAWSAutoScalingGroup_InstanceRefresh_Triggers (477.33s)
--- PASS: TestAccAWSAutoScalingGroup_LoadBalancers (786.57s)

The following block will be added to aws_autoscaling_group:

resource "aws_autoscaling_group" "default" {
  instance_refresh {
    instance_warmup_seconds = 300
    min_healthy_percentage = 90
    strategy = "Rolling"
  }
}

If the block is present, start an ASG Instance Refresh with the given parameters when the ASG is updated. The refresh is started only when a property is changed that would need a rollout to take effect (e.g. launch_template). If there is already an active refresh during an update, it is cancelled.

The instance refresh is "fire and forget": the resource does not wait for the refresh to complete. This comes partly down to the difficulty of choosing a sensible timeout (highly dependent on ASG size, warmup settings, healthy percentages), and partly down to difficulties related to the representation of a rollout in Terraform state. Since the ASG update and Instance Refresh are separate operations, the resource might initially fail the apply, but then report a misleading success on the second run. Workarounds would require an opinionated solution, such as using a separate resource so that the identity of an Instance Refresh could be tracked individually.

Vivalldi · 2020-06-18T14:59:13Z

Should wait_for_completion be true by default?

I'm in favor of having the wait enabled by default. However, I don't think we need the wait_for_completion flag as we already have wait_for_completion_timeout. If we follow the behavior of wait_for_capacity_timeout we should be good. Wait for the specified time in wait_for_completion_timeout. If it's set to "0" don't wait.

roberth-k · 2020-06-18T20:07:40Z

@Vivalldi Consistency with the conventions of the other timeout field settles it.

I'm currently looking at the following remaining issues:

~~Handling repeat updates. At present: if the first refresh fails, a subsequent update of the resource will succeed even though the refresh hasn't happened. This is misleading, if not dangerous.~~
Handling conflicting updates. Only one Instance Refresh can be active at a time. The resource might want to start a refresh while another one is still in progress. Should the resource fail or wait? (Interestingly: the Instance Refresh appears to capture the latest updates to an ASG even when they have been made mid-refresh)
~~When the refresh is cancelled after a timeout, should the resource wait for the cancellation to complete?~~

With the above plus acceptance tests, it will be a few more days before this is ready.

Edit: waiting and timeouts have been dropped

Vivalldi · 2020-06-20T23:42:23Z

PR looks good to me. Appreciate all the work you've put in.

Out of sheer curiosy, what's the reasoning behind dropping waiting and timeouts?

roberth-k · 2020-06-21T00:22:21Z

@Vivalldi I couldn't find a way of managing failures and timeouts such that the outcome is obvious to the user. Not unsolvable, but it would have been a distraction in this PR.

The link between an ASG's state and a particular Instance Refresh is weak, so realistically the provider would need a new resource in order to be able to declaratively start and track the outcome of a single refresh. An update to the ASG would cause ForceNew by way of a null_resource-esque trigger map. To do this inside of aws_autoscaling_group would have required refactoring it using partial state.

flosell · 2020-06-26T10:27:43Z

On waiting for completion: I think that's an extremely valuable feature, given that (at least as far as I can tell) one of the main use-cases for instance_refresh would be managing deployments. Essentially, what we are saying with instance_refresh is "ensure my autoscaling group and all the instances attached to it are configured as specified in the code". This operation is only complete once the instance refresh is done.

I don't have a strong opinion on adding this in a separate PR though (might even help creating one) as long as this PR doesn't make it hard to do so.

Would be curious @roberth-k what troubles you ran into, to me a behaviour like @Vivalldi mentioned, using wait_for_completion sounds reasonable. I'm thinking the DescribeInstanceRefreshes might help get around the misleading success on second run.

roberth-k · 2020-06-26T11:23:41Z

@flosell

Two main complications:

Terraform won't store the state of aws_autoscaling_group when the refresh fails (unless the resource is refactored using partial state).
DescribeInstanceRefreshes doesn't offer enough information to reliably say which ASG update a refresh pertains to.

Hence if the first terraform apply fails (e.g. instances failed to start, or maybe the timeout was too short), the resource doesn't maintain the ID of the refresh it triggered. On the second run, terraform apply notices that the resource diff is unchanged (the ASG itself was updated after all), and gives the green light even though the ASG may have been left in an inconsistent state.

(Note: CloudFormation doesn't have this issue, as it would roll back the entire ASG on failure)

This can happen more often than it seems, as there's no one size fits all default to the timeout. A large ASG with a long warmup time could take ages to refresh.

Read at diff time might notice that there is an ongoing (or failed) refresh, but wouldn't be able to tell whether it was the previous terraform apply, or perhaps whether an external agent had triggered it for its own reasons. DescribeInstanceRefreshes only provides the Instance Refresh ID and its start and end times, but DescribeAutoScalingGroups doesn't provide an UpdatedTime.

A separate aws_autoscaling_instance_refresh resource can work around these issues, as it would be able to track the lifecycle of a specific instance refresh separately from the ASG.

roberth-k · 2020-07-31T22:45:46Z

Superseded by #14442.

ghost · 2020-08-31T17:11:13Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

gdavison · 2020-12-09T20:18:39Z

Hi @roberth-k. Thanks for both of your PRs for Instance Refresh. We've decided to base the Instance Refresh feature implementation on this PR, not #14442.

The Instance Refresh operation is triggered as part of the changes made to the resource, so it fits better as a configuration of the resource itself. A Refresh can also happen concurrently with a scaling operation, so we don't want to wait for scaling to complete before triggering the Refresh.

roberth-k force-pushed the f-aws_autoscaling_group-instance_refresh branch from 8e84b3a to 12119c2 Compare June 20, 2020 20:51

roberth-k changed the title ~~[WIP] aws_autoscaling_group: add instance_refresh block~~ resource/aws_autoscaling_group: add instance_refresh block Jun 20, 2020

roberth-k marked this pull request as ready for review June 20, 2020 21:11

roberth-k requested a review from a team June 20, 2020 21:11

Vivalldi approved these changes Jun 21, 2020

View reviewed changes

dpiddockcmp mentioned this pull request Jul 3, 2020

Instance Refresh for EC2 Auto Scaling terraform-aws-modules/terraform-aws-eks#929

Closed

4 tasks

roberth-k added 2 commits July 7, 2020 09:26

resource/aws_autoscaling_group: add instance_refresh block

e4dc2a4

resource/aws_autoscaling_group: fix typos in test comments

908c553

roberth-k force-pushed the f-aws_autoscaling_group-instance_refresh branch from 12119c2 to 908c553 Compare July 7, 2020 08:26

roberth-k mentioned this pull request Jul 13, 2020

Support ASG Instance Refresh #13785

Closed

anGie44 force-pushed the master branch from 54db70d to 48ba2f7 Compare July 21, 2020 18:26

roberth-k mentioned this pull request Jul 31, 2020

New resource: aws_autoscaling_instance_refresh #14442

Closed

roberth-k closed this Jul 31, 2020

ghost locked and limited conversation to collaborators Aug 31, 2020

gdavison reopened this Dec 9, 2020

gdavison merged commit ebfcec2 into hashicorp:master Dec 18, 2020

breathingdust removed the needs-triage Waiting for first response or review from a maintainer. label Sep 17, 2021

roberth-k deleted the f-aws_autoscaling_group-instance_refresh branch January 16, 2022 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resource/aws_autoscaling_group: add instance_refresh block #13791

resource/aws_autoscaling_group: add instance_refresh block #13791

roberth-k commented Jun 17, 2020 •

edited

Loading

Vivalldi commented Jun 18, 2020

roberth-k commented Jun 18, 2020 •

edited

Loading

Vivalldi commented Jun 20, 2020

roberth-k commented Jun 21, 2020

flosell commented Jun 26, 2020

roberth-k commented Jun 26, 2020

roberth-k commented Jul 31, 2020

ghost commented Aug 31, 2020

gdavison commented Dec 9, 2020

resource/aws_autoscaling_group: add instance_refresh block #13791

resource/aws_autoscaling_group: add instance_refresh block #13791

Conversation

roberth-k commented Jun 17, 2020 • edited Loading

Community Note

Vivalldi commented Jun 18, 2020

roberth-k commented Jun 18, 2020 • edited Loading

Vivalldi commented Jun 20, 2020

roberth-k commented Jun 21, 2020

flosell commented Jun 26, 2020

roberth-k commented Jun 26, 2020

roberth-k commented Jul 31, 2020

ghost commented Aug 31, 2020

gdavison commented Dec 9, 2020

roberth-k commented Jun 17, 2020 •

edited

Loading

roberth-k commented Jun 18, 2020 •

edited

Loading