Troubleshooting Instances in a Scaling Group

Photo by Sven Huls from Pexels

I came across a question seeking two different ways to safely perform maintenance on an instance in an AutoScaling group. I was aware of putting an instance into Standby mode but wasn't aware that specific processes related to the scaling actions could be directly suspended.

Standby Mode

In an Auto Scaling Group, performing maintenance or troubleshooting of an instance may be necessary. Both these actions can degrade the instance or application running on the instance. The standard approach is to put the instance into a Standby mode via the console or CLI.

Standby Lifecycle

An instance that goes into Standby mode is not removed from the Auto Scaling Group, but the desired capacity of the group is decremented. This prevents the ASG from launching a replacement. 

If the instance is part of Load Balancer Group, its connections are drained if configured to do so, after which that instance is deregistered from the load balancer group so that it does not serve traffic.

At this point, it is possible to troubleshoot or perform maintenance on the instance. 

Exiting the Standby mode increments the desired capacity to its original value and registers the instance back into the load balancer.


Suspending Processes

A set of processes is triggered upon each ASG lifecycle stage, and it is possible to individually suspend them via the CLI or Console.

The ReplaceUnhealthy process is responsible for terminating unhealthy instances and replacing them. Suspending this process allows us to perform maintenance without worrying about the instance getting replaced. If the load balancer uses the Instance health check, unhealthy Instances will be deregistered from the group.

I would consider this method a little extreme as this suspension affects the replacement of ALL unhealthy Instances in the group. Nevertheless, it is still a valid answer to the original question posed at the beginning of this post.

Sources