At 10:32 PM (Oct 10, 2018) a spike of exceptions during an auto-scaling event triggered an alert. We began investigating and discovered an issue with servers failing to become “healthy” in our autoscaling groups. As this issue was isolated only to new machines and did not appear to be affecting customers, we proceeded immediately to roll out a fix.
We were alerted by a customer at 10:46 PM (14 minutes after the first alert) that they were experiencing delays in jobs being assigned to agents. We checked our background queues and noticed a spike in new builds, which was unable to be processed by our current capacity. And since new machines couldn’t be added to the fleet, we were under-provisioned to handle the change in load.
At 10:52 PM (6 minutes after being alerted by a customer) we reported it to StatusPage. We deployed a fix to our autoscaling group and the system returned to normal at 11:07 PM.
Delays in jobs being assigned to agents, as well as external notifications of builds results.
As new machines are added into our autoscaling group, they “self-deploy” the latest version of the application. They do this by running downloading the latest copy of the source code using “git fetch”, and then “git checkout” to the version the rest of the fleet are running. Once this process completes, they are marked as “healthy” and added to the group and start serving requests.
In the latest version of the application, there was a reference to a git submodule commit that no longer existed, which caused the git commands during the self-deploy process to fail resulting in the machines not being added to the group and never transitioning to “healthy”. We were under-provisioned to handle any changes in workload because of this.
We fixed the git submodule reference and pushed out a new copy of our application to all servers.
Investigate alternative approaches to our self-deploy process and accelerating our plans to move to artifact based deployments.