Delayed builds
Incident Report for Buildkite
Postmortem

Overview

At 10:32 PM (Oct 10, 2018) a spike of exceptions during an auto-scaling event triggered an alert. We began investigating and discovered an issue with servers failing to become “healthy” in our autoscaling groups. As this issue was isolated only to new machines and did not appear to be affecting customers, we proceeded immediately to roll out a fix.

We were alerted by a customer at 10:46 PM (14 minutes after the first alert) that they were experiencing delays in jobs being assigned to agents. We checked our background queues and noticed a spike in new builds, which was unable to be processed by our current capacity. And since new machines couldn’t be added to the fleet, we were under-provisioned to handle the change in load.

At 10:52 PM (6 minutes after being alerted by a customer) we reported it to StatusPage. We deployed a fix to our autoscaling group and the system returned to normal at 11:07 PM.

What was affected

Delays in jobs being assigned to agents, as well as external notifications of builds results.

Root cause

As new machines are added into our autoscaling group, they “self-deploy” the latest version of the application. They do this by running downloading the latest copy of the source code using “git fetch”, and then “git checkout” to the version the rest of the fleet are running. Once this process completes, they are marked as “healthy” and added to the group and start serving requests.

In the latest version of the application, there was a reference to a git submodule commit that no longer existed, which caused the git commands during the self-deploy process to fail resulting in the machines not being added to the group and never transitioning to “healthy”. We were under-provisioned to handle any changes in workload because of this.

Changes we’ve made

We fixed the git submodule reference and pushed out a new copy of our application to all servers.

Changes we're going to make

Investigate alternative approaches to our self-deploy process and accelerating our plans to move to artifact based deployments.

Posted about 1 month ago. Oct 11, 2018 - 14:49 AEDT

Resolved
This incident has been resolved.
Posted about 1 month ago. Oct 11, 2018 - 11:06 AEDT
Monitoring
We’ve got builds running as normal again. We’re keeping an eye on things while it levels out.
Posted about 1 month ago. Oct 11, 2018 - 10:07 AEDT
Identified
Some builds are lagging behind, and we’re working on a fix
Posted about 1 month ago. Oct 11, 2018 - 09:52 AEDT
This incident affected: Job Queue.