Delayed builds

Incident Report for Buildkite

Postmortem

Overview

At 10:32 PM (Oct 10, 2018) a spike of exceptions during an auto-scaling event triggered an alert. We began investigating and discovered an issue with servers failing to become “healthy” in our autoscaling groups. As this issue was isolated only to new machines and did not appear to be affecting customers, we proceeded immediately to roll out a fix.

We were alerted by a customer at 10:46 PM (14 minutes after the first alert) that they were experiencing delays in jobs being assigned to agents. We checked our background queues and noticed a spike in new builds, which was unable to be processed by our current capacity. And since new machines couldn’t be added to the fleet, we were under-provisioned to handle the change in load.

At 10:52 PM (6 minutes after being alerted by a customer) we reported it to StatusPage. We deployed a fix to our autoscaling group and the system returned to normal at 11:07 PM.

What was affected

Delays in jobs being assigned to agents, as well as external notifications of builds results.

Root cause

As new machines are added into our autoscaling group, they “self-deploy” the latest version of the application. They do this by running downloading the latest copy of the source code using “git fetch”, and then “git checkout” to the version the rest of the fleet are running. Once this process completes, they are marked as “healthy” and added to the group and start serving requests.

In the latest version of the application, there was a reference to a git submodule commit that no longer existed, which caused the git commands during the self-deploy process to fail resulting in the machines not being added to the group and never transitioning to “healthy”. We were under-provisioned to handle any changes in workload because of this.

Changes we’ve made

We fixed the git submodule reference and pushed out a new copy of our application to all servers.

Changes we're going to make

Investigate alternative approaches to our self-deploy process and accelerating our plans to move to artifact based deployments.

Posted Oct 11, 2018 - 03:49 UTC

Resolved

This incident has been resolved.

Posted Oct 11, 2018 - 00:06 UTC

Monitoring

We’ve got builds running as normal again. We’re keeping an eye on things while it levels out.

Posted Oct 10, 2018 - 23:07 UTC

Identified

Some builds are lagging behind, and we’re working on a fix

Posted Oct 10, 2018 - 22:52 UTC

This incident affected: Job Queue.