buildkite.com timing out
Incident Report for Buildkite
Postmortem

Timeline

On Monday 2021-03-22 01:41 UTC, a merged commit which had passed automated tests prevented our production Rails puma web server processes from starting.

As this rolled out across our fleet, we stopped successfully serving requests, and an automated pager alerted our team within three minutes. After several minutes investigation, our team published an incident to https://buildkitestatus.com/

The broken commit was identified early, and rolled back. However the rollback did not apply consistently across all servers, which led to extended downtime while this was diagnosed and resolved.

During this partial rollback time, it appeared that systems were operational but under heavy load, which was mistakenly assumed to be caused by queued requests and background work. In response, the team manually scaled out the server fleet beyond the normal maximum, but this did not alleviate high CPU load. Eventually it was determined that the broken commit was still running on a subset of the fleet, causing restart loops and high CPU.

At 02:52 UTC the remainder of the fleet was updated to a good commit, fully restoring operations.

Summary

We would like to apologise to everybody that was impacted by this service interruption.

It was determined that the failure was caused by a puma plugin being updated in Gemfile from a published gem to a git source, combined with puma's prune_bundler option, in a way which did not manifest when puma was started during our automated testing.

To prevent this happening again, we are overhauling our deployment process to incorporate automatic rollback in the case of post-deploy health check failure. This will prevent an error like this from impacting production traffic, and also make the rollback process itself more resilient.

Posted Mar 22, 2021 - 06:22 UTC

Resolved
Operations have returned to normal.
Posted Mar 22, 2021 - 03:21 UTC
Monitoring
We believe that operations have returned to normal and we will continue to monitor.
Posted Mar 22, 2021 - 03:05 UTC
Identified
Error rates and latency still remain high while queues continue to be processed
Posted Mar 22, 2021 - 02:33 UTC
Monitoring
A broken commit has been identified and rolled back. Performance may remain degraded while queues are processed.
Posted Mar 22, 2021 - 02:05 UTC
Update
We are continuing to investigate this issue.
Posted Mar 22, 2021 - 01:53 UTC
Investigating
Buildkite systems are currently timing out. The team is investigating.
Posted Mar 22, 2021 - 01:49 UTC
This incident affected: Web, Agent API, REST API, Job Queue, SCM Integrations and Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Hipchat Notifications, Webhook Notifications).