On Monday 2021-03-22 01:41 UTC, a merged commit which had passed automated tests prevented our production Rails puma web server processes from starting.
As this rolled out across our fleet, we stopped successfully serving requests, and an automated pager alerted our team within three minutes. After several minutes investigation, our team published an incident to https://buildkitestatus.com/
The broken commit was identified early, and rolled back. However the rollback did not apply consistently across all servers, which led to extended downtime while this was diagnosed and resolved.
During this partial rollback time, it appeared that systems were operational but under heavy load, which was mistakenly assumed to be caused by queued requests and background work. In response, the team manually scaled out the server fleet beyond the normal maximum, but this did not alleviate high CPU load. Eventually it was determined that the broken commit was still running on a subset of the fleet, causing restart loops and high CPU.
At 02:52 UTC the remainder of the fleet was updated to a good commit, fully restoring operations.
We would like to apologise to everybody that was impacted by this service interruption.
It was determined that the failure was caused by a puma plugin being updated in Gemfile from a published gem to a git source, combined with puma's prune_bundler option, in a way which did not manifest when puma was started during our automated testing.
To prevent this happening again, we are overhauling our deployment process to incorporate automatic rollback in the case of post-deploy health check failure. This will prevent an error like this from impacting production traffic, and also make the rollback process itself more resilient.