On 2020-12-22 between 12:08 UTC and 12:28 UTC the Web, API, REST API, Agent API, Job Queue, SCM Integrations and Notifications (GitHub Commit Status Notifications, Email Notifications) were unavailable or in a degraded state. For an additional sixty four minutes (until 13:32 UTC) some endpoints of the agent API experienced elevated latency.
Buildkite operates in the AWS us-east-1 region across two availability zones - us-east-1c and us-east-1d (for us).
At 2021-12-22 12:08 UTC, AWS experienced a power outage in one data center within our us-east-1c availability zone.
This caused several database servers to automatically fail over to hot standby's in us-east-1d. The database failovers went smoothly and completed by 2021-12-22 12:10 UTC.
Our EC2 Autoscaling Groups and Fargate Services also detected the failure and starting bring up replacement instances and tasks in us-east-1d. However our Autoscaling Groups are configured to use OnDemand instances of only a single type. us-east-1d had a shortage of instances of that type, and we were unable to serve all requests.
By 2021-12-22 12:28 UTC (20 minutes from the power outage) we were able to serve all requests, although in some cases (particularly the agent API) latency was significantly higher than normal.
By 2021-12-22 13:32 UTC (1 hour 24 minutes from the power outage) latency on requests had also returned to a normal range.
Over the past few months we've been refactoring our VPC to support our growth and to prepare for expanding into a third Availability Zone. This involved new larger subnets for the two existing zones, reserving IP space for the planned third zone, and shifting existing EC2/ECS compute to the new subnets.
We plan to complete the expansion in 2022. With three zones a similar outage would knock out ~⅓ of our capacity, and the healthy zones would each carry a smaller portion of the increased load.
We will also investigate whether adding additional instance types to the autoscaling group will help ensure rapid scale up in the healthy zones remains possible. However it's possible our slow-but-steady shift towards Fargate will make this less necessary.