Request Timeouts

Incident Report for Buildkite

Postmortem

Service Impact

On 2020-12-22 between 12:08 UTC and 12:28 UTC the Web, API, REST API, Agent API, Job Queue, SCM Integrations and Notifications (GitHub Commit Status Notifications, Email Notifications) were unavailable or in a degraded state. For an additional sixty four minutes (until 13:32 UTC) some endpoints of the agent API experienced elevated latency.

Incident Timeline

Buildkite operates in the AWS us-east-1 region across two availability zones - us-east-1c and us-east-1d (for us).

At 2021-12-22 12:08 UTC, AWS experienced a power outage in one data center within our us-east-1c availability zone.

This caused several database servers to automatically fail over to hot standby's in us-east-1d. The database failovers went smoothly and completed by 2021-12-22 12:10 UTC.

Our EC2 Autoscaling Groups and Fargate Services also detected the failure and starting bring up replacement instances and tasks in us-east-1d. However our Autoscaling Groups are configured to use OnDemand instances of only a single type. us-east-1d had a shortage of instances of that type, and we were unable to serve all requests.

By 2021-12-22 12:28 UTC (20 minutes from the power outage) we were able to serve all requests, although in some cases (particularly the agent API) latency was significantly higher than normal.

By 2021-12-22 13:32 UTC (1 hour 24 minutes from the power outage) latency on requests had also returned to a normal range.

Changes we’re making

Over the past few months we've been refactoring our VPC to support our growth and to prepare for expanding into a third Availability Zone. This involved new larger subnets for the two existing zones, reserving IP space for the planned third zone, and shifting existing EC2/ECS compute to the new subnets.

We plan to complete the expansion in 2022. With three zones a similar outage would knock out ~⅓ of our capacity, and the healthy zones would each carry a smaller portion of the increased load.

We will also investigate whether adding additional instance types to the autoscaling group will help ensure rapid scale up in the healthy zones remains possible. However it's possible our slow-but-steady shift towards Fargate will make this less necessary.

Posted Dec 24, 2021 - 01:54 UTC

Resolved

This incident has been resolved.

Posted Dec 22, 2021 - 14:04 UTC

Monitoring

We are seeing our systems operating normally, we are monitoring any additional unexpected issues from US-EAST-1 Region.

Posted Dec 22, 2021 - 13:42 UTC

Update

We provisioned new capacity on new availability zones and we are monitoring its results

Posted Dec 22, 2021 - 13:14 UTC

Identified

The issue has been identified. AWS is reporting an increased error rate in one of our availability zones

Posted Dec 22, 2021 - 12:48 UTC

Update

We continue investigating the issue. We are experiencing an unexpected database failover

Posted Dec 22, 2021 - 12:35 UTC

Investigating

We are investigating an increase in timeouts and error rates

Posted Dec 22, 2021 - 12:15 UTC

This incident affected: Web, Agent API, REST API, Job Queue, SCM Integrations and Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Webhook Notifications).