Agent and web connectivity issue

Incident Report for Buildkite

Postmortem

On Saturday 31st August, 2019 at 11:11PM AEST (UTC+10) we received an alert about elevated error rates on the Agent API. From 11:07PM when the first errors began, agents were unable to connect to the API, existing build jobs would have completed running but may not have reported results, and new build jobs would not have been started. Some dashboard pages were also unavailable. We quickly identified a problem connecting to one of our Redis clusters in AWS ElasticCache. This seemed to be caused by a general outage in our primary AWS region, us-east-1. After further diagnosis and several attempted solutions, we implemented a workaround by Sunday 1st August, 2019 at 00:57AM AEST which reduced error rates. From 00:59AM agents began reconnecting and resumed processing build jobs. Full service was restored by 1:10AM. Underlying services were restored over the coming hours and our workaround was removed by 2:10PM that day without further service interruption.

Timeline

11:07pm — First error reports
11:11pm — High error rates lead to an alert, on call engineer responded
11:18pm — On call engineer has diagnosed Redis issue related to AWS outage
11:36pm — Additional engineer comes online
11:46pm — Additional engineer comes online
11:57pm — Affected Redis instance is restarted, still unavailable
12:11am — Another Redis instance is added in another availability zone, also unavailable
12:40am — Another unrelated Redis instance is discovered to be still available
12:47am — App is switched to use unrelated Redis instance as a workaround
12:57am — Workaround deployed, error rate starts subsiding
1:10am — Full service restored, error rate nominal
2:10pm — Original Redis service confirmed restored, workaround removed without interruption

Root Cause

Power failure in us-east-1 region caused connectivity failure to our Redis instances across multiple availability zones.

What We’re Doing

We're investigating how to increase the resiliency of Redis, and how to fail more gracefully when it is unavailable.

Posted Sep 15, 2019 - 00:21 UTC

Resolved

Connectivity issues with the Agent API have been resolved and traffic has returned to a normal level.

Posted Aug 31, 2019 - 15:15 UTC

Monitoring

We've restored connectivity to the Agent API and things appear to be functioning again.

Posted Aug 31, 2019 - 15:02 UTC

Identified

We've identified a connectivity issue with our redis infrastructure caused by an AWS outage, we're working on a fix

Posted Aug 31, 2019 - 14:01 UTC

Investigating

We're investigating agent and web degraded network connectivity possibly related to an ongoing AWS EC2 incident.

Posted Aug 31, 2019 - 13:46 UTC

This incident affected: Web, Agent API, and REST API.