On Saturday 31st August, 2019 at 11:11PM AEST (UTC+10) we received an alert about elevated error rates on the Agent API. From 11:07PM when the first errors began, agents were unable to connect to the API, existing build jobs would have completed running but may not have reported results, and new build jobs would not have been started. Some dashboard pages were also unavailable. We quickly identified a problem connecting to one of our Redis clusters in AWS ElasticCache. This seemed to be caused by a general outage in our primary AWS region, us-east-1. After further diagnosis and several attempted solutions, we implemented a workaround by Sunday 1st August, 2019 at 00:57AM AEST which reduced error rates. From 00:59AM agents began reconnecting and resumed processing build jobs. Full service was restored by 1:10AM. Underlying services were restored over the coming hours and our workaround was removed by 2:10PM that day without further service interruption.
11:07pm — First error reports
11:11pm — High error rates lead to an alert, on call engineer responded
11:18pm — On call engineer has diagnosed Redis issue related to AWS outage
11:36pm — Additional engineer comes online
11:46pm — Additional engineer comes online
11:57pm — Affected Redis instance is restarted, still unavailable
12:11am — Another Redis instance is added in another availability zone, also unavailable
12:40am — Another unrelated Redis instance is discovered to be still available
12:47am — App is switched to use unrelated Redis instance as a workaround
12:57am — Workaround deployed, error rate starts subsiding
1:10am — Full service restored, error rate nominal
2:10pm — Original Redis service confirmed restored, workaround removed without interruption
Power failure in us-east-1 region caused connectivity failure to our Redis instances across multiple availability zones.
We're investigating how to increase the resiliency of Redis, and how to fail more gracefully when it is unavailable.