Buildkite dashboard unavailable

Incident Report for Buildkite

Postmortem

On Wedneday, 29th August 2018 at 22:38:47 UTC our primary AWS RDS PostgreSQL database suffered a critical error in underlying storage and performed a failover event. During the failover, the Buildkite Dashboard and all APIs were momentarily unavailable. The failover event completed by 22:40 UTC, and service was fully restored by 22:41. No data was lost — agents retry operations that fail until they succeed.

We have confirmed with AWS that underlying storage hardware was at fault, and failover performed correctly as designed. By the time we were on deck to investigate, the incident had resolved itself.

Some of our web backend hosts did become unhealthy as a result of losing database connections causing a minute or two of partial degradation on the Buildkite dashboard. We already had some plans to make these more resilient to database connectivity issues and will be deploying updates.

We apologise for the interruption, and were disappointed to have an outage, but the underlying service handled the failure gracefully and restored service quickly. Thank you for your patience and understanding.

Posted Sep 03, 2018 - 06:38 UTC

Resolved

Our primary RDS database is running normally again.

Posted Aug 30, 2018 - 01:35 UTC

Monitoring

Service is fully restored and seems stable. There was no indication of abnormal load on our RDS instance leading up to the failure, so we can only assume it was underlying hardware failure and that failover worked as designed to restore service. We'll be monitoring for a while, and asking AWS for a root cause.

Posted Aug 29, 2018 - 23:07 UTC

Identified

Our underlying production RDS instance failed over. We're opening a ticket with AWS to investigate cause. Service appears to be fully restored, and the failover should not have lost any data. Agent should retry any operations that were in progress.

Posted Aug 29, 2018 - 23:01 UTC

Update

Several backend instances became unhealthy in our dashboard autoscaling group. New instances have come online to compensate and service seems to be restored. We're still investigating what's happened.

Posted Aug 29, 2018 - 22:57 UTC

Investigating

We are investigating an issue with buildkite.com being unavailable.

Posted Aug 29, 2018 - 22:56 UTC

This incident affected: Web, Agent API, REST API, and Job Queue.