Buildkite dashboard unavailable
Incident Report for Buildkite

On Wedneday, 29th August 2018 at 22:38:47 UTC our primary AWS RDS PostgreSQL database suffered a critical error in underlying storage and performed a failover event. During the failover, the Buildkite Dashboard and all APIs were momentarily unavailable. The failover event completed by 22:40 UTC, and service was fully restored by 22:41. No data was lost — agents retry operations that fail until they succeed.

We have confirmed with AWS that underlying storage hardware was at fault, and failover performed correctly as designed. By the time we were on deck to investigate, the incident had resolved itself.

Some of our web backend hosts did become unhealthy as a result of losing database connections causing a minute or two of partial degradation on the Buildkite dashboard. We already had some plans to make these more resilient to database connectivity issues and will be deploying updates.

We apologise for the interruption, and were disappointed to have an outage, but the underlying service handled the failure gracefully and restored service quickly. Thank you for your patience and understanding.

Posted 10 months ago. Sep 03, 2018 - 16:38 AEST

Our primary RDS database is running normally again.
Posted 10 months ago. Aug 30, 2018 - 11:35 AEST
Service is fully restored and seems stable. There was no indication of abnormal load on our RDS instance leading up to the failure, so we can only assume it was underlying hardware failure and that failover worked as designed to restore service. We'll be monitoring for a while, and asking AWS for a root cause.
Posted 10 months ago. Aug 30, 2018 - 09:07 AEST
Our underlying production RDS instance failed over. We're opening a ticket with AWS to investigate cause. Service appears to be fully restored, and the failover should not have lost any data. Agent should retry any operations that were in progress.
Posted 10 months ago. Aug 30, 2018 - 09:01 AEST
Several backend instances became unhealthy in our dashboard autoscaling group. New instances have come online to compensate and service seems to be restored. We're still investigating what's happened.
Posted 10 months ago. Aug 30, 2018 - 08:57 AEST
We are investigating an issue with being unavailable.
Posted 10 months ago. Aug 30, 2018 - 08:56 AEST
This incident affected: Web, Agent API, REST API, and Job Queue.