On Wedneday, 29th August 2018 at 22:38:47 UTC our primary AWS RDS PostgreSQL database suffered a critical error in underlying storage and performed a failover event. During the failover, the Buildkite Dashboard and all APIs were momentarily unavailable. The failover event completed by 22:40 UTC, and service was fully restored by 22:41. No data was lost — agents retry operations that fail until they succeed.
We have confirmed with AWS that underlying storage hardware was at fault, and failover performed correctly as designed. By the time we were on deck to investigate, the incident had resolved itself.
Some of our web backend hosts did become unhealthy as a result of losing database connections causing a minute or two of partial degradation on the Buildkite dashboard. We already had some plans to make these more resilient to database connectivity issues and will be deploying updates.
We apologise for the interruption, and were disappointed to have an outage, but the underlying service handled the failure gracefully and restored service quickly. Thank you for your patience and understanding.