Buildkite UI is down

Incident Report for Buildkite

Postmortem

Impact

The web dashboard was mostly unavailable for 68 minutes, from 15:56 UTC to 17:04 UTC.

Our API, agent connections, webhooks, Test Analytics data collection, and builds triggered externally (e.g. via a pull request) were all unaffected.

What happened

At around 10:00 UTC we attempted a reconfiguration targeting the hosts that serve our web dashboard. This change was intended to move database connections from a small pool on each host to a large shared pool for the fleet. The configuration targeting did not work as expected and was rolled back with no immediate customer impact, but incompletely. The problematic change prevented new hosts from starting successfully.

Our infrastructure scales in and out automatically. As load increased through the US work day, we were unable to meet demand without additional hosts. This caused a dashboard outage, as CPU was saturated on correctly configured hosts, and new hosts failed to launch due to configuration failure.

At 15:56 our monitoring began to detect elevated response times, and an engineer was paged at 16:00. The problem was identified at 16:49, and we started rolling out a fix by 17:03.

How we’re becoming more resilient to issues like this

We will improve monitoring to ensure an engineer is paged when hosts repeatedly fail to start.
Our current method of configuring secrets is manual. As part of our move to immutable infrastructure, we already have plans to improve on this.

Posted Oct 07, 2022 - 08:59 UTC

Resolved

This incident has been resolved.

Posted Oct 06, 2022 - 17:31 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 06, 2022 - 17:20 UTC

Update

Root cause identified and working on rolling out the fix

Posted Oct 06, 2022 - 17:03 UTC

Update

We are continuing to investigate this issue.

Posted Oct 06, 2022 - 16:50 UTC

Update

We are continuing to investigate this issue.

Posted Oct 06, 2022 - 16:35 UTC

Investigating

Our team is aware of the issue and we are currently investigating it

Posted Oct 06, 2022 - 16:15 UTC

This incident affected: Web.