The web dashboard was mostly unavailable for 68 minutes, from 15:56 UTC to 17:04 UTC.
Our API, agent connections, webhooks, Test Analytics data collection, and builds triggered externally (e.g. via a pull request) were all unaffected.
At around 10:00 UTC we attempted a reconfiguration targeting the hosts that serve our web dashboard. This change was intended to move database connections from a small pool on each host to a large shared pool for the fleet. The configuration targeting did not work as expected and was rolled back with no immediate customer impact, but incompletely. The problematic change prevented new hosts from starting successfully.
Our infrastructure scales in and out automatically. As load increased through the US work day, we were unable to meet demand without additional hosts. This caused a dashboard outage, as CPU was saturated on correctly configured hosts, and new hosts failed to launch due to configuration failure.
At 15:56 our monitoring began to detect elevated response times, and an engineer was paged at 16:00. The problem was identified at 16:49, and we started rolling out a fix by 17:03.