On 2022-01-08 between 09:25 UTC and 10:07 UTC, the Buildkite dashboard was unavailable.
This incident began at 9:25 UTC, when latency for the Buildkite dashboard increased to 60s and hit the request timeout. Our investigation of the incident uncovered a denial of service attack on the buildkite.com domain.
A significant number of the requests were against our public build page. This page included a slow SQL query, which was causing requests to time out. This exhausted our HTTP request capacity, as requests were taking up to 60 seconds. It also caused high database load on one of our non-primary databases.
In response, at 9:54 UTC we fixed the slow query, and at 10:04 UTC we put mitigations in place to block the source of the attack, including blocking a range of IP addresses, and provisioned additional capacity to help absorb the load. At 10:07 UTC, the Dashboard became accessible again.
As part of our reliability review, we’re investigating adding rate limits and a Web Application Firewall for the Buildkite dashboard to help prevent future DOS attacks.