Increased error rate
Incident Report for Buildkite
Postmortem

Service Impact

On 2022-01-08 between 09:25 UTC and 10:07 UTC, the Buildkite dashboard was unavailable.

Incident Timeline

This incident began at 9:25 UTC, when latency for the Buildkite dashboard increased to 60s and hit the request timeout. Our investigation of the incident uncovered a denial of service attack on the buildkite.com domain.

A significant number of the requests were against our public build page. This page included a slow SQL query, which was causing requests to time out. This exhausted our HTTP request capacity, as requests were taking up to 60 seconds. It also caused high database load on one of our non-primary databases.

In response, at 9:54 UTC we fixed the slow query, and at 10:04 UTC we put mitigations in place to block the source of the attack, including blocking a range of IP addresses, and provisioned additional capacity to help absorb the load. At 10:07 UTC, the Dashboard became accessible again.

Changes we’re making

As part of our reliability review, we’re investigating adding rate limits and a Web Application Firewall for the Buildkite dashboard to help prevent future DOS attacks.

Posted Jan 12, 2022 - 05:03 UTC

Resolved
This incident has been resolved.
Posted Jan 08, 2022 - 10:34 UTC
Monitoring
We've identified and mitigated a denial of service attack and we are monitoring the results.
Posted Jan 08, 2022 - 10:17 UTC
Identified
We identified a high web error rate and a fix is being implemented
Posted Jan 08, 2022 - 10:04 UTC
Investigating
We are currently investigating this issue.
Posted Jan 08, 2022 - 09:50 UTC
This incident affected: Web and Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications).