Elevated error rates
Incident Report for Buildkite
Postmortem

On Monday, 20th September between 14:00 and 16:11 UTC the Buildkite platform experienced elevated error rates, less than 0.01% overall, but about 2% on the REST and GraphQL APIs. Once escalated, the root cause was quickly resolved, the errors subsided, and service returned to normal. We have adjusted alerting so that we can respond to these issues faster if they recur.

What happened

A small percentage of requests were unable to reserve a primary database connection from our connection pool during this time, causing the errors. We made some configuration changes the day before based on changes made on our other databases, and with extensive testing, and rolled them out carefully during high load, and with several alerts in place, but they still had unforeseen impact during later peak load.

The overall error rate across the platform was not significant enough to trigger alerts, and so we were unfortunately slow to respond. Several customer workloads were acutely affected, and reports received around 15:45 UTC were escalated quickly through customer support at 15:59 UTC to an on call engineer who was able to identify and resolve the issue by 16:11 UTC.

What have we changed

We have tuned alerts we already had in place to more strongly align with expected error rates and latency so that they would have caught this issue, and we expect these will alert us sooner if a similar incident happens again. We’ll also be modifying our rollout strategy for the continuation of this work to be in smaller stages with more time in between stages to build confidence and prove scalability.

Posted Sep 23, 2021 - 07:40 UTC

Resolved
This incident has been resolved.
Posted Sep 20, 2021 - 16:50 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 20, 2021 - 16:33 UTC
Identified
The issue has been identified, and a fix was implemented
Posted Sep 20, 2021 - 16:18 UTC
Update
We are continuing to investigate this issue.
Posted Sep 20, 2021 - 16:11 UTC
Investigating
We've noticed elevated error rates, we're on it
Posted Sep 20, 2021 - 16:07 UTC
This incident affected: Agent API.