Increased REST and GraphQL API Latency

Incident Report for Buildkite

Postmortem

Service Impact

Between August 16th and September 24th, 2025, some customers experienced elevated API latency and intermittent errors when using Buildkite’s REST and GraphQL APIs.

On September 23rd at 18:04 UTC the impact became severe and universal; alarms were raised and several customers reached out to us directly. The impact varied by customer and endpoint, with common pain points including creating builds and fetching build lists.

Incident Summary

A combination of increased API usage and changing workload patterns drove higher‑than‑usual network throughput. We discovered unexpected and undocumented throughput limits on our database proxy service. During the incident, this led to intermittent request latency and timeouts.

By increasing the amount of CPU and memory resources available to our proxy service, the network throughput limits were also raised, and the error rates and latencies began to decrease.

We discovered at this time that this throughput limit had been the major contributing factor to the inconsistent latencies that some customers had experienced since August 16th.

Changes we’re making

The nature of this issue meant that customers experienced intermittent latency spikes, usually as a result of the amount of data that needed to go between our applications and our databases. This contributed significantly to the time between when this incident was initially reported, confirmation, and resolution. We’re implementing better customer-specific observability and monitoring as a result.

Limitations in our compute platform mean that these network throughput limits are not something that we can reliably detect or alert on. We’re evaluating moving our proxy service to an alternative compute solution so that we can provide better observability, and we’ve pre-emptively scaled up other network sensitive services as a preventative measure.

We are also optimizing requests that return large result sets to reduce load and response sizes where practical.

Posted Nov 10, 2025 - 03:07 UTC

Resolved

We have implemented further mitigations and seen error rates and latency return to acceptable levels. We will continue to investigate the issue to better understand the causes.
Posted Sep 24, 2025 - 01:41 UTC

Monitoring

The p95 experience on our REST and GraphQL APIs have had a number of latency spikes, as high as 26 seconds and we saw an increase in error rates during this time. We're continuing to implement mitigations, and have seen latency and error rate return to baseline over the last 30 minutes. We are continuing to monitor the issue.
Posted Sep 24, 2025 - 00:42 UTC

Investigating

We've seen a recurrence on the latency issue and we're continuing to investigate.
Posted Sep 23, 2025 - 23:29 UTC

Monitoring

We have seen recovery on the impacted services and are investigating possible root causes
Posted Sep 23, 2025 - 21:48 UTC

Update

We are continuing to investigate this issue.
Posted Sep 23, 2025 - 21:04 UTC

Update

We have applied mitigations and are still investigating the cause of the increased latency.
Posted Sep 23, 2025 - 20:23 UTC

Investigating

We are investigating reports of increased latency on our REST and GraphQL APIs
Posted Sep 23, 2025 - 19:57 UTC
This incident affected: REST API.