Between August 16th and September 24th, 2025, some customers experienced elevated API latency and intermittent errors when using Buildkite’s REST and GraphQL APIs.
On September 23rd at 18:04 UTC the impact became severe and universal; alarms were raised and several customers reached out to us directly. The impact varied by customer and endpoint, with common pain points including creating builds and fetching build lists.
A combination of increased API usage and changing workload patterns drove higher‑than‑usual network throughput. We discovered unexpected and undocumented throughput limits on our database proxy service. During the incident, this led to intermittent request latency and timeouts.
By increasing the amount of CPU and memory resources available to our proxy service, the network throughput limits were also raised, and the error rates and latencies began to decrease.
We discovered at this time that this throughput limit had been the major contributing factor to the inconsistent latencies that some customers had experienced since August 16th.
The nature of this issue meant that customers experienced intermittent latency spikes, usually as a result of the amount of data that needed to go between our applications and our databases. This contributed significantly to the time between when this incident was initially reported, confirmation, and resolution. We’re implementing better customer-specific observability and monitoring as a result.
Limitations in our compute platform mean that these network throughput limits are not something that we can reliably detect or alert on. We’re evaluating moving our proxy service to an alternative compute solution so that we can provide better observability, and we’ve pre-emptively scaled up other network sensitive services as a preventative measure.
We are also optimizing requests that return large result sets to reduce load and response sizes where practical.