Increased error rates and timeouts

Incident Report for Buildkite

Postmortem

Service Impact

From 00:02 to 00:27 (UTC) on March 14th, Buildkite Pipelines experienced increased latency and error rates impacting all customers. Between 00:02 and 00:45 some customers experienced severe performance degradation and periods of time when no builds would have progressed.

Similar to the March 5th incident, the primary impact seen by customers was a delay in the time it took for a Job to be started by an agent, as agents experienced latency and elevated error rates when communicating with the Buildkite API. The below graph shows the average latency experienced by customers between when a build is created and the first job in that build started.

Additionally the Buildkite website experienced increased latency and error rates during this time.

Incident Summary

Many customers have scheduled builds set to run at 0:00 UTC, which results in a spike in the number of builds created and processed at that time each day. March 14th was no exception, but on this occasion this expected spike was on top of some already exceptional load on one of our database instances.

This combined load caused high enough concurrency that the database experienced excessive LWLock:lockmanager contention. When the database reaches this critical point, it enters a state which can only be recovered by shedding load.

Our engineers were paged automatically at 00:09 UTC and confirmed that the database had entered this state. Load shedding from the affected database instance began at 00:19 UTC. In previous incidents the chosen mechanism of load shedding (temporarily halting background processing for the affected database instance) has had the intended effect very quickly, dropping load on the database within 1 minute.

This time the effect was slower, with load on the database recovering over the following 8 minutes. Recovery of the database load restored performance for customers not on that database instance at 00:27 UTC, and our engineers began re-enabling background processing for customers on the impacted database instance, with that completed by 00:45 UTC.

Changes we're making

Excessive LWLock:lockmanager contention has been a common occurrence in past Buildkite Pipelines incidents. Before horizontal sharding, our early efforts to scale our Pipelines database included partitioning some key high-volume Postgres tables, including our builds table. Most queries on these tables have been optimized using partition pruning, but there are certain queries where that is not possible. For those queries, “non-fast path locks” must be acquired for every partition (and index of those partitions) to find the relevant builds.

In the case of this incident, the existing workload on that database instance was performing such queries at a high rate (unusual across all of our databases) and the addition of similar queries resulting from scheduled builds at midnight UTC tipped load over the edge.

Our efforts to horizontally shard the Pipelines databases have given us a path to remove these now-unused partitions, so our first response to this was to fast-track our plan to detach said partitions from the impacted database shards. This work was completed in the hours following the incident.

This has significantly dropped the maximum number of locks used by the database, when compared with the week prior:

We’ve since rolled this change out to all of our database shards.

The load this database was under before 0:00 UTC was unusual compared to other database instances, but this partitioning change renders that particular load no longer a concern.

Separately, we are always reviewing our platform’s traffic patterns, and this load has highlighted one opportunity to further optimize the performance of queries for annotations.

We also reviewed our response time to this incident and have implemented a new monitor that would have triggered 7 minutes earlier during this incident, allowing for a faster response time should a single database instance experience excessive load.

These changes reduce the likelihood of a single database instance entering an unhealthy state, and improve our ability to respond quickly should it happen, however our top priority remains improving load isolation between customer workloads, leveraging our newly-sharded databases. We have seen how having isolated background workers has had a sizable positive impact on the stability of our platform, and are currently working on bringing that isolation into our web servers. Our Agent API serves all communication between your agents and our platform, placing it in the critical path for ensuring your builds complete without interruption. As such, it is our first target for isolation – we will be sharing more information about this in the coming weeks.

Posted Mar 19, 2025 - 01:04 UTC

Resolved

After isolating an impacted database and shedding load we've seen error rates and latency return to normal.

Posted Mar 14, 2025 - 01:10 UTC

Update

We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.

Posted Mar 14, 2025 - 00:44 UTC

Investigating

We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.

Posted Mar 14, 2025 - 00:18 UTC

This incident affected: Web, Agent API, REST API, and Job Queue.