Degraded performance due to latency

Incident Report for Buildkite

Postmortem

Service Impact

Last week Buildkite experienced two incidents of degraded performance for multiple hours:

2023-08-15 20:23 until 22:44 UTC (141 minutes)
2023-08-16 18:22 until 22:22 UTC (240 minutes)

During these times, customers experienced elevated latency and error rates, and were not able to reliably run builds.

Incident Summary

During the incidents Buildkite engineers were automatically paged to respond and investigate. Overall high load and lock contention on our largest database were identified, but without a single clear cause. Multiple mitigations were applied. A recent table partitioning change was reverted in case that was a factor. Low-priority asynchronous work was paused to shed load. Traffic from bots crawling public pipelines was identified as contributing to load and was rate limited. Other mitigations to shed load were prepared but the incidents subsided before they were required.

After the incidents, our engineering team swarmed on further investigation and mitigation. Our main primary and replica databases were upgraded to machines with 33% more CPUs & RAM with higher IO and memory bandwidth. Our application layer was upgraded to 90% more CPU cores to reduce thread preemption contributing to database lock contention. Substantial read load was shifted from the primary database to a replica.

We also worked closely with our vendor to analyze the performance anomaly, and assess the impact of the partitioning change that had been in place these two nights. The increased lock contention observed on the database was consistent with the extra index locking required by attaching that partition. We believe this partitioning change, combined with overall high load, was a significant factor in the degraded performance.

Changes we're making

Tactically, in the short to medium term, based on the vastly improved metrics collected during peak periods since the incidents, we’re confident this combination of upgrades and mitigations will prevent the issue from happening again. We will be closely monitoring the situation. In this 14 day graph, you can see the historic level of lock contention during the week prior to the incidents, as well as the very significant decrease in lock contention as a result of the mitigations after August 17th.

Strategically, our engineering teams have been working hard on multiple streams of work towards a scalable sharded database architecture. We have now completed the application changes necessary to seamlessly support multiple database shards, launched a second (empty) database shard, and have been adding new signups onto this second shard for the past few weeks. Our current goal is around migrating existing customers from the original database shard to new destination shards. We are developing migration strategies with the goal of minimizing any customer disruption, and plan to discuss these with you soon.

Once we have spread our customers across the new database shards, we will see much improved performance and reliability, as by eliminating these database bottlenecks, we will have no other architectural issues in the way of scaling Buildkite with customer demand.

Posted Aug 22, 2023 - 04:49 UTC

Resolved

Latency has reduced to normal levels and system performance has been restored

Posted Aug 16, 2023 - 23:55 UTC

Monitoring

Latency has reduced to normal levels. We will continue to deploy improvements to reduce database contention throughout the day.

Posted Aug 16, 2023 - 23:23 UTC

Update

Latency has reduced to normal levels. We will continue to deploy improvements to reduce database contention throughout the day.

Posted Aug 16, 2023 - 22:50 UTC

Update

Latency has improved but is still elevated. We are deploying another change to reduce database lock contention.

Posted Aug 16, 2023 - 22:12 UTC

Update

We have blocked some problematic patterns of bot requests manually and will shortly block them automatically. We are working on more comprehensive fixes at the same time. We will provide another status update in 30 minutes.

Posted Aug 16, 2023 - 22:00 UTC

Update

We have blocked some problematic patterns of bot requests manually and will shortly block them automatically. We are working on more comprehensive fixes at the same time.

Posted Aug 16, 2023 - 21:28 UTC

Update

We continue to apply mitigations to shed database load. We will provide another status update in 30 minutes.

Posted Aug 16, 2023 - 20:54 UTC

Update

We continue to apply mitigations to shed database load. We will provide another status update in 30 minutes.

Posted Aug 16, 2023 - 20:22 UTC

Update

We continue to apply mitigations to shed database load. We will provide another status update in 30 minutes.

Posted Aug 16, 2023 - 19:52 UTC

Update

Latency has reduced somewhat but the system has not yet recovered. We continue to apply mitigations to shed database load.

Posted Aug 16, 2023 - 19:31 UTC

Update

Latency on the Agent API has further degraded. We continue to apply mitigations to shed database load.

Posted Aug 16, 2023 - 19:26 UTC

Update

We continue to apply mitigations to shed database load. We will provide another status update in 30 minutes.

Posted Aug 16, 2023 - 19:23 UTC

Update

We are applying mitigation to shed load on the DB. We will be providing status update every 30 minutes on the progress

Posted Aug 16, 2023 - 18:54 UTC

Update

We are observing high load on DB causing performance issues with API and Web. We are taking actions to shed database load. We continue to investigate the root cause.

Posted Aug 16, 2023 - 18:46 UTC

Investigating

We are currently investigating this issue.

Posted Aug 16, 2023 - 18:31 UTC

This incident affected: Web, Agent API, and REST API.