Last week Buildkite experienced two incidents of degraded performance for multiple hours:
During these times, customers experienced elevated latency and error rates, and were not able to reliably run builds.
During the incidents Buildkite engineers were automatically paged to respond and investigate. Overall high load and lock contention on our largest database were identified, but without a single clear cause. Multiple mitigations were applied. A recent table partitioning change was reverted in case that was a factor. Low-priority asynchronous work was paused to shed load. Traffic from bots crawling public pipelines was identified as contributing to load and was rate limited. Other mitigations to shed load were prepared but the incidents subsided before they were required.
After the incidents, our engineering team swarmed on further investigation and mitigation. Our main primary and replica databases were upgraded to machines with 33% more CPUs & RAM with higher IO and memory bandwidth. Our application layer was upgraded to 90% more CPU cores to reduce thread preemption contributing to database lock contention. Substantial read load was shifted from the primary database to a replica.
We also worked closely with our vendor to analyze the performance anomaly, and assess the impact of the partitioning change that had been in place these two nights. The increased lock contention observed on the database was consistent with the extra index locking required by attaching that partition. We believe this partitioning change, combined with overall high load, was a significant factor in the degraded performance.
Tactically, in the short to medium term, based on the vastly improved metrics collected during peak periods since the incidents, we’re confident this combination of upgrades and mitigations will prevent the issue from happening again. We will be closely monitoring the situation. In this 14 day graph, you can see the historic level of lock contention during the week prior to the incidents, as well as the very significant decrease in lock contention as a result of the mitigations after August 17th.
Strategically, our engineering teams have been working hard on multiple streams of work towards a scalable sharded database architecture. We have now completed the application changes necessary to seamlessly support multiple database shards, launched a second (empty) database shard, and have been adding new signups onto this second shard for the past few weeks. Our current goal is around migrating existing customers from the original database shard to new destination shards. We are developing migration strategies with the goal of minimizing any customer disruption, and plan to discuss these with you soon.
Once we have spread our customers across the new database shards, we will see much improved performance and reliability, as by eliminating these database bottlenecks, we will have no other architectural issues in the way of scaling Buildkite with customer demand.