High job dispatch latency

Incident Report for Buildkite

Postmortem

Over April and May 2022 Buildkite had five distinct periods of significantly degraded performance, all linked by the same root cause:

We have been working closely and extensively in partnership with AWS to help diagnose and mitigate the underlying issue. As a result, since 19th May we have stopped experiencing the non-workload-correlated latency spikes which were causing these incidents.

What happened

During each incident, a 10x increase in storage write latency to our main database was observed without any correlation to changes in our workload. Being an issue we hadn’t seen before, the first incident took some time to identify the issue, escalate to AWS, and work with them on a mitigation. Once we realized there was potential for it to reoccur, we added additional monitoring to detect this issue faster, added an additional engineer to the on call roster, and developed a custom escalation process with AWS. Together, those actions meant each incident was shorter than the previous one.

The increase in write latency meant Buildkite was online but latency on most requests was highly elevated. Requests that are mostly-read (like the buildkite.com web dashboard) typically would load, but slowly. Some operations required for builds to perform at full speed were particularly slow, resulting in drastically reduced build throughput for many customers. Key performance issues included pipeline uploads and job accept/start/finish by the agent, and the server-side “dispatch” process, where Buildkite matches the pool of ready to run jobs with the pool of available agents. These processes are all write-heavy and therefore suffered when IO write latency increased.

The size and write throughput of this database was a significant constraint on our ability to work around these incidents our vendor was experiencing. We explored failing over to new Availability Zones, creating new read replicas for promotion, and accelerating a possible migration to Amazon Aurora. However, we had reason to believe the issue would have been present on any replacement RDS PostgreSQL instance, and a zero or low downtime migration to Aurora wasn’t viable with our multi-terabyte tables and write throughput; the disk on our primary would fill with Write Ahead Logs (WAL) before any Aurora replica was online and up to date.

The final incident was on May 19th, and shortly after that AWS confirmed that the underlying issue had been resolved. The additional monitoring we established to detect the issue confirms that there have been no spikes in write latency uncorrelated with workload since May 19th, and we have returned to standard staffing levels on our on-call roster.

How we’re becoming more resilient to issues like this

As identified in our reliability review earlier this year, we’ve been working across multiple teams and projects to improve our database dependencies. AWS have been heavily involved in some of the projects, offering their expertise at running PostgreSQL at scale and helping to refine our plans. As each one completes, our database performance improves and we gain new options for responding to any future incidents.

We have provisioned a new read replica (in addition to the Multi-AZ standby instance) to give us greater failover control and shift load from our primary instance.

We have moved a large write-heavy table to a separate database, reducing the main database storage by 24% from 45TB to 34TB. The support for multiple databases introduced in Rails 6.0 made this an accessible option for our application.

We are partitioning our large time-series tables (e.g. builds, jobs, artifacts) to improve the efficiency of caches, indexes, and VACUUM operations. Builds older than a few days are rarely accessed, so partitioning builds and their dependent data means old data can (mostly) stay on disk, with in-memory caches left for recent data.

We are building processes to delete data that no longer needs to be accessed, to reduce our overall storage size.

We will continue to invest in evolving our architecture to support our business growth. We intend to evaluate Amazon Aurora as a replacement for Amazon RDS for PostgreSQL, and explore further database sharding strategies.

Conclusion

We’re sorry for the disruption these events caused. These incidents accelerated some of the work already planned to improve the reliability of our databases, and adjusted our priorities to ensure we’re working to meet and exceed our commitment to consistent reliability.

Posted Jun 23, 2022 - 07:10 UTC

Resolved

This issue was escalated to AWS who has applied manual remediations to the primary RDS instance that powers job dispatching. We continue to work with AWS to identify and remediate the root cause of repeated high write latency.

Posted May 19, 2022 - 23:02 UTC

Monitoring

The issue has been identified and service is beginning to return to normal. We are monitoring

Posted May 19, 2022 - 22:49 UTC

Investigating

We are experiencing elevated latency in dispatching build jobs to agents.

Posted May 19, 2022 - 22:44 UTC

This incident affected: Job Queue.