Degraded performance on Notifications and Job Dispatches
Incident Report for Buildkite
Postmortem

Interim report: Degraded performance on notifications and job dispatches

We would like to share this interim report on the degraded performance for notifications and job dispatches you experienced at the start of this month.

Summary

From Tuesday 31st January until Monday 6th of February, Buildkite experienced varying levels of degraded performance of our asynchronous tasks.

The key contributing factor was increased load on our primary database.

The primary customer impacts were:

Job Dispatches: Buildkite Agents waiting for pipeline jobs to be assigned via the Agent API saw delays of up to 70 seconds between the 1st and 2nd of February.

Build status updates: Delays in this result in builds taking longer to reflect their final status and have a knock on effect to subsequent notifications being sent out. We saw delays of up to 12 minutes between the 31st of January and 3rd of February.

Commit statuses: Reflecting a green tick on GitHub and other version control systems. This is often used as a permissions gate to allow merging a Pull Request. Between 30th of January and 3rd of February we saw delays of up to 52 minutes, where normally we see sub-second latency.

Job notifications: Used to reflect step status in Github. Some customers use this to assign pipeline jobs to agents using the acquire job feature, resulting in jobs taking longer to start. Between January 30th and February 6th delays of up to 5 hours were experienced.

See Appendix for a full breakdown of impact.

We implemented a series of mitigations between January 31st and February 6th to reduce load on our primary database, including moving some query workloads to a replica database. This reduced contention and load on the primary database, which in turn caused performance to return to satisfactory levels.

Technical Background

To explain how these delays occurred we have to explain first in a bit more detail how builds work. The fundamental building block for pipelines is a job and a build typically consists of many jobs. Here’s a simplified flow chart of Job and Build state machines.

Build states:

Some of these transitions happen synchronously when an API request is made, for instance when an agent reports a job as finished we immediately update the database record to reflect this. Others, such as assigning jobs to an agent or transitioning a running build to Passed or Failed, happen asynchronously.

Regardless of whether the transition happens synchronously or asynchronously, we schedule an asynchronous task to fire off any notifications from that job or build. This task checks to see what notifications are configured and then schedules more tasks to actually send the corresponding webhooks. We send on average 50-100 notifications per build, depending on the time of day. Generating each notification then creates many SQL queries.

Over the course of the week we saw our highest ever load in terms of pipeline jobs (4% higher than our previous highest hourly load) and volume of outbound notifications.

At times of peak job creation, normally mid-morning PST, the system suffered a severe performance cliff. As we failed to keep up with the rate of notifications being sent, a backlog accumulated. This resulted in higher attempted concurrency of jobs and therefore database contention, which in turn slowed down our throughput. This reinforcing feedback cycle meant once we experienced the performance cliff it often took us many hours to recover.

Mitigations

Our first priority was to limit customer impact by splitting notifications tasks from other higher priority tasks, giving us the capability to disable specific types of notifications and understand better the impact this had.

We then added rate limiting capabilities for notifications (these already existed for creating builds), which enabled us to recover from database contention more quickly.

Interim Conclusion

We continue to work on understanding how we came to be in this situation and are devising strategies to prevent it from happening again.

At Buildkite, we value honesty and transparency, especially when we fall short of expectations. While we understand that this recent experience may have been frustrating for you, I want to assure you that our engineering and support teams went above and beyond to manage and resolve this complex incident. As someone new to Buildkite it was really great to see them work, and how they problem-solved under pressure. We are confident that we have addressed the issue at hand and are taking steps to ensure it doesn't happen again.

We're currently in the process of gathering all the necessary information for a full incident report, which will include a detailed account of how the events unfolded, as well as a roadmap outlining our plans for future capacity and scalability mitigations. This report will be completed within the coming weeks and will be shared with you.

Thank you for being a Buildkite customer.

Daniel Vydra

CTO, Buildkite

Appendix

Customer Impact (UTC)

Job dispatch (Assigning jobs to agents via the agent API. Delays in this result in jobs taking longer to start):

2023-02-01 19:25 - 2023-02-02 01:30 Dispatch times vary from 30-70 seconds

2023-02-02 18:22 - 18:38 Dispatch times spike up to 70 seconds

Build status updates (Reflecting Job finished in the UI and API. Delays in this result in builds taking longer to reflect their final status and subsequent notifications being sent.):

2023-01-31 19:40 - 2023-02-01 00:20 Latency remains above 5 minutes and spikes as high as 12 minutes

2023-02-01 19:00 - 2023-02-01 22:30 Latency varies between 1 and 5 minutes

2023-02-03 18:30 - 22:20 Latency varies between 5 seconds (within normal levels) and 60 seconds (above normal levels)

Github commit status notifications (Reflecting a green tick on Github. This is often used as a permissions gate to allow users to merge a Pull Request. Delays in this often result in not being able to deploy their changes as quickly):

2023-01-30 21:12 - 23:34 Latency of up to 50 minutes

2023-01-31 19:47 - 2023-02-01 02:30 Latency remains above 5 minutes and spikes as high as 20 minutes

2023-02-01 19:00 - 2023-02-02 01:50  Latency spikes up to 52 minutes

2023-02-02 18:23 - 18:40  Latency spikes up to 9 minutes

2023-02-03 22:15 - 22:55 Latency spikes up to 30 minutes

Job notifications (Used to reflect step status in Github. Some customers use this to assign jobs to agents using the acquire job feature, resulting in jobs taking longer to start)

2023-01-30 21:12 - 23:34 Latency of up to 50 minutes

2023-01-31 19:47 - 2023-02-01 02:30 Latency remains above 5 minutes and spikes as high as 20 minutes

2023-02-01 19:00 - 2023-02-02 01:50  Latency spikes up to 52 minutes

2023-02-02 18:23 - 19:23  Latency spikes up to 9 minutes

2023-02-03 19:05 - 2023-02-04 03:40  Latency spikes up to 5 hours

2023-02-06 19:23 - 23:05 A small number of notifications are delayed by up to 26 minutes.

Agent status update (When an agent stops responding to our heartbeats we mark it as lost and any jobs assigned to it are marked failed. Customers may manually or automatically retry jobs in this state. Delays in this feature result in jobs taking longer to be marked as failed and turn around times to correct this being longer)

2023-01-31 17:30 - 2023-02-01 02:00 Latency peaks at 5 minutes

2023-01-01 17:30 - 2023-02-02 02:00 Latency peaks at 5 minutes

2023-02-02 18:00 - 2023-02-02 20:00 Latency peaks at 4 minutes

2023-02-03 18:30 - 2023-02-03 22:00 Latency peaks at 5 minutes

Posted Feb 16, 2023 - 04:10 UTC

Resolved
We have resolved issues with notifications and are now working on a backlog of tasks to update various statistics in the API and website.
We continue to investigate and work on longer-term mitigations for the ongoing issues.
Posted Feb 04, 2023 - 04:06 UTC
Monitoring
We have rolled out the change to automatically handle throttling for system stability and are now processing through the backlog of jobs. We will continue to monitor the system to ensure stability.
Posted Feb 04, 2023 - 03:30 UTC
Update
We are continuing to manually process the backlog of jobs in chunks which is allowing us to throttle job notifications as we work through the backlog and are now in the process of rolling out a change to automatically handle throttling for system stability.
Posted Feb 04, 2023 - 03:01 UTC
Update
We are continually working to improve the processing of job event notifications while balancing system stability. We are currently manually processing the backlog of jobs in chunks allowing us to throttle job notifications to enable us to work through the backlog.
Posted Feb 04, 2023 - 02:22 UTC
Update
We continue work to improve the processing of job event notifications while balancing system stability. We are working on ways to throttle job notifications to enable us to work through the backlog.
Posted Feb 04, 2023 - 01:40 UTC
Update
We are seeing GitHub commit statuses and build statuses returning to normal. We continue work to improve the processing of job event notifications while balancing system stability.
Posted Feb 04, 2023 - 01:02 UTC
Update
We are seeing improvements across the system, and are continuing work to restore notifications while balancing the processing of notifications with system stability
Posted Feb 04, 2023 - 00:16 UTC
Identified
We are seeing improvements across the system, and are continuing to restore notifications
Posted Feb 03, 2023 - 23:32 UTC
Update
We continue to prioritize dispatch latency over notifications and we are working on restoring notifications.
Posted Feb 03, 2023 - 22:56 UTC
Update
We continue to prioritize dispatch latency over notifications while we continue to investigate this issue
Posted Feb 03, 2023 - 22:19 UTC
Update
We are currently prioritizing dispatch latency over notifications while we continue to investigate
Posted Feb 03, 2023 - 21:30 UTC
Update
We are investigating elevated latency on notifications and job dispatch
Posted Feb 03, 2023 - 20:57 UTC
Investigating
Notifications are delayed and we're investigating this.
Posted Feb 03, 2023 - 20:17 UTC
This incident affected: Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Hipchat Notifications, Webhook Notifications) and Job Queue.