Agents not processing jobs
Incident Report for Buildkite
Postmortem

Service Impact

On 2023-10-13 between 08:45 and 11:28 UTC (2h43m) some customers were unable to run builds because jobs were not being assigned to available agents. For customers using Clusters, some queues may have kept working while others were impacted.

Incident Summary

Buildkite received reports from several customers that jobs were not running despite available agents. Our monitoring did not alert us to any issues, there were no high-volume errors, and our metrics and tracing showed that, for most customers, jobs were being dispatched to agents, although the total quantity was lower than usual. No code/operational changes correlated with the timing of the behavior change. We paged more engineers to investigate why some customers were experiencing this issue. After some time, the problem was traced back to a process responsible for recovering from failures/errors/timeouts in assigning jobs to agents. That code had been recently updated, but did not manifest as a problem until some job assignments failed, and weren’t able to self-heal as they normally would. The change was behind a feature flag which we were able to immediately revert, after which the system immediately recovered for all customers.

Actions

We have added automated alerting on statistical anomalies in the volume of jobs being assigned to agents, which we believe would have flagged this issue and reduced the time it took us to identify the issue. Additionally, we have increased the visibility of long-running job assignments. 

We have also enhanced the observability of our feature flag system to make it easier for incident responders to quickly see which flags were recently changed, which would also have reduced time to identify the issue in this case.

Posted Oct 18, 2023 - 03:48 UTC

Resolved
Jobs are being dispatched to agents for all customers and the system is performing normally. We will continue to monitor closely, and will share details once we've completed a detailed investigation
Posted Oct 13, 2023 - 12:18 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 13, 2023 - 11:56 UTC
Identified
We have identified the cause of the issue and are beginning to see jobs being processed.
Posted Oct 13, 2023 - 11:39 UTC
Update
We are continuing to investigate this issue.
Posted Oct 13, 2023 - 11:02 UTC
Update
We have received reports that agents are not picking up any work and are continuing to investigate this issue
Posted Oct 13, 2023 - 10:31 UTC
Investigating
We have received reports of elevated wait times for job dispatch to agents and are investigating the issue.
Posted Oct 13, 2023 - 10:05 UTC
This incident affected: Job Queue.