Increased dispatch latency

Incident Report for Buildkite

Postmortem

Service Impact

Between approximately 18:00 UTC and 22:50 UTC February 26, 2026, a subset of customers experienced increased latency when dispatching jobs to agents. Affected customers observed agents sitting idle for several minutes despite having matching jobs waiting in the queue. Job dispatch eventually succeeded, but with significantly elevated latency. The impact was concentrated on specific database shards but affected customers across multiple shards over the course of the incident.

Incident Summary

A database maintenance task designed to improve job ordering performance was running across all production database shards. However, this task was itself contributing significant database load, which impacted normal job dispatch and pipeline upload operations. This increased database load caused dispatch operations to queue up, resulting in the observed delays in matching jobs to agents.

The issue was compounded by a connection pooling service having several containers running on underperforming infrastructure, which reduced the available database throughput.

Contributing factors:

  • The maintenance task consumed limited database resources, which conflicted with concurrent dispatch operations
  • The task was running simultaneously across all database shards, amplifying the impact
  • A connection pooling service had degraded capacity due to infrastructure imbalance

Changes we're making

  • The maintenance task has been paused and in future will be run during low-traffic periods and on individual shards rather than all shards simultaneously
  • The connection pooling service has been rebalanced to ensure consistent performance
  • We are improving our monitoring and dashboards to enable faster identification of lock contention issues during incidents
Posted Mar 04, 2026 - 02:35 UTC

Resolved

We have seen a full recovery of service, and have a good understanding of the underlying cause. We will publish a post-incident review next week.
Posted Feb 27, 2026 - 02:30 UTC

Monitoring

We've seen recovery for the remaining subset of customers. We will continue to monitor.
Posted Feb 27, 2026 - 00:57 UTC

Investigating

We're seeing ongoing latency impact across for a subset of customers. Some customers are seeing signs of improvement, but we are continuing to investigate the issue.
Posted Feb 26, 2026 - 23:41 UTC

Monitoring

We're seeing signs of recovery and will continue to monitor.
Posted Feb 26, 2026 - 19:36 UTC

Identified

Some customers are experiencing increased latency for jobs being assigned to agents. We have identified the cause and are working on mitigations.
Posted Feb 26, 2026 - 19:10 UTC
This incident affected: Job Queue.