Delayed notifications

Incident Report for Buildkite

Postmortem

Service Impact

Customers experienced delayed Buildkite notification delivery. The customer impact varied depending on how those notifications are used. For some customers, delayed notifications also delayed downstream CI, merge, or deployment workflows.

Incident Summary

On 28 May, Buildkite experienced elevated notification delivery latency after part of our notification-processing infrastructure became underprovisioned. This happened because the Prometheus service used by our EKS autoscaling path ran out of storage, which meant some EKS-based workers could not autoscale correctly while queues were growing. We mitigated the incident by moving affected workloads back to our previous ECS-based infrastructure and manually increasing worker capacity. Recovery took longer than expected because the rollback path did not fully handle this scenario.

Impact window 1

At 20:01 UTC, notification-processing workers became underprovisioned and notification delivery latency increased. We detected the issue through internal queue latency monitoring and began shifting affected workloads from EKS back to ECS. This rollback took longer than expected because the ECS services we were rolling back to were not ready to immediately take the full load. Engineers had to manually adjust scaling configuration and worker counts while the incident was active. Notification latency recovered for most customers by 21:00 UTC.

Impact window 2

A second, shorter impact window occurred between 22:12 UTC and 22:40 UTC for a subset of customers. After the first recovery, some workloads were still running on EKS and had started autoscaling again after Prometheus recovered. We incorrectly believed those workloads were no longer serving traffic. When we reconciled our infrastructure configuration, those EKS workloads were scaled down before their ECS equivalents had been fully scaled up. This caused another period of underprovisioning for some notification-processing workers. We resolved it by completing the rollback and scaling the remaining affected ECS services.

Customer Impact

The impact was not identical for every customer. For customers who use Buildkite notifications as an input to other CI or deployment systems, notification latency can delay those downstream workflows. Some customers also experienced secondary or longer-running effects based on the specific notification types, retry behaviour, or integrations involved. We are following up directly with affected customers where their impact differed from the general incident.

Changes we're making

We have made the following immediate changes:

Increased Prometheus storage capacity and reconciled that change in infrastructure-as-code.
Added monitoring to alert before Prometheus storage exhaustion can affect autoscaling.
Moved affected notification-processing workloads back to known-good ECS capacity.
Fixed GitHub notification retry behaviour for a class of errors that could cause repeated retries and extend notification delays.

We are also making the following reliability improvements:

Hardening the EKS-to-ECS rollback process so it verifies destination capacity, autoscaling configuration, and traffic movement before and during rollback.
Reviewing other EKS control-plane dependencies, including KEDA and Karpenter, to ensure their CPU, memory, and storage allocations are appropriate for production load.
Reassessing the order and pace of future EKS migrations so customer-critical workloads move more gradually and with clearer settling periods.
Improving customer-level monitoring for notification delivery latency, so we can detect customer-impacting regressions earlier.
Reviewing which notification types are on the scheduling or CI hot path for customers, and whether they need tighter latency expectations, separate queueing, or more specific alerting than general notification work.

Areas we are improving: incident communication

During this incident, our public status page did not reflect customer-visible impact as quickly or clearly as it should have. In particular, notification delivery latency can affect customers differently depending on how notifications are used in their CI and deployment workflows.

We are improving how we communicate during notification latency incidents by:

Updating the status page earlier when notification latency is likely to affect customer workflows
Making status page updates clearer about the customer-visible impact, not just the affected internal service
Improving internal escalation paths for customers who report critical CI impact before the incident is fully understood
Using customer-level notification latency monitoring to help identify affected customers sooner

Posted Jun 09, 2026 - 22:21 UTC

Resolved

This incident has been resolved.

Posted May 28, 2026 - 21:18 UTC

Identified

We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.

Posted May 28, 2026 - 20:47 UTC

Investigating

We are investigating delays to build and job notifications for a subset of customers.

Posted May 28, 2026 - 20:20 UTC

This incident affected: Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Webhook Notifications).