Delayed notifications

Incident Report for Buildkite

Postmortem

Service Impact

A subset of our customers experienced elevated latency in our notification delivery, build dispatch and metrics services.

Incident Summary

We are in the process of migrating our underlying compute platform from AWS Fargate to AWS EKS for our production workloads. We are migrating our services in small batches so we can verify stability as we go.

Between 15:42 and 17:33 our EKS Prometheus server began to need more memory than was available on the host where it was running. This was caused by autoscaling operations that increased the number of pods tracked by Prometheus, which in turn increased the Prometheus server's memory requirement. The host killed the Prometheus server process, which was restarted shortly after by the Kubernetes control plane. In the interim, the metrics used for application autoscaling were unavailable. The unavailable metrics meant that the affected services were not being triggered to scale up, resulting in the observed delays. Prometheus exceeded the host's available memory again soon after restarting, which caused the cycle to repeat.

The on call team followed a prepared documentation to shift load on the affected services back to Fargate. The majority of customers saw complete recovery from 16:49. A handful of customers had developed such a large backlog during the period of higher latency, that they had to be manually scaled up further. All customers saw full recovery by 17:33.

Changes we're making

We have already made the following changes to our rollout of EKS for production workloads:

  • Upsized the underlying system nodes.
  • Set higher requests and limits for the Prometheus server so it can handle more product load.
  • Reviewed and set any missing requests and limits for all new EKS resources, ensuring that EKS has all the required information to prevent accidental resource contention.
  • Added more observability and monitors for EKS pod and node health to help us identify root causes quickly during future incidents.

We have since migrated all these services back to EKS and observed successful scaling well beyond the limits we encountered during this incident.

Posted May 25, 2026 - 07:04 UTC

Resolved

The incident is resolved
Posted May 20, 2026 - 17:39 UTC

Monitoring

We are seeing recovery across affected customers and continue to monitor
Posted May 20, 2026 - 17:26 UTC

Identified

We have identified the issue and applied mitigations and are monitoring recovery

We have determined that only a subset of customers are affected by the notification latency.
Posted May 20, 2026 - 17:06 UTC

Investigating

We are investigating delays to notifications across all customers
Posted May 20, 2026 - 16:40 UTC
This incident affected: Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Webhook Notifications).