Increased latency and error rates

Incident Report for Buildkite

Postmortem

Service Impact

A subset of customers experienced elevated latency in notification delivery.

Incident Summary

While migrating a subset of our background processing services to Amazon EKS, we encountered an issue with delivery of internal metrics. The discovered issue did not impact performance or availability, but would have impaired our ability to detect such problems if they occurred.

Out of an abundance of caution we decided to revert the migration, and moved those services back to the original infrastructure on AWS Fargate.

When migrating to EKS, we scale down and disable automatic scaling on Fargate. This allows us to quickly migrate back by scaling up Fargate. When we moved the workloads back to Fargate to restore internal metrics, we missed the step to re-enable autoscaling. As a result, the affected services did not have sufficient capacity and could not keep up with incoming work.

We re-enabled autoscaling promptly once the problem was discovered, and provisioned extra capacity for customers where a backlog of work had accumulated.

Between 09:17 and 10:17 UTC, a small subset of our customers were impacted. Individual customers experienced a limited outage of notification services, which lasted between 35 and 58 minutes within this window, if there was any impact at all. The migration is performed in small batches, so not all customers experienced this incident.

Changes we're making

  • We are simplifying the runbook used to rollback migrations in the event of incidents.
  • We are adding more verification steps to the migration process.
Posted May 28, 2026 - 04:21 UTC

Resolved

We think the impact from the issue is over.
Posted May 26, 2026 - 10:38 UTC

Update

We see processing time for all affected services has returned to normal as of 20 minutes ago.
Posted May 26, 2026 - 10:37 UTC

Monitoring

We've identified the problem and have completed the remediation steps, we are now monitoring as service resumes.
Posted May 26, 2026 - 10:08 UTC

Identified

We're observing increased latency and error rates for a subset of our customers. We're currently remediating and will provide status updates as they become available.
Posted May 26, 2026 - 09:56 UTC
This incident affected: Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Webhook Notifications).