Delayed notifications

Incident Report for Buildkite

Postmortem

This incident was the result of a lack of capacity in the group of workers that services outbound notifications queues. Notification queue latency reached 8 minutes due to this lack of capacity. During a blue/green rollout of these workers, autoscaling was manually disabled to increase the number of workers serving the queue in preparation for migrating the processing of background jobs between the blue and green side of the deployment. After shifting background workload across to the new workers, autoscaling was not re-enabled which prevented the deployment from scaling up to maximum capacity. We’re improving the documentation and tooling involved in this blue/green deployment process to prevent a recurrence of this issue and improve visibility into the status of autoscaling.

Posted 2 years ago. Apr 04, 2023 - 03:46 UTC

Resolved

This incident has been resolved.
Posted 2 years ago. Apr 04, 2023 - 02:50 UTC

Identified

We've identified an issue with notification delays and have scaled up capacity to deal with it.
Posted 2 years ago. Apr 04, 2023 - 02:40 UTC

Investigating

We are investigating delays to build and job notifications such as commit status and other webhooks
Posted 2 years ago. Apr 04, 2023 - 02:26 UTC
This incident affected: Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Webhook Notifications).