Notifications Service Issue
Incident Report for Buildkite
Postmortem

Service Impact

Between 2022-11-16 23:15 UTC and 2022-11-17 02:00 UTC, Webhooks, Event Bridge, SCM Commit Statuses and Slack notifications and Datadog integrations were delayed by up to 45 minutes.

Incident Summary

This incident was triggered by a long running migration task that was archiving static data from our relational datastore to S3. The migration task enqueues background jobs into a queue to be processed by an asynchronous worker pool.

These jobs were enqueued with a higher priority than the background jobs that deliver notifications, rather than being enqueued into our lowest priority queue. This caused our asynchronous worker pool to process these migration tasks ahead of notifications queues. This caused elevated latency of the notifications queues. To reduce the latency of sending notifications we paused the processing of the higher priority queue containing the low priority migration jobs. This allowed jobs from the notifications queues to be processed and increased the capacity of the asynchronous worker pool responsible for processing notifications.

Changes we're making

Currently our notifications queues are processed by two different worker pools. A dedicated pool that only processes notifications queues and a shared pool that processes multiple queues including notifications queues. To prevent this kind of incident recurring we plan to fully isolate the processing of notifications queues to a single worker pool and provision the pool with additional capacity.

Posted Nov 22, 2022 - 04:44 UTC

Resolved
The Notifications service is operating at normal levels and the queues caused by the incident have been cleared.
Posted Nov 17, 2022 - 02:31 UTC
Monitoring
The notifications service is now operating as expected, we continue to monitor the service to ensure no further degradation.
Posted Nov 17, 2022 - 02:05 UTC
Update
Queue latency is now down to 10 minutes, we continue to work on reducing the queue latency further.
Posted Nov 17, 2022 - 01:58 UTC
Update
We have prioritised service notifications over other background work, we expect this to further reduce latency in delivering notifications.
Posted Nov 17, 2022 - 01:42 UTC
Update
We are working on further scaling up the processing of background jobs in order to reduce the queue size caused by this outage.
Posted Nov 17, 2022 - 01:16 UTC
Update
We’ve identified an increased number of background jobs consuming capacity shared between notifications and other background jobs. We are provisioning increased dedicated capacity for notifications jobs.
Posted Nov 17, 2022 - 00:52 UTC
Identified
We’ve identified an increased number of background jobs consuming capacity shared between notifications and other background jobs. We are provisioning increased dedicated capacity for notifications jobs.
Posted Nov 17, 2022 - 00:52 UTC
This incident affected: Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Webhook Notifications).