Between 2022-11-16 23:15 UTC and 2022-11-17 02:00 UTC, Webhooks, Event Bridge, SCM Commit Statuses and Slack notifications and Datadog integrations were delayed by up to 45 minutes.
This incident was triggered by a long running migration task that was archiving static data from our relational datastore to S3. The migration task enqueues background jobs into a queue to be processed by an asynchronous worker pool.
These jobs were enqueued with a higher priority than the background jobs that deliver notifications, rather than being enqueued into our lowest priority queue. This caused our asynchronous worker pool to process these migration tasks ahead of notifications queues. This caused elevated latency of the notifications queues. To reduce the latency of sending notifications we paused the processing of the higher priority queue containing the low priority migration jobs. This allowed jobs from the notifications queues to be processed and increased the capacity of the asynchronous worker pool responsible for processing notifications.
Changes we're making
Currently our notifications queues are processed by two different worker pools. A dedicated pool that only processes notifications queues and a shared pool that processes multiple queues including notifications queues. To prevent this kind of incident recurring we plan to fully isolate the processing of notifications queues to a single worker pool and provision the pool with additional capacity.