Delayed service notifications

Incident Report for Buildkite

Postmortem

During this incident customers experienced delayed service notifications due to temporary mitigations put in place to prevent an earlier incident from re-occurring. In the previous incident, we manually scaled down our notifications capacity to protect against concurrent background workers overloading our primary database while we investigated the underlying cause of the contention.

We've now identified the cause of the database lock contention as the concurrency limit for sending GitHub commit statuses being too low. This concurrency limit only applies for customers using GitHub Enterprise Server to avoid overloading their servers with a high rate of commit status updates. Previously, we had understood the underlying problem to be the concurrency of all GitHub status notifications (github.com and GitHub Enterprise Server) being too high. However, we now understand that the concurrency limit for GitHub Enterprise Server being too low caused the Sidekiq worker responsible for delivering commit statuses for job level notifications to be re-enqueued immediately after hitting the concurrency limit. This resulted in a sustained retry loop of these workers, which perform a database query to check the notification is still relevant before delivering the notification (and being limited by the GitHub Enterprise Server concurrency limit).

The frequency of retrying these workers was causing high database load and contention on the job rows associated with the commit statuses. In response to identifying the high frequency of these retries we've modified the retry behaviour to exponentially backoff, and are working to configure appropriate concurrency limits.

We would like to apologise to everybody that was impacted by this service interruption.

Posted Nov 22, 2021 - 04:24 UTC

Resolved

This incident has been resolved.

Posted Nov 18, 2021 - 00:55 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 18, 2021 - 00:38 UTC

Identified

The issue has been identified and a fix is being implemented.

We have added additional capacity to our notification workers and are working through existing notifications.

Posted Nov 18, 2021 - 00:32 UTC

Investigating

We are investigating an issue with delays to service notifications.

Posted Nov 18, 2021 - 00:19 UTC

This incident affected: Notifications (Slack Notifications, Webhook Notifications).