Missing Outbound Notifications (including webhooks and some GitHub commit statuses)
Incident Report for Buildkite
Postmortem

We've identified the cause of this incident as a recently introduced change to how we lookup the job that a notification is related to. This change was intended to improve the performance of this lookup. However, it broke an interaction between the background jobs used for sending notifications (Webhooks, Slack notifications, Event Bridge) and commit statuses related to job and step events. We reverted the change and will be adding further regression testing to exercise this interaction more throughly. In addition we are evaluating additional intergration testing to fully exercise the multiple code paths involved in sending notifications related to job events.

Our automated monitoring for notification delivery was insufficient to detect this incident. We are exploring additional monitoring to more quickly detect failures to deliver notifications. In this case we decided not to retro actively process webhooks and notifications to avoid sending stale commit statuses. We apologise for the prolonged duration of this incident.

Posted Nov 04, 2022 - 03:33 UTC

Resolved
Starting at 7:09am UTC, for 3 hours and 36 minutes Buildkite experienced a disruption in sending outbound webhooks and some GitHub commit statuses on job events such as job.scheduled. The issue has been resolved.

We are continuing to investigate the root cause and will share more details when they are available.
Posted Nov 02, 2022 - 07:00 UTC