We've identified the cause of this incident as a recently introduced change to how we lookup the job that a notification is related to. This change was intended to improve the performance of this lookup. However, it broke an interaction between the background jobs used for sending notifications (Webhooks, Slack notifications, Event Bridge) and commit statuses related to job and step events. We reverted the change and will be adding further regression testing to exercise this interaction more throughly. In addition we are evaluating additional intergration testing to fully exercise the multiple code paths involved in sending notifications related to job events.
Our automated monitoring for notification delivery was insufficient to detect this incident. We are exploring additional monitoring to more quickly detect failures to deliver notifications. In this case we decided not to retro actively process webhooks and notifications to avoid sending stale commit statuses. We apologise for the prolonged duration of this incident.