Elevated Processing Times

Incident Report for Buildkite

Resolved

The system has stabilised and we continue to work on more long term mitigations to latency issues we've been experiencing over the past few days.
Posted Feb 02, 2023 - 04:33 UTC

Monitoring

Our systems are operating at normal levels, we continue to monitor performance.
Posted Feb 02, 2023 - 01:53 UTC

Update

We saw increased load create increased notification latency. Our mitigations continue to allow Job Dispatch and Web UI latency down to within acceptable levels.

We are continuing to work on ways to improve notification latency.
Posted Feb 02, 2023 - 01:06 UTC

Identified

Notifications are delayed and we're investigating this.
Posted Feb 02, 2023 - 00:33 UTC

Monitoring

Job dispatch and the web interface are back to normal. While we continue to see delays to outbound notifications of up to 6 minutes, latency is improving. We understand this has some impact on customers and we continue to work on longer term mitigations to this ongoing issue.
Posted Feb 02, 2023 - 00:13 UTC

Update

We have seen job dispatch stabilise down to within our SLA, but continue to have higher latency than SLA for notifications.

We continue to investigate how to improve notification latency
Posted Feb 01, 2023 - 23:27 UTC

Update

We have deployed a change to prioritize job dispatch over notifications (i.e. commit statuses). The impact of this is that customers will see commit statuses delayed by up to 30 minutes. This is a once-off impact and notifications latency is expected to return to normal after the initial backlog has been processed.
Posted Feb 01, 2023 - 23:03 UTC

Update

We continue to take steps to stabilize Job Dispatch and we hope to have those changes implemented in the next 90 minutes

Notifications (including commit statuses) will continue to be delayed and may get worse as we limit their load on the system in order to prioritize job dispatch. We continue to work on stabilizing system load and will provide an ETA when available.
Posted Feb 01, 2023 - 22:14 UTC

Update

We continue to investigate database load and work on ways to reduce the impact to Job Dispatch primarily.
We have multiple streams of work going on to improve Job Dispatch as our first priority
Posted Feb 01, 2023 - 21:55 UTC

Identified

We are investigating latency spikes across many of our asynchronous processing queues which is causing slowness in notifications and job assignments. We have identified an issue with database load and we continue to investigate while taking steps to mitigate database load and keep the system stable.
Posted Feb 01, 2023 - 21:30 UTC

Investigating

We are investigating reports of sluggish UI and latency in assigning jobs to agents
Posted Feb 01, 2023 - 20:50 UTC
This incident affected: Web, Agent API and Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Webhook Notifications).