Degraded Agent Dispatch and API performance

Incident Report for Buildkite

Postmortem

Service Impact

On 2022-01-17 between 2:06 UTC and 2:59 UTC, pipelines notification workers, job dispatch and Agent API had degraded performance.

Incident Timeline

This incident began at 1:52 UTC, when an error in the Test Analytics application caused a large number of sidekiq retries that flooded our Redis server. Our investigation of the incident uncovered Redis timeouts were cross-impacting the Pipelines product and job dispatch was delayed by up to 20 seconds.

The deployment that caused the error was closely monitored and a fix was merged at 2:01 UTC.

Our monitoring system notified us of the incident at 2:06 UTC. No customer impact was reported.

Due to the degraded state of job dispatching, our fix failed to deploy through our usual deployment pipeline. In order to mitigate the impact and deploy the fix, Test Analytics was placed into maintenance mode at 2:14 UTC.

An incident was posted to statuspage at 2:20 UTC.

An attempt was made to terminate the broken sidekiq jobs however our production console was unvailable due to the same Redis timeout issue. As load began to decrease we were able to gain access and terminate the failing jobs at 2:38 UTC. A manual deployment rollback was performed shortly afterwards.

System performance returned to normal at 2:55 UTC.

At around 3:00 UTC we performed a manual deployment of a fix.

The incident was marked as resolved at 3:12 UTC.

Changes we’re making

As part of our reliability review and the release of Test Analytics, we’re allocating additional dedicated infrastructure for Test Analytics in the form of sidekiq, Redis, and ActionCable. This will mitigate the chance of cross-impacting outages of Test Analytics and Pipelines. Test Analytics already has a dedicated database server. Work on dedicated infrastructure had commenced in December 2021 prior to this incident. This work is due to complete before the General Access release of Test Analytics.

Posted Jan 21, 2022 - 02:00 UTC

Resolved

This incident has been resolved.

Posted Jan 17, 2022 - 03:12 UTC

Update

System performance is returning to normal. We are continuing to monitor results.

Posted Jan 17, 2022 - 02:59 UTC

Monitoring

The fix has been deployed and we are monitoring the results.

Posted Jan 17, 2022 - 02:38 UTC

Identified

We have identified a problem and deploying a fix.

Posted Jan 17, 2022 - 02:24 UTC

Investigating

We are currently investigating slow agent dispatch for running builds and API performance.

Posted Jan 17, 2022 - 02:20 UTC

This incident affected: Agent API, REST API, and Job Queue.