On 2022-01-17 between 2:06 UTC and 2:59 UTC, pipelines notification workers, job dispatch and Agent API had degraded performance.
This incident began at 1:52 UTC, when an error in the Test Analytics application caused a large number of sidekiq retries that flooded our Redis server. Our investigation of the incident uncovered Redis timeouts were cross-impacting the Pipelines product and job dispatch was delayed by up to 20 seconds.
The deployment that caused the error was closely monitored and a fix was merged at 2:01 UTC.
Our monitoring system notified us of the incident at 2:06 UTC. No customer impact was reported.
Due to the degraded state of job dispatching, our fix failed to deploy through our usual deployment pipeline. In order to mitigate the impact and deploy the fix, Test Analytics was placed into maintenance mode at 2:14 UTC.
An incident was posted to statuspage at 2:20 UTC.
An attempt was made to terminate the broken sidekiq jobs however our production console was unvailable due to the same Redis timeout issue. As load began to decrease we were able to gain access and terminate the failing jobs at 2:38 UTC. A manual deployment rollback was performed shortly afterwards.
System performance returned to normal at 2:55 UTC.
At around 3:00 UTC we performed a manual deployment of a fix.
The incident was marked as resolved at 3:12 UTC.
Changes we’re making
As part of our reliability review and the release of Test Analytics, we’re allocating additional dedicated infrastructure for Test Analytics in the form of sidekiq, Redis, and ActionCable. This will mitigate the chance of cross-impacting outages of Test Analytics and Pipelines. Test Analytics already has a dedicated database server. Work on dedicated infrastructure had commenced in December 2021 prior to this incident. This work is due to complete before the General Access release of Test Analytics.