On 2021-11-11 between 04:25 UTC and 05:43 UTC Buildkite was unable to dispatch or update builds and perform other asynchronous tasks. This incident was triggered by memory exhaustion of the Redis node that Sidekiq queues use to process updates to builds and jobs, upload pipeline definitions, process webhooks, deliver notifications and other background operations.
At 04:25 UTC we were alerted to the incident by a spike in application errors indicating that Sidekiq workers were unable to connect to Redis. Upon further investigation we were unable to manually connect to the primary Redis cluster used to store Sidekiq jobs. In response to identifying memory exhaustion as the cause of the connection failures, we attempted to increase the size of the existing AWS ElastiCache instance via the AWS API. Increasing the size of the single node ElastiCache Redis cluster creates a new node and copies the existing cache data from the old node to the new node. We made the decision to attempt to increase the size of the existing cluster with the aim of retaining the state of the existing cluster. After waiting 27 minutes for the copy operation to complete we started on a secondary plan to restore service availability, provisioning a new Redis cluster via ElastiCache.
Unfortunately, when connecting our Sidekiq workers to the new ElastiCache Redis node, we discovered that Sidekiq doesn't support Redis in clustered mode which this cluster was configured to use. We provisioned another ElastiCache Redis cluster in non-clustered mode. Upon connecting the new Redis node to the Sidekiq workers and our other API servers we begun processing new Sidekiq jobs. This restored service to dispatching jobs, starting new builds, pipeline uploads and processing incoming webhooks.
We manually audited running builds that may have finished during the incident to ensure they were marked as finished. We also triggered dispatches for all organisations to ensure that jobs were assigned to agents.
We're so sorry for this major interruption. In response to this incident we've: