Increased API error rate and latency

Incident Report for Buildkite

Postmortem

On 2021-11-11 between 04:25 UTC and 05:43 UTC Buildkite was unable to dispatch or update builds and perform other asynchronous tasks. This incident was triggered by memory exhaustion of the Redis node that Sidekiq queues use to process updates to builds and jobs, upload pipeline definitions, process webhooks, deliver notifications and other background operations.

At 04:25 UTC we were alerted to the incident by a spike in application errors indicating that Sidekiq workers were unable to connect to Redis. Upon further investigation we were unable to manually connect to the primary Redis cluster used to store Sidekiq jobs. In response to identifying memory exhaustion as the cause of the connection failures, we attempted to increase the size of the existing AWS ElastiCache instance via the AWS API. Increasing the size of the single node ElastiCache Redis cluster creates a new node and copies the existing cache data from the old node to the new node. We made the decision to attempt to increase the size of the existing cluster with the aim of retaining the state of the existing cluster. After waiting 27 minutes for the copy operation to complete we started on a secondary plan to restore service availability, provisioning a new Redis cluster via ElastiCache.

Unfortunately, when connecting our Sidekiq workers to the new ElastiCache Redis node, we discovered that Sidekiq doesn't support Redis in clustered mode which this cluster was configured to use. We provisioned another ElastiCache Redis cluster in non-clustered mode. Upon connecting the new Redis node to the Sidekiq workers and our other API servers we begun processing new Sidekiq jobs. This restored service to dispatching jobs, starting new builds, pipeline uploads and processing incoming webhooks.

We manually audited running builds that may have finished during the incident to ensure they were marked as finished. We also triggered dispatches for all organisations to ensure that jobs were assigned to agents.

We're so sorry for this major interruption. In response to this incident we've:

Vertically scaled the Redis node used for Sidekiq to four times the previous size. This provides our Sidekiq queues with significant additional headroom.
Created alerts to monitor for low available memory on each of our Redis clusters.

Posted Nov 29, 2021 - 06:14 UTC

Resolved

After some consideration, we won't be reprocessing inbound webhooks received during 04:25-05:40, 11 Nov 2021 UTC.

Given that newer webhooks have been processed, we are worried that processing older webhooks may have negative consequences for deployment pipelines.

You can use the “New Build” button to trigger builds for missed commits if you need.

We’re so sorry for the interruption this has caused. Please reach out to support@buildkite.com if you need assistance.

Posted Nov 11, 2021 - 06:44 UTC

Update

Dispatches and other build events from during the queue unavailability have been reprocessed; we’re currently working on handling webhooks that we received during 04:25-05:40, 11 Nov 2021 UTC.

Outbound notifications from Buildkite to your commit statuses, Slack, custom webhook handlers etc from this time window will not be sent, but notifications for activity since then are being sent.

If you notice any stuck builds or have any further questions, please contact Buildkite Support at support@buildkite.com.

Posted Nov 11, 2021 - 06:25 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 11, 2021 - 05:43 UTC

Update

We are currently having an issue with the Redis job queue, which has become unavailable and the team is in the process of replacing it.

Posted Nov 11, 2021 - 05:20 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 11, 2021 - 04:55 UTC

Update

We are continuing to investigate this issue.

Posted Nov 11, 2021 - 04:39 UTC

Investigating

We are currently investigating this issue.

Posted Nov 11, 2021 - 04:37 UTC

This incident affected: Web, Agent API, REST API, Job Queue, SCM Integrations, Third Party Services (AWS ec2-us-east-1, AWS elasticache-us-east-1, AWS elb-us-east-1, AWS rds-us-east-1, PagerDuty Notification Delivery), SCM Providers (GitHub, GitHub API Requests, Atlassian Bitbucket SSH, Atlassian Bitbucket Website and API, Atlassian Bitbucket Git via HTTPS), and Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Webhook Notifications).