Degraded performance on Scheduled Builds

Incident Report for Buildkite

Postmortem

At 2023-09-18 18:45 UTC, AWS announced “Network connectivity issues” impacting Elastic Container Service, Elastic Compute Cloud and Elasticache (among others) in Buildkite’s primary AWS region.

At this exact time, one of our Sidekiq services—then serving as the elected leader for enqueueing the periodic jobs used to run scheduled builds—experienced a matching loss of network connectivity, manifesting as errors connecting to AWS and other external services.

From this point, this service no longer enqueued these periodic jobs, while also consuming 200% of its container’s CPU.

Sidekiq leadership renewal should happen every 20 seconds, and because no other Sidekiq process took leadership, we theorise that this one process was able to renew its leadership but was otherwise in a state preventing its ordinary operation.

This meant that scheduled builds did not run over this period.

At 2023-09-18 21:56 UTC, a Buildkite deploy started new Sidekiq processes, resulting in a new leader election and a new Sidekiq process taking the role of enqueuing the periodic jobs, and scheduled builds successfully resumed.

This is the first time we’ve seen a Sidekiq process retain its leadership while otherwise ceasing to function normally.

To ensure we can respond better to such an occurrence in future, we’re establishing a dedicated monitor for the periodic job used to run scheduled builds.

Posted Sep 22, 2023 - 06:20 UTC

Resolved

This incident has been resolved.

Posted Sep 18, 2023 - 22:33 UTC

Monitoring

Scheduled builds are functional again. We are currently monitoring the service.

Posted Sep 18, 2023 - 22:12 UTC

Investigating

We are currently investigating an issue with scheduled builds not being triggered.

Posted Sep 18, 2023 - 21:56 UTC

This incident affected: Agent API.