Elevated error rates
Incident Report for Buildkite

Service Impact

Around 2022-11-19 13:00 UTC, a small subset of agents disconnected and failed to reconnect again. From 2022-11-21 02:35 to 03:10 UTC, high Agent API error rates prevented agents from successfully processing jobs. After this point, newly connected agents could successfully execute jobs, but older agents still couldn't reconnect. From 2022-11-21 05:54 UTC, older agents were successfully reconnected.

Incident Summary

Last Wednesday we began backfilling a new table that provides better tracking of automatic retry rules for build jobs. A Redis set of IDs was persisted in one of our ElastiCache clusters to track which builds were backfilled. Initial calculations indicated that this set would fit within the memory constraints of the cluster, but as this backfill ran over the weekend it became clear that the calculations were incorrect.

Around 12:00 Saturday (UTC) the Redis shard holding this growing set ran low on memory, and our volatile-lru eviction policy began deleting some agent heartbeat state from that shard.

Agents are marked as "lost" after 10 minutes without a heartbeat. A heartbeat from a recently lost agent returns it to “connected”, but after 24 hours the session token is invalidated, resulting in HTTP 401 Unauthorized responses.

During this incident, the eviction of heartbeat state from one shard caused those agents to be considered “last seen” at their registration timestamp. This meant agents connected longer than 24 hours were considered permanently lost, and their tokens considered invalid, resulting in those agents receiving HTTP 401 responses to their API requests. Agents would ideally re-register at this point, but the HTTP 401 wasn’t treated as terminal, so they continued to wait for jobs to be assigned. At this point, the only solution for affected customers was to restart or replace those agents.

On Sunday evening (UTC) the lack of memory of the impacted Redis shard was detected, though it was not having an ongoing impact on our customers' jobs or agents. In order to clear space, we attempted to delete the large Redis set at 2022-11-21 02:35 UTC. The DEL command in Redis is O(N) relative to the collection size, and our N was large enough to make the shard unresponsive. This triggered a failover to the replica for that shard, but meanwhile resulted in higher load on our underlying database, which together resulted in the Agent API error rate peaking at 80%. This was the point at which builds stopped successfully executing for the majority of our customers.

We explored options such as dropping the shard or restarting the Redis cluster, but decided to failover to a new, empty Redis cluster. The switch to this new cluster took effect at 2022-11-21 03:10 UTC, and we saw Agent API error rates return to normal levels. From this point, any newly-connected agent could successfully pick up and complete jobs.

Agents were expected to reconnect, but our metrics indicated that wasn’t happening. We eventually found a data inconsistency — the agents were seeing a “connected” state from a Redis session cache, while the state in the database was “lost”. It was unclear at the time how we ended up in this split-brain situation, but until the session cache for these agents expired, they would be unable to reconnect.

In order to speed up their reconnection, we invalidated the entire session cache by incrementing the cache key version, which refreshed the session data for all agents and allowed them to come back online. This process was finished at 2022-11-21 05:54 UTC.

We now believe that the inconsistency between the session cache and the database was a result of the rolling update to the new Redis cluster. The Agent API became functional again once our web nodes were pointed at the cluster, and at this point, we believe that our customers' agents' next heartbeat signals would have resulted in them moving from "lost" to "connected" again – this would have been reflected in both the database and the new cache.

However, at this point, some instances of our background workers would still have been pointed at the old cache, which did not contain any heartbeat records. For the same reason as Saturday morning, the agents were marked as “lost”, which updated both the database and the *old* cache. This is how we ended up with a database that said the agents were lost, and a cache that said they were connected.

How we’re becoming more resilient to issues like this

Our monitoring should have alerted us to the high memory usage in Redis, but a misconfiguration prevented it triggering a notification/pager. This has been resolved.

Agent heartbeats from the past 24 hours should never be evicted from Redis, but by setting a 24 hour expiry they’re stored as “volatile”, meaning the configured volatile-lru policy (or any policy other than noeviction) may select them for eviction. Other data in the same cluster does require eviction. We’re exploring separating this into two separate Redis clusters for volatile vs non-volatile data, and configuring the latter with the noeviction policy.

The absence of heartbeat data in Redis, whether caused by eviction or some other event, should be treated as “data missing; unknown last heartbeat” rather than an authoritative “heartbeat never received”, and so should not cause agents to be considered “lost”. This change has been implemented and tested, and will likely ship after a thorough review is completed.

If we need to delete a very large Redis collection in future, we’ll do so incrementally, and we’re investigating automated safety checks to that effect, similar to how we safety-check database migrations.

Buildkite Agents were indefinitely retrying API calls when receiving HTTP 401 Unauthorized responses due to no heartbeat data for more than 24 hours, leaving them stuck in a running-but-not-connected state. These agents counted towards capacity in many customers' auto scaling logic, leading to insufficient connected agents to run builds. This will be fixed in a subsequent version of the Buildkite Agent, such that terminal HTTP 401 responses will result in re-registration or termination, self-healing that scenario.

Posted Nov 25, 2022 - 05:14 UTC

Error rates have remained at normal levels for some time now. Our monitoring shows that lost agents have been reconnected and are successfully running jobs.
Posted Nov 21, 2022 - 06:09 UTC
We’ve seen error rates return to normal levels for some time now. Our monitoring shows that lost agents have been reconnected and are successfully running jobs.
Posted Nov 21, 2022 - 05:54 UTC
Agents that were previously unresponsive should now be processing jobs again. Our advice to restart agent processes/instances no longer applies, as the issue has been fixed server-side.
Posted Nov 21, 2022 - 05:31 UTC
We are still in the process of restoring previously connected agents. We recommend cycling your agents by spinning up new ones and removing those experiencing issues in running jobs.
Posted Nov 21, 2022 - 04:58 UTC
We are in the process of trying to restore the connection to agents that were previously connected. As an interim measure, we recommend that you restart your agent instances
Posted Nov 21, 2022 - 04:33 UTC
We believe connected agents may be in a stuck state that is not operational and are continuing to investigate the cause and work on a fix. As an interim measure, we recommend cycling your agent instances to produce new agents
Posted Nov 21, 2022 - 04:09 UTC
We have promoted the new redis node to primary. We are now investigating agent connectivity and working to restore service.
Posted Nov 21, 2022 - 03:46 UTC
We are in the process of promoting the new redis cluster to primary.
Posted Nov 21, 2022 - 03:25 UTC
We have identified an issue with our redis cluster and are provisioning a new redis cluster and preparing to roll over.
Posted Nov 21, 2022 - 03:11 UTC
We are currently investigating an issue causing elevated error rates and high latency.
Posted Nov 21, 2022 - 02:54 UTC
This incident affected: Web, Agent API, and REST API.