Elevated error rate on agent registration
Incident Report for Buildkite

On Monday 13 Jun between 07:01 UTC and 07:48, Buildkite Agents using the %n template variable in their name experienced elevated latency and error rates.

The %n template variable for name guarantees uniqueness, and requires mutually exclusive locking to do so. That mutual exclusion is guaranteed in our database, but our application layer additionally limits the concurrency to keep that lock wait time out of the database.

The application layer concurrency/rate limiting subsystem previously shared a Redis with other workload, but the day before this incident it was separated to a dedicated Redis for future capacity planning reasons.

The new connection pool providing these Redis connections to our multi-threaded web application server was configured smaller than the previous shared-purpose pool. When weekday traffic scaled up, the connection pool became saturated resulting in 1,000 ms timeout errors, rather than the %n concurrency limit waiting up to 10,000 ms.

To prevent this happening again we’re going to resume our gradual deprecation of %n, preferring %spawn, %hostname, or other agent naming schemes instead. We have also reviewed our connection pool sizes including enlarging the new connection pool to match the old one.

Apologies to those affected for the agent registration disruption.

Posted Jun 14, 2022 - 07:56 UTC

This incident has been resolved.
Posted Jun 13, 2022 - 08:12 UTC
We have identified the cause of these errors as new Redis configuration introduced over the weekend and have reverted the configuration changes. We’ll continue monitoring for the Agent API performance.
Posted Jun 13, 2022 - 08:00 UTC
We’ve identified the problem and working on a mitigation.
Posted Jun 13, 2022 - 07:44 UTC
We are continuing to investigate this issue.
Posted Jun 13, 2022 - 07:28 UTC
We are investigating elevated error rates registering new agents.
Posted Jun 13, 2022 - 07:27 UTC
This incident affected: Agent API.