On Monday 13 Jun between 07:01 UTC and 07:48, Buildkite Agents using the
%n template variable in their name experienced elevated latency and error rates.
%n template variable for name guarantees uniqueness, and requires mutually exclusive locking to do so. That mutual exclusion is guaranteed in our database, but our application layer additionally limits the concurrency to keep that lock wait time out of the database.
The application layer concurrency/rate limiting subsystem previously shared a Redis with other workload, but the day before this incident it was separated to a dedicated Redis for future capacity planning reasons.
The new connection pool providing these Redis connections to our multi-threaded web application server was configured smaller than the previous shared-purpose pool. When weekday traffic scaled up, the connection pool became saturated resulting in 1,000 ms timeout errors, rather than the
%n concurrency limit waiting up to 10,000 ms.
To prevent this happening again we’re going to resume our gradual deprecation of
%hostname, or other agent naming schemes instead. We have also reviewed our connection pool sizes including enlarging the new connection pool to match the old one.
Apologies to those affected for the agent registration disruption.