Elevated Agent connectivity issues

Incident Report for Buildkite

Postmortem

Service Impact

Database performance degradation (all customers):

Between August 22nd and August 29th we saw periods of degraded performance of our database due to increased lock contention. This resulted in:

  • An increase in our API error rate (up to 0.6%) for brief periods of time.
  • A small number of jobs (less than 50) took up to 5 minutes to be dispatched to an agent.
  • A small number of pipeline uploads failed causing their build to fail.

Agent lost bug (some customers):

Customers running agent version v3.76.0 to v3.82.0 (inclusive) on Linux were impacted by a bug in the way HTTP/2 connections handle connection timeouts. This was fixed in v3.82.1 of the agent. We estimate 131,500 jobs failed due to agents being unable to communicate with the Buildkite backend. 93% of these jobs were automatically retried.

Incident Summary

On the 31st of July 2024 we released a new version of the Buildkite Agent (v3.76.0) which fixed a bug in how we configured our HTTP communications to the Buildkite API. Previously almost all agent communications to Buildkite used HTTP/1.1.This unmasked a bug in the Golang standard library in how HTTP/2 connections are re-used resulting in an intermittent issue with agents losing connectivity to Buildkite backend for many minutes.

Because this issue only manifested when a network connection stopped receiving any packets from our API, which is known to occur when stateful network appliances such as NAT Gateways and Firewalls drop the connection without notifying the client, this issue went unnoticed for some time.

Database performance degradation

The impact of this bug first manifested as increased locking on one of our databases, due to the increase in agents transitioning to a “lost” state and back to “connected” and how our database schema handles that. PostgreSQL has a global limit of locks, which is a multiple of the maximum number of connections; in this specific database the limit was around 100,000. When this limit is reached the server returns an OutOfMemory error to each running transaction, then returns to normal behaviour. This resulted in increased errors on our API and a small number of Builds being delayed or failing.

We resolved this issue on August 29th by decreasing the number of locks necessary for some queries. In particular this query was behaving unexpectedly:

UPDATE builds_partitioned SET state = $1, started_at = $2 WHERE id IN (

  SELECT id FROM builds_partitioned WHERE id = $3 AND state = $4 FOR UPDATE

) RETURNING state

Because the table we’re updating is partitioned we need to ensure all queries use the partition key to ensure an efficient query with minimal lock contention. This query uses the id column, which is the partition key for this table. But because of the nested query the Postgres query planner isn’t able to know which partitions to prune for the outer query, so it scans each partition, requiring lightweight locks to be taken out on each partition and its indexes. This can result in a single query taking out more than 1,500 lightweight locks.

Because this query only needed to update a single row we could change it to:

UPDATE builds_partitioned SET state = $1, started_at = $2 WHERE id = $3 RETURNING state

Which reduced the number of locks this query required dramatically. Because the lock limit is global to the entire database, our reduction in locks from this query mitigated the OutOfMemory issue, even though it was triggered by the HTTP/2 change described above.

Agent lost investigation

Around late August we started to receive customer reports of job failures due to lost agents. When our backend doesn’t receive any communications from an Agent for more than 3 minutes it’s marked as lost and any jobs it’s running are cancelled.

This can be caused by a variety of reasons, including:

  • Network partitions caused by third party connectivity issues
  • NAT port exhaustion on the Virtual Private Cloud where agents are running
  • Agent process being terminated by Linux OOM process

After eliminating these possibilities we attempted unsuccessfully to replicate the issue locally. In order to gather more information we released a new version of the agent which would emit logs about the connection timings and state when a timeout occurred.

While waiting to receive this information one of our engineers found a blog post that described the behaviour we had observed. Despite following the reproduction steps we couldn’t replicate the behaviour with the test code provided on our MacOS development environments. Only once we switched to Linux were we able to replicate the problem. On the 23rd of September we released a new agent version which mitigated the issue by setting the recommended workaround.

Following further validation on September 25th we began notifying customers using the impacted versions they should upgrade their agent.

Two weeks after the new agent version was released, more than 20% of agents were using v3.82.1, compared to 7.2% using the impacted versions. The number of reconnection events per agent confirmed our bug fix had the desired effect (lower is better).

Changes we're making

We’re continuing to reduce the size of our databases via horizontal sharding, to further decrease the risk of lock contention such as we saw in this incident.

We have improved our reporting and visibility into the number of agents lost, to enable us to identify and resolve potential future regressions faster.

Posted Oct 09, 2024 - 10:07 UTC

Resolved

Customers running agent version v3.76.0 to v3.82.0 (inclusive) on Linux were impacted by a bug in the way HTTP/2 connections handle connection timeouts. This was fixed in v3.82.1 of the agent. We estimate 131,500 jobs failed due to agents being unable to communicate with the Buildkite backend. 93% of these jobs were automatically retried.
Posted Jul 31, 2024 - 14:00 UTC