Degraded Agent API performance - network connectivity
Incident Report for Buildkite
Postmortem

Service Impact

From 2023-07-12 23:48 to 2023-07-13 01:14 (UTC) customers experienced elevated latency and error rates to the Agent API. This impacted agents trying to accept and update jobs, upload artifacts and job logs; and request OIDC tokens.

Incident Summary

At 23:43 some file uploads to a third-party provider start to experience significant latency (up to 15 seconds before timing out). Because these requests were occurring within a database transaction, over the next few minutes the number of open transactions began to climb.

By 23:49 the database connection limits we place, via PgBouncer, were reached and requests to our primary pipelines database from agent API servers become significantly degraded. These limits help to contain the impact to just agent API requests, so other systems (Web, REST and GraphQL) continue to behave as normal.

At 23:59 our engineers received a page and began investigating

At 00:20 we identified the file uploads were much slower than normal

At 00:46 we identified the root cause as a networking issue

At 01:14 we rolled out a change to route traffic for uploads via a different network endpoint, which resulted in an almost immediate recovery.

Changes we've made

  • Due to an out of date configuration of our monitoring system, alerts about increased latency took 10-minutes to be escalated to our on-call engineer. We have since made significant improvements to this alert so it notifies us faster.
  • We decoupled the file uploads from the database transactions so that a small number of slow uploads don’t cause a cascading failure to all requests
Posted Jul 26, 2023 - 03:27 UTC

Resolved
The routing mitigations have resolved the aforementioned connectivity issues to S3, so the incident is resolved. We will continue to investigate the root cause.
Posted Jul 13, 2023 - 01:40 UTC
Monitoring
We have identified network connectivity issues between our production systems and a subset of S3 IP addresses. We have applied mitigations to our routing, which appears to have resolved the issue. We’ll continue to monitor, and investigate the root cause before reverting the mitigation.
Posted Jul 13, 2023 - 01:22 UTC
Update
We are continuing to investigate reports of elevated latency affecting the performance of the Agent API
Posted Jul 13, 2023 - 00:49 UTC
Investigating
We are currently investigating some reports of degraded performance affecting artifact uploads.
Posted Jul 13, 2023 - 00:19 UTC
This incident affected: Agent API.