On May 31st between 5:38 AM and 8:53 AM AEST our Agent API was non-responsive due to being unable to connect to the underlying database. Agents were unable to accept or process jobs and were returning 502 errors when attempting to connect. Our monitoring picked this issue up immediately and we alerted customers 8 minutes later via the status page and Slacks. The on-call engineer investigated and escalated initially to second engineer and finally to an "all hands on deck". Several different issues were identified and fixes put in place to mitigate. At 8:53 AM AEST the issue was resolved and Agents were able to function correctly again.
Agents were unable to connect to the Agent API, meaning they were unable to receive new work or report the completion of work they had previously been working on.
Several of our large Postgres tables that are critical for agent communications use row-level locking for strong consistency. As they have grown, the auto-vacuuming process has gradually gotten slower. The outage was caused by a cascading issue where several long-running processes that update job statuses ran unusually slowly (5+ minutes) due to the auto-vacuuming. These processes turned out to hold long-running row-level locks on large numbers of jobs. This caused requests to the Agent API to block on these locks, which caused an exhaustion of connections in our database connection pools. The lack of available database connections caused the Agent API EC2 instance status checks to fail and instances were marked as failed and became unavailable via HTTPS.
5:38 AM AEST We received an alert from PagerDuty that there was a spike in exceptions being raised from the Buildkite Agent API. These exceptions indicated that the API could not connect to our main transactional database.
5:46 AM AEST Customers notified of incident via status page and Slack.
6:30 AM AEST The rest of the Buildkite engineering team were notified as the root cause had not been identified. This is when the incident became an "all hands on deck" scenario.
6:50 AM AEST The root cause of the issue was identified. There were several issues that all cascaded into a "perfect storm" of performance problems. Specifically:
7:00 AM AEST Once we figured out what was going on, we started manually killing off long running queries that could be safely retried later on, restarted PG Bouncer to reclaim connections, and introduced a hard limit on SQL queries being executed by the Agent API to stop the long running queries.
7:22 AM AEST Mitigations were applied and the Agent API returned to normal.
7:52 AM AEST We were alerted that webhooks and GitHub status updates were not being delivered. The performance problems of the Agent API had a flow-on effect that caused a backlog of these external notifications to not be processed. At this stage, there were around 9 million webhooks and GitHub status updates that needed to be delivered. We increased our background job processor capacity to work through the backlog.
8:53 AM AEST We finished our backlog of work, and systems were fully operation again.