Database Connectivity Issues for Agent API

Incident Report for Buildkite

Postmortem

Overview

On May 31st between 5:38 AM and 8:53 AM AEST our Agent API was non-responsive due to being unable to connect to the underlying database. Agents were unable to accept or process jobs and were returning 502 errors when attempting to connect. Our monitoring picked this issue up immediately and we alerted customers 8 minutes later via the status page and Slacks. The on-call engineer investigated and escalated initially to second engineer and finally to an "all hands on deck". Several different issues were identified and fixes put in place to mitigate. At 8:53 AM AEST the issue was resolved and Agents were able to function correctly again.

What was Affected

Agents were unable to connect to the Agent API, meaning they were unable to receive new work or report the completion of work they had previously been working on.

Root Cause

Several of our large Postgres tables that are critical for agent communications use row-level locking for strong consistency. As they have grown, the auto-vacuuming process has gradually gotten slower. The outage was caused by a cascading issue where several long-running processes that update job statuses ran unusually slowly (5+ minutes) due to the auto-vacuuming. These processes turned out to hold long-running row-level locks on large numbers of jobs. This caused requests to the Agent API to block on these locks, which caused an exhaustion of connections in our database connection pools. The lack of available database connections caused the Agent API EC2 instance status checks to fail and instances were marked as failed and became unavailable via HTTPS.

Timeline

5:38 AM AEST We received an alert from PagerDuty that there was a spike in exceptions being raised from the Buildkite Agent API. These exceptions indicated that the API could not connect to our main transactional database.
5:46 AM AEST Customers notified of incident via status page and Slack.
6:30 AM AEST The rest of the Buildkite engineering team were notified as the root cause had not been identified. This is when the incident became an "all hands on deck" scenario.
6:50 AM AEST The root cause of the issue was identified. There were several issues that all cascaded into a "perfect storm" of performance problems. Specifically:
- An auto-vacuum was kicked off by our AWS RDS database on the main jobs table, causing query performance to degrade.
- Queries that generally were fast started to slow down significantly. Job and build records ended up being locked for long periods of time (up to several minutes).
- Agents tried to finish jobs and builds, but started hanging as they couldn't get locks on the records they needed to.
- All of our PG Bouncer connections filled up with long running SQL queries waiting on locks. This is what started raising exceptions which caused the PagerDuty alert.
- All of our Puma workers filled up with hanging agents waiting on SQL queries.
- The health check endpoints within the Agent API couldn't be reached by our ALB, which caused servers to go unhealthy and shutdown.
7:00 AM AEST Once we figured out what was going on, we started manually killing off long running queries that could be safely retried later on, restarted PG Bouncer to reclaim connections, and introduced a hard limit on SQL queries being executed by the Agent API to stop the long running queries.
7:22 AM AEST Mitigations were applied and the Agent API returned to normal.
7:52 AM AEST We were alerted that webhooks and GitHub status updates were not being delivered. The performance problems of the Agent API had a flow-on effect that caused a backlog of these external notifications to not be processed. At this stage, there were around 9 million webhooks and GitHub status updates that needed to be delivered. We increased our background job processor capacity to work through the backlog.
8:53 AM AEST We finished our backlog of work, and systems were fully operation again.

Changes We've Made

Review and optimise job and build locks to only hold locks on small sets of jobs and process incrementally.
Roll out timeouts on long-running SQL queries across performance-critical parts of our system.
Partition our jobs table to increase query performance and reduce the performance impact of an auto-vacuum.

Posted Jun 03, 2018 - 07:11 UTC

Resolved

Everything looks good on our end. We're marking as resolved and we'll publish an incident report shortly.

Posted May 30, 2018 - 23:20 UTC

Monitoring

We finished our backlog of around 9 million external webhooks and GitHub SCM notifications. Everything is going smooth on our end now, we'll keep monitoring to make sure everything is fully operational again.

Posted May 30, 2018 - 22:53 UTC

Update

We are continuing to work on a fix for this issue.

Posted May 30, 2018 - 22:36 UTC

Identified

We've also identified outgoing webhooks, live page updates and GitHub commit statuses are heavily backlogged. Build jobs still should be running as usual. We're increasing capacity on our end now to work through the backlog and we expect things to return to normal shortly.

Posted May 30, 2018 - 21:52 UTC

Monitoring

We've identified the issue and put a fix in place. A slow running query was causing lock contention that lead to connection exhaustion in our agent API. We've got some fixes planned and will follow up with a post mortem, in the meantime, we're keeping a close eye on things.

Posted May 30, 2018 - 21:22 UTC

Identified

We've identified the database issue and have a temporary fix in place whilst we keep investigating.

Posted May 30, 2018 - 20:50 UTC

Update

We are continuing to investigate, job queue is also now unavailable.

Posted May 30, 2018 - 20:02 UTC

Investigating

We're seeing an issue with our Agent API being able to connect to it's database again. We're investigating.

Posted May 30, 2018 - 19:46 UTC

This incident affected: Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Webhook Notifications).