Agent API Degradation and Downtime
Incident Report for Buildkite
Postmortem
Overview

On July 20th between 6:38 AM and 7:38 AM AEST the Agent API experienced critical error rates which prevented agents from finishing jobs or starting new jobs. The underlying cause was elevated levels of load which caused database failover for our job logs database. From 7:21 AM agents were correctly finishing outstanding jobs, and processing new jobs. The increased activity generated a backlog of webhook deliveries which caught up by 8:42 AM AEST.

What was affected

Between 6:38 AM and 7:38 AM AEST the Agent API was returning errors for ~80% of requests. The majority of critical agent operations during this time failed, primarily showing up as jobs not finishing and agents unable to pick up new jobs.

During this time, agents connected and disconnected rapidly as they attempted to self-heal. This compounded the issue and caused a large backlog of webhook deliveries for agent status changes.

Root Cause

Our agents collect job output ("logs") and submit them in chunks to our log processing database. Jobs cannot be marked as complete and new work assigned until these chunks have been stored prior to archival in Amazon S3.

The major outage was caused by a failover event in our underlying log database after a period of high load combined with an autovacuum process. During the failover to the replica responses from the logs database were slow or failed entirely, which cascaded to failures in the Agent API. Because the log database was responding slowly, agents were unable to pick up new jobs quickly enough, which resulted in our systems marking the agents as unresponsive and re-assigning the work elsewhere. This compounded the situation by generating incidental database load and webhook events for each agent state change and job re-assignment.

The elevated error rates prior to the failover event were due to high work load, the symptom of which was statement timeouts due to lock contention, which without the underlying log database failure would have self-healed.

We're continuing to work on identifying the cause of both the unusually high load and the points of lock contention that cause errors in these conditions.

Changes we've made

We've added an additional log database, and are modifying the structure and architecture of these databases — sharding the tables — to better cope with the performance profile of log ingestion.

We're adding more visibility to our database to help us identify and solve congestion points earlier. We'll be doing some scheduled maintenance this weekend to both enable this visibility and to upgrade to the next major version of Postgres which has stronger support for partitioning and sharding our high contention tables, and generally better query performance.

We're also working on preventing agent connection and job assignment oscillation during failures.

Posted Jul 26, 2018 - 07:39 UTC

Resolved
Jobs are running normally, and the webhook notification backlog has been fully processed.
Posted Jul 19, 2018 - 22:42 UTC
Update
Jobs which were stuck appear to have resumed processing normally. We’re waiting on the built-up webhook notification backlog to process, and keeping an eye on things.
Posted Jul 19, 2018 - 22:07 UTC
Monitoring
Jobs and agents appear to be returning to normal as the backlog is processed
Posted Jul 19, 2018 - 21:45 UTC
Update
Jobs are failing to dispatch and agents are not properly coming online or going offline
Posted Jul 19, 2018 - 21:27 UTC
Investigating
Agents appear to be taking longer than usual to change state, we’re investigating
Posted Jul 19, 2018 - 21:10 UTC
This incident affected: Agent API and Job Queue.