On July 20th between 6:38 AM and 7:38 AM AEST the Agent API experienced critical error rates which prevented agents from finishing jobs or starting new jobs. The underlying cause was elevated levels of load which caused database failover for our job logs database. From 7:21 AM agents were correctly finishing outstanding jobs, and processing new jobs. The increased activity generated a backlog of webhook deliveries which caught up by 8:42 AM AEST.
Between 6:38 AM and 7:38 AM AEST the Agent API was returning errors for ~80% of requests. The majority of critical agent operations during this time failed, primarily showing up as jobs not finishing and agents unable to pick up new jobs.
During this time, agents connected and disconnected rapidly as they attempted to self-heal. This compounded the issue and caused a large backlog of webhook deliveries for agent status changes.
Our agents collect job output ("logs") and submit them in chunks to our log processing database. Jobs cannot be marked as complete and new work assigned until these chunks have been stored prior to archival in Amazon S3.
The major outage was caused by a failover event in our underlying log database after a period of high load combined with an autovacuum process. During the failover to the replica responses from the logs database were slow or failed entirely, which cascaded to failures in the Agent API. Because the log database was responding slowly, agents were unable to pick up new jobs quickly enough, which resulted in our systems marking the agents as unresponsive and re-assigning the work elsewhere. This compounded the situation by generating incidental database load and webhook events for each agent state change and job re-assignment.
The elevated error rates prior to the failover event were due to high work load, the symptom of which was statement timeouts due to lock contention, which without the underlying log database failure would have self-healed.
We're continuing to work on identifying the cause of both the unusually high load and the points of lock contention that cause errors in these conditions.
We've added an additional log database, and are modifying the structure and architecture of these databases — sharding the tables — to better cope with the performance profile of log ingestion.
We're adding more visibility to our database to help us identify and solve congestion points earlier. We'll be doing some scheduled maintenance this weekend to both enable this visibility and to upgrade to the next major version of Postgres which has stronger support for partitioning and sharding our high contention tables, and generally better query performance.
We're also working on preventing agent connection and job assignment oscillation during failures.