On August 3rd 1:06am AEST we were notified of slow job dispatches for a large volume customer and agents being slow to complete work. Load on our primary postgresql database was high (>80%) and there was a large backlog (12M) of queued background jobs, which presented to customers as slow web-hooks and commit statuses.
While we investigated and implemented various potential fixes, database performance gradually got worse. At 6:00am we identified one of our routine auto-vacuum processes was blocked on zombie processes related to a known postgresql bug 1 introduced in our recent database upgrade to postgresql 10.4.
Based on advice from AWS, we decided to restart the primary database to clear those processes. We also increased the database capacity to help handle the thundering herd that we anticipated when it came back online. Buildkite was in maintenance mode for around 60 minutes whilst we rebooted the database, verified that the zombie processes had gone and that auto-vacuum (and auto-analyze) could proceed on the key tables. We also performed a manual vacuum and analyze, which radically improved performance. At 7:26am the site was back online and performing well.
We are optimistic that we’ve uncovered and fixed the underlying issue that caused this incident and the last few incidents. Since the fix and capacity improvements our infrastructure has been running as well as it ever has.
Due to the database performance issues there was a wide variety of impacts, but the key impacted systems were our queue system and the Agent API. Both of these systems were partially degraded between 1:06am and 7:26am.
The Agent API was running slowly and returning a high percentage of errors due to statement timeouts and lock contention. This presented in a very similar manner to the previous outages the week before 2, namely jobs not finishing and agents unable to pick up new jobs.
The backlog in the queue system manifested as asynchronous tasks that buildkite performs taking much longer to execute. We break up our tasks into a set of queues based on urgency. During the incident, the high and standard priority queue were being adequately serviced with a latency of under 60 seconds, but at the expense of the low priority queues which were at a stand-still, despite radically increased queue worker capacity. This meant that webhooks and commit statuses were severely delayed for the duration. The backlog of these tasks was processed fully once the site returned from maintenance mode at 7:26am.
This outage was preceeded by a major outage on July 20th 2 where the root cause was also high database load. As part of the follow-on steps from that outage, we updated the database in question to Postgres 10.4, primarily for the performance improvements and better performance analytics to assist with further isolating the aspects of the workload that were causing the lock contention under load. We performed that update successfully during a planned maintenance on July 29th.
It turns out that Postgres 10.4 has an open issue where a postgresql process can become non-responsive during a parallel index scan 1. We discovered this shortly after the upgrade and reached out to AWS for how best to handle the issue. They suggested disabling parallel index scanning until postgresql 10.5, which we did, but the only way to kill the zombie processes were to restart the database, which we decided to defer to a later maintenance window to avoid interruption as we believed they would have no impact on the system.
During the outage, we noticed that the zombie processes were holding open locks on the tables and preventing auto-vacuums from running correctly. As time progressed the performance on the relevant tables was getting worse as query statistics that are updated during auto-vacuum weren't updated. When another period of high load began, these underlying performance issues combined with the effect we saw in the previous outage 2 which was lock contention on our primary database causing agents to thrash state and compound the issue by creating extra table rows and also generating large numbers of queued tasks.
On further investigation of our auto-vacuum configuration, we discovered a number of areas of concern with auto-vacuuming. Our largest table has close to 1 billion rows, so it’s normal that there are long running auto-vacuums, but with the default settings they weren’t keeping up with the rate of change in the table, even when they weren’t blocked by the zombie processes. The end result was that auto-analyze wasn’t getting run regularly on the larger tables and dead tuple growth wasn’t being correctly cleaned up.
We believe that inadequate vacuuming is the root cause of a lot of the performance related outages that we've been seeing over the past few weeks, which has resulted in unusually high database load approaching (and occasionally exceeding) the provisioned IOPS on the underlying host.
We've done a bunch of things:
With both this outage and the previous one, we've been too slow to pickup the incidents, often being notified by customers of slow dispatches or webhook delays. We are planning to make a series of changes around our monitoring and alerting conditions so that we are able to respond to these situations more proactively. Some of the monitoring changes we have planned:
Beyond this, we'd like to better automate how our Statuspage components are marked during outages. Currently this is a manual process, which leaves a lot to be desired in terms of correctly indicating the timeframe of the outage.
Outside of monitoring changes, are planning the following systems changes: