Database Performance Issues
Incident Report for Buildkite
Postmortem

Overview

On August 3rd 1:06am AEST we were notified of slow job dispatches for a large volume customer and agents being slow to complete work. Load on our primary postgresql database was high (>80%) and there was a large backlog (12M) of queued background jobs, which presented to customers as slow web-hooks and commit statuses.

While we investigated and implemented various potential fixes, database performance gradually got worse. At 6:00am we identified one of our routine auto-vacuum processes was blocked on zombie processes related to a known postgresql bug 1 introduced in our recent database upgrade to postgresql 10.4.

Based on advice from AWS, we decided to restart the primary database to clear those processes. We also increased the database capacity to help handle the thundering herd that we anticipated when it came back online. Buildkite was in maintenance mode for around 60 minutes whilst we rebooted the database, verified that the zombie processes had gone and that auto-vacuum (and auto-analyze) could proceed on the key tables. We also performed a manual vacuum and analyze, which radically improved performance. At 7:26am the site was back online and performing well.

We are optimistic that we’ve uncovered and fixed the underlying issue that caused this incident and the last few incidents. Since the fix and capacity improvements our infrastructure has been running as well as it ever has.

What was affected

Due to the database performance issues there was a wide variety of impacts, but the key impacted systems were our queue system and the Agent API. Both of these systems were partially degraded between 1:06am and 7:26am.

The Agent API was running slowly and returning a high percentage of errors due to statement timeouts and lock contention. This presented in a very similar manner to the previous outages the week before 2, namely jobs not finishing and agents unable to pick up new jobs.

The backlog in the queue system manifested as asynchronous tasks that buildkite performs taking much longer to execute. We break up our tasks into a set of queues based on urgency. During the incident, the high and standard priority queue were being adequately serviced with a latency of under 60 seconds, but at the expense of the low priority queues which were at a stand-still, despite radically increased queue worker capacity. This meant that webhooks and commit statuses were severely delayed for the duration. The backlog of these tasks was processed fully once the site returned from maintenance mode at 7:26am.

Root cause

This outage was preceeded by a major outage on July 20th 2 where the root cause was also high database load. As part of the follow-on steps from that outage, we updated the database in question to Postgres 10.4, primarily for the performance improvements and better performance analytics to assist with further isolating the aspects of the workload that were causing the lock contention under load. We performed that update successfully during a planned maintenance on July 29th.

It turns out that Postgres 10.4 has an open issue where a postgresql process can become non-responsive during a parallel index scan 1. We discovered this shortly after the upgrade and reached out to AWS for how best to handle the issue. They suggested disabling parallel index scanning until postgresql 10.5, which we did, but the only way to kill the zombie processes were to restart the database, which we decided to defer to a later maintenance window to avoid interruption as we believed they would have no impact on the system.

During the outage, we noticed that the zombie processes were holding open locks on the tables and preventing auto-vacuums from running correctly. As time progressed the performance on the relevant tables was getting worse as query statistics that are updated during auto-vacuum weren't updated. When another period of high load began, these underlying performance issues combined with the effect we saw in the previous outage 2 which was lock contention on our primary database causing agents to thrash state and compound the issue by creating extra table rows and also generating large numbers of queued tasks.

On further investigation of our auto-vacuum configuration, we discovered a number of areas of concern with auto-vacuuming. Our largest table has close to 1 billion rows, so it’s normal that there are long running auto-vacuums, but with the default settings they weren’t keeping up with the rate of change in the table, even when they weren’t blocked by the zombie processes. The end result was that auto-analyze wasn’t getting run regularly on the larger tables and dead tuple growth wasn’t being correctly cleaned up.

We believe that inadequate vacuuming is the root cause of a lot of the performance related outages that we've been seeing over the past few weeks, which has resulted in unusually high database load approaching (and occasionally exceeding) the provisioned IOPS on the underlying host.

Changes we've made

We've done a bunch of things:

  • Increase the provisioned IOPS for the database by a factor of 5X. We try to provision around double the capacity we require for our peak load.
  • Tuned auto-vacuum settings for shorter and more frequent auto-vaccuums of the relevant tables.
  • Reduced the scope of a variety of long-running queries that were holding long-running locks. This includes the agent reaper that marks non-responsive agents as lost, which was a major contributor to outage.
  • Radically reduced the number of tasks enqueued for service notifications.
  • Added improved connection pooling for our Redis instances (we believed at one point that this was a factor, but later ruled it out).
  • Split out web-hooks and commit statuses as their own sidekiq queues for better visibility and prioritization. We will be prioritizing commit statuses over webhooks as part of this.

Changes we will make

With both this outage and the previous one, we've been too slow to pickup the incidents, often being notified by customers of slow dispatches or webhook delays. We are planning to make a series of changes around our monitoring and alerting conditions so that we are able to respond to these situations more proactively. Some of the monitoring changes we have planned:

  • Alerting around database IOPS to assist with better capacity planning
  • Alerting around auto-vacuums and auto-analyze so that we can be sure these processes are correctly running
  • Alerting around slow dispatches for high-volume customers
  • Alerting around lock contention and queries that are terminated for exceeding statement timeouts

Beyond this, we'd like to better automate how our Statuspage components are marked during outages. Currently this is a manual process, which leaves a lot to be desired in terms of correctly indicating the timeframe of the outage.

Outside of monitoring changes, are planning the following systems changes:

  • Partitioning our larger tables to allow us to handle future scaling needs
Posted Aug 07, 2018 - 07:17 UTC

Resolved
The incident has been resolved, the system is running as normal.
Posted Aug 02, 2018 - 22:43 UTC
Monitoring
All systems are functioning as normal at this stage. Thanks for your patience everyone and we'll absolutely be following up with more analysis and a post-mortem.
Posted Aug 02, 2018 - 22:11 UTC
Update
We're seeing performance returning to normal levels, we've turned off maintenance mode. The background queues are also reaching normal levels.
Posted Aug 02, 2018 - 22:04 UTC
Update
We're continuing to investigate the cause of the ongoing database issues with Amazon. Will provide updates as soon as we know more.
Posted Aug 02, 2018 - 21:19 UTC
Update
The affected database has been restarted, but we're investigating timeouts that are preventing the web nodes from returning to service
Posted Aug 02, 2018 - 20:44 UTC
Update
We've been advised by our upstream provider that there is a Postgres bug that we are likely being affected by that is causing autovacuums to not complete. We will be performing a database restart on the affected database which will cause 3-5 minutes of outage.
Posted Aug 02, 2018 - 20:19 UTC
Update
We've rolled out a new queue cluster, which means the commit statuses and webhook events from now on will be processed immediately whilst we deal with the backlog from the past few hours.
Posted Aug 02, 2018 - 19:56 UTC
Update
We're still working through the web-hook and commit status backlog. We're spinning up extra infrastructure to assist.
Posted Aug 02, 2018 - 19:17 UTC
Identified
We've isolated the initial problem as database lock contention caused by unusually high load. The database performance issue resolved quickly, but caused a cascade of follow-on problems as traffic dog-piled back onto the system. We've resolved the issues with the Agent API and agents should be functioning normally, but we are still processing a large backlog of background tasks which will cause things like web-hooks to be delayed whilst we process them.
Posted Aug 02, 2018 - 17:54 UTC
Update
We're also seeing elevated response times on the Agent API, which is related to the same underlying database issues.
Posted Aug 02, 2018 - 16:07 UTC
Investigating
We're seeing high delays in service notifications including webhooks.
Posted Aug 02, 2018 - 15:56 UTC
This incident affected: Web, Agent API, REST API, Job Queue and Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Hipchat Notifications, Webhook Notifications).