Reported slow response times and increased error rates

Incident Report for Buildkite

Postmortem

Overview

On Thursday, 14th March 2019 from around 4:00am AEDT Buildkite experienced performance degradation which led to a partial outage. Engineers were paged at 4:32am AEDT by the high load to investigate and intervened to manage performance. Between 5:00am-7:15am AEDT, job dispatch became slow and the Agent API experienced high error rates leading to erroneous build failures. Performance returned to normal by 8:30am AEDT.

Root Cause

Before the outage there were an unusually high number of queries that were generating excessive temporary files on the disk of our main transactional database. This temp file activity caused significant write latency and slowed regular database traffic down - including scheduling jobs to agents. Some large customers' automatic agent scaling systems interpreted the reduced job throughput as a lack of capacity on their end, which meant their systems massively scaled out the number of agents, in turn slowing down Buildkite even further. We later determined that the high number of temp files also interfered with the write-ahead log and checkpointing activities which became a choke point for write activities.

What We’re Doing

1. We've added monitoring to identify increases in temp file usage and checkpoint latency.

2. We’ve adjusted our database configuration settings to increase the capacity of our write-ahead log and checkpoint less frequently even under high load.

3. We’re optimizing the queries that generated the large volume of temporary files.

4. We’re continuing works to optimize our write load for regular activity to prevent these issues generally.

5. We’re engaging with affected customers to discuss ways we can offer better autoscaling metrics.

6. We’re continuing our review of Postgres usage patterns and parameter settings, including a third party expert review.

Posted Mar 19, 2019 - 03:40 UTC

Resolved

The database performance incident is resolved, we'll follow up with a post-mortem.

Posted Mar 13, 2019 - 22:00 UTC

Monitoring

We've identified and fixed the issue causing database load, latency issues should be returning to normal on all fronts.

Posted Mar 13, 2019 - 21:12 UTC

Update

We're continuing to investigate the issue that is causing degraded performance.

Posted Mar 13, 2019 - 20:40 UTC

Identified

We’ve identified some performance problems with our main transactional database that’s causing slow load times for dashboard requests and some API calls to timeout. We’re working on addressing the issues now.

Posted Mar 13, 2019 - 18:58 UTC

This incident affected: Web, Agent API, and REST API.