Increased timeouts and errors

Incident Report for Buildkite

Postmortem

Timeline

04:17:45am UTC a routine but long running reporting SQL query on our primary database completed. The long running read query was followed by an expected high volume of writes to the postgres Write Ahead Log (WAL), consuming a significant amount of available IOPS on the instance and starving many queries executing for customer requests of IO capacity.

04:22:30am UTC the WAL writes completed

04:25:00am UTC IO capacity for responding to customer queries was fully restored and database throughput was at normal levels

Impact

The peak error rate for customer requests over the nearly 8 minutes was 41%. Most errors were timeouts, and latencies on requests that didn't time our were significantly higher than our targets.

Future Work

To avoid a repeat of this incident, we are exploring options for moving more of our reporting tasks to infrastructure that is dedicated to reporting. In the interim, reports such as the one involved in this incident will not be run on our primary instance.

Posted Jul 30, 2021 - 06:49 UTC

Resolved

This incident has been resolved.

Posted Jul 27, 2021 - 05:15 UTC

Monitoring

We have identified the cause of the elevated error and latency issue. The spike in errors and latency appear to have resolved and we are currently monitoring the system status.

Posted Jul 27, 2021 - 04:51 UTC

Investigating

We are investigating reports of an increase in request timeouts and errors.

Posted Jul 27, 2021 - 04:29 UTC

This incident affected: Web, Agent API, and REST API.