04:17:45am UTC a routine but long running reporting SQL query on our primary database completed. The long running read query was followed by an expected high volume of writes to the postgres Write Ahead Log (WAL), consuming a significant amount of available IOPS on the instance and starving many queries executing for customer requests of IO capacity.
04:22:30am UTC the WAL writes completed
04:25:00am UTC IO capacity for responding to customer queries was fully restored and database throughput was at normal levels
The peak error rate for customer requests over the nearly 8 minutes was 41%. Most errors were timeouts, and latencies on requests that didn't time our were significantly higher than our targets.
To avoid a repeat of this incident, we are exploring options for moving more of our reporting tasks to infrastructure that is dedicated to reporting. In the interim, reports such as the one involved in this incident will not be run on our primary instance.