Overview
On Thursday, 14th March 2019 from around 4:00am AEDT Buildkite experienced performance degradation which led to a partial outage. Engineers were paged at 4:32am AEDT by the high load to investigate and intervened to manage performance. Between 5:00am-7:15am AEDT, job dispatch became slow and the Agent API experienced high error rates leading to erroneous build failures. Performance returned to normal by 8:30am AEDT.
Root Cause
Before the outage there were an unusually high number of queries that were generating excessive temporary files on the disk of our main transactional database. This temp file activity caused significant write latency and slowed regular database traffic down - including scheduling jobs to agents. Some large customers' automatic agent scaling systems interpreted the reduced job throughput as a lack of capacity on their end, which meant their systems massively scaled out the number of agents, in turn slowing down Buildkite even further. We later determined that the high number of temp files also interfered with the write-ahead log and checkpointing activities which became a choke point for write activities.
What We’re Doing
1. We've added monitoring to identify increases in temp file usage and checkpoint latency.
2. We’ve adjusted our database configuration settings to increase the capacity of our write-ahead log and checkpoint less frequently even under high load.
3. We’re optimizing the queries that generated the large volume of temporary files.
4. We’re continuing works to optimize our write load for regular activity to prevent these issues generally.
5. We’re engaging with affected customers to discuss ways we can offer better autoscaling metrics.
6. We’re continuing our review of Postgres usage patterns and parameter settings, including a third party expert review.