On 18th Jan, 2019 at 12:16 PM AEDT (UTC+11) we started receiving alerts from customers that they were seeing errors when loading the Buildkite dashboard. We started investigating right away, and our automated systems also started sending us alerts that the majority of our systems were down.
We quickly identified a storage problem with our main transactional database. We increased database storage capacity and waited for recovery to complete.
At 12:32 PM AEDT, Buildkite was back online and started serving requests as usual.
During this period, we had a major outage of all of our components. Agents would continue to run any in-process work, however no new work would have been scheduled during this time. Our webhook intake system was also down during this period, which subsequently means we didn’t process any GitHub webhooks during this time.
As part of regular database maintenance, we occasionally use pg_repack to optimise our database storage and performance. It does so by creating a new table, re-writing the data from the existing table into it (using triggers to keep the two tables in sync), and then switching the tables when finished. It does all this without any downtime, and is generally not noticeable. pg_repack is also a great alternative to running VACUUM FULL to optimise a table which exclusively locks the table until the process completes, and could take hours depending on the size of the table. Because it creates a copy of the table, it requires at least as much free storage space as the size of the table being repacked. We never repack a table unless we know it will fit well within our free storage space with a large margin for regular operations.
This morning we kicked off a process to repack one of our tables. But instead of repacking a target table which we knew was safe to repack with our available free storage space, due to human error we instead repacked a much larger, similarly-named table which exceeded our storage capacity. This caused a storage error and pushed our database into recovery until we increased storage space.
Our database free storage space dropped from over half a terabyte to zero within a matter of minutes. Our monitoring was designed to alert on regular application activity causing storage exhaustion, and did not trigger soon enough because it happened so fast. When it hit zero our database went into recovery, Buildkite started throwing errors, and we had a major system outage.
1. We’re going to update our database maintenance scripts to check for free storage space before running the repack processes to avoid human error.
2. We’ve tightened up our database monitoring to alert us sooner if free storage drops suddenly.