On 2023-09-22, from 17:14 to 20:46 UTC (3h32m), customers experienced sustained elevated latency, particularly on agent network requests. For the following 1h32m until 22:18 UTC, there were sporadic spikes in latency, before system performance returned to expected levels.
During the incident, Buildkite engineers were automatically paged to respond and investigate. While the root cause was unable to be identified during the incident itself, responding engineers shed load from the database through adjustments to the available database connections from our application, and by pausing low-priority background tasks. These changes improved the latency overall, but spikes in latency remained until load generated by customer activity decreased.
In line with our recent efforts to improve Buildkite’s reliability, particularly by reducing the data in our primary database, our engineers have rebuilt the automated tasks that delete build data older than the retention periods. The most recent change altered the order in which builds were deleted – where previously builds were processed without a guaranteed order, the new task simply processed all builds in ascending order of ID. This task was running with low concurrency over the weekend in order to monitor performance impact, and was one of the background tasks that was paused during incident response.
During the incident, it was observed that the cost of a query associated with our agent’s ping cycle had increased drastically, with index buffer hits in the planning phase blowing out. Investigation after the incident led to a theory that this was related to the `get_actual_variable_range()` function in PostgreSQL, which attempts “to identify the current actual minimum and/or maximum of the specified variable, by looking for a suitable b-tree index and fetching its low and/or high values”.
As the new build deletion task was deleting in ascending order of ID, it left dead tuples on the left-hand side of the B-tree for the offending table’s primary key index. As we have since confirmed through reproduction, this increased the index buffer hits caused by `get_actual_variable_range()`, and therefore the planning cost of the query overall. The cost remained high until a vacuum of the table’s indexes had completed.
This high query planning cost combined with high traffic from agent pings resulted in database load that impacted all agent requests to our system.
A thorough offline reproduction of the issue above has allowed us to make improvements to the build deletion task, which will maintain synthetic records as the lowest values in the primary key index. We have since resumed this background process under close observation.
In addition, we have identified opportunities for additional levers we can utilize in case of future incidents, such as shedding load by dynamically increasing agent ping intervals.
We apologize for any impact this period of increased latency may have had on your workloads, and thank you for your patience as we work towards larger improvements to the reliability of our systems.