Elevated Latency and Error Rates

Incident Report for Buildkite

Postmortem

Service Impact

On 2023-09-22, from 17:14 to 20:46 UTC (3h32m), customers experienced sustained elevated latency, particularly on agent network requests. For the following 1h32m until 22:18 UTC, there were sporadic spikes in latency, before system performance returned to expected levels.

Incident Summary

During the incident, Buildkite engineers were automatically paged to respond and investigate. While the root cause was unable to be identified during the incident itself, responding engineers shed load from the database through adjustments to the available database connections from our application, and by pausing low-priority background tasks. These changes improved the latency overall, but spikes in latency remained until load generated by customer activity decreased.

In line with our recent efforts to improve Buildkite’s reliability, particularly by reducing the data in our primary database, our engineers have rebuilt the automated tasks that delete build data older than the retention periods. The most recent change altered the order in which builds were deleted – where previously builds were processed without a guaranteed order, the new task simply processed all builds in ascending order of ID. This task was running with low concurrency over the weekend in order to monitor performance impact, and was one of the background tasks that was paused during incident response.

During the incident, it was observed that the cost of a query associated with our agent’s ping cycle had increased drastically, with index buffer hits in the planning phase blowing out. Investigation after the incident led to a theory that this was related to the `get_actual_variable_range()` function in PostgreSQL, which attempts “to identify the current actual minimum and/or maximum of the specified variable, by looking for a suitable b-tree index and fetching its low and/or high values”.

As the new build deletion task was deleting in ascending order of ID, it left dead tuples on the left-hand side of the B-tree for the offending table’s primary key index. As we have since confirmed through reproduction, this increased the index buffer hits caused by `get_actual_variable_range()`, and therefore the planning cost of the query overall. The cost remained high until a vacuum of the table’s indexes had completed.

This high query planning cost combined with high traffic from agent pings resulted in database load that impacted all agent requests to our system.

Actions

A thorough offline reproduction of the issue above has allowed us to make improvements to the build deletion task, which will maintain synthetic records as the lowest values in the primary key index. We have since resumed this background process under close observation.

In addition, we have identified opportunities for additional levers we can utilize in case of future incidents, such as shedding load by dynamically increasing agent ping intervals.

We apologize for any impact this period of increased latency may have had on your workloads, and thank you for your patience as we work towards larger improvements to the reliability of our systems.

Posted Sep 29, 2023 - 06:13 UTC

Resolved

This incident has been resolved.

Posted Sep 22, 2023 - 22:45 UTC

Update

We have seen Agent API latency return to expected levels, and are continuing to monitor the situation closely.

Posted Sep 22, 2023 - 21:59 UTC

Monitoring

We have seen Agent API latency return to expected levels, and are continuing to monitor the situation closely.

Posted Sep 22, 2023 - 21:30 UTC

Update

Our mitigations have reduced Agent API latency to normal levels, we continue to actively monitor the situation.

Posted Sep 22, 2023 - 21:00 UTC

Update

We are currently attempting to mitigate the increased system load by reducing concurrent HTTP processing capacity. We are continuing to investigate this issue.

Posted Sep 22, 2023 - 20:33 UTC

Update

We continue investigating this issue and continue to apply mitigations to shed database load

Posted Sep 22, 2023 - 20:00 UTC

Update

We are continuing to investigate this issue, we are still experiencing elevated latency

Posted Sep 22, 2023 - 19:22 UTC

Update

We are still seeing elevated response time and we are taking actions to shed database load. We continue to investigate the root cause.

Posted Sep 22, 2023 - 18:41 UTC

Update

We are still observing a high load on DB, causing performance issues with API. We are investigating to identify the root cause.

Posted Sep 22, 2023 - 18:05 UTC

Investigating

We are currently investigating a high-latency issue.

Posted Sep 22, 2023 - 17:35 UTC

This incident affected: Agent API and Job Queue.