High Agent API latency

Incident Report for Buildkite

Postmortem

All times are in UTC

Service Impact

On May 13th from 21:30 till 23:40 a small percentage of customers (less than 10%) experienced delays starting builds of up to 5 minutes.

On May 14th from 18:10 till 20:30 a different subset of customers (again less than 10%) experienced delays starting builds of up to 20 minutes.

Incident Summary

On May 13th our engineers were paged at 22:17 (UTC) due to high database load. They soon identified an sub-optimal plan was being generated for a key database query used by our backend to fetch the ID of a job assigned to an agent. Due to the high throughput of this query we experienced performance degradation for all agents communicating with the Agent API for the impacted database.  At 22:50 our on-call engineers started a manual analyze, but soon concluded it was going to take more than an hour to complete so began investigating alternative workarounds. We deployed an emergency change, behind a feature flag, to hint to the query planner to prefer a more efficient index. This was enabled for all affected users at 23:34. At 23:46 the manual analyze completed and service was fully restored.

Recent improvements made to the isolation of the Agent API meant this incident had no impact on customers on other databases.

On May 14th at 18:44 (UTC), our engineers were automatically paged for high latency on the Agent API. At 19:13 the on-call engineer enabled the previously deployed feature flag for all customers, which partially restored service, and started a manual analyze. At 20:30 service was fully restored.

Technical Analysis

The Buildkite Pipelines sharded database contains a jobs table which stores, amongst other things, the state of a job (scheduled, assigned, running, passed etc.). When agents are assigned a job our application queries the database for the job ID to return to the agent. Following a post-autovacuum ANALYZE the query used became more expensive and due to the high throughput of this query it resulted in performance degradation for all agents communicating with the Agent API for the impacted database. 

During an analyze Postgres calculates a freq statistic for each value of a column. This value is used to build a most common value histogram, which is used by the query planner to estimate the cost of different queries it could execute. It calculates this by taking a random sample of the table. Since jobs are typically moved from scheduled to assigned and then running very quickly there is typically a very small percentage of rows in this table with the assigned state at any one time. This very skewed distribution reduces the accuracy of the analyze statistics, since it’s only sampling a small subset of the table.

Over time the distribution of the state column became so skewed that there was a relatively high probability statistics produced by analyze would be inaccurate. This caused the query planner to mis-estimate the number of rows returned by scans of a partial index and wrongly conclude it would be more efficient than using the primary key index. These queries were of high enough throughput that this mis-estimation had a significant impact on database load and resulted in degraded performance of all queries to the jobs table.

Changes we're making

We have deployed changes to ensure all queries for jobs in the assigned state use the most efficient query plan.

We are investigating how aurora_stat_plans will enable us to detect sub-optimal query plans sooner.

Posted May 30, 2025 - 01:13 UTC

Resolved

This incident is now resolved. Database and Agent API performance has returned to normal.
Posted May 14, 2025 - 20:28 UTC

Monitoring

We've corrected the issue and are monitoring performance of the affected database shard.
Posted May 14, 2025 - 19:17 UTC

Identified

A bad database query plan is causing increased latency and timeouts for some organizations. We're working to correct the issue.
Posted May 14, 2025 - 19:12 UTC

Investigating

We're experiencing high Agent API latency for a subset of customers. We are working to identify the cause.
Posted May 14, 2025 - 19:08 UTC
This incident affected: Agent API.