All times are in UTC
On May 13th from 21:30 till 23:40 a small percentage of customers (less than 10%) experienced delays starting builds of up to 5 minutes.
On May 14th from 18:10 till 20:30 a different subset of customers (again less than 10%) experienced delays starting builds of up to 20 minutes.
On May 13th our engineers were paged at 22:17 (UTC) due to high database load. They soon identified an sub-optimal plan was being generated for a key database query used by our backend to fetch the ID of a job assigned to an agent. Due to the high throughput of this query we experienced performance degradation for all agents communicating with the Agent API for the impacted database. At 22:50 our on-call engineers started a manual analyze, but soon concluded it was going to take more than an hour to complete so began investigating alternative workarounds. We deployed an emergency change, behind a feature flag, to hint to the query planner to prefer a more efficient index. This was enabled for all affected users at 23:34. At 23:46 the manual analyze completed and service was fully restored.
Recent improvements made to the isolation of the Agent API meant this incident had no impact on customers on other databases.
On May 14th at 18:44 (UTC), our engineers were automatically paged for high latency on the Agent API. At 19:13 the on-call engineer enabled the previously deployed feature flag for all customers, which partially restored service, and started a manual analyze. At 20:30 service was fully restored.
The Buildkite Pipelines sharded database contains a jobs table which stores, amongst other things, the state of a job (scheduled, assigned, running, passed etc.). When agents are assigned a job our application queries the database for the job ID to return to the agent. Following a post-autovacuum ANALYZE the query used became more expensive and due to the high throughput of this query it resulted in performance degradation for all agents communicating with the Agent API for the impacted database.
During an analyze Postgres calculates a freq
statistic for each value of a column. This value is used to build a most common value histogram, which is used by the query planner to estimate the cost of different queries it could execute. It calculates this by taking a random sample of the table. Since jobs are typically moved from scheduled
to assigned
and then running
very quickly there is typically a very small percentage of rows in this table with the assigned
state at any one time. This very skewed distribution reduces the accuracy of the analyze statistics, since it’s only sampling a small subset of the table.
Over time the distribution of the state column became so skewed that there was a relatively high probability statistics produced by analyze would be inaccurate. This caused the query planner to mis-estimate the number of rows returned by scans of a partial index and wrongly conclude it would be more efficient than using the primary key index. These queries were of high enough throughput that this mis-estimation had a significant impact on database load and resulted in degraded performance of all queries to the jobs table.
We have deployed changes to ensure all queries for jobs in the assigned state use the most efficient query plan.
We are investigating how aurora_stat_plans will enable us to detect sub-optimal query plans sooner.