Between 04:48 and 05:30 UTC on November 10th 2025 the Agent API, that serves interactions with the Buildkite Agent, was severely degraded for the majority of customers. This degradation continued until 08:43 UTC for some remaining customers.
The Rest API and GraphQL were degraded from 04:48 until 08:43 UTC.
Some agents and jobs were left in an inconsistent state as a result of the degradation. Cases where the system could not self-heal required manual intervention, which was completed by 11:56 UTC.
At 04:48 UTC, a database migration was applied to remove an index on the table that stores annotations in the Pipelines product. Another index was already in place that was meant to replace the removed one, but there was a small number of high traffic queries that were not hitting the new index as expected.
At 04:53 UTC, the impact of this removed index was detected, and observed to be queries timing out on the annotations table. In particular, the agent command to create a new annotation was behind many of these timed-out queries. The result of these queries was substantially increased CPU load on the impacted databases, which degraded performance of all Pipelines features.
We immediately began recreating the removed database index, which was completed for roughly half our database shards by 05:30 UTC, and resulted in partial service restoration. For the remaining shards, load from ongoing queries to the annotations prevented the efficient recreation of the index.
At this point, we split into teams in order to investigate multiple strategies to reduce that load, including disabling annotation functionality in the product, redirecting new annotations to a separate table, and disabling the annotations table through table-level permissions.
In the end, the cleanest and most efficient strategy proved to be temporarily revoking permissions from our application user to read from/write to the annotations table. Using this, in addition to increasing max_parallel_maintenance_workers in Postgres, we were able to significantly increase the speed of index creation.
Between 07:50 UTC and 08:43 UTC, we used this method to finish recreating the index on the remaining database shards. By this point, latency and error rates across the Agent API, Rest API and GraphQL had returned to regular levels.
As a result of the incident, a small number of agents and jobs were left in an inconsistent state. Over the following 4 hours, the agents were freed, and job states cleaned up.
This incident was primarily caused by a database migration that exposed gaps in our review process. We're implementing the following improvements to prevent similar incidents:
We've extended our internal tooling that flags risky migrations during code review to now cover index removals. This tooling now instructs engineers to verify that an index is no longer required before deploying any database migration that would delete the index.
Database migrations currently apply to all shards simultaneously. We are implementing phased rollouts for high-risk migrations, which will limit blast radius and enable faster restoration if issues arise.
We're evaluating tools such as HypoPG to improve our analysis when creating or removing indexes. We're also working to surface index usage data from pganalyze more prominently in our development workflow, making it easier for engineers to make informed decisions about index changes.
Finally, we will automate the manual cleanup process used to resolve the inconsistent states that impacted some agents, to ensure the system can self-heal from similar issues in the future.