Increased error rate affecting Web UI and APIs

Incident Report for Buildkite

Postmortem

Service Impact

Between 04:48 and 05:30 UTC on November 10th 2025 the Agent API, that serves interactions with the Buildkite Agent, was severely degraded for the majority of customers. This degradation continued until 08:43 UTC for some remaining customers.

The Rest API and GraphQL were degraded from 04:48 until 08:43 UTC.

Some agents and jobs were left in an inconsistent state as a result of the degradation. Cases where the system could not self-heal required manual intervention, which was completed by 11:56 UTC.

Incident Summary

At 04:48 UTC, a database migration was applied to remove an index on the table that stores annotations in the Pipelines product. Another index was already in place that was meant to replace the removed one, but there was a small number of high traffic queries that were not hitting the new index as expected.

At 04:53 UTC, the impact of this removed index was detected, and observed to be queries timing out on the annotations table. In particular, the agent command to create a new annotation was behind many of these timed-out queries. The result of these queries was substantially increased CPU load on the impacted databases, which degraded performance of all Pipelines features.

We immediately began recreating the removed database index, which was completed for roughly half our database shards by 05:30 UTC, and resulted in partial service restoration. For the remaining shards, load from ongoing queries to the annotations prevented the efficient recreation of the index.

At this point, we split into teams in order to investigate multiple strategies to reduce that load, including disabling annotation functionality in the product, redirecting new annotations to a separate table, and disabling the annotations table through table-level permissions.

In the end, the cleanest and most efficient strategy proved to be temporarily revoking permissions from our application user to read from/write to the annotations table. Using this, in addition to increasing max_parallel_maintenance_workers in Postgres, we were able to significantly increase the speed of index creation.

Between 07:50 UTC and 08:43 UTC, we used this method to finish recreating the index on the remaining database shards. By this point, latency and error rates across the Agent API, Rest API and GraphQL had returned to regular levels.

As a result of the incident, a small number of agents and jobs were left in an inconsistent state. Over the following 4 hours, the agents were freed, and job states cleaned up.

Changes we're making

This incident was primarily caused by a database migration that exposed gaps in our review process. We're implementing the following improvements to prevent similar incidents:

We've extended our internal tooling that flags risky migrations during code review to now cover index removals. This tooling now instructs engineers to verify that an index is no longer required before deploying any database migration that would delete the index.

Database migrations currently apply to all shards simultaneously. We are implementing phased rollouts for high-risk migrations, which will limit blast radius and enable faster restoration if issues arise.

We're evaluating tools such as HypoPG to improve our analysis when creating or removing indexes. We're also working to surface index usage data from pganalyze more prominently in our development workflow, making it easier for engineers to make informed decisions about index changes.

Finally, we will automate the manual cleanup process used to resolve the inconsistent states that impacted some agents, to ensure the system can self-heal from similar issues in the future.

Posted Nov 18, 2025 - 03:12 UTC

Resolved

This incident has been resolved. All services are operating normally and remaining jobs have been processed.

We apologise for the disruption and thank you for your patience.
Posted Nov 10, 2025 - 12:00 UTC

Update

All services are currently stable and we continue to monitor system status. We are working to unblock some remaining jobs that were impacted by the incident.

Thank you for your continued patience.
Posted Nov 10, 2025 - 10:46 UTC

Monitoring

The issue has been resolved and service has been restored to all remaining customers. We are continuing to monitor to ensure stability.

We apologize for the disruption and thank you for your patience.
Posted Nov 10, 2025 - 09:02 UTC

Update

We have restored service for additional customers and are continuing to work on a resolution for remaining affected customers.

We've identified the root cause of the issue: a database index on our annotations table was replaced, but the replacement index isn't being utilized as expected under load, which is causing performance degradation.

Thank you for your understanding and patience as we work to resolve this as quickly as possible.
Posted Nov 10, 2025 - 08:46 UTC

Update

We have restored service for additional customers and are continuing to work on a resolution for remaining affected customers.
Posted Nov 10, 2025 - 08:05 UTC

Update

We are actively working on a resolution for remaining customers. We do not have an estimated time for resolution at this time.
Posted Nov 10, 2025 - 07:24 UTC

Update

We have restored service for the majority of customers and are continuing to work on those that remain.

We are working through applying a fix but the exact time to fully restored service is not yet clear.
Posted Nov 10, 2025 - 06:36 UTC

Identified

The REST and GraphQL APIs, as well as the Web UI, are currently affected by increased error rates and slow performance.

We have identified the cause of the issue, and are working to address it.
Posted Nov 10, 2025 - 05:30 UTC

Investigating

We have noticed issues with displaying build annotations, which may cause viewing builds in the web UI. We are still investigating and will provide another update shortly.
Posted Nov 10, 2025 - 05:22 UTC
This incident affected: Web and REST API.