Buildkite service disruption

Incident Report for Buildkite

Postmortem

Service Impact

Between 02:29 and 02:48 UTC on May 1st 2026, the Buildkite dashboard, Agent API, REST API and GraphQL were severely degraded for the majority of customers.

During this period, customers were unable to reliably load pages in the dashboard, and agents experienced errors when communicating with the Agent API.

Service was restored for all customers by 02:48 UTC, with a small number of follow-up support cases handled in the hours afterward.

Incident Summary

At 02:29 UTC, a database migration was applied that renamed a column on the users table in our foundation database, which is a core, unsharded table referenced across the Buildkite platform.

The migration ran before the new application code was deployed. During the rollout window, the application code already running in production still expected the original column name. Because database migrations run before application code is deployed, queries that loaded a user record began failing immediately across the product, and continued to fail until the column was manually restored.

The impact was detected at 02:29 UTC by an automated alert, and several customers reached out to support within minutes.

Initial investigation focused on preparing an emergency rollback, but at approximately 02:41 UTC it was identified that the change was an unsafe migration and that a rollback would not restore service.

We then moved to manually recreate the renamed column on the users table. This change was applied at approximately 02:48 UTC, at which point error rates returned to normal levels across the affected services.

Total customer-facing impact was 19 minutes.

What contributed to this happening

Three things came together to allow this change to ship without the safety pre-steps that would normally precede a destructive migration:

The migration was generated by an LLM-assisted workflow, which added the safety_assured annotation as part of the generated change. safety_assured is intended as an explicit engineer acknowledgement that the safety implications of a destructive migration have been considered and prepared for not as a routine flag added by tooling.

We use Danger (https://danger.systems/ruby/) to call out risky database migrations with specific guidance for certain operations such as index changes — but there was no specific guidance for rename_column. The generic migration warning that did exist had, after dozens of routine migrations, become background noise.

The migration was bundled into the same PR as the application code that depended on the renamed column. This made the schema change easier to overlook in review, and meant the test suite for the PR only ever exercised the new code against the new schema, never the previously-deployed code against the new schema.

This combination meant the unsafe sequence reached production with no automated test or human review specifically focused on whether the currently-running application could tolerate the schema change.

Changes we're making

This incident was caused by a database migration that was not adequately prepared for safe deployment, and exposed gaps in both our review process and our use of code generation tooling. We're making the following improvements to prevent similar incidents:

Database migrations that are unsafe to apply against the currently-running application code will be flagged and blocked before they can be merged. We are still working through the exact mechanism for this (one option under active discussion is requiring schema migrations to ship in separate pull requests from any application code that depends on them), but the intent is the same: the test suite on a migration's PR exercises the post-migration schema against the application code already running in production — the exact intermediate state that exists between the migration running and any new code shipping — so that backwards-incompatibility shows up as a failing test on the migration PR itself, where it can be caught by branch protection rather than by a human reading a routine review comment.

We're extending Danger, the automated review tool that comments on every pull request that touches a database migration, to specifically highlight `rename_column` amongst the other higher-risk operations it already highlights, so that they stand out from routine migration changes during code review.

We're updating our internal engineering guidance (AGENTS.md) to make clear that protective constructs such as safety_assured (from https://github.com/ankane/strong_migrations) are intended as engineer-acknowledged sign-offs and should not be added by code-generation tooling on an engineer's behalf.

Finally, we're reviewing how we surface and document the required pre-steps (such as adding ignored_columns) for breaking schema changes, so that the safe rollout sequence is harder to skip, particularly for changes to unsharded core tables where there is no opportunity for a phased rollout.

The changes described above protect against the test-detectable class of schema/code mismatches. There remains a class of failure where a destructive migration to an unsharded core table affects every customer at once with no opportunity for a staged rollout. Our longer-term work on cell-based deployments, splitting customer data across multiple foundation databases, is what will give us the ability to roll out schema changes incrementally and detect issues against a small subset of customers before they reach the full fleet.

Posted May 08, 2026 - 06:09 UTC

Resolved

This incident has been resolved.

Posted May 01, 2026 - 03:30 UTC

Monitoring

We've corrected the issue that caused this disruption and normal service has been restored. We are monitoring the situation now.

Posted May 01, 2026 - 02:52 UTC

Identified

We've identified a service change that is causing a service disruption. We are reverting this change.

Posted May 01, 2026 - 02:41 UTC