Investigating high API error rates.
Incident Report for Buildkite
Postmortem

Between 2021-12-12 04:35-04:49 UTC Buildkite's Web, Rest API, GraphQL API, Agent API and Job Dispatch suffered severe degraded performance. API routes that changed the state of builds were particularly impacted. A database migration was applied at 04:35 UTC which resulted in our application throwing errors when attempting to update records. These errors were the result of database columns being deleted, before being ignored in the corresponding application model. We restarted the application which refreshed the application model. This restored our services to normal operation. We subsequently discovered some builds had not been processed. In response, we re-processed in-progress builds around the time of the incident. In the future, we'll be implementing additional internal reviews of database migration changes to prevent recurrence of this situation. We're incredibly sorry for the disruption this outage caused.

Posted Dec 13, 2021 - 10:39 UTC

Resolved
This incident has been resolved.
Posted Dec 13, 2021 - 05:31 UTC
Monitoring
We have reprocessed builds that were in progress at the start of the incident. All systems should be operational
Posted Dec 13, 2021 - 05:18 UTC
Identified
We've rolled back the deployment. We're able to process new builds, and investigating the state of in-progress builds to make sure they are being processed.
Posted Dec 13, 2021 - 04:55 UTC
Investigating
We're currently investigating a deployment that is causing high error rates and attempting to roll it back.
Posted Dec 13, 2021 - 04:44 UTC
This incident affected: Web, Agent API, REST API, Job Queue, and SCM Integrations.