Service unavailable

Incident Report for Buildkite

Postmortem

Service Impact

On 2022-07-04 between 07:36 UTC and 08:04 UTC the Web, API, REST API, Agent API, Job Queue, SCM Integrations and Notifications (GitHub Commit Status Notifications, Email Notifications) were unavailable or in a degraded state.

What happened

We are currently partitioning our large time-series tables (e.g. builds, jobs, artifacts) to improve the efficiency of caches, indexes, and VACUUM operations. Builds older than a few days are rarely accessed, so partitioning builds and their dependent data means old data can (mostly) stay on disk, with in-memory caches left for recent data.

In preparation for table partitioning, a schema migration was applied at 07:34 UTC which caused performance degradation for less than 1% of requests. The migration was manually rolled back at 07:36 UTC. However, the rollback migration required an ACCESS EXCLUSIVE lock on a high throughput table which caused the transaction to wait for in-progress statements on the table to finish. These in-progress statements were made slower by the statements experiencing performance degradation. While the rollback migration was waiting to run, subsequently enqueued transactions were unable to execute. This caused HTTP requests to block for 60 seconds until they received a statement timeout. These blocked requests consumed database connection pool capacity across our application servers causing errors for requests that require database connections. Manual remediation was performed to stop incoming traffic and cancel slow queries blocking the migration. At 07:55 UTC the rollback migration was successfully applied, the application was resumed and service was restored by 08:04 UTC.

How we’re becoming more resilient to issues like this

The rollback migration that required an ACCESS EXCLUSIVE lock on a critical and high throughput table was able to block other queries for an extended period of time. In the future we will use the lock_timeout and statement_timeout parameters to prevent blocking other transactions for more than 10 seconds. We already use these parameters for automated migrations via CI, but we'll now use them for manually initiated rollbacks.

We've implemented a new capability in our feature switch system that allows us to change the behaviour of a percentage of requests. In the future, where applicable, we use this system to migrate our application across to using new database schema formats without needing to take a database schema lock. This will allow us to rollback without waiting for an additional table lock. It also allows us to build confidence on a smaller percentage of our traffic before rolling out changes more widely.

Posted Jul 18, 2022 - 10:31 UTC

Resolved

This incident has been resolved.

Posted Jul 04, 2022 - 08:16 UTC

Monitoring

The service is fully available and we are continuing to investigate and monitor

Posted Jul 04, 2022 - 08:04 UTC

Update

We are continuing to investigate this issue.

Posted Jul 04, 2022 - 07:44 UTC

Investigating

Our engineers are working to resolve a database issue.

Posted Jul 04, 2022 - 07:43 UTC

This incident affected: Web, Agent API, REST API, Job Queue, SCM Integrations, SCM Providers (GitHub, GitHub API Requests, Atlassian Bitbucket SSH, Atlassian Bitbucket Website and API, Atlassian Bitbucket Git via HTTPS), and Notifications (GitHub Commit Status Notifications, Email Notifications, Slack Notifications, Webhook Notifications).