Increased Agent API error rate and latency

Incident Report for Buildkite

Postmortem

On 29th November 2021 at 05:06 UTC a deployment caused degraded performance of the agent artifact API endpoints. This degradation led to high latency and timeout errors across Buildkite. A revert was applied and regular performance was restored by 05:19 UTC.

The deployment included a database migration which changed the performance of some key queries powering the agent artifacts API. These changes had been tested extensively including sampling in production, but unfortunately still had some unexpected side effects when applied to the whole production workload. We’ll be doing more extensive testing before re-deploying this in a weekend maintenance window to minimise the risk of any further disruption.

Posted Nov 29, 2021 - 06:20 UTC

Resolved

This incident has been resolved.
Posted Nov 29, 2021 - 05:23 UTC

Monitoring

We've identified the issue as an in-progress migration on one of our database tables and have reverted the migration.
Posted Nov 29, 2021 - 05:21 UTC

Investigating

We are currently investigating this issue.
Posted Nov 29, 2021 - 05:17 UTC
This incident affected: Agent API.