Elevated Error Rates on Agent API
Incident Report for Buildkite
Postmortem

On Monday, 3rd September at 01:37 UTC the Buildkite Agent API experienced elevated error rates. This was due to high load on our backend RDS database caused by high load, data migrations, and a vacuum process. The migration and vacuum processes were cancelled and service was restored by 01:39 UTC. Agents retry behaviour should have handled these failures without data loss. Some build pipeline uploads may have failed and required retry.

These vacuums have been rescheduled to run on weekends during quiet periods, and the data migrations have been modified to pause more to allow normal operations.

Posted Sep 03, 2018 - 06:39 UTC

Resolved
This incident has been resolved.
Posted Sep 03, 2018 - 03:38 UTC
Monitoring
Agent API error rates have returned to normal.
Posted Sep 03, 2018 - 02:06 UTC
Investigating
We're seeing elevated error rates on the Agent API, investigating.
Posted Sep 03, 2018 - 01:55 UTC
This incident affected: Agent API.