Elevated Error Rates on Agent API
Incident Report for Buildkite
Postmortem

On Monday, 3rd September at 01:37 UTC the Buildkite Agent API experienced elevated error rates. This was due to high load on our backend RDS database caused by high load, data migrations, and a vacuum process. The migration and vacuum processes were cancelled and service was restored by 01:39 UTC. Agents retry behaviour should have handled these failures without data loss. Some build pipeline uploads may have failed and required retry.

These vacuums have been rescheduled to run on weekends during quiet periods, and the data migrations have been modified to pause more to allow normal operations.

Posted 2 months ago. Sep 03, 2018 - 16:39 AEST

Resolved
This incident has been resolved.
Posted 2 months ago. Sep 03, 2018 - 13:38 AEST
Monitoring
Agent API error rates have returned to normal.
Posted 2 months ago. Sep 03, 2018 - 12:06 AEST
Investigating
We're seeing elevated error rates on the Agent API, investigating.
Posted 2 months ago. Sep 03, 2018 - 11:55 AEST
This incident affected: Agent API.