Increased API error rate and latency
Incident Report for Buildkite
Postmortem

We observed degraded performance on the Agent API from 16-11-21 20:20 UTC to 16-11-21 21:35 UTC due to high database load. This load was caused by contention between normal operations (Job Dispatch, API, Dashboard UI) and a background notification worker that concurrently accesses the jobs and job event tables to generate Github commit status notifications. In order to reduce the high database load, we temporarily reduced the capacity of the background notification worker queue, thereby limiting the number of workers that were concurrently accessing the contended tables. This reduced the observed latency on the Agent API. We are implementing application level concurrency limiting on the background worker to prevent this contention recurring at the database layer. Unfortunately, temporarily reducing the capacity of our notification worker queue later resulted in increased latency for service notifications, see https://www.buildkitestatus.com/incidents/q4104hnbyvsx.

Posted Nov 17, 2021 - 10:22 UTC

Resolved
This incident has been resolved.
Posted Nov 16, 2021 - 21:55 UTC
Monitoring
The issue has been identified and we are monitoring it
Posted Nov 16, 2021 - 21:36 UTC
Update
We are continuing to investigate this issue.
Posted Nov 16, 2021 - 21:14 UTC
Investigating
We are currently investigating this issue.
Posted Nov 16, 2021 - 20:43 UTC
This incident affected: Agent API.