Elevated errors rates on the Agent API

Incident Report for Buildkite

Postmortem

This incident was a follow-on from the performance problems we had earlier on in the day:

https://www.buildkitestatus.com/incidents/xvkcty9n8y5m

We started getting PagerDuty alerts at around 21:50 AEST that resembled those that we already dealt with in the previous incident.

After some investigation, we discovered that our hard SQL query timeout fix hadn't been successfully rolled out to some parts of our Agent API. We rolled out the fix to the rest of the API that needed it, performed a similar set of restarts like we did earlier on in the day, and the Agent API was running again by 22:26 AEST.

We've since made some changes to our main transactional database to ensure that the SQL query timeout is the same across all parts of our infrastructure.

Posted Jun 03, 2018 - 07:18 UTC

Resolved

Everything is operating smoothly, and the issue has been resolved. We'll continue to monitor performance, and roll out further fixes to ensure this same problem doesn't reoccur.

Posted May 31, 2018 - 12:26 UTC

Monitoring

We've identified the issue and have rolled out a fix. The API response rates are now back to normal, and we'll continue to monitor performance and stability.

Posted May 31, 2018 - 12:03 UTC

Investigating

We are experiencing a large number of error responses being returned from the Agent API, and we're currently investigating the issue.

Posted May 31, 2018 - 11:56 UTC

This incident affected: Agent API.