Elevated errors rates on the Agent API
Incident Report for Buildkite

This incident was a follow-on from the performance problems we had earlier on in the day:


We started getting PagerDuty alerts at around 21:50 AEST that resembled those that we already dealt with in the previous incident.

After some investigation, we discovered that our hard SQL query timeout fix hadn't been successfully rolled out to some parts of our Agent API. We rolled out the fix to the rest of the API that needed it, performed a similar set of restarts like we did earlier on in the day, and the Agent API was running again by 22:26 AEST.

We've since made some changes to our main transactional database to ensure that the SQL query timeout is the same across all parts of our infrastructure.

Posted about 1 year ago. Jun 03, 2018 - 17:18 AEST

Everything is operating smoothly, and the issue has been resolved. We'll continue to monitor performance, and roll out further fixes to ensure this same problem doesn't reoccur.
Posted about 1 year ago. May 31, 2018 - 22:26 AEST
We've identified the issue and have rolled out a fix. The API response rates are now back to normal, and we'll continue to monitor performance and stability.
Posted about 1 year ago. May 31, 2018 - 22:03 AEST
We are experiencing a large number of error responses being returned from the Agent API, and we're currently investigating the issue.
Posted about 1 year ago. May 31, 2018 - 21:56 AEST
This incident affected: Agent API.