This incident was a follow-on from the performance problems we had earlier on in the day:
https://www.buildkitestatus.com/incidents/xvkcty9n8y5m
We started getting PagerDuty alerts at around 21:50 AEST that resembled those that we already dealt with in the previous incident.
After some investigation, we discovered that our hard SQL query timeout fix hadn't been successfully rolled out to some parts of our Agent API. We rolled out the fix to the rest of the API that needed it, performed a similar set of restarts like we did earlier on in the day, and the Agent API was running again by 22:26 AEST.
We've since made some changes to our main transactional database to ensure that the SQL query timeout is the same across all parts of our infrastructure.