This incident was a follow-on from the two previous, related incidents:
At around 04:26 AEST, we were alerted that some agents weren't being assigned work. The rest of the Buildkite engineering team was immediately notified.
We discovered that the SQL query timeout fix we had applied earlier on in the day had caused some unforeseen consequences.
The background task that is responsible for expiring jobs that have been in a particular state for too long (i.e. a job that isn't accepted by an agent 30 seconds after it was assigned will be expired) was not running correctly. The SQL queries that it was producing when under load were exceeding the 60 second timeout and being cancelled.
This task runs every minute in Sidekiq and the default behaviour for Sidekiq is to retry jobs with exceptions right away. This behaviour meant that multiple versions of this task were running at the same time, all trying to lock and expire the same jobs. This caused major performance problems and impacted job dispatch. An auto-vacuum was also running at this time, which caused the queries to take longer than usual.
We rolled out a fix to correct the retry behaviour, killed any long running SQL queries, and things returned back to normal at 06:11 AEST.
At this stage we're still working on rolling out our new job partitions to reduce the performance impact of an auto-vacuum, and refactoring the background task to run more efficiently.