Job Cancelation Issues

Incident Report for Buildkite

Postmortem

This incident was a follow-on from the two previous, related incidents:

At around 04:26 AEST, we were alerted that some agents weren't being assigned work. The rest of the Buildkite engineering team was immediately notified.

We discovered that the SQL query timeout fix we had applied earlier on in the day had caused some unforeseen consequences.

The background task that is responsible for expiring jobs that have been in a particular state for too long (i.e. a job that isn't accepted by an agent 30 seconds after it was assigned will be expired) was not running correctly. The SQL queries that it was producing when under load were exceeding the 60 second timeout and being cancelled.

This task runs every minute in Sidekiq and the default behaviour for Sidekiq is to retry jobs with exceptions right away. This behaviour meant that multiple versions of this task were running at the same time, all trying to lock and expire the same jobs. This caused major performance problems and impacted job dispatch. An auto-vacuum was also running at this time, which caused the queries to take longer than usual.

We rolled out a fix to correct the retry behaviour, killed any long running SQL queries, and things returned back to normal at 06:11 AEST.

At this stage we're still working on rolling out our new job partitions to reduce the performance impact of an auto-vacuum, and refactoring the background task to run more efficiently.

Posted Jun 03, 2018 - 07:22 UTC

Resolved

Our fix appears to have resolved the issue, and the backlog has been processed correctly.

Posted May 31, 2018 - 21:25 UTC

Monitoring

We've got a fix in place for the cancellation issue, we've keeping an eye on it.

Posted May 31, 2018 - 20:11 UTC

Identified

We've identified the issue with job cancelation and are putting a fix in place.

Posted May 31, 2018 - 19:15 UTC

Investigating

We are investigating an issue where some jobs are getting stuck canceling and causing related Agents to sit in "Assigned to agent, waiting for it to accept".

Posted May 31, 2018 - 18:26 UTC