On Saturday, 18th May at 4am AEST (UTC+10) we received a high database load alert. Investigations revealed our process for assigning jobs to agents in batches (dispatching) was delayed due to the elevated database load. This load was caused by some very particular pipeline operations triggering far more background work than was required in a way that we’d previously not identified resulting in high database load that delayed the job dispatching process system-wide.
We intervened to manage the background work, and once the background work was completed job dispatch became speedy again.
Since then, we’ve corrected the scope of the background work that caused the elevated load and have implemented some improvements to our monitoring based on the symptoms we saw.