On March 25, a latent bug in retry handling for Build Matrix led to an exponential increase in build job creation for one customer. As a result, dispatch latency became unacceptably high for all customers for 3 hours, and again 10 hours later for 30 minutes. We’d like to describe what happened and apologize for the disruption to our customers.
Our engineers were automatically paged at 01:53 UTC when a critical queue latency exceeded 30 seconds. The root cause was not immediately obvious, and impact appeared low. The queue depth & latency gradually increased, and at 02:09 the incident Increased latency for job assigning was published. Our engineers continued to search for a root cause.
By 03:15 UTC the queue latency had reached 250 seconds, with job dispatch latency heavily impacted for all customers, but the root cause remained unknown.
At 04:19 UTC a correlation was found between the observed latency and an anomalously large number of jobs in a small set of pipelines for one customer. Further investigation found this pipeline to contain a build matrix with failing steps and automatic retry, and a bug in retry handling was identified.
The matrix in question expanded to 8 jobs. With automatic retry configured, a failure in any of those jobs should have caused just that job to be retried. Instead, each failing job caused the entire 8-job matrix to be retried. As the retried jobs were assigned to agents and failed again, the number of jobs increased exponentially. A default limit of 2 retries (and a hard limit of 10 retries) should have applied, but was bypassed by this code path.
The problematic pipeline was mitigated by transitioning the ~40,000 jobs to an inactive status, preventing them from being dispatched.
By 04:50 UTC the queue latency was healthy, all systems operational. Our engineers continued to monitor. This concluded the primary impact of the incident.
At 06:34 UTC a code change was prepared to prevent automatic retry for build matrices, pending deeper investigation. With the system stable, our engineers chose to continue monitoring while waiting for thorough code review, rather than rushing it into production. By 08:16 UTC the mitigation had been reviewed and deployed.
At 10:57 UTC a second code change was prepared to fix the underlying bug, and re-allow automatic retries on build matrices. Systems remained operational and stable with the mitigation in place, so again deploying it was not rushed.
At 12:28 UTC our engineers were automatically paged once again due to queue latency elevating beyond 30 seconds. The cause was rapidly identified as another build which was created prior to the mitigation preventing automatic-retry build matrices, which had slowly looped into exponential growth. By 12:57 UTC the problem build had been deactivated, and engineers verified no other problematic builds had been created prior to the mitigating code change.
By 13:40 UTC the code change prepared at 10:57 had been reviewed, deployed and verified. This fixed the underlying bug and re-enabled automatic retries for build matrices.
The following morning our team worked with the customer whose pipeline had been involved, shifting builds and jobs into suitable states to resume correct agent auto-scaling and build/job dispatch.
In a similar incident last year, we wrote “To avoid unbounded growth of jobs within a build we are implementing a limit on jobs per build based on real usage patterns, which will prevent unbounded job growth without affecting customers’ builds”. We have been analyzing and beginning work on these system limits, but have not yet implemented this one, which would have mitigated this incident. To prevent this happening again, we have boosted the priority of that limit and other limits relating to Build Matrix.
We’re sorry for the disruption this caused — our mission is to unblock developers, and this incident did just the opposite, which isn’t acceptable. We continue to invest in reliability and performance alongside powerful new features to deliver the best CI/CD product available.