A subset of our customers experienced elevated latency in our notification delivery, build dispatch and metrics services.
We are in the process of migrating our underlying compute platform from AWS Fargate to AWS EKS for our production workloads. We are migrating our services in small batches so we can verify stability as we go.
Between 15:42 and 17:33 our EKS Prometheus server began to need more memory than was available on the host where it was running. This was caused by autoscaling operations that increased the number of pods tracked by Prometheus, which in turn increased the Prometheus server's memory requirement. The host killed the Prometheus server process, which was restarted shortly after by the Kubernetes control plane. In the interim, the metrics used for application autoscaling were unavailable. The unavailable metrics meant that the affected services were not being triggered to scale up, resulting in the observed delays. Prometheus exceeded the host's available memory again soon after restarting, which caused the cycle to repeat.
The on call team followed a prepared documentation to shift load on the affected services back to Fargate. The majority of customers saw complete recovery from 16:49. A handful of customers had developed such a large backlog during the period of higher latency, that they had to be manually scaled up further. All customers saw full recovery by 17:33.
We have already made the following changes to our rollout of EKS for production workloads:
We have since migrated all these services back to EKS and observed successful scaling well beyond the limits we encountered during this incident.