A subset of customers experienced elevated latency in notification delivery.
While migrating a subset of our background processing services to Amazon EKS, we encountered an issue with delivery of internal metrics. The discovered issue did not impact performance or availability, but would have impaired our ability to detect such problems if they occurred.
Out of an abundance of caution we decided to revert the migration, and moved those services back to the original infrastructure on AWS Fargate.
When migrating to EKS, we scale down and disable automatic scaling on Fargate. This allows us to quickly migrate back by scaling up Fargate. When we moved the workloads back to Fargate to restore internal metrics, we missed the step to re-enable autoscaling. As a result, the affected services did not have sufficient capacity and could not keep up with incoming work.
We re-enabled autoscaling promptly once the problem was discovered, and provisioned extra capacity for customers where a backlog of work had accumulated.
Between 09:17 and 10:17 UTC, a small subset of our customers were impacted. Individual customers experienced a limited outage of notification services, which lasted between 35 and 58 minutes within this window, if there was any impact at all. The migration is performed in small batches, so not all customers experienced this incident.