This incident was due to an issue in our upstream DNS service on Amazon Web Services. AWS reported the following incident notes on the AWS status page:
11:35 PM PDT We are investigating elevated failures when querying Route 53 Resolver in a single Availability Zone (use1-az4) of the US-EAST-1 Region.
11:58 PM PDT We are seeing recovery, but are still working to confirm root cause at this time.
Apr 2, 12:13 AM PDT Between 9:39 PM and 11:36 PM PDT, customers experienced elevated DNS failures when querying Route 53 Resolver for certain hostnames in a single Availability Zone (use1-az4) in the US-EAST-1 Region. The issue has been resolved and all queries are now being answered normally.
Inbound requests between April 2 4:40 UTC to 06:40 UTC to some of our nodes located in the us-east-1d region were affected, preventing DNS resolution between agents and the Buildkite agent API. This interrupted service to customers running Buildkite agents in
Outbound requests to any addresses resolved by Route 53 to
use-az4 were also affected, delaying some status notifications from being sent.
We apologise for the inconvenience and can confirm that we are already in the process of improving our availability zone coverage such that a lower percentage of requests fail due to these types of upstream issues. We expect to be operating on an additional availability zone in the very near future. We will also investigate options to better relocate affected traffic if one of our operating availability zones is degraded like this in future.