Reported delays to jobs
Incident Report for Buildkite
Postmortem

This incident was due to an issue in our upstream DNS service on Amazon Web Services. AWS reported the following incident notes on the AWS status page:

11:35 PM PDT We are investigating elevated failures when querying Route 53 Resolver in a single Availability Zone (use1-az4) of the US-EAST-1 Region.
11:58 PM PDT We are seeing recovery, but are still working to confirm root cause at this time.
Apr 2, 12:13 AM PDT Between 9:39 PM and 11:36 PM PDT, customers experienced elevated DNS failures when querying Route 53 Resolver for certain hostnames in a single Availability Zone (use1-az4) in the US-EAST-1 Region. The issue has been resolved and all queries are now being answered normally.

Inbound requests between April 2 4:40 UTC to 06:40 UTC to some of our nodes located in the us-east-1d region were affected, preventing DNS resolution between agents and the Buildkite agent API. This interrupted service to customers running Buildkite agents in us-east-1.

Outbound requests to any addresses resolved by Route 53 to use-az4 were also affected, delaying some status notifications from being sent.

We apologise for the inconvenience and can confirm that we are already in the process of improving our availability zone coverage such that a lower percentage of requests fail due to these types of upstream issues. We expect to be operating on an additional availability zone in the very near future. We will also investigate options to better relocate affected traffic if one of our operating availability zones is degraded like this in future.

Posted Apr 02, 2021 - 08:06 UTC

Resolved
The incident affecting DNS resolution in use1-az4 has been resolved upstream by AWS.
Posted Apr 02, 2021 - 07:46 UTC
Update
We are continuing to monitor for any further issues.
Posted Apr 02, 2021 - 07:44 UTC
Monitoring
The issue affecting us-east-1d / use1-az4 appears to be resolved for our outgoing requests and customers are reporting that DNS resolution is now working properly to the affected agent API endpoints. We will continue to monitor the situation as AWS resolves the issue https://status.aws.amazon.com
Posted Apr 02, 2021 - 07:04 UTC
Identified
The issue has been identified as an upstream issue. AWS is currently investigating route53 DNS query failures in the N. Virginia region: https://status.aws.amazon.com – We are continuing to investigate ways to mitigate this issue on our end.
Posted Apr 02, 2021 - 06:46 UTC
Update
We’re seeing name resolution errors when delivering service notifications to some customer endpoints, and some customers have reported errors resolving Buildkite endpoints and so agent operations may be degraded. We are still confirming the source of these NS issues.
Posted Apr 02, 2021 - 05:37 UTC
Investigating
We're currently investigating customer reported issues with dispatching jobs for pipelines.
Posted Apr 02, 2021 - 05:20 UTC
This incident affected: Notifications (Github Commit Status Notifications, Email Notifications, Slack Notifications, Hipchat Notifications, Webhook Notifications) and Agent API, Job Queue.