Intermittent DNS resolution
Incident Report for Buildkite
Postmortem

Service Impact

Between 2023-06-02 00:53 and 04:12 (UTC) a very limited number of customers experienced DNS resolution failures when browsing to buildkite.com and related sub-domains (api.buildkite.com, agent.buildite.com and graphql.buildkite.com). Most customers, including agents running in AWS and GCP, were not impacted due to inconsistent propagation of DNS changes and default behaviour to not verify DNSSEC signatures.

Incident Summary

Over several months we had been working to improve our edge capabilities, including the introduction of a Content Delivery Network (CDN) and Web Application Firewall (WAF) to provide improved speed, reliability and security. As part of this work we decided to consolidate our Domain Name System (DNS) to the same provider as our new CDN, enabling us to make the cutover incrementally and reduce the risk of future migrations.

Our preparation for the DNS cutover involved ensuring the records in both the systems were the same and performing testing that DNS records were resolving correctly from the new system (using manual and automated verification).

At 00:53 we migrated our name servers to a new provider by updating our registrar. All testing by the team indicated no problems.

02:17 A customer reported buildkite.com is failing to load in a browser, but their agents are working correctly.

02:45 We complete rolling back the migration

03:25 We investigate possible failure to disable DNSSEC on our old Zone as the underlying cause

03:50 A second customer reports DNS resolution issues

03:58 We manually flush NS, DS and DNSKEY records on Cloudflare and Google's public resolvers

04:05 We confirm our suspicion that the issue relates to misconfigured DNSSEC

04:12 We manually flushes cache at opendns public resolvers

04:14 We setup DNS monitoring from multiple locations and confirm the issue is resolved

04:28 We confirm that customers running with AWS VPC defaults (including the Elastic Stacks) aren’t impacted by these changes, as they disable DNSSEC signature verification

Changes we've made

  • We have created DNS monitoring from multiple locations and cloud providers around the world, to catch this issue faster in the future
  • Future changes of this type will follow a standard process including:

    • Collating all steps in one document, such as:

      • Preparation
      • Pre and Post Validation
      • Execution
      • Rollback
    • Seeking out critical review from subject matter experts in the technologies involved

Posted Aug 03, 2023 - 22:44 UTC

Resolved
This incident has been resolved.
Posted Jun 03, 2023 - 00:05 UTC
Monitoring
We have flushed caches for both google and cloudflare public resolvers, however some DNS servers may have caches that are yet to expire. We are continually monitoring the situation
Posted Jun 02, 2023 - 04:50 UTC
Identified
The rollback completed about 90 minutes ago, however some DNS servers may have caches that are yet to expire. We have manually flushed caches for google and cloudflare public resolvers
Posted Jun 02, 2023 - 04:18 UTC
Monitoring
We have rolled back the DNS change and are monitoring a report that resolving buildkite.com (possibly via IPv6) may not be working in some parts of the world.
Posted Jun 02, 2023 - 03:39 UTC
Investigating
Following a DNS change we have received a report that resolving buildkite.com may not be working in some parts of the world, possibly via IPv6. We are investigating further.
Posted Jun 02, 2023 - 02:43 UTC
This incident affected: Web.