Service Impacted

Incident Report for Buildkite

Postmortem

Notifications were incorrectly sent for this incident when publishing the postmortem for the incident on the 4th of July.

Service Impact

On 2022-06-30 between 15:21 UTC and 15:56 UTC the Web, API, REST API, Agent API, Job Queue, SCM Integrations and Notifications (GitHub Commit Status Notifications, Email Notifications) were unavailable or in a degraded state.

What happened

Our primary RDS Postgres instance failed over. Failovers are rare but expected, and in the past our services have recovered automatically within minutes. This failover behaved differently to what we have seen in the past, and our connection pooler (PgBouncer) did not reconnect properly and required manual intervention.

How we’re becoming more resilient to issues like this

We have reduced our TCP keepalive and user timeout settings so that PgBouncer will forcefully close its upstream connection and retry more quickly, reducing our mean time to recovery. This change will roll out gradually over the next week. Unfortunately, because we are unable to simulate the kind of the failover we experienced in this incident, we cannot be certain it will be effective.

Posted Jul 18, 2022 - 10:28 UTC

Resolved

Our performance is back to normal and this incident has been resolved. We'll continue investigating this issue

Posted Jun 30, 2022 - 16:12 UTC

Monitoring

We identified the issue and manually intervene. API latency is returning to normal

Posted Jun 30, 2022 - 15:57 UTC

Update

We are continuing to investigate this issue.

Posted Jun 30, 2022 - 15:49 UTC

Investigating

We are currently investigating this issue.

Posted Jun 30, 2022 - 15:32 UTC

This incident affected: Web and Agent API.