Notifications were incorrectly sent for this incident when publishing the postmortem for the incident on the 4th of July.
On 2022-06-30 between 15:21 UTC and 15:56 UTC the Web, API, REST API, Agent API, Job Queue, SCM Integrations and Notifications (GitHub Commit Status Notifications, Email Notifications) were unavailable or in a degraded state.
Our primary RDS Postgres instance failed over. Failovers are rare but expected, and in the past our services have recovered automatically within minutes. This failover behaved differently to what we have seen in the past, and our connection pooler (PgBouncer) did not reconnect properly and required manual intervention.
We have reduced our TCP keepalive and user timeout settings so that PgBouncer will forcefully close its upstream connection and retry more quickly, reducing our mean time to recovery. This change will roll out gradually over the next week. Unfortunately, because we are unable to simulate the kind of the failover we experienced in this incident, we cannot be certain it will be effective.