Between 2022-10-17 20:20 and 2022-10-18 01:00 we experienced increased latency and associated timeout errors on some endpoints and processes. Builds were generally running correctly during that period, but with increased wait times in some cases.
The latency was due to database connection pool contention, which was caused by row locking triggered by an unusual pattern of customer workload.
We are in the process of rearchitecting our database connection pooling solution (based on PgBouncer), and this issue occurred a few hours after a related configuration change. We don’t believe this to be the cause of the issue, however we have rolled back this migration phase for further investigation.
The scenario that led to this connection pool exhaustion has been mitigated with a concurrency limiter to serialize the process before it acquires any database locks. We are also eliminating the need for those database locks, enabling greater concurrency and efficient use of connections.
We will continue to explore and implement these engineering solutions, and review other limits to ensure the resilience of our product.
We apologize for the disruption this may have caused to builds running during this time.