Elevated latency and timeouts

Incident Report for Buildkite

Postmortem

Between 2022-10-17 20:20 and 2022-10-18 01:00 we experienced increased latency and associated timeout errors on some endpoints and processes. Builds were generally running correctly during that period, but with increased wait times in some cases.

The latency was due to database connection pool contention, which was caused by row locking triggered by an unusual pattern of customer workload.

We are in the process of rearchitecting our database connection pooling solution (based on PgBouncer), and this issue occurred a few hours after a related configuration change. We don’t believe this to be the cause of the issue, however we have rolled back this migration phase for further investigation.

The scenario that led to this connection pool exhaustion has been mitigated with a concurrency limiter to serialize the process before it acquires any database locks. We are also eliminating the need for those database locks, enabling greater concurrency and efficient use of connections.

We will continue to explore and implement these engineering solutions, and review other limits to ensure the resilience of our product.

We apologize for the disruption this may have caused to builds running during this time.

Posted Oct 19, 2022 - 06:50 UTC

Resolved

System performance remains within an acceptable state. We’ll continue to monitor system performance and investigate the underlying cause of the incident and share our findings in a post-incident review.

Posted Oct 18, 2022 - 00:43 UTC

Update

System performance continues to perform at acceptable levels. Job dispatch latency remains stable but at elevated levels. The underlying cause is still being investigated.

Posted Oct 18, 2022 - 00:27 UTC

Monitoring

System performance has returned to acceptable levels. Job dispatch latency remains stable but at elevated levels. Our incident team continues to investigate the underlying cause.

Posted Oct 18, 2022 - 00:00 UTC

Update

Job dispatch times remain stable and overall throughput of the system is healthy. The database load is trending towards expected levels, as we continue to investigate the root cause.

Posted Oct 17, 2022 - 23:22 UTC

Update

Job dispatch times remain stable and overall throughput of the system is healthy. We continue to investigate the cause of high database load.

Posted Oct 17, 2022 - 22:59 UTC

Update

We continue to see a reduction in database load and we have seen a corresponding improvement in agent job dispatch latency.

Posted Oct 17, 2022 - 22:26 UTC

Update

We continue to see a reduction in database load and are working to further reduce contention.

Posted Oct 17, 2022 - 22:05 UTC

Identified

A database was under higher than normal load and we are working to reduce contention within our fleet

Posted Oct 17, 2022 - 21:44 UTC

Update

We are continuing to investigate the issue.

Posted Oct 17, 2022 - 21:19 UTC

Update

We are continuing to investigate the issue.

Posted Oct 17, 2022 - 21:04 UTC

Investigating

We are currently investigating an issue of elevated latency and error rates

Posted Oct 17, 2022 - 20:49 UTC

This incident affected: Web and Agent API.