Agent API degraded

Incident Report for Buildkite

Postmortem

Service Impact

Performance of the Agent API and Buildkite web UI was degraded between 2023-11-21 19:19 and 21:11 UTC, then again between 22:22 and 22:50 UTC.

Elevated error rates and request latency meant that customers were unable to run many of their builds.

Incident Summary

When investigating the degraded performance, we identified a number of customer builds entering into runaway loops writing agent metadata. This spike in activity led to high load on our primary pipelines database and negatively impacted the performance of all other Buildkite agent interactions.

After we interrupted the runaway loops, database performance remained poor. We identified that the extreme level of metadata writes left two partitions of our builds table in an unhealthy state, resulting in poor performance of any related queries. We manually vacuumed these partitions, which took database load back to normal levels, and allowed for normal Buildkite operations to resume.

At 22:22 UTC, an hour after first resolving this incident, we noted the same degraded performance pattern emerging, and identified the same runaway metadata write loop as the cause. We interrupted these runaway loops again and worked with the customer to find and fix the cause of this issue within their own builds. Ordinary Buildkite operations fully resumed by 22:50 UTC.

Changes we're making

We’ve implemented a rate limit on per-build agent metadata writes, which will ensure pathological builds do not create the widespread impact we saw earlier today. This has been in place as of 2023-11-22 03:00 UTC. Our default limit (1,000 writes/build/minute) should be enough to encompass all ordinary usage of agent metadata, but if you experience any issues, please contact Buildkite support.

We’re also adjusting the data fetching mechanism for the builds listing pages on our web interface. These pages update themselves in real time in response to new builds or updates to your existing builds, and they will now fetch data more selectively, ensuring that they do not exacerbate any issues during periods of degraded performance. This change will be live within the next 24 hours.

Posted Nov 22, 2023 - 05:30 UTC

Resolved

This incident has been resolved.

Posted Nov 22, 2023 - 00:01 UTC

Monitoring

We've identified and resolved the cause of the high levels of agent API activity. Agent activity looks normal again, and we're currently monitoring.

Posted Nov 21, 2023 - 23:11 UTC

Identified

We are seeing a reoccurrence of the same unusually high level of agent API activity, and are looking to mitigate.

Posted Nov 21, 2023 - 22:41 UTC

Monitoring

We’ve identified and resolved the cause of the high database load. Agent activity looks normal again, and we’re currently monitoring.

Posted Nov 21, 2023 - 21:20 UTC

Update

We're identifying the high database impact and are working towards relieving the load.

Posted Nov 21, 2023 - 21:04 UTC

Update

We are seeing high database load, which we are working to bring back to normal operation.

Posted Nov 21, 2023 - 20:43 UTC

Update

We are continuing to investigate this issue.

Posted Nov 21, 2023 - 20:25 UTC

Update

We are continuing to investigate this issue.

Posted Nov 21, 2023 - 20:20 UTC

Update

Performance is degraded across the system. We’ve identified some unusually high usage as a likely cause, we’re continuing to investigate.

Posted Nov 21, 2023 - 20:00 UTC

Update

Performance is degraded across the system, we've identified some unusually high usage. We are continuing to investigate.

Posted Nov 21, 2023 - 19:54 UTC

Investigating

We are currently investigating this issue.

Posted Nov 21, 2023 - 19:24 UTC

This incident affected: Agent API, REST API, and Job Queue.