Elevated Latency
Incident Report for Buildkite
Postmortem

Service Impact

On 2024-01-26 between 17:09 and 19:31 UTC, our Agent APIs experienced increased latency. Two small spikes in error rates were also seen at 17:32 and 19:15 UTC.

Customers may have experienced delayed job dispatch during this period. A small number of customers using legacy versions of buildkite-agent-metrics also experienced timeouts when fetching metrics. Due to recent database reliability improvements the impact was contained to a subset of customers.

Incident Summary

Our monitoring detected elevated response times on the Agent API and investigation revealed that a REST API endpoint used by legacy versions of buildkite-agent-metrics was also experiencing timeouts.

After some time it was concluded that the requests from the legacy versions of buildkite-agent-metrics were driving the high load on the database and changes were made to temporarily block these requests. This had the immediate effect of reducing load on the database and Agent API latency restored to normal levels.

Further investigation has identified that a VACUUM on one table partition led to the PostgreSQL query planner using a more expensive lookup algorithm. Our legacy metrics queries then chose this very expensive alternative query plan that overwhelmed the database. Subsequent testing has confirmed that an ANALYZE on the affected partition resolves the performance degradation immediately.

Changes we’re making

Update: As of 2024-01-30 11:00 UTC we have eliminated the bloat in the problem partition and it is no longer an ongoing concern.

Previously: The table partition in question is one of our oldest and suffers from significant bloat. This partition is almost empty and we are working towards eliminating it completely as soon as possible. Doing so will avoid the risk of the bad query planning re-emerging.

In the meantime our on-call engineers are prepared to run an ANALYZE on the partition should it go bad again, and we are looking at changes to our monitoring to try and detect this condition earlier.

We will also be reaching out to the few remaining customers who use legacy versions of buildkite-agent-metrics (versions earlier than v3.0.0) to encourage them to upgrade to newer versions. We will be dropping support for these versions in the near future.

Posted Jan 30, 2024 - 05:19 UTC

Resolved
This incident has been resolved.
Posted Jan 26, 2024 - 19:53 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 26, 2024 - 19:43 UTC
Identified
We identified elevated load on couple of endpoints and actively working on mitigation
Posted Jan 26, 2024 - 19:27 UTC
Update
We are continuing to investigate the latency issue with Agent and REST APIs
Posted Jan 26, 2024 - 18:30 UTC
Update
We are continuing to investigate the latency issue with agent and REST APIs
Posted Jan 26, 2024 - 18:01 UTC
Investigating
We are investigating increased latency in our agent API
Posted Jan 26, 2024 - 17:44 UTC
This incident affected: Agent API and REST API.