Elevated error rate and latency in Agent API

Incident Report for Buildkite

Postmortem

Service Impact

Elevated error rate and elevated latency when creating, retrieving, or updating artifacts in the Agent API, REST or GraphQL APIs, or web interface.

Incident Summary

We store metadata for all uploaded artifacts in a managed RDS PostgreSQL database.

Beginning from around 18:55 UTC, the performance of queries to that database degraded due to a hardware failure, and it started to fail over to its replica at 19:02 UTC, finishing at 19:06 UTC. It then took until 19:14 to catch up on the transaction log.

Hardware issues and failovers are expected, however performance was still unexpectedly poor. We discovered that due to an unclean shutdown, all statistics counters were reset, resulting in very inefficient query plans. At 19:47 we manually ran an ANALYZE command and query performance was restored to normal levels.

Changes we're making

We have since switched artifact metadata storage to a partitioned table, such that each partition only stores a few days worth of data. As well as improving day-to-day query performance, we expect this will improve time to recovery after any future failover.

We have also updated our runsheet for a database failover to ensure statistics are regenerated.

Posted Jul 17, 2024 - 05:00 UTC

Resolved

Latency and error rates of the Agent API have recovered after a database failover.

Posted Jun 26, 2024 - 20:23 UTC

Monitoring

Latency of the Agent API has recovered after a database failover, we are continuing to monitor performance after this change.

Posted Jun 26, 2024 - 20:03 UTC

Update

We are continuing to investigate the root cause for the spikes in latency with Agent API

Posted Jun 26, 2024 - 19:44 UTC

Update

Latency of Agent API is back to normal. We are continuing to investigate the root cause of the issue

Posted Jun 26, 2024 - 19:29 UTC

Investigating

We've detected a higher than normal latency and error rate in our Agent API, we are investigating.

Posted Jun 26, 2024 - 19:15 UTC

This incident affected: Agent API and Job Queue.