Elevated error rate and elevated latency when creating, retrieving, or updating artifacts in the Agent API, REST or GraphQL APIs, or web interface.
We store metadata for all uploaded artifacts in a managed RDS PostgreSQL database.
Beginning from around 18:55 UTC, the performance of queries to that database degraded due to a hardware failure, and it started to fail over to its replica at 19:02 UTC, finishing at 19:06 UTC. It then took until 19:14 to catch up on the transaction log.
Hardware issues and failovers are expected, however performance was still unexpectedly poor. We discovered that due to an unclean shutdown, all statistics counters were reset, resulting in very inefficient query plans. At 19:47 we manually ran an ANALYZE command and query performance was restored to normal levels.
We have since switched artifact metadata storage to a partitioned table, such that each partition only stores a few days worth of data. As well as improving day-to-day query performance, we expect this will improve time to recovery after any future failover.
We have also updated our runsheet for a database failover to ensure statistics are regenerated.