Test Analytics availability

Incident Report for Buildkite

Postmortem

Summary

On September 6th, 2022, Test Analytics was impacted by a database unavailability outage for a period of 6 hours and 33 minutes, preventing access to the Test Analytics application and causing our ingestion of test executions to stop.

We have been working closely with our partners at AWS to fully understand the root cause of this issue and to put in place multiple layers of mitigation to prevent incidents like this from happening in the future.

What happened

The Test Analytics Aurora cluster consists of a single writer instance and multiple reader instances for load balancing and failovers in the case of an incident.

At 3:59 am UTC, the team was paged due to a “Database Restarted” event. Upon further inspection, the team noticed that once the restart procedure happened, there would be errors when trying to fetch critical configuration information which prevented database servers from restarting cleanly, including our main instance.

We attempted to get the application back online as soon as possible by promoting other replicas as writers, but they would exhibit the same behavior which caused the entire database to be unavailable — effectively letting us know that whatever had caused this issue had replicated across instances.

After noticing this, the team executed a Point-In-Time restore (PITR) from around 20 seconds before the incident had occurred to minimize data loss. Creating entirely new clusters and restoring a database snapshot to a specific point in time is a much more time-consuming process than promoting a reader instance, but at this time we believed that that would be the fastest way to get back online.

Unfortunately, the new database cluster experienced the same restart loop as the malfunctioning cluster. This forced us to restore to a much earlier backup (at around 1:18am UTC) at the request of AWS Support to ensure that this new cluster would not exhibit the same issues. Once the restore was complete, this allowed us to connect into our database, delete the corrupted index, and recreate a healthy one, then get the application back online.

However, this also meant that data from 1:18am UTC until around 3:59am UTC wasn’t included. This data is currently not accessible through the Test Analytics Web UI, but we have recovered it and will attempt to add it back into our archives in the future.

Root cause analysis

The Test Analytics application supports multibyte UTF-8 characters (such as Emoji or Kanji). PostgreSQL uses a system-level library for collation-related functions like sorting and querying multibyte strings. Before the incident, an automated OS-level update to the GNU C Library (glibc) changed the collation ordering of a small set of multibyte characters. This included changes around kanji characters and how they’re processed and interpreted. This means that traversing the B-tree index would happen differently before and after this upgrade if certain multibyte characters were indexed.

A combination of these factors created a situation where an INSERT operation that would normally have been rejected by the database engine due to violating uniqueness constraints succeeded. Multiple duplicate records were inserted, so the B-tree unique index became corrupted, affecting both the writer instance and all the readers. Instead of restoring from a healthy snapshot, the team was effectively recreating instances and clusters with already corrupted indices.

How we’re becoming more resilient to issues like this

With the support of AWS, we have upgraded the underlying PostgreSQL version to one that prevents inserting bad data into the B-tree index if the result will cause corruption. The open-source PostgreSQL codebase contains this logic as an assertion that is only active when running in debug mode.

Related to this, AWS is working with the PostgreSQL community to decouple the index collation and sorting from OS-level libraries. There are already conversations around this issue in the PostgreSQL mailing list.

Finally, we’re using pg_amcheck to check all of our indices across Buildkite for corruption, and we already have corrective mechanisms in place in case it detects that indices run the risk of corruption.

Conclusion

We’re beyond sorry for the disruption this may have caused. We want Test Analytics to be a tool that is reliably integrated in your workflow and your engineering day to day, which is why we take situations like this incredibly seriously. We would also like to take this opportunity to thank AWS for their support during this incident, and their diligent involvement with us and the community to improve resilience.

Posted Sep 29, 2022 - 22:52 UTC

Resolved

We’ve restored Test Analytics availability, and test executions are now being ingested as usual. We apologise for today's incident and will be publishing a public post-incident review once we have had a chance to review the associated data and timelines.

Posted Sep 06, 2022 - 11:24 UTC

Monitoring

The restored Test Analytics database is now operational. We’ve restored service availability and are now ingesting new executions. We will continue working with AWS to attempt to restore test executions ingested from 01:17 UTC through 03:55. Test executions that were attempted to be sent to Test Analytics between 03:55 and 10:32 were not ingested and are not recoverable.

Posted Sep 06, 2022 - 10:57 UTC

Update

We have restored our database onto a new AWS Aurora cluster and are attempting to restore the connection.

Posted Sep 06, 2022 - 10:11 UTC

Update

We are still working with our upstream provider on the Test Analytics database recovery.

Posted Sep 06, 2022 - 09:18 UTC

Update

We are continuing to work with our upstream provider on the Test Analytics database recovery.

Posted Sep 06, 2022 - 08:32 UTC

Identified

We are continuing to work with our upstream provider around our database recovery.
We have also shipped a fix for team administration in the Buildkite UI.

Posted Sep 06, 2022 - 07:57 UTC

Update

We are still working with our upstream provider, around a point-in-time recovery as a contingency. We are still working on remediating team administration.

Posted Sep 06, 2022 - 07:21 UTC

Update

Our upstream provider continues to investigate, and the point-in-time recovery contingency option is nearly ready, should we need it

Posted Sep 06, 2022 - 06:40 UTC

Update

While we are investigating the problem with our upstream provider, we are preparing a point-in-time recovery as a contingency. We have identified that administering team permissions are also impacted.

Posted Sep 06, 2022 - 06:06 UTC

Update

We are continuing to investigate this issue with our upstream provider.

Posted Sep 06, 2022 - 05:29 UTC

Update

Availability of Test Analytics is currently impacted. We have escalated to our upstream provider and are continuing to investigate the cause of the problem.

Posted Sep 06, 2022 - 05:11 UTC

Investigating

We're investigating some database issues that are currently impacting availability of Test Analytics.

Posted Sep 06, 2022 - 04:29 UTC

This incident affected: Test Analytics (Web, Ingestion).