Some job logs unavailable

Incident Report for Buildkite

Resolved

The issue has been resolved and the underlying instance has been completely replaced. There was an underlying hardware issue which was not detected by usual hardware failure mechanisms and should no longer be affecting the instances.

The issue only affected some instances in our cluster so agents which failed log submission should have retried, and dashboard log display may have been delayed or malfunctioned, but no data has been lost, and operations are now back to normal.

Posted Apr 08, 2019 - 11:49 UTC

Monitoring

We've failed over and the new primary is doing nicely. We're recycling the old one so we're not affected by the underlying hardware fault in the event of a failover back.

Posted Apr 08, 2019 - 11:37 UTC

Identified

We've been speaking to AWS who have identified an underlying hardware issue and are performing a failover.

Posted Apr 08, 2019 - 11:24 UTC

Monitoring

The issue seems to have been resolved, some network connections were dropped between out connection pooler and an upstream rds instance, appearing to be a transient networking issue. We're keeping an eye on things.

Posted Apr 08, 2019 - 10:31 UTC

Investigating

One of our job log databases is having some trouble, we're looking into it.

Posted Apr 08, 2019 - 10:20 UTC

This incident affected: Job Queue.