The issue has been resolved and the underlying instance has been completely replaced. There was an underlying hardware issue which was not detected by usual hardware failure mechanisms and should no longer be affecting the instances.
The issue only affected some instances in our cluster so agents which failed log submission should have retried, and dashboard log display may have been delayed or malfunctioned, but no data has been lost, and operations are now back to normal.
Posted 15 days ago. Apr 08, 2019 - 21:49 AEST
We've failed over and the new primary is doing nicely. We're recycling the old one so we're not affected by the underlying hardware fault in the event of a failover back.
Posted 15 days ago. Apr 08, 2019 - 21:37 AEST
We've been speaking to AWS who have identified an underlying hardware issue and are performing a failover.
Posted 15 days ago. Apr 08, 2019 - 21:24 AEST
The issue seems to have been resolved, some network connections were dropped between out connection pooler and an upstream rds instance, appearing to be a transient networking issue. We're keeping an eye on things.
Posted 15 days ago. Apr 08, 2019 - 20:31 AEST
One of our job log databases is having some trouble, we're looking into it.