From 2023-01-26 20:32 (UTC) until 21:10 our Agent API experienced an increase in error rates. The issue was mostly resolved by 20:44, with a small spike in errors reoccurring between 21:00 and 21:10.
Between 20:27 and 21:07 our provider experienced elevated error rates for their Logs API.
Our AWS Fargate containers were configured to send logs in blocking mode. This meant that when logs can’t be delivered or are delayed the application hangs until they recover or the health check fails and the container is replaced.
As discussed in this Amazon blog post you can configure your containers in non-blocking mode, which stores logs in a ring-buffer and discards them when the buffer gets full. This trades off potentially missing logs when there’s issues with the underlying log delivery, but gains us better availability of the system. We believe this trade off makes sense and so we have now configured all our containers to use non-blocking mode.