Elevated error rates

Incident Report for Buildkite

Postmortem

Service Impact

From 2023-01-26 20:32 (UTC) until 21:10 our Agent API experienced an increase in error rates. The issue was mostly resolved by 20:44, with a small spike in errors reoccurring between 21:00 and 21:10.

Incident Summary

‌

Between 20:27 and 21:07 our provider experienced elevated error rates for their Logs API.

Our AWS Fargate containers were configured to send logs in blocking mode. This meant that when logs can’t be delivered or are delayed the application hangs until they recover or the health check fails and the container is replaced.

Changes we're making

As discussed in this Amazon blog post you can configure your containers in non-blocking mode, which stores logs in a ring-buffer and discards them when the buffer gets full. This trades off potentially missing logs when there’s issues with the underlying log delivery, but gains us better availability of the system. We believe this trade off makes sense and so we have now configured all our containers to use non-blocking mode.

Posted Apr 21, 2023 - 05:48 UTC

Resolved

System performance remains stable. We’ll continue to monitor system performance and investigate the underlying cause of the incident

Posted Jan 26, 2023 - 21:45 UTC

Monitoring

System performance has returned to acceptable levels. We are currently monitoring.

Posted Jan 26, 2023 - 21:13 UTC

Identified

We have identified an issue causing high latency. We are currently investigating the root cause.

Posted Jan 26, 2023 - 20:56 UTC

This incident affected: Agent API.