Elevated error rates
Incident Report for Buildkite
Postmortem

Service Impact

From 2023-01-26 20:32 (UTC) until 21:10 our Agent API experienced an increase in error rates. The issue was mostly resolved by 20:44, with a small spike in errors reoccurring between 21:00 and 21:10.

Incident Summary

Between 20:27 and 21:07 our provider experienced elevated error rates for their Logs API.

Our AWS Fargate containers were configured to send logs in blocking mode. This meant that when logs can’t be delivered or are delayed the application hangs until they recover or the health check fails and the container is replaced.

 

Changes we're making

As discussed in this Amazon blog post you can configure your containers in non-blocking mode, which stores logs in a ring-buffer and discards them when the buffer gets full. This trades off potentially missing logs when there’s issues with the underlying log delivery, but gains us better availability of the system. We believe this trade off makes sense and so we have now configured all our containers to use non-blocking mode.

Posted Apr 21, 2023 - 05:48 UTC

Resolved
System performance remains stable. We’ll continue to monitor system performance and investigate the underlying cause of the incident
Posted Jan 26, 2023 - 21:45 UTC
Monitoring
System performance has returned to acceptable levels. We are currently monitoring.
Posted Jan 26, 2023 - 21:13 UTC
Identified
We have identified an issue causing high latency. We are currently investigating the root cause.
Posted Jan 26, 2023 - 20:56 UTC
This incident affected: Agent API.