Increased latency and error rates

Incident Report for Buildkite

Postmortem

Posted Oct 24, 2025 - 21:25 UTC

Resolved

Our services have been fully recovered for the last hour, so we are marking this as resolved.

Our engineers will continue to monitor AWS and will keep services scaled up to prevent impact from any additional failures.
Posted Oct 20, 2025 - 22:02 UTC

Update

Latency and error rates have all returned to baseline levels. We have seen full recovery of our services.

We continue to actively monitor our services and the AWS reports on us-east-1 impact to ensure stability is maintained.
Posted Oct 20, 2025 - 21:04 UTC

Update

We're seeing signs of recovery across the board. Error rates have reduced to baseline levels. Latency is trending towards baseline.

We continue to actively monitor our services and the AWS reports on us-east-1 impact.
Posted Oct 20, 2025 - 20:33 UTC

Monitoring

We're seeing slow recovery of all our services. Latency and error rates are decreasing across the board. We are continuing to monitor the situation.
Posted Oct 20, 2025 - 19:29 UTC

Update

Our mitigations improved latency for the Agent API, although latency and error rates are still visible across other services. The us-east-1 issue is reporting some recovery and we are seeing further improvements in our services. We are actively monitoring the situation and implementing mitigations where possible.
Posted Oct 20, 2025 - 18:54 UTC

Update

We have implemented mitigations and see an improvement in latency for the Agent API. Latency and error rates continue to be elevated across Rest, GraphQL and Web service as well as notifications being delayed.

We are continuing to work through mitigations and will provide an update in 1 hour.
Posted Oct 20, 2025 - 17:36 UTC

Identified

We're continuing to see increased latency across much of our sub-systems due to an on going AWS outage. We are unable to launch new tasks in us-east-1 and are investigating potential mitigations to restore service.
Posted Oct 20, 2025 - 17:04 UTC

Update

We're currently working on mitigations for scaling up, but at this stage service is degraded with increased latency across API, notifications, and builds starting.
Posted Oct 20, 2025 - 15:17 UTC

Investigating

We're observing increased latency and error rates due to an inability to scale up. We're currently investigating and will provide status updates as they become available.
Posted Oct 20, 2025 - 14:18 UTC
This incident affected: Web, Agent API, REST API, Job Queue, SCM Integrations and Notifications (GitHub Commit Status Notifications).