Agent unexpectedly disconnected

Incident Report for Buildkite

Postmortem

Service Impact

Agents that were in a connected or stopping state before January 23, 06:50 UTC—and remained in that state through January 25, 08:05 UTC—were incorrectly transitioned to lost, causing them to disconnect.

Recovery: Agents running on Buildkite Elastic CI Stack for AWS automatically recovered within ~10 minutes. For other orchestration methods, restarting the agent service (systemctl restart buildkite-agent or equivalent) will restore connectivity.

Incident Summary

On January 23 06:50 UTC, we deployed a change to migrate agent heartbeat data to a new storage format. On January 25 at 08:05 UTC, a background maintenance process that monitors agent health could not locate heartbeat records for agents still using the previous format. These agents were incorrectly identified as unresponsive and marked as lost.

Only agents in connected or stopping states were affected. Agents actively running jobs were unaffected.

Changes we're making

We are adding an immediate safeguard to the maintenance process so that agents in a connected state are never automatically marked as lost. Instead, mismatches will trigger an internal alert for investigation—ensuring we catch anomalies without customer impact.

We are restructuring our internal agent state storage to support staged rollouts, limiting the blast radius of future maintenance operations to a small subset of agents before wider deployment.

Customer outreach: We are proactively contacting affected customers whose agents have not automatically recovered to confirm whether manual intervention is needed.

Posted Jan 26, 2026 - 07:26 UTC

Resolved

A maintenance process incorrectly marked some connected agents as lost between January 23–25 UTC. Agents on Elastic CI Stack recovered automatically; others may need a service restart. We've deployed a fix to prevent this from recurring.
Posted Jan 25, 2026 - 08:00 UTC