On May 7th between 08:14 and 09:25 UTC customers using hosted agents, the k8s stack and the Buildkite agent acquire job feature experienced failures when starting a job, resulting in an error message Missing agent. See: buildkite-agent bootstrap --help.
As part of our efforts to improve the performance of our platform we shipped a change to how database commits were grouped together. This change inadvertently caused certain job environment variables to be omitted when a job was assigned via the acquire method. These variables are key to our integration with hosted agents as well as customers using our Kubernetes stack and their omission caused any jobs launched via this method to fail.
While we had test coverage that ensured these variables were populated, these tests did not exercise the code path used by acquire job effectively enough to indicate this problem before it was deployed. Furthermore, while parts of the change were placed behind a feature flag, the refactor of the code that caused this bug were not.
Attempts to restore service by rolling back this change were hindered by the current revision being selected for deploy instead of the previous. The initial rollback was triggered at around 08:40 UTC, but it wasn’t until 09:05 we realised the mistake and began the rollback to the correct revision. The rollback started deploying at 09:16 and at 09:25 service was fully restored.
We have added additional rollback gates making it easier to identify when the incorrect revisions has been selected.
Our test suite will be expanded to include contract tests for APIs used by Hosted Agent and the Kubernetes stack.
Additionally, we are configuring Hosted Agent synthetic tests to automatically page on-call engineers when failures occur, improving our response times.