Newly connected agents unable to run jobs

Incident Report for Buildkite

Postmortem

On Sat, 22 July 2023 10:06:20 UTC, our internal primary key for connected agents exceeded the maximum value for PostgreSQL `integer` type (signed 32-bit integer; 2³¹-1 → 2,147,483,647). This was neither unexpected nor directly problematic; the primary key was stored as `bigint` (signed 64-bit integer) with ample capacity. However an associated foreign key mapping build jobs to agents was stored as the smaller `integer` type. As a result, jobs could not be assigned to agents that connected after that moment.

Within 4 minutes our automated monitoring had alerted us to the situation. Our team began work on migrating that foreign key to bigint, while also investigating faster remediations. Neither the SQL standard(s) nor PostgreSQL support unsigned integer types. However we discovered that PostgreSQL sequences do support negative start & increment values, and that Rails/ActiveRecord is okay with this, so we would be able to use the negative half of the signed 32-bit integer type as a temporary workaround. These identifiers are internal only — Buildkite uses separate UUIDs for public reference — so there was no risk of external systems rejecting the unusual negative IDs.

At 11:40 UTC, after verifying the solution in non-production environments and reconfiguring our table partitioning, we altered this ID sequence to start at -1 and “increment” by -1. Agents that connected after that moment worked correctly. By 12:08 UTC we had disconnected any remaining agents that had connected during the 10:06–11:40 window, so that they would reconnect with new IDs.‌

Alongside our ongoing database sharding project, we have been using time-ordered UUID primary keys for new tables. However we will retain existing numeric primary keys for some time, and this incident highlighted a blind spot in our monitoring. We have taken several actions to prevent it happening again:

Set up monitoring and alerting when identifiers exceed 50% of their maximum value,
Audited all existing primary and foreign keys for correctness and runway,
Added “ratchet tests” to prevent the introduction of any new 32-bit `integer` identifiers.

Our team would like to apologize to the customers that were impacted by this incident during the weekend. We are continuing to invest heavily in the scalability, reliability and resilience of our systems.

Posted Jul 25, 2023 - 06:29 UTC

Resolved

Agents connected between 2023-07-22 10:06:20 UTC - 2023-07-22 11:40:09 UTC were unable to pick up new jobs. Agents connected during that window were stopped around 2023-07-22 12:08:20 UTC so that they would reconnect with fresh IDs. Everything should be working now. If you still experience any issue of Agent not picking up jobs, please contact us. A detailed postmortem will be posted later.

Posted Jul 22, 2023 - 12:32 UTC

Update

The fix that was implemented has resolved the issue for all agents except agents that have connected between 2023-07-22 10:06:20 and 2023-07-22 11:44.

We are currently manually disconnecting these agents, which should resolve the issue for all customers.

Posted Jul 22, 2023 - 12:03 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 22, 2023 - 11:44 UTC

Update

We have shipped a fix and monitoring the system.

Posted Jul 22, 2023 - 11:43 UTC

Update

We are investigating a potential fix to the existing 32-bit Agent ID column that will resolve the issue.

We will have another update in the next 30 minutes.

Posted Jul 22, 2023 - 11:19 UTC

Update

This issue is being caused by some tables attempting to reference 64-bit agent IDs using 32-bit foreign key columns.
Agent IDs created since 2023-07-22 10:06:20 cannot be referenced as 32-bit.

We're working on switching these foreign keys to 64-bit. Meanwhile, agents connected before 2023-07-22 10:06:20 will continue to operate, but agents connected after that will not be able to accept jobs until the solution is completed.

We will provide regular updates.

Posted Jul 22, 2023 - 10:47 UTC

Identified

Newly connected agents are currently unable to run jobs.
The issue has been identified and is being worked on, we will post regular updates.

Posted Jul 22, 2023 - 10:42 UTC

This incident affected: Agent API and Job Queue.