Between 01:33 UTC and 04:49 UTC on February 12, 2026, a subset of Sidekiq background workers experienced database connection failures, affecting several types of operations over approximately 3 hours and 16 minutes.
For builds where pipeline steps had already been uploaded, the impact was limited to delays. Job dispatching, notifications, and job completion processing were delayed but retried successfully. All jobs that had already been uploaded were eventually dispatched, and no jobs were lost.
Across all shards, approximately 0.35% of job dispatches and 0.5% of notifications needed retries. However, the impact was not evenly distributed. The underlying issue affected some worker containers and not others, and because containers are assigned to process work for specific database shards, some shards experienced significantly higher error rates while others were completely unaffected. Because each customer's data is assigned to a specific shard, customers on the worst-affected shards would have seen a much higher failure rate than these fleet-wide figures suggest.
For builds where a pipeline upload was still in progress, the upload itself could have failed entirely, resulting in a failed build. In these cases, customers would have needed to retry the build manually.
We were in the process of migrating how environment variables are provided to our Sidekiq background workers running on AWS Fargate. The migration involved two changes: an application-level change to support loading environment configuration from S3, and an infrastructure change to have the Sidekiq services rely on that new loading mechanism.
The application change was deployed on February 10 without issue. At this point we verified across all Sidekiq services that the environment is correct when loading from S3. This was done by invoking the S3 environment loading mechanism on a single container per Sidekiq service and comparing the result with existing Sidekiq containers. On February 12 at approximately 01:30 UTC, the infrastructure change was applied, and a subsequent application deployment caused new worker containers to launch using the S3 configuration.
Under the new configuration, environment files are downloaded from S3 at boot time and written to a local path. However, some of our Sidekiq services use sidekiqswarm, which forks multiple child processes. Gem and application preloading is disabled for these services, meaning each child process independently boots the full Rails application (including the S3 download step) after the fork. When multiple child processes booted simultaneously, they each attempted to download the same S3 files and write them to the same local path concurrently. This introduced a race condition: one process could read a file right when another process zeroed it out before writing to it, resulting in a process missing environment variables. This meant that database URL environment variables, which tell the application how to connect to PostgreSQL over the network, were absent in some workers. Without it, the PostgreSQL client fell back to attempting a local Unix socket connection, which does not exist in the containerized environment.
Sidekiq workers are responsible for a range of asynchronous operations including job dispatching, notifications, and job completion processing. Workers that could not connect to the database would fail to perform any of these operations and would automatically retry, either succeeding when the retry was performed on an unaffected worker, or after the configuration was rolled back. However, pipeline upload operations that timed out during the incident window could fail outright, causing the associated build to fail.
The issue was not immediately detected for two reasons. First, the race condition meant it was effectively random whether a given container was affected. Approximately 0.35% of job dispatches and 0.5% of notifications needed retries across the incident window.
Second, the failures did not trigger our existing alerts. When a worker hit a database connection failure, an internal error handling layer caught the error and silently rescheduled the job for retry instead of raising an exception. Our metrics layer saw no raised exception from these attempts, so they were indistinguishable from successful executions. An existing monitor that tracks P99 job dispatch duration saw no increase, because the failures happened almost instantly and looked identical to fast successes. The monitor is effective at detecting slow job processing, but this failure mode (fast, silent, and rescheduled) fell outside its coverage. A high volume of database connection errors was present in application logs, which is ultimately how responders identified the root cause during investigation.
These factors — a low overall error rate, silent rescheduling, and a duration-based monitor that couldn't detect fast failures — contributed to the delay in detection.
Contributing factors:
Key timestamps (UTC):
| Time | Event |
|---|---|
| 01:33 | Impact began — first database connection errors appeared |
| 03:57 | Incident declared; investigation began |
| 04:19 | Root cause identified; revert of infrastructure change initiated |
| 04:37 | Revert deployed via application deployment |
| 04:49 | Impact ended — error rates returned to zero |