Some jobs not dispatching

Incident Report for Buildkite

Postmortem

Service Impact

Between 01:33 UTC and 04:49 UTC on February 12, 2026, a subset of Sidekiq background workers experienced database connection failures, affecting several types of operations over approximately 3 hours and 16 minutes.

For builds where pipeline steps had already been uploaded, the impact was limited to delays. Job dispatching, notifications, and job completion processing were delayed but retried successfully. All jobs that had already been uploaded were eventually dispatched, and no jobs were lost.

Across all shards, approximately 0.35% of job dispatches and 0.5% of notifications needed retries. However, the impact was not evenly distributed. The underlying issue affected some worker containers and not others, and because containers are assigned to process work for specific database shards, some shards experienced significantly higher error rates while others were completely unaffected. Because each customer's data is assigned to a specific shard, customers on the worst-affected shards would have seen a much higher failure rate than these fleet-wide figures suggest.

For builds where a pipeline upload was still in progress, the upload itself could have failed entirely, resulting in a failed build. In these cases, customers would have needed to retry the build manually.

Incident Summary

We were in the process of migrating how environment variables are provided to our Sidekiq background workers running on AWS Fargate. The migration involved two changes: an application-level change to support loading environment configuration from S3, and an infrastructure change to have the Sidekiq services rely on that new loading mechanism.

The application change was deployed on February 10 without issue. At this point we verified across all Sidekiq services that the environment is correct when loading from S3. This was done by invoking the S3 environment loading mechanism on a single container per Sidekiq service and comparing the result with existing Sidekiq containers. On February 12 at approximately 01:30 UTC, the infrastructure change was applied, and a subsequent application deployment caused new worker containers to launch using the S3 configuration.

Under the new configuration, environment files are downloaded from S3 at boot time and written to a local path. However, some of our Sidekiq services use sidekiqswarm, which forks multiple child processes. Gem and application preloading is disabled for these services, meaning each child process independently boots the full Rails application (including the S3 download step) after the fork. When multiple child processes booted simultaneously, they each attempted to download the same S3 files and write them to the same local path concurrently. This introduced a race condition: one process could read a file right when another process zeroed it out before writing to it, resulting in a process missing environment variables. This meant that database URL environment variables, which tell the application how to connect to PostgreSQL over the network, were absent in some workers. Without it, the PostgreSQL client fell back to attempting a local Unix socket connection, which does not exist in the containerized environment.

Sidekiq workers are responsible for a range of asynchronous operations including job dispatching, notifications, and job completion processing. Workers that could not connect to the database would fail to perform any of these operations and would automatically retry, either succeeding when the retry was performed on an unaffected worker, or after the configuration was rolled back. However, pipeline upload operations that timed out during the incident window could fail outright, causing the associated build to fail.

The issue was not immediately detected for two reasons. First, the race condition meant it was effectively random whether a given container was affected. Approximately 0.35% of job dispatches and 0.5% of notifications needed retries across the incident window.

Second, the failures did not trigger our existing alerts. When a worker hit a database connection failure, an internal error handling layer caught the error and silently rescheduled the job for retry instead of raising an exception. Our metrics layer saw no raised exception from these attempts, so they were indistinguishable from successful executions. An existing monitor that tracks P99 job dispatch duration saw no increase, because the failures happened almost instantly and looked identical to fast successes. The monitor is effective at detecting slow job processing, but this failure mode (fast, silent, and rescheduled) fell outside its coverage. A high volume of database connection errors was present in application logs, which is ultimately how responders identified the root cause during investigation.

These factors — a low overall error rate, silent rescheduling, and a duration-based monitor that couldn't detect fast failures — contributed to the delay in detection.

Contributing factors:

  • A race condition in the multi-process Sidekiq setup caused concurrent S3 environment file downloads to write to the same local path, allowing processes to read zero-byte files and boot without critical environment variables.
  • There was no startup assertion to fail fast if critical environment variables such as database configuration were missing or empty after the environment loading step.
  • Database connection errors in affected workers were caught by an internal error handling layer and silently rescheduled rather than raised as exceptions, preventing them from appearing in error metrics or alerting.
  • Existing monitoring was designed to detect slow job processing rather than fast failures, so this failure mode fell outside its coverage.

Key timestamps (UTC):

Time Event
01:33 Impact began — first database connection errors appeared
03:57 Incident declared; investigation began
04:19 Root cause identified; revert of infrastructure change initiated
04:37 Revert deployed via application deployment
04:49 Impact ended — error rates returned to zero
  • Total duration of impact: 3 hours and 16 minutes.
  • Time to detection: 2 hours and 24 minutes.
  • Time from detection to resolution: 52 minutes.

Changes we're making

  • Startup validation for critical environment variables (deployed): We’ve added a fail-fast assertion during application boot to verify that database URLs and other critical configuration values are present and non-empty after environment loading completes.
  • Atomic S3 file downloads (deployed): We’ve updated the S3 environment file download process to avoid the possibility of processes reading incomplete environment files.
  • Improved job processing monitoring (in progress): We are adding error-rate monitoring for job processing to complement our existing latency-based monitors, ensuring that workers which are picking up jobs but failing to complete them are detected regardless of how quickly the failure occurs.
  • Incremental rollout for worker configuration changes (in progress): Future changes to how workers load critical configuration will be deployed to a subset of services first, allowing us to detect problems before they affect the full fleet.
Posted Feb 19, 2026 - 02:04 UTC

Resolved

Between 2026-02-12 01:33 and 04:49 UTC, an elevated error rate in job dispatch was observed. This affected a subset of jobs. Affected jobs were unable to be dispatched during the disruption window, but were eventually dispatched due to automatic retries. A subset of pipeline step uploads were also affected; the agent would check for successful step upload over a 5-minute period before timing out.

This occurred due to a bug in a configuration change that resulted in some Sidekiq instances not receiving all required database connection configuration. The configuration change was subsequently reverted, with further investigation underway.
Posted Feb 12, 2026 - 01:33 UTC