May 21, 11:20 UTC
We have been closely monitoring performance since last week, and consider the issue fully resolved. Stay tuned for a full incident review in the coming days.
May 12, 04:40 UTC
To resolve the ongoing performance issues covered by this incident, we performed an urgent database version upgrade and subsequent maintenance operations. We did this outside business hours for the vast majority of our customers, expecting around 15 minutes of downtime. There were some complications which forced us back into maintenance mode for substantially longer than that. This is now fully resolved.
We've been closely monitoring the system since then, and we believe the maintenance has had the intended outcome. We don't anticipate any more issues with latency or request timeouts. We are continuing to monitor, but things are looking great so far.
We will be publishing a full incident report as soon as we can, with much more detail about the original issue and the actions we've taken to resolve it.
May 7, 13:28 UTC
We finished processing the webhook queue 30 minutes ago, so all systems are back to normal. We'll continue to monitor closely, but everything is looking good now.
May 7, 12:14 UTC
Service has been restored! Thanks for your patience, and sorry to have held up your builds from running for so long. All webhooks from GitHub etc were safely queued during the incident, and we're processing that backlog now to trigger your builds. Until that completes, new builds will not trigger in real-time; if you need a build in a hurry you can trigger it manually from the web dashboard.
May 7, 11:22 UTC
Database performance is starting to recover and buildkite.com is back online. Performance may continue to be sluggish as various caches warm up.
Incoming webhooks continue to be buffered and are not being processed in realtime - we'll begin draining that queue shortly. Manual builds via the web interface should start quickly if you need to bypass the webhook queues.
May 7, 10:27 UTC
We're experiencing degraded performance while our database caches warm up and the backlog of work is processed. We're placing the site under maintenance mode so we can ease the load back in a controller manner.
May 7, 09:33 UTC
We've finished the database maintenance. We're keeping an eye on performance as systems come online again.
May 7, 08:44 UTC
We're performing urgent database maintenance and we're expecting 15-30 minutes of downtime over the next hour. During this period the Buildkite dashboard and API be unavailable and no new work will be scheduled. Agents which already have work can run the work to completion, but agents will not be assigned new work and will not be able to submit the results of work until the maintenance is completed. Webhooks will be received but not processed until maintenance is completed.
May 7, 08:16 UTC
Since Wednesday 22nd April, customers have experienced intermittent periods of increased latency and request timeouts. These periods are most common during North American business hours, and are most commonly seen as HTTP 502 errors in builds that upload or download artifacts.
Database resource contention is at the root of the problem, with two overlapping issues:
1. We use a per-build lock around many agent operations (like pipeline uploads, job accepting, and job starting). This has ensured correct behavior, but as we've grown and our customers have built novel workloads, the lock contention has become a bottleneck that can't be solved by throwing more CPUs at the problem (we tried)
2. The locking happens in a database table with a high update rate and the indexes have become large enough to impact general query performance
So far we've upgraded the size of our database, worked with a few customers to adjust their workloads to a form that uses fewer resources, and profiled locking hotspots. Our next step is to introduce finer grained locking around some operations and adjust our database schema to ensure querying data related to active builds is faster.
We'll post updates here as we make progress, and share a post-mortem when the issue is resolved.
Our agent has robust back-off and retry behavior when communicating with our APIs, which means even in the face of API timeouts or errors, these database issues generally manifest as slightly delayed builds. However we've identified one interaction that wasn't retrying, resulting in build errors on API timeout. We've released a new version of the agent (https://github.com/buildkite/agent/releases/tag/v3.21.0
) with improved retry behavior for artifact uploads and downloads. Customers may be able to mitigate some of the impact by upgrading.
May 5, 14:12 UTC