All Systems Operational

About This Site

Status updates for Buildkite’s services and components. You can also follow @buildkitestatus on Twitter for updates.

Web Operational
90 days ago
99.91 % uptime
Today
Agent API Operational
90 days ago
99.91 % uptime
Today
REST API Operational
90 days ago
99.91 % uptime
Today
Job Queue Operational
90 days ago
99.9 % uptime
Today
SCM Integrations Operational
90 days ago
99.95 % uptime
Today
Notifications Operational
90 days ago
100.0 % uptime
Today
Github Commit Status Notifications Operational
Email Notifications Operational
Slack Notifications Operational
Hipchat Notifications Operational
Webhook Notifications Operational
90 days ago
100.0 % uptime
Today
SCM Providers ? Operational
GitHub Operational
GitHub API Requests Operational
Atlassian Bitbucket SSH Operational
Atlassian Bitbucket Website and API Operational
Atlassian Bitbucket Git via HTTPS Operational
Third Party Services ? Operational
AWS ec2-us-east-1 Operational
AWS elasticache-us-east-1 Operational
AWS elb-us-east-1 Operational
AWS rds-us-east-1 Operational
PagerDuty Notification Delivery Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
had a major outage
had a partial outage
Web Response Time ?
Fetching
Agent API Response Time ?
Fetching
REST API Response Time ?
Fetching
HTTP Request Error Rate ?
Fetching
Agent Job Dispatch ?
Fetching
Past Incidents
May 26, 2020

No incidents reported today.

May 25, 2020

No incidents reported.

May 24, 2020

No incidents reported.

May 23, 2020

No incidents reported.

May 22, 2020

No incidents reported.

May 21, 2020

No incidents reported.

May 20, 2020

No incidents reported.

May 19, 2020

No incidents reported.

May 18, 2020

No incidents reported.

May 17, 2020

No incidents reported.

May 16, 2020

No incidents reported.

May 15, 2020

No incidents reported.

May 14, 2020

No incidents reported.

May 13, 2020

No incidents reported.

May 12, 2020
Resolved - We have seen no further errors in the past few hours, and this incident is now resolved.
May 12, 23:11 UTC
Monitoring - We've identified a problem with webhook ingestion on our load-balancer over the past few hours and have mitigated the problem. We will continue to monitor this closely. Lost webhooks can be retried by pushing a no-op commit to your source control or by starting a new build from the Buildkite dashboard.
May 12, 19:36 UTC
Investigating - We're receiving more reports that webhooks from source control systems are being lost in transit again and have started investigating the issue for other causes.

We recommend pushing no-op commits to run any lost webhooks again in order to trigger builds or to schedule manual builds.
May 12, 18:51 UTC
Monitoring - We've identified an unusual load pattern that caused errors in processing some webhooks. It has now passed, and we've also enforced limits to prevent it recurring.
May 12, 16:23 UTC
Investigating - We're investigating a reports of intermittent issues or delays processing incoming webhooks from source control systems over the past hour.

We recommend retrying affected builds while we investigate.
May 12, 15:13 UTC
Postmortem - Read details
May 21, 11:20 UTC
Resolved - We have been closely monitoring performance since last week, and consider the issue fully resolved. Stay tuned for a full incident review in the coming days.
May 12, 04:40 UTC
Update - To resolve the ongoing performance issues covered by this incident, we performed an urgent database version upgrade and subsequent maintenance operations. We did this outside business hours for the vast majority of our customers, expecting around 15 minutes of downtime. There were some complications which forced us back into maintenance mode for substantially longer than that. This is now fully resolved.

We've been closely monitoring the system since then, and we believe the maintenance has had the intended outcome. We don't anticipate any more issues with latency or request timeouts. We are continuing to monitor, but things are looking great so far.

We will be publishing a full incident report as soon as we can, with much more detail about the original issue and the actions we've taken to resolve it.
May 7, 13:28 UTC
Update - We finished processing the webhook queue 30 minutes ago, so all systems are back to normal. We'll continue to monitor closely, but everything is looking good now.
May 7, 12:14 UTC
Monitoring - Service has been restored! Thanks for your patience, and sorry to have held up your builds from running for so long. All webhooks from GitHub etc were safely queued during the incident, and we're processing that backlog now to trigger your builds. Until that completes, new builds will not trigger in real-time; if you need a build in a hurry you can trigger it manually from the web dashboard.
May 7, 11:22 UTC
Update - Database performance is starting to recover and buildkite.com is back online. Performance may continue to be sluggish as various caches warm up.

Incoming webhooks continue to be buffered and are not being processed in realtime - we'll begin draining that queue shortly. Manual builds via the web interface should start quickly if you need to bypass the webhook queues.
May 7, 10:27 UTC
Update - We're experiencing degraded performance while our database caches warm up and the backlog of work is processed. We're placing the site under maintenance mode so we can ease the load back in a controller manner.
May 7, 09:33 UTC
Update - We've finished the database maintenance. We're keeping an eye on performance as systems come online again.
May 7, 08:44 UTC
Update - We're performing urgent database maintenance and we're expecting 15-30 minutes of downtime over the next hour. During this period the Buildkite dashboard and API be unavailable and no new work will be scheduled. Agents which already have work can run the work to completion, but agents will not be assigned new work and will not be able to submit the results of work until the maintenance is completed. Webhooks will be received but not processed until maintenance is completed.
May 7, 08:16 UTC
Identified - Since Wednesday 22nd April, customers have experienced intermittent periods of increased latency and request timeouts. These periods are most common during North American business hours, and are most commonly seen as HTTP 502 errors in builds that upload or download artifacts.

Database resource contention is at the root of the problem, with two overlapping issues:

1. We use a per-build lock around many agent operations (like pipeline uploads, job accepting, and job starting). This has ensured correct behavior, but as we've grown and our customers have built novel workloads, the lock contention has become a bottleneck that can't be solved by throwing more CPUs at the problem (we tried)
2. The locking happens in a database table with a high update rate and the indexes have become large enough to impact general query performance

So far we've upgraded the size of our database, worked with a few customers to adjust their workloads to a form that uses fewer resources, and profiled locking hotspots. Our next step is to introduce finer grained locking around some operations and adjust our database schema to ensure querying data related to active builds is faster.

We'll post updates here as we make progress, and share a post-mortem when the issue is resolved.

Our agent has robust back-off and retry behavior when communicating with our APIs, which means even in the face of API timeouts or errors, these database issues generally manifest as slightly delayed builds. However we've identified one interaction that wasn't retrying, resulting in build errors on API timeout. We've released a new version of the agent (https://github.com/buildkite/agent/releases/tag/v3.21.0) with improved retry behavior for artifact uploads and downloads. Customers may be able to mitigate some of the impact by upgrading.
May 5, 14:12 UTC