Increased errors and slow responses across Buildkite
Incident Report for Buildkite
Resolved
We rolled out a configuration change a few hours ago and we've not seen elevated response times or errors since. We're happy things are back to normal moving forward. The entire team are still all hands on deck and monitoring activity.
Posted Apr 22, 2020 - 23:18 UTC
Update
We are continuing to monitor for any further issues.
Posted Apr 22, 2020 - 11:10 UTC
Update
The database configuration change completed successfully, with under a minute of downtime. Performance will be degraded for around 10 minutes as buffers and cached warm up.
Posted Apr 22, 2020 - 10:50 UTC
Update
We are now applying a configuration change to our database to avoid further issues, which will cause a brief outage for a few minutes.
Posted Apr 22, 2020 - 09:52 UTC
Monitoring
The performance of our APIs has been back to normal for the past couple of hours after a period of heavy load. We'll continue to monitor and continue to take steps to prevent reoccurrence. We'll follow up with a post incident report in the coming days.
Posted Apr 22, 2020 - 04:50 UTC
Identified
We continue to see increased load on our database, which is leading to an overall decrease in performance and intermittent timeouts across our API and website.

In many cases the agent will auto-retry after a request failure, however some errors may have an impact on builds.

Webhook processing should be near real-time.

Our research into the root cause of the increased load is continuing.
Posted Apr 22, 2020 - 03:10 UTC
Monitoring
We've processed the webhook queue and expect new webhooks should now be processed in near realtime.

The agent API continues to be under load, which is resulting in intermittent errors during agent registration, log uploads and artifact uploads. We're implementing a fix to improve this and expect the error rate to drop soon.
Posted Apr 22, 2020 - 00:42 UTC
Update
We are continuing to process the backlog of webhooks. The current delay on received webhooks is about 45 minutes, but that time is reducing and at the current rate it will catch up within an hour.
Posted Apr 21, 2020 - 23:13 UTC
Update
We are currently processing the backlog of webhooks in batches to ramp load back on to the system and hope to restore normal function soon. Thank you for your patience.
Posted Apr 21, 2020 - 21:18 UTC
Update
We are continuing to investigate the issues with Buildkite. We will be delaying further webhook ingestion to give the system a chance to catch up and error rates to abate. Webhooks will still be received, but will be processed in the order they were recieved once the system has recovered.
Posted Apr 21, 2020 - 20:03 UTC
Investigating
We are currently investigating slower than usual response times across Buildkite and the API
Posted Apr 21, 2020 - 18:48 UTC
This incident affected: Web, Agent API, REST API, Job Queue, and SCM Integrations.