Intermittent periods of increased latency and request timeouts
Incident Report for Buildkite
Postmortem

Summary

Over the course of two months, Buildkite experienced three interrelated database issues, resulting in periods of degraded performance and some downtime. We deeply regret the impact this had on our customers, and would like to share what happened and what we’ve done about it.

The three database issues were:

  • Unexpected storage limitation
  • Query performance degradation
  • Risk of transaction ID wraparound

Unexpected storage limitation

On 19 March 2020, we discovered that our AWS RDS PostgreSQL database storage would not scale beyond 16 TiB. This contradicted documentation and announcements stating a 64 TiB limit. Our database monitoring and forecasting showed approximately eight weeks runway before we exceeded 16 TiB.

We contacted AWS support, and within a few days they explained the issue and proposed a solution. Our database had been provisioned in 2015, at which time RDS was using a 32-bit filesystem which was limited to 16 TiB. Amongst several theoretical solutions, the one viable solution involved us creating a read replica of the database which the RDS team would modify to a 64-bit filesystem, ready to be verified and promoted to primary.

On 26 March 2020 we handed the read replica to the RDS team; within a few days it was ready to scale storage up to a maximum of 64 TiB. We performed integrity testing based on PostgreSQL’s data checksums while preparing for a weekend scheduled maintenance window to promote the database to primary.

On 18 April 2020 we promoted the new database to primary, buffering inbound webhooks to AWS S3 during maintenance so that builds triggered during that time would run afterwards. We published this as https://www.buildkitestatus.com/incidents/vklxjkn52fpf.

The replica promotion went to plan, and the storage limitation was resolved.

Query performance degradation

On 21 April 2020, our monitoring system paged us due to high database CPU utilization and degraded query/API performance. We posted an incident at https://www.buildkitestatus.com/incidents/0yywdz6b6kqs

Our first theory was a thundering herd caused by a coinciding GitHub incident (https://www.githubstatus.com/incidents/516zy0q1ns06). They stopped sending notifications, so our customer builds weren’t triggered, causing our customer buildkite-agent fleets to scale in, and then GitHub processed their notification backlog causing elevated agent & build activity.

Our second theory was some kind of hardware issue, causing our usual workload to manifest as much higher CPU utilisation. This was our first weekly peak load on the new database instance, and the previous database instance had handled the same load without issue. We had experienced an instance-specific performance issue a few months prior, where a planned database failover resulted in degraded performance which was ultimately traced back to a bug in the RDS/EBS storage system.

Neither of these theories ended up explaining the issue, but they did cloud our subsequent investigation.

On 22 April 2020 (the following day) we were still seeing degraded performance, and had not found a root cause. To reduce customer impact we doubled our database instance class; from db.r5.12xlarge (48 vCPU, 384 GiB RAM) to db.r5.24xlarge (96 vCPU, 768 GiB RAM). This maintenance was done outside business hours for most customers, and was published as https://www.buildkitestatus.com/incidents/0yywdz6b6kqs

The extra 48 CPU cores solved customer-impacting performance issues, but CPU utilization was still higher than expected. Over the following week we identified some expensive API usage patterns from particular customers, and worked with them to satisfy this use-cases in more efficient ways. Based on metrics and past performance, we believed we could revert back to the original database instance class.

On 29 April 2020 we reverted the database to db.r5.12xlarge. This was again done outside business hours for most customers, with webhooks buffered to S3 for subsequent processing, and was tracked in https://www.buildkitestatus.com/incidents/f6492w22g0rf.

The following peak traffic period resulted in degraded API and job dispatch performance, which we published in https://www.buildkitestatus.com/incidents/2b64j3hkqv45. This confirmed that the underlying problem had not been resolved, and that it wasn’t specific to one database instance.

On 30 April 2020 we upgraded the database instance type once again. We discovered a db.r5.16xlarge instance class (64 vCPU, 512 GiB RAM) had been recently introduced, but required a PostgreSQL minor version upgrade before we could use it. That two-step upgrade meant a few extra minutes of downtime, once again done outside most business hours and with webhooks buffered for subsequent processing, which we published in https://www.buildkitestatus.com/incidents/gxncbp2rmxqq.

Performance on the db.r5.16xlarge started looking more acceptable, but not ideal. Over the following week we shipped various mitigations to the application and schema to reduce database load and improve API performance. We also ran various experiments using clones of the production database restored from backup. However, CPU utilization remained higher than expected, and we still had some periods of degraded performance, leading us to publish the problem as an ongoing intermittent incident at https://www.buildkitestatus.com/incidents/h0vnx4gp7djx.

We continued searching for a root cause.

Risk of transaction ID wraparound

Meanwhile, we had needed to interrupt several VACUUM operations to do database maintenance. The main purpose of vacuuming is to remove dead tuples from index and table data. However VACUUM serves another more critical purpose; facilitating transaction ID re-use. PostgreSQL’s MVCC model breaks down if more than 2^32 / 2 (~2 billion) transaction IDs are used between vacuums. The resulting transaction ID wraparound is a catastrophic event resulting in forced shutdown and prolonged downtime for a large database.

We closely monitor our unvacuumed transaction ID usage, and given our usual VACUUM duration we were alert but not alarmed.

On 6 May 2020 we became actively concerned about the possibility of a PostgreSQL transaction ID wraparound.

Each previously interrupted VACUUM had failed to release some amount of dead tuples, making each subsequent VACUUM slower than the last. Due to an unresolved issue, PostgreSQL has a hard limit of ~180 million dead tuples scanned before triggering an additional index vacuum phase. By this stage our indexes had become extremely bloated, causing the index vacuum phase to take two days. With ~500 million dead tuples, we estimated a total duration of 10 days. That put it a few days prior to our estimated transaction wraparound, but much too close for comfort.

We immediately brought the whole Engineering team to work on three fronts:

1. Last resort (plan for transaction wraparound)

We remained confident that we wouldn’t reach this point, but wanted options and a solid plan. We designed a much softer landing than sudden database shutdown, and agreed on a threshold of remaining transaction IDs before we would execute this last resort plan.

Thankfully we didn’t need to explore too far down any of these paths.

2. Postpone (reduce transaction ID usage)

We searched for transaction ID usage hot-spots which could be changed to consume fewer IDs.

  • We deleted some unused database-writing code paths.
  • We grouped some write queries into existing transactions where they belonged.
  • We migrated some low-join write-heavy tables into a separate database, using Foreign Data Wrappers for zero-downtime transition.
  • We rebuild a locking system to persist in Redis instead of a PostgreSQL table.

These activities reduced our transaction ID usage and also reduced database CPU utilization.

3. Avoid (finish vacuuming sooner)

The ideal outcome would be some way to confidently finish vacuuming much sooner. We ran several experiments on production database clones. Completion time and customer impact was evaluated for various combinations of vacuum parameters, database instance classes, and PostgreSQL versions.

We also discovered a new VACUUM parameter introduced in PostgreSQL 12 called INDEX_CLEANUP; setting it to false skips most of the slow vacuuming behaviour and focuses on transaction ID management. We were running PostgreSQL 11, but had plans to upgrade to 12 in the coming weeks. We could bring those plans forwards if needed.

The results of the experiments showed that upgrading to PostgreSQL 12 and using the new INDEX_CLEANUP parameter was the only option we were confident would finish before transaction ID wraparound. This option required 15-30 minutes downtime, which we decided was regrettable but necessary.

PostgreSQL 12 Upgrade

On 7 May 2020 we scheduled, planned and executed the upgrade from PostgreSQL 11 to 12 that we had been planning to do in the coming weeks. This was done outside business hours for most of our customers, with inbound webhooks buffered to S3 for subsequent processing. It was published as part of https://www.buildkitestatus.com/incidents/h0vnx4gp7djx

The upgrade itself went to plan, matching our expectations based on the trial upgrades we ran on production clones. We expected cold-start degraded performance for up to five minutes after starting the upgrade instance, but it extended beyond that. After 15 minutes, we stopped background workers and put the site back into maintenance mode to alleviate extremely high database load. We drilled into metrics and explored hypotheses such as PostgreSQL 12 changes to CTE query planning. It took us longer than it should have to realise that we had forgotten to run ANALYZE after the major version upgrade, so the query planner was not able to effectively use indexes.

After analyzing all tables, performance on the PostgreSQL 12 instance was excellent. The new emergency vacuum mode ran in just 12 hours, rather than the 10 day forecast of our earlier autovacuum. We were then able to use PostgreSQL 12’s new REINDEX CONCURRENTLY feature to quickly reduce index bloat. After that a proper vacuum of our largest table took a very reasonable 21 hours.

Query performance degradation revisited

Ever since the initial alert of degraded performance, we had been constantly testing theories and searching for a root cause. We understood the unexpected storage limitation, and we understood the risk of transaction ID wraparound, but we still did not understand the query performance degradation.

We had previously seen correlation with an upstream outage and a new database instance, but these never led to causation.

Eventually we found a third correlation.

Our logs showed that an ANALYZE of our largest table finished just before the CPU utilization increased. That also correlated with a huge increase in pages per second read for that table, although from RAM, not disk, so it didn’t show up in IO stats. This new finding prompted us to take yet another look at some key query plans.

We finally discovered that ever since that fateful ANALYZE, one of our primary-key-scoped queries was no longer using the primary key index, and instead scanning a far less efficient index. The ANALYZE operations that should have happened in the intervening weeks had been prevented by long-running vacuums holding table locks. The ANALYZE we belatedly ran after the PostgreSQL 12 upgrade fully resolved the root cause of the performance issue.

This won’t happen again

We deeply regret the impact this chain of events had intermittently throughout April and May. It has taught us many things about PostgreSQL and our internal processes. We have made a number of changes to ensure it won’t happen again, and we’re commencing a project to make further improvements to our database resilience.

We have a number of notable changes in place already:

  • We have reduced the rate we consume transaction IDs by shifting multiple subsystems out of the primary database and by optimizing some key transactions.
  • Our existing automation to proactively run non-automatic VACUUM during weekends has been reconfigured to run faster.
  • We have optimised our automatic VACUUM cost limiting and thresholds so that under regular operations they run more frequently, and faster.
  • PostgreSQL 12 has given us REINDEX CONCURRENTLY, making it easier to manage index bloat.
  • PostgreSQL 12 has given us the VACUUM INDEX_CLEANUP parameter. We hope to never need it again, but it’s a valuable tool against impending transaction ID wraparound.
  • As a team we have learned a huge amount about PostgreSQL operations and failure modes.

Here’s what we will keep working on:

  • We will set up automated reindexing and associated index bloat monitoring, to keep vacuum times stable.
  • We will document and/or automate the ANALYZE operations required after certain types of database maintenance, so that future version upgrades won’t incur additional downtime.
  • We are working on new partitioning schemes for our large tables which, amongst other benefits, will make VACUUM faster and ANALYZE more reliable.

Thank you to Buildkite’s amazing customer community for being patient, understanding, and supportive while we got to the bottom of these issues. We now fully understand the unexpected storage limitation, query performance degradation, and risk of transaction ID wraparound that we experienced.

Posted May 21, 2020 - 11:20 UTC

Resolved
We have been closely monitoring performance since last week, and consider the issue fully resolved. Stay tuned for a full incident review in the coming days.
Posted May 12, 2020 - 04:40 UTC
Update
To resolve the ongoing performance issues covered by this incident, we performed an urgent database version upgrade and subsequent maintenance operations. We did this outside business hours for the vast majority of our customers, expecting around 15 minutes of downtime. There were some complications which forced us back into maintenance mode for substantially longer than that. This is now fully resolved.

We've been closely monitoring the system since then, and we believe the maintenance has had the intended outcome. We don't anticipate any more issues with latency or request timeouts. We are continuing to monitor, but things are looking great so far.

We will be publishing a full incident report as soon as we can, with much more detail about the original issue and the actions we've taken to resolve it.
Posted May 07, 2020 - 13:28 UTC
Update
We finished processing the webhook queue 30 minutes ago, so all systems are back to normal. We'll continue to monitor closely, but everything is looking good now.
Posted May 07, 2020 - 12:14 UTC
Monitoring
Service has been restored! Thanks for your patience, and sorry to have held up your builds from running for so long. All webhooks from GitHub etc were safely queued during the incident, and we're processing that backlog now to trigger your builds. Until that completes, new builds will not trigger in real-time; if you need a build in a hurry you can trigger it manually from the web dashboard.
Posted May 07, 2020 - 11:22 UTC
Update
Database performance is starting to recover and buildkite.com is back online. Performance may continue to be sluggish as various caches warm up.

Incoming webhooks continue to be buffered and are not being processed in realtime - we'll begin draining that queue shortly. Manual builds via the web interface should start quickly if you need to bypass the webhook queues.
Posted May 07, 2020 - 10:27 UTC
Update
We're experiencing degraded performance while our database caches warm up and the backlog of work is processed. We're placing the site under maintenance mode so we can ease the load back in a controller manner.
Posted May 07, 2020 - 09:33 UTC
Update
We've finished the database maintenance. We're keeping an eye on performance as systems come online again.
Posted May 07, 2020 - 08:44 UTC
Update
We're performing urgent database maintenance and we're expecting 15-30 minutes of downtime over the next hour. During this period the Buildkite dashboard and API be unavailable and no new work will be scheduled. Agents which already have work can run the work to completion, but agents will not be assigned new work and will not be able to submit the results of work until the maintenance is completed. Webhooks will be received but not processed until maintenance is completed.
Posted May 07, 2020 - 08:16 UTC
Identified
Since Wednesday 22nd April, customers have experienced intermittent periods of increased latency and request timeouts. These periods are most common during North American business hours, and are most commonly seen as HTTP 502 errors in builds that upload or download artifacts.

Database resource contention is at the root of the problem, with two overlapping issues:

1. We use a per-build lock around many agent operations (like pipeline uploads, job accepting, and job starting). This has ensured correct behavior, but as we've grown and our customers have built novel workloads, the lock contention has become a bottleneck that can't be solved by throwing more CPUs at the problem (we tried)
2. The locking happens in a database table with a high update rate and the indexes have become large enough to impact general query performance

So far we've upgraded the size of our database, worked with a few customers to adjust their workloads to a form that uses fewer resources, and profiled locking hotspots. Our next step is to introduce finer grained locking around some operations and adjust our database schema to ensure querying data related to active builds is faster.

We'll post updates here as we make progress, and share a post-mortem when the issue is resolved.

Our agent has robust back-off and retry behavior when communicating with our APIs, which means even in the face of API timeouts or errors, these database issues generally manifest as slightly delayed builds. However we've identified one interaction that wasn't retrying, resulting in build errors on API timeout. We've released a new version of the agent (https://github.com/buildkite/agent/releases/tag/v3.21.0) with improved retry behavior for artifact uploads and downloads. Customers may be able to mitigate some of the impact by upgrading.
Posted May 05, 2020 - 14:12 UTC
This incident affected: Web, Agent API, REST API, and Job Queue.