I've recently dumped a lot of time into making our CircleCI pipeline faster. I'd love to hear why you switched to BuildKite because we're looking at doing the same thing. My understanding is BuildKite has the following advantages:
- You can have persistent workers since the build machines run in your cluster.
- Much better Docker layer cache hit rates. We use custom Docker images and I've never seen a Docker image cache hit on our CircleCI machines.
- Better download throughput. CircleCI appears to be bandwidth limited to about 15MB per second.
- It's much more straight-forward to programmatically generate workflows instead of hard-coding them in multi-thousand line YAML file.
Some of low-hanging optimizations on CircleCI I've used to reduce build time are:
- Using a shallow git checkout. Reduced checkout time of our 500MB repo from 30 seconds to 2 seconds.
- save_cache into /dev/shm. This cut our 1GB node_modules (yea, it's huge I know) save_cache and restore_cache from 30 seconds to 10 seconds. The HUGE downside is /dev/shm is mounted as noexec so you can't run anything in node_modules/.bin without some hacks.
We never really "switched" to Buildkite; its what we started with, and its been great for us. But I'm always experimenting with other options, which has recently included GHA and Circle.
Buildkite has pros and cons, like any CI platform. Managing your own agents is nice in the sense that you get really fine-grained control over their performance and cost. But, that being said, it would take some engineering to create an agent pool which can match the cost of something like Circle.
If you're running an agent 24/7, on, say, an EC2 instance, but you're not actually using it 24/7, there's some serious cost overhead there where you'd see some nice savings on Circle. The great thing about BK is the flexibility; there's infinite potential to do some really smart things, like build an autoscaling agent pool. In fact, they have an AWS CloudFormation stack which does exactly that (we don't run that, but I've looked into it and its really cool: it essentially has a lambda cron which calls the BK API to determine the number of outstanding jobs, dumps that into a CloudWatch metric, then sets up an ASG which can scale-to-zero based on this metric). As far as I can tell, this is a "one-click deploy", so management should be easy.
We run our agents inside Kubernetes on GKE. We've had some minor issues with this. Docker-in-Docker is never fun. So we'll probably look to switching to that stack (and maybe set it up so, instead of the automated ASG, we just have a timeboxed ASG between 9-5 M-F, scale to 1 instance outside of those hours, as we're pretty timezone-centralized). The advantage of GCP/GKE is that we're using preemptable instances; ridiculously easy to set up on GKE and you get that 70% cost savings. The only downside is that, every once in a while, a build just fails because the underlying instance restarts. At our scale, not a huge deal; just re-run it. Maybe happens once a week.
Again, its trade-offs. You can go with more persistent workers, and get amazing docker layer image caching. Or you can go with an auto-scaling route and save a ton of costs. It shouldn't be impossible to find a balance in there which works for your company, and I like that about BuildKite.
I don't think we'd ever switch to Circle, just because I don't see a huge benefit in doing so. We may, one day, switch to GHA. We already use Github, so that integration is nice. If the performance is looking good and the cost is acceptable, then we'd do it; CI is so fungible that it really comes down to minutia like this.
I've been working on a hosted tool that uses some kernel trickery to speed up big CI jobs like this - would you be willing to email me more details about your use case? It would really help me prioritize features for my beta!
- You can have persistent workers since the build machines run in your cluster.
- Much better Docker layer cache hit rates. We use custom Docker images and I've never seen a Docker image cache hit on our CircleCI machines.
- Better download throughput. CircleCI appears to be bandwidth limited to about 15MB per second.
- It's much more straight-forward to programmatically generate workflows instead of hard-coding them in multi-thousand line YAML file.
Some of low-hanging optimizations on CircleCI I've used to reduce build time are:
- Using a shallow git checkout. Reduced checkout time of our 500MB repo from 30 seconds to 2 seconds.
- save_cache into /dev/shm. This cut our 1GB node_modules (yea, it's huge I know) save_cache and restore_cache from 30 seconds to 10 seconds. The HUGE downside is /dev/shm is mounted as noexec so you can't run anything in node_modules/.bin without some hacks.