I've recently dumped a lot of time into making our CircleCI pipeline faster. I'd...

013a · on Sept 20, 2019

We never really "switched" to Buildkite; its what we started with, and its been great for us. But I'm always experimenting with other options, which has recently included GHA and Circle.

Buildkite has pros and cons, like any CI platform. Managing your own agents is nice in the sense that you get really fine-grained control over their performance and cost. But, that being said, it would take some engineering to create an agent pool which can match the cost of something like Circle.

If you're running an agent 24/7, on, say, an EC2 instance, but you're not actually using it 24/7, there's some serious cost overhead there where you'd see some nice savings on Circle. The great thing about BK is the flexibility; there's infinite potential to do some really smart things, like build an autoscaling agent pool. In fact, they have an AWS CloudFormation stack which does exactly that (we don't run that, but I've looked into it and its really cool: it essentially has a lambda cron which calls the BK API to determine the number of outstanding jobs, dumps that into a CloudWatch metric, then sets up an ASG which can scale-to-zero based on this metric). As far as I can tell, this is a "one-click deploy", so management should be easy.

We run our agents inside Kubernetes on GKE. We've had some minor issues with this. Docker-in-Docker is never fun. So we'll probably look to switching to that stack (and maybe set it up so, instead of the automated ASG, we just have a timeboxed ASG between 9-5 M-F, scale to 1 instance outside of those hours, as we're pretty timezone-centralized). The advantage of GCP/GKE is that we're using preemptable instances; ridiculously easy to set up on GKE and you get that 70% cost savings. The only downside is that, every once in a while, a build just fails because the underlying instance restarts. At our scale, not a huge deal; just re-run it. Maybe happens once a week.

Again, its trade-offs. You can go with more persistent workers, and get amazing docker layer image caching. Or you can go with an auto-scaling route and save a ton of costs. It shouldn't be impossible to find a balance in there which works for your company, and I like that about BuildKite.

I don't think we'd ever switch to Circle, just because I don't see a huge benefit in doing so. We may, one day, switch to GHA. We already use Github, so that integration is nice. If the performance is looking good and the cost is acceptable, then we'd do it; CI is so fungible that it really comes down to minutia like this.

colinchartier · on Sept 20, 2019

I've been working on a hosted tool that uses some kernel trickery to speed up big CI jobs like this - would you be willing to email me more details about your use case? It would really help me prioritize features for my beta!

colin@layerci.com