It may not be true everywhere, but at my company we 100% had more SEVs after two rounds of RIFs. We are talking simple statistics of SEVs per month plotted against RIFs.
Data centers take time to build. The capital investment to build these DCs is needed now in expectation that future revenue streams will pay for that capital.
That's not what OpenAI announced. They said initial spend would be $100B, and I'm sure until the ink is dry on each contract, that they can change their mind at any time. No business is going to be placing ironclad commitments four years out.
Lifestyle creep and tech hubs have expensive real estate. I am super lucky in that I have a FAANG job in LCOL area and will most likely be FIRE by I am 40 with a fully paid off single detached home, but you cannot do that in the bay area where almost all your income will go to paying your mortgage.
I also know a co-worker who lives in a LCOL area and just bought a Porsche 911 GT3…
Whats less well known is as the Ionsphere heats up the upper atmosphere, it bulges out into space like a tyre sidewall bulge. This has the effect of putting an atmosphere in the path of LEO satellite, which then causes the LEO satellite to fall to earth because they are not designed to travel through an atmosphere.
Joule heating is the most important one which can alter the thermospheric dynamics quite significantly.[1]
Do you have a source that you can measurably affect drag on satellites using a ground based Ionospheric heater? How much is the atmosphere actually going to heat up from a few megawatts?
Github is migrating from their old infra to Azure. Doing migrations like that is hard, no matter who you are. And from a business and engineering perspective I think it makes sense to leverage the economies of scale of Azure instead of GitHub running their own boxes.
Anyone being forced to use Azure has, at least temporarily until they can find a new job, lost at life, not necessarily through any fault of their own. The poor souls probably also have to use Teams.
The engineers at github are getting paid $300k/year at SWE3 to do their job. I don’t think they lost at life.
Why bring people down so hard? That is really solid money and you can provide for a family, retire in your 40s, and it is work that does not destroy your body.
Spending your life working on making things worse (and knowing it) is pretty demoralizing. I know many people who have made the decision to take a pay cut or just quit when they realize that’s their job.
Sometimes those people aren't realizing that they're making things worse, they're just in a depressive spiral and can't see the other end, or see how much good is still being generated while other things are temporarily worse, or see that different tradeoffs have been made to make things worse in some ways and better in others. Just as people can delude themselves that they're always making a positive impact, people can delude themselves that they're making a negative one. The latter tends to be more costly, though, which can sure be annoying to those with a bias for a more cynical or pessimistic outlook...
Trying to ascribe positive/negative impact to strangers isn't usually a useful exercise, even if you have enough data to make a solid case. It can be cathartic -- imagine a different world where programmers making things worse would screw off and go do something else that's not programing! (I have a similar imagining, like of a world where programming is done by those who love it even outside of work -- even though I've worked with and helped hire excellent engineers who only treated programming as a job, they weren't my favorite to work with, and some were very much not excellent.) The best you can hope for is to trigger some self-reflection, and I do think that's important on an individual level. It's better to not make the world uglier, if you notice yourself doing so, and it's not just a distortion of your thinking, then you should probably stop, do something else, or figure out if it's at a level that you can compensate. A Richard Stallman quote I like:
"The straightforward and easy path was to join the proprietary software world, signing nondisclosure agreements and promising not to help my fellow hacker....I could have made money this way, and perhaps had fun programming (if I closed my eyes to how I was treating other people). But I knew that when my career was over, I would look back on years of building walls to divide people, and feel I had made the world ugly."
When I was involved about a year ago, cilium falls apart at around a few thousand nodes.
One of the main issues of cilium is that the bpf maps scale with the number of nodes/pods in the cluster, so you get exponential memory growth as you add more nodes with the cilium agent on them.
https://docs.cilium.io/en/stable/operations/performance/scal...
The k8s scheduler lets you tweak how many nodes to look at when scheduling a pod (percentage of nodes to score) so you can change how big “global state” is according to the scheduler algorithm.
It makes me sad that to get these scalability numbers requires some secret sauce on top of spanner, which no body else in the k8s community can benefit from. Etcd is the main bottleneck in upstream k8s and it seems like there is no real steam to build an upstream replacement for etcd/boltdb.
I did poke around a while ago to see what interfaces that etcd has calling into boltdb, but the interface doesn’t seem super clean right now, so the first step in getting off boltdb would be creating a clean interface that could be implemented by another db.
It's possible I'm talking out of my ass and totally wrong because I'm basing this on principles, not benchmarking, but I'm pretty sure the problem is more etcd itself than boltdb. Specifically, the Raft protocol requires that the cluster leader's log has to be replicated to a quorum of voting members, who need to write to disk, including a flush, and then respond to the leader, before a write is considered committed. That's floor(n/2) + 1 disk flushes and twice as many network roundtrips to write any value. When your control plane has to span multiple data centers because the electricity cost of the cluster is too large for a single building to handle, it's hard for that not to become a bottleneck. Other limitations include the 8GiB disk limit another comment mentions and etcd's hard-coded 1.5 MiB request size limit that prevents you from writing large object collections in a single bundle.
etcd is fine for what it is, but that's a system meant to be reliable and simple to implement. Those are important qualities, but it wasn't built for scale or for speed. Ironically, etcd recommends 5 as the ideal number of cluster members and 7 as a maximum based on Google's findings from running chubby, that between-member latency gets too big otherwise. With 5, that means you can't ever store more than 40GiB of data. I have no idea what a typical ratio of cluster nodes to total data is, but that only gives you about 307MiB per node for 130,000 nodes, which doesn't seem like very much.
There are other options. k3s made kine which acts as a shim intercepting the etcd API calls made by the apiserver and translating it into calls to some other dbms. Originally, this was to make a really small Kubernetes that used an embedded sqlite as its datastore, but you could do the same thing for any arbitrary backend by just changing one side of the shim.
I run several clusters a bit over 10k nodes and the etcd db size is about 30-50GiB depending on how long ago defragmentation was run.
It is kindof sad as these nodes are running around 2k IOPS to the disk and are mostly sitting idle at the hardware level, but etcd still regularly chokes.
I did look into kine in the past, but I have no idea if it is suitable for running a high performance data store.
> When your control plane has to span multiple data centers because the electricity cost of the cluster is too large for a single building to handle
The trick is you deploy your k8s clusters in multiple datacenters in the same region (think AZs in AWS term). The control plane can span multiple AZs which are in separate buildings, but close in geography. From the setups I work on the latency between datacenters in the same region is only about 500 microseconds.
> It makes me sad that to get these scalability numbers requires some secret sauce on top of spanner, which no body else in the k8s community can benefit from.
I'm not so sure. I mean, everything has tradeoffs, and what you need to do to put together the largest cluster known to man is not necessarily what you want to have to put together a mundane cluster.
For those not aware, if you create too many resources you can easily use up all of the 8GB hard coded maximum size in etcd which causes a cluster failure. With compaction and maintenance this risk is mitigated somewhat but it just takes one misbehaving operator or integration (e.g. hundreds of thousands of dex session resources created for pingdom/crawlers) to mess everything up. Backups of etcd are critical. That dex example is why I stopped it for my IDP.
This is why I’ve always thought Tekton was a strange project. It feels inevitable that if you buy into Tekton CI/CD you will hit issues with etcd scaling due to the sheer number of resources you can wind up with.
What boundaries does this 8GB etcd limit cut across? We've been using Tekton for years now but each pipeline exists in its own namespace and that namespace is deleted after each build. Presumably that kind of wholesale cleanup process keeps the DB size in check, because we've never had a problem with Etcd size...
We have multiple hundreds of resources allocated for each build and do hundreds of builds a day. The current cluster has been doing this for a couple of years now.
Yeah I mean if you’re deleting namespaces after each run then sure, that may solve it. They have a pruner now that you can enable too to set up retention periods for pipeline runs.
There’s also some issues with large Results, though I think you have to manually enable that. From their site
> CAUTION: the larger you make the size, more likely will the CRD reach its max limit enforced by the etcd server leading to bad user experience.
And then if you use Chains you’re opening up a whole other can of worms.
I contracted with a large institution that was moving all of their cicd to Tekton and they hit scaling issues with etcd pretty early in the process and had to get Red Hat to address some of them. If they couldn’t get them addressed by RH they were going to scrap the whole project.
Yeah, quite unfortunate. But maybe there is hope. Apparently k3s uses Kine which is an etcd translation layer for relational databases and there is another project called Netsy which persists into s3 https://nadrama.com/netsy. Some interesting ideas. Hopefully native postgres support gets added since its so ubiquitous and performant.
There is a hard coded warning which says safety not guaranteed after 8GB. I have tried increasing this after a database has become full and it didn’t start. It’s definitely not a recovery strategy for a full etcd by itself, maybe as part of a way to eek out a little larger margin of safety.
This warning seems to be outdated. We had run etcd at much larger volumes without issues (at least without issues related to its size). Alibaba has been running 100G etcd clusters for a while now, probably others too
It's totally possible to run tens of thousands of QPS on etcd if your disks are NVMEs (or if you disable fdatasync which is not recommended). If you use kine+cockroachdb or tidb you can go even higher which is what I'm guessing is equivalent to their spanner setup.
There was a blogpost about creating an alternative to etcd for super high scale kubernetes cluster. All code was open too. It was from someone named Benjamin I think but not sure.
I’m not able to find the blogpost but maybe someone else can!