Is there a reason for the lack of naming+shaming Crowdstrike in this blogpost? W...

StevenWaterman · on July 22, 2024

If you consider kernel programming to be inherently unsafe, then you would consider this to be inevitable, meaning it's not really the specific company's fault. They were just the unlucky ones.

brendangregg · on July 22, 2024

Right, and we wanted to talk about all security solutions and not make this about one company. We also wanted to avoid shaming since they have been seriously working on eBPF adoption, so in that regard they are at the forefront of doing the right thing.

lordnacho · on July 22, 2024

They could have helped their luck by doing some of the common sense things suggested in the article.

For instance, why not find a subset of your customers that are low risk, push it out to them, and see what happens? Or perhaps have your own fleet of example installations to run things on first. None of which depends on any specific technology.

hello_moto · on July 22, 2024

"find a subset of low risk customers" and use them as test subject?

Repeat that a few times to understand the repercussions.

If I were the customers and I found out that I was used as test subject, how would I feel?

lordnacho · on July 22, 2024

> If I were the customers and I found out that I was used as test subject, how would I feel?

In reality, every business has relationships that it values more than others. If I wasn't paying a lot for it, and if I was running something that wasn't critical (like my side project) then why not? You can price according to what level of service you want to provide.

hello_moto · on July 22, 2024

Customers will ask to opt-out.

ahtihn · on July 22, 2024

Customers will pay to opt out.

whynotminot · on July 22, 2024

Canary deployments are already an industry accepted practice and it’s shocking Crowdstrike apparently doesn’t do them.

hello_moto · on July 22, 2024

Which industry? Cybersecurity or Cloud software?

whynotminot · on July 22, 2024

Any industry that wants to reliably deliver software that doesn’t brick systems at scale? I’m confused by your question.

Are you telling me the cybersecurity scene is special and shouldn’t follow best practices for software deployment?

hello_moto · on July 22, 2024

Canary deployment for subset of Salesforce customers won't see much of revolt from customers compare to AV definition rollout (not software, but AV definition) in Cybersecurity where gaps between 0day and rollout means you're exposed.

If customers found out that some are getting roll out faster than the others, essentially splitting the group into 2, there will be a need for customer opt-in/opt-out.

If everyone is opting-out because of Friday, your Canary deployment becomes meaningless.

Any proof that other Cybersecurity vendors do Canary deployment for their AV definition? :)

PS: not to say that the company should test more internally...

whynotminot · on July 22, 2024

Canary deployment doesn’t necessarily mean massive gaps between deployment waves. You can fast-follow. Sure, there may be scenarios with especially severe vulnerabilities where time is of the essence. I’m out of the loop if this crowdstrike update was such a scenario where best practices for software deployment were worth bypassing.

If this is just how they roll with regular definition updates, then their deployment practices are garbage and this kind of large scale disaster was inevitable.

hello_moto · on July 22, 2024

Let's walk this through: Canary deployment to Windows machines. If those Windows machines got hit with BSOD, they will go offline. How do you determine if they go offline because of Canary or because of regular maintenance by the customer's IT cycle?

You can guess, but you cannot be 100% sure.

What if the targeted canary deployments are Employees desktops that are OFFLINE during the time of rollout?

>I’m out of the loop if this crowdstrike update was such a scenario where best practices for software deployment were worth bypassing.

I did post a question: what about other Cybersecurity vendors? Do you think they do canary deployment on their AV definitions?

Here's more context to understand Cybersecurity: https://radixweb.com/blog/what-is-mean-time-to-detect

Cybersecurity companies participate in Sec evaluation annually that evaluates (measure) and grade their performance. That grade is an input for Organizations to select vendors outside their own metrics/measurements.

I don't know if MTTD is included in the contract/SLA. If it does, you got some answer as to why certain decision is made.

It's definitely interesting to see Software developers of HN giving out their 2c for a niche Cybersecurity industry.

whynotminot · on July 22, 2024

> You can guess, but you cannot be 100% sure.

I worked in the cyber security space for a decent chunk of my career, and the most frustrating part was cyber security engineers thinking their problems were unique and being completely unaware of the lessons software engineering teams have already learned.

Yes, you need to tune your canary deployment groups to be large and diverse enough to give a reliable indicator of deployment failure, while still keeping them small enough that they achieve their purpose of limiting blast radius.

Again, if you follow industry best practices for software deployment, this is already something that should be considered. This is a relatively solved problem -- this is not new.

> I did post a question: what about other Cybersecurity vendors? Do you think they do canary deployment on their AV definitions?

I think that question is being asked right now by every company using Crowdstrike — what vendors are actually doing proper release engineering and how fast can we switch to them so that this never happens to us again?

hello_moto · on July 22, 2024

>if you follow industry best practices for software deployment, this is already something that should be considered. This is a relatively solved problem -- this is not new.

You have to ask the customer if they're okay with that citing "our software might failed and brick your machine".

I'd like to see any Sales and Marketing folks say that ;)

> I think that question is being asked right now by every company using Crowdstrike — what vendors are actually doing proper release engineering and how fast can we switch to them so that this never happens to us again?

Uber valid question and this BSOD incident might be a turning point for customers to pay up more for their IT infrastructure.

It's like: previously Cybersecurity vendors are shy to ask customers to setup Canary systems because that's just "one-more-thing-to-do". After BSOD: customers will smarten up and do it without being asked and to the point where they would ask Vendors to _support_ that type of deployment (unless they continue to be cheap and lazy).

whynotminot · on July 22, 2024

> You have to ask the customer if they're okay with that citing "our software might failed and brick your machine".

I think you’re still missing the point of Canary deployments. The question your sales team should ask is “would you like a 5% chance of a bug harming your system, or a 100% chance?”

> It's like: previously Cybersecurity vendors are shy to ask customers to setup Canary systems because that's just "one-more-thing-to-do"

You should by shy because it is not your customer’s job to set up canary deployments. Crowdstrike owns the software and the deployment process. They should be deploying to a subset of machines, measuring the results, and deciding whether to roll forward or roll back. It is not the customers job to implement good release engineering controls for Crowdstrike (although after this debacle you may well see customers try).

hello_moto · on July 22, 2024

If you refer Canary deployment as the vendor's internal deployment? I definitely agree.

What I find it hard is those in Software that suggested to roll it to a few customers first because this isn't cloud deployment doing A/B test when it comes to Virus Definition.

Customers must know what's going on when it comes to virus definition and the implication of them whether they're being part of the rollout group or not.

whynotminot · on July 22, 2024

> If you refer Canary deployment as the vendor's internal deployment? I definitely agree.

No, I’m talking about external deployment to customers. They clearly also had a massive failure in their internal processes too, since a bug this egregious should never make it to the release stage. But that is not what I am talking about right now.

> What I find it hard is those in Software that suggested to roll it to a few customers first because this isn't cloud deployment doing A/B test when it comes to Virus Definition.

I don’t care what you’re releasing to customers— application binary, configuration change, virus definition, etc, if it has the chance of doing this much damage it must be deployed in a controlled, phased way. You cannot 100% one-shot deploy any change that has the potential to boot-loop a massive amount of systems like this. This current process is unacceptable.

> Customers must know what's going on when it comes to virus definition and the implication of them whether they're being part of the rollout group or not.

Who says they don’t have to know? Telling your customers that an update is planned and giving them a time window for their update seems reasonable to me.

hello_moto · on July 22, 2024

If it's virus defn, what's the process here?

* 0day is happening

* Cybersecurity vendors preparing virus definition

* Vendors send update => new virus definition is about to go down in 1 hour, get ready.

Folks are asleep, nobody reads it?

Let's say now let's do Canary: let's deploy to a few customers (this is unclear how this started: should this be opt-in? opt-out?)

Some customers got it, others... who knows, unclear what the processes are here.

Between here and there, 0day exploited customers because AV defn is not there. What now?

I'm not sure how this plays out tbh.

gtsop · on July 22, 2024

Why even do that? We have virtualization, they could emulate real clients and networks of clients. This particular bug would have been prevented for sure

lordnacho · on July 22, 2024

Yeah I thought maybe the VM thing might not catch the bug for some reason, but it seems like the natural thing to do. Spin up VM, see if there's a crash. I heard the technical reason had something to do with a file being full of nulls, but that sort of thing you should catch.

Honestly, the most generous excuse I can think of is that CS were informed of some sort of vulnerability that would have profound consequences immediately, and that necessitated a YOLO push. But even that doesn't seem too likely.

efee22 · on July 22, 2024

Agree, Crowdstrike was an unlucky one, but it is more about the issue in general. If I remember correctly, also others like sysdig user their own kernel modules for collection.

asynchronous · on July 22, 2024

I still hold true that testing even improperly would have caught this before it hit worldwide. But I suppose you are right, that doesn’t help the argument being made here.

ForOldHack · on July 22, 2024

Wasnt that the job of AI/co-pilot/clippy /D.E.P? "Would you like me to try and execute a random blank file?"

And of course QA.

I was unaffected, but was fielding calls from customers.

My update Tuesday is the week after, so in-between MS and my updates, I am very suspicious of everything.

I was also unaffected by 22H2, and spent time fielding calls.

hiddencost · on July 22, 2024

I think the article isn't about crowd strike. It's about ebpf.

pimlottc · on July 22, 2024

The second paragraph is 100% about Crowdstrike. It even links to the Wikipedia article:

https://en.m.wikipedia.org/wiki/2024_CrowdStrike_incident

hiddencost · on July 22, 2024

CrowdStrike is mentioned, but the goal of the article is to promote eBPF. CrowdStrike is tangentially related because it draws attention to a platform that Gregg has put a lot into.