That's a lot of words to say "We did not test a file that gets ingested by a kernel level program, not even once"
At no point did they deploy this file to a computer they owned and attempted to boot it. They purposely decided to deploy behavior to every computer they could without even once making sure it wouldn't break from something stupid.
Are these people fucking nuts?
I do more testing than this and I might be incompetent. Also nothing I touch will kill millions of PCs. I get having pressure put on you from above, I get being encouraged to cut corners so some shithead can check off a box on his yearly review and make more money while stiffing you on your raise, I get making mistakes.
I think it is worse than that. When I make a change to some code or config, I'll run it locally to make sure that the change has the effect that I want. I know that we are human and that bugs occasionally appear in code. But what I can't understand is that the human who initiated this change decided not to see if it actually did what they wanted it to do.
I've made changes on personal projects that I thought were simple, and yet broke stuff. But CrowdStrike is a multi-billion dollar company -- how can it be possible to have such a broken process. Their RCA document was interesting, but didn't cover any of the interesting issues. It seems that they don't know about the 5 Whys process (https://en.wikipedia.org/wiki/Five_whys) or decided that those answers were so embarrassing that they had to omit them.
> When I make a change to some code or config, I'll run it locally to make sure that the change has the effect that I want.
It's not uncommon for devs to be working against outdated databases / config dumps. Certainly bad practice but when devs have the option of being lazy vs doing chores, they will pick the path of less resistance.
> But what I can't understand is that the human who initiated this change decided not to see if it actually did what they wanted it to do.
We're assuming that the person who changed the code also made the choice to initiate the rollout. They are 2 separate actions which can be made by separate individuals and could also involve many multiple steps in between, each undertaken by a separate individual as well.
Distance from Prod does introduce a sense of malaise and complacency, I've found.
It's really hard to assign blame, but I'd put more blame on Team 2 for not being defensive with their inputs enough.
As we all know there are greater issues with their deployment pipelines (lack of canaries, phased rollouts etc.) but no point going over those in this context.
At no point did they deploy this file to a computer they owned and attempted to boot it. They purposely decided to deploy behavior to every computer they could without even once making sure it wouldn't break from something stupid.
Are these people fucking nuts?
I do more testing than this and I might be incompetent. Also nothing I touch will kill millions of PCs. I get having pressure put on you from above, I get being encouraged to cut corners so some shithead can check off a box on his yearly review and make more money while stiffing you on your raise, I get making mistakes.
But like, fuck man, come on.