> We’ve done extensive work hardening our systems to prevent unauthorized access...

educationcto · on Oct 5, 2021

The original error was the network command, but the slower response and lengthy outage was partially due to the physical security they put in place to prevent malicious activity. Any event like this has multiple root causes.

Ansil849 · on Oct 5, 2021

Yes, but the fact that the blogpost concludes on this relatively tangential note (which notably also conveniently allows Facebook to brag about their security measures) and not on the note that their audit code was apparently itself not sufficiently audited, is what makes this deceptive spin.

cranekam · on Oct 5, 2021

I agree that there's an awkward emphasis on how FB prioritizes security and privacy but nothing is deceptive here. Had the audit bug not subsequently cut off access to internal tools and remote regions it would be easy to revert. Had there not been a global outage nobody would have known that the process for getting access in an emergency was too slow.

Huge events like this always have many factors that have to line up just right. To insist that the one and only true cause was a bug in the auditing system is reductive.

Ansil849 · on Oct 5, 2021

> I agree that there's an awkward emphasis on how FB prioritizes security and privacy but nothing is deceptive here.

I guess deceptive was the wrong word, so whatever's the term for "awkward emphasis" :).

closeparen · on Oct 5, 2021

Our postmortems have three sections. Prevention, detection, and mitigation. They all matter.

Shit happens. People ship bugs. People fat-finger commands. An engineering team’s responsibility doesn’t stop there. It also needs to quickly activate responders who know what to do and have the tools & access to fix it. Sometimes the conditions that created the issue are within acceptable bounds; the real need for reform is in why it took so long to fix.

adrianmonk · on Oct 5, 2021

No, they just wanted to cover both "what caused it?" and "why did it take too long to fix it?" since both are topics people were obviously extremely interested in.

It would have been surprising and disappointing if they didn't cover both of them.

tedunangst · on Oct 5, 2021

Seems like appropriate emphasis given how many people yesterday were asking why aren't they back online yet. For every person asking why they deleted their routes there were two people asking why they didn't put them back.

detaro · on Oct 5, 2021

And as they say, it lead to slower recovery from the event, which was caused, as they also clearly say, by something else. Given that "why is it taking so long to revert a config change?!!!" was a common comment, relevant to the discussion.

It's disingenuous to point to a paragraph in the article and complain that it doesn't mention the root cause when they already said before that, in the same article "This was the source of yesterday’s outage" about something else.

temp_praneshp · on Oct 5, 2021

No one except you is trying to spin anything.

Ansil849 · on Oct 5, 2021

Going on at lengths about how the trade off between prolonged downtime and strict security protocols is a worthy trade off is erecting a nonsensical strawman, the literal definition of spinning a story. The key issue had nothing to do with Facebook's data center security protocols.

temp_praneshp · on Oct 5, 2021

No one except you is erecting a nonsensical strawman.

Ansil849 · on Oct 5, 2021

Except, I quoted exactly where they were.

cratermoon · on Oct 5, 2021

Simple Testing Can Prevent Most Critical Failures[1], "We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design."

1 https://www.eecg.utoronto.ca/~yuan/papers/failure_analysis_o...

net4all · on Oct 5, 2021

That article should be required reading for all of us.

jeffbee · on Oct 5, 2021

Having a separate testing instance of the internet might not be practical. How exactly would you test such a change? Simulating the effect of router commands is a very daunting challenge.