Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. I believe a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this. From here on out, our job is to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible.

I found this to be an extremely deceptive conclusion. This makes it sound like the issue was that Facebook's physical security is just too gosh darn good. But the issue was not Facebook's data center physical security protocols. The issue was glossed over in the middle of the blogpost:

> Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.

The issue was faulty audit code. It is disingenuous to then attempt to spin this like the downtime was due to Facebook's amazing physec protocols.



The original error was the network command, but the slower response and lengthy outage was partially due to the physical security they put in place to prevent malicious activity. Any event like this has multiple root causes.


Yes, but the fact that the blogpost concludes on this relatively tangential note (which notably also conveniently allows Facebook to brag about their security measures) and not on the note that their audit code was apparently itself not sufficiently audited, is what makes this deceptive spin.


I agree that there's an awkward emphasis on how FB prioritizes security and privacy but nothing is deceptive here. Had the audit bug not subsequently cut off access to internal tools and remote regions it would be easy to revert. Had there not been a global outage nobody would have known that the process for getting access in an emergency was too slow.

Huge events like this always have many factors that have to line up just right. To insist that the one and only true cause was a bug in the auditing system is reductive.


> I agree that there's an awkward emphasis on how FB prioritizes security and privacy but nothing is deceptive here.

I guess deceptive was the wrong word, so whatever's the term for "awkward emphasis" :).


Our postmortems have three sections. Prevention, detection, and mitigation. They all matter.

Shit happens. People ship bugs. People fat-finger commands. An engineering team’s responsibility doesn’t stop there. It also needs to quickly activate responders who know what to do and have the tools & access to fix it. Sometimes the conditions that created the issue are within acceptable bounds; the real need for reform is in why it took so long to fix.


No, they just wanted to cover both "what caused it?" and "why did it take too long to fix it?" since both are topics people were obviously extremely interested in.

It would have been surprising and disappointing if they didn't cover both of them.


Seems like appropriate emphasis given how many people yesterday were asking why aren't they back online yet. For every person asking why they deleted their routes there were two people asking why they didn't put them back.


And as they say, it lead to slower recovery from the event, which was caused, as they also clearly say, by something else. Given that "why is it taking so long to revert a config change?!!!" was a common comment, relevant to the discussion.

It's disingenuous to point to a paragraph in the article and complain that it doesn't mention the root cause when they already said before that, in the same article "This was the source of yesterday’s outage" about something else.


No one except you is trying to spin anything.


Going on at lengths about how the trade off between prolonged downtime and strict security protocols is a worthy trade off is erecting a nonsensical strawman, the literal definition of spinning a story. The key issue had nothing to do with Facebook's data center security protocols.


No one except you is erecting a nonsensical strawman.


Except, I quoted exactly where they were.


Simple Testing Can Prevent Most Critical Failures[1], "We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design."

1 https://www.eecg.utoronto.ca/~yuan/papers/failure_analysis_o...


That article should be required reading for all of us.


Having a separate testing instance of the internet might not be practical. How exactly would you test such a change? Simulating the effect of router commands is a very daunting challenge.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: