Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My old man has been working in radio and electronics for decades. We discussed the Amazon outage and when I told him two generators had failed, he smiled grimly and muttered "Murphy's Law".

No matter how much you test, you simply cannot know how a system will behave in a critical state until that state is reached.

The other thing too here is availability bias. We see the outages, but we don't hear about the near-outages. We're not seeing a true baseline for the occasions where the system behaved resiliently according to its design.



"How complex systems fail" by Dr. Richard Cook makes exactly this point – all complex systems are, by definition, running in a degraded mode, with catastrophe just around the corner. They are keep up through a series of gambles – and you never hear about the good ones.

Interesting it is written by a doctor about medical environments, but works for any complex system.

The essay is reproduced in Web Operations by John Allspaw and Jesse Robbins with a web ops spin on it, and is also available online at http://www.ctlab.org/documents/How%20Complex%20Systems%20Fai...


Yes, I started reading into the literature on failures recently because of that exact essay. It's been a great supplement to my reading on systems thinking.


Point 7 is a new idea to me: "Post-accident attribution accident to a ‘root cause’ is fundamentally wrong."


I disagree with that nostrum -- I wrote about it in nauseating detail here: http://chester.id.au/2012/04/09/review-drift-into-failure/




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: