My old man has been working in radio and electronics for decades. We discussed the Amazon outage and when I told him two generators had failed, he smiled grimly and muttered "Murphy's Law".
No matter how much you test, you simply cannot know how a system will behave in a critical state until that state is reached.
The other thing too here is availability bias. We see the outages, but we don't hear about the near-outages. We're not seeing a true baseline for the occasions where the system behaved resiliently according to its design.
"How complex systems fail" by Dr. Richard Cook makes exactly this point – all complex systems are, by definition, running in a degraded mode, with catastrophe just around the corner. They are keep up through a series of gambles – and you never hear about the good ones.
Interesting it is written by a doctor about medical environments, but works for any complex system.
Yes, I started reading into the literature on failures recently because of that exact essay. It's been a great supplement to my reading on systems thinking.
No matter how much you test, you simply cannot know how a system will behave in a critical state until that state is reached.
The other thing too here is availability bias. We see the outages, but we don't hear about the near-outages. We're not seeing a true baseline for the occasions where the system behaved resiliently according to its design.