One of the main reasons that you don't get a good post mortem is that it's often...

mseebach · on Aug 8, 2016

Delta being down for four hours is not an issue of "systemic importance". Safety-related issues are already (and rightfully) publicly dissected, but it's hard to see what a post-mortem showing (what is likely to be) the interplay of a couple of subtle and individually benign problems which when hit by the power outage became a big issue will do much except enable a large chorus of armchair experts to pronounce on how this is obviously the fault of evil cost-cutting management/because they didn't use the cloud/because they used the cloud/fixed in the latest version of node.js.

planetjones · on Aug 8, 2016

I did not say that Delta being down was of "systemic importance". However a bank being unable to send or receive payments for a week is of systemic importance; hence why I would like a regulator to decide what is of systemic importance e.g. banking, utilities, telecoms. We are getting more and more where broadband is out, banking is out, etc. and the root cause is never transparent.

pc86 · on Aug 8, 2016

If a bank's systems are down consistently, use a different bank. Where public safety is not a concern, it is not any of your business as a private citizen what the inner workings of a private enterprise are, including the cause of outages even if they affected you directly.

That being said, of course I am much more likely to patronize businesses who do give me this type of information. Given a choice between a bank with satisfactory service and amazing transparency, and a bank with great service and no transparency, I will choose the former every time.

But government regulators opening up the working of private businesses is not a proportional response to your bill pay being unavailable for a few days.

digler999 · on Aug 8, 2016

A ticketing system may not be "safety" critical, but its malfunction can cause tremendous discomfort and financial strain on people's lives. Being stranded far from home, missing an important event, or being put through hours of incredibly stressful situations is inexcusable.

sneak · on Aug 11, 2016

Then don't ever fly Delta. Consider this your official notice.

planetjones · on Aug 8, 2016

You look at it from a narrow consumer perspective. If a bank's IT systems fail for a period then the systemic importance clause kicks in, as the wider economy may be affected. Regulation already exists e.g. You can't fix interest rates, you have to be fair to consumers, etc. In today's modern world I want regulation on how some of these enterprises are managing the technology which underpins their operations.

mrweasel · on Aug 8, 2016

>it's often directly related to human error or a severe defect brought about by a decision to cut costs

I wonder if companies do cost-benefit analysis on those kinds of situation. "We saved X amount of £ on cutting costs. Outages as a result of cost cutting Y amount of £". You would think that at least stock holders would insist.

oedenfield · on Aug 8, 2016

In a past life I was an IT consultant and good organizations do ask this question. Sometimes you have to ask the question like that to get business leaders (or board members) to pay attention to that "new IT spending".

di4na · on Aug 8, 2016

No because doing that would mean knowing what happened and spending time doing a post mortem.

pc86 · on Aug 8, 2016

This may come as a shock but most people want to do a good job, even if they work for Citi or Delta or Comcast or where ever.

Okay, maybe not Comcast.

di4na · on Aug 10, 2016

I never said the contrary. But a tons of these systems are old, and people knowing how they work are long gone.

Plus doing a proper post mortem and understanding how it works woudl take a lot of time and money. Things they most of the time to not have budget for.

user5994461 · on Aug 8, 2016

I love doing that =D

A while ago (when I was in aerospace) there was one senior people who gave us the price of insurance per head.

Given that we're talking about 1M€ per head and that a plane can easily have 200 people. I have a hard time finding a situation where I can expect a Return On Investment by doing things poorly and cheaply. We're talking many millions of payout.

(And yet, I'd love to have such a situation, it's always good to attract the attention of attendees at conference or student classrooms).

Then later I realized that I've made a mistake in my calculation... I'm only accounting for A SINGLE plane failing and killing everyone. In the real world, a bad component would be shipped to a whole batch of planes and they'd fall randomly like flies.

That batch-effect drastically increases the cost of failure. So... I really don't see ANY place where there is a ROI on people dying.

BAZINGA!!! I'm from Europe and people are expensive there. The trick is to manufacture things for the 3rd world market where people get peanuts when they die (sometimes even nothing) :D

But then again... we're back to square one, how could I build parts AND guarantee that they're only shipped to planes in low-cost locations? At that point, it's probably closer to a plane-by-plane targeted sabotage rather than a scheme of doing minimum-shit-to-close-sales. And when doing plane-by-plane there is no economy of scale :(

... Anyway! Just wanted to say that any analysis would clearly reveal that THE PLANE IS CHEAPER WHEN IT FLIES!!! (and your life is truly expensive) so you're fine. You can keep on using airplanes ;)

Also, people at work do have a conscience. Once in a while they may forget about the importance of what they do, luckily there is a plane crash happening every few months to remind them that there are lifes at stake, for real (in case it was ever forgotten).

nkozyra · on Aug 8, 2016

Another big reason is security - the less a nefarious actor knows about a system, the better (at least from the perspective of a large, closed organization).

Now we know this is not always true, but even giving out details about schema, data-organization or technologies used could provide more detail than is necessary.

We could argue the merits of this approach - I can certainly sympathize with the idea - but security by obscurity is still a rampant ethos in huge corporations. I wouldn't be surprised if TSA also enforced a bit of this as well.

TuringNYC · on Aug 8, 2016

This makes sense in a vacuum, but it is pretty easy to figure out any decent sized company's tech stack just by browsing their IT group's LinkedIn profiles.

motxilo · on Aug 8, 2016

Good postmortems should be created for human-induced errors or top-down decision to cut costs. Especially for those.

mseebach · on Aug 8, 2016

You don't need a post-mortem when you already know what the problem was.

motxilo · on Aug 8, 2016

If you knew what the problem was before the outage happened, and didn't consciously put the effort to prevent it, then oh well, that's professional negligence. Maybe not on you because you may not have the decision power to apply large-scale changes, but definitely somebody within the company. And if you didn't know what caused the outage, and you conducted some type of investigation to get at the heart of the problem, then you need a postmortem.

That's one of the main goals of a postmortem: document what the cause of the problem was, not just for you, but for all interested parties. If some human caused the issue, you need to document what happened and what should be done to prevent it from re-occurring. And fix it, of course.

rpazyaquian · on Aug 8, 2016

Can't we all just assume from now on that anything bad that happens is because of some shitty CEO cost-cutting everything to line their pockets? Because 99% of the time that's what's happening.

cbsmith · on Aug 8, 2016

Financial outages get punished by investors anyway, and they all want to keep their competitors from knowing how they operate, so I'm not sure there is a strong interest in getting this right.

In the case of airlines though, they have a strong collective interest in avoiding incidents.