Microsoft blames outage on small staff, automation failures

WJW · on Sept 5, 2023

HN headline is quite different than the actual article, which is "Microsoft admits slim staff and broken automation contributed to Azure outage". The headline as submitted is quite focused on the manager vs employee conflict that is so in vogue on HN at the moment, while in the actual headline (and the rest of the article) Microsoft acknowledges they've put too few people on that shift and have already taken mitigations. Mitigations both to increase the team size and improve the automation that supports them btw.

joshuaissac · on Sept 5, 2023

The title tag on The Register article starts with "Microsoft blames outage on small staff, automation failures", so it is not something added by the submitter.

ryanjshaw · on Sept 5, 2023

It must have changed as I don't see that at all:

> Microsoft admits slim staff and broken automation contributed to Azure outage

> Just three people were on duty in Australia when 'power sag' struck and software failures left them blind

benplumley · on Sept 5, 2023

That's the heading, the title is still "small staff" for me. The heading is what shows at the top of the page, you'll see the page title in the browser tab's name among other places.

Torkel · on Sept 5, 2023

The word "small" in the HN headline ("Microsoft blames outage on small staff [...]"). I am not native English, but the image in my head was of the staffers being physically small. A bit curious how a native speaker reads it?

badcppdev · on Sept 5, 2023

As a native speaker I find the phrase "small staff" to be quite awkward and I would avoid writing it.

I'll also note that it's totally standard for the theregister.com to have headlines that incorporate puns or colloquial language. In this case the original headlines has "slim staff" which is also awkward but has a different mental image :-)

Supernaut · on Sept 5, 2023

To this native speaker, the intended meaning was immediately obvious, but I agree, it is awkwardly phrased. Something like "inadequate staffing levels" would be much better.

Stranger43 · on Sept 5, 2023

It's on brand for "the register" whose editorial staff never miss a chance for a good pun or double entendre so the awkward phrasing is likely intentional.

tetha · on Sept 5, 2023

This is a good lesson in clear writing, by the way. Take some documentation you've written and try to misunderstand it. Think about synonyms or think about the words in a different context and see how far you can run away from the original meaning.

For example - "original meaning" here is kind of strange. What if the original documentation is wrong or ambiguous? Then we don't want the original meaning, we want the intended meaning. But I'll leave it in here as an example.

TylerE · on Sept 5, 2023

Small as in few in number. Consider the phrase a “a large crowd”. Or its size if you take it as the aggregate size - I.e. population. That works too.

sd9 · on Sept 5, 2023

Small crowd, or small team, parses just fine. Small staff sounds strange.

TylerE · on Sept 5, 2023

In the UK staff is used much the way “team” or “employees” are used in the US. It’s a collective noun.

“Staff must park behind the building”.

verve_rat · on Sept 5, 2023

As a native speaker that reads as: "Microsoft blames outage on small [number of] staff".

I'm not sure of the technical terms, but "staff" can both be the total mass of employees, and the individual employees.

"Small staff" means a small number of employees in the same way "small army" means a small number of soldiers.

Something, something countable vs uncountable nouns?

martyvis · on Sept 5, 2023

As an Aussie, yes the word "small" would indicate something about their stature. I would use the phrase "low staff numbers"

_a_a_a_ · on Sept 5, 2023

reduced, minimal, skeleton, insufficient, 'a small team'.

Ylpertnodi · on Sept 5, 2023

Agree with all, but 'small staff' does sound odd. I wonder if it's because of the alliteration. A 'great group' can also be confusing.

_a_a_a_ · on Sept 5, 2023

Agreed about sounding odd but "a small staff" sounds fine to me. I think the missing 'a' is what weirdifies it.

upon_drumhead · on Sept 5, 2023

Well, the team size increase is only temporary....

> We have temporarily increased the team size from three to seven, until the underlying issues are better understood, and appropriate mitigations can be put in place.

greggsy · on Sept 5, 2023

I read it in the context of management stating that they’re understaffed. Different worldviews and perspectives I suppose.

AnthonyMouse · on Sept 5, 2023

> Microsoft also had trouble understanding why its storage infrastructure didn't come back online.

> Storage hardware damaged by the data hall temperatures "required extensive troubleshooting" but Microsoft's diagnostic tools could not find relevant data because the storage servers were down.

Should've self-hosted it instead of trusting some cloud vendor.

baz00 · on Sept 5, 2023

At this point, after dealing with AWS and Azure for the last decade, I concur. We had lower TCO, higher aggregate reliability, lower staffing costs and importantly far far fewer problems when we had two redundant datacentre cages.

I think our costs are around 7x what we had before with no material improvements.

vb-8448 · on Sept 5, 2023

Cloud vendors should focus more on "low/mid level" stuff(hardware and basic virtualization) and less on all their abstraction that should be left to end-users or third party software vendors.

Unfortunately, this will never happen.

vmfunction · on Sept 5, 2023

This is what co-location do. Some hosting company still does that. So look for those, then you just bring your own hardware appliances. Then you just using their facilities, uninterrupted power supplies.

vb-8448 · on Sept 5, 2023

I don't want to mess with my hardware.

Anyway, with "abstractions" I mean all services like AWS App Runner that are build on top of foundational services (ec2, ebs, s3, vpc) that drains resources and money at the expense of the low level stuff.

AnthonyMouse · on Sept 5, 2023

> Unfortunately, this will never happen.

They only do it because people buy it from them.

whalesalad · on Sept 5, 2023

Debugging MSFT stuff is such a nightmare that this is kind of funny to hear.

nixgeek · on Sept 5, 2023

What do you do when you are the cloud vendor?

nottheengineer · on Sept 5, 2023

You do better than this so your customers don't start hosting themselves.

andyjohnson0 · on Sept 5, 2023

Obligatory XKCD: https://xkcd.com/908/

vb-8448 · on Sept 5, 2023

So the cost reduction that all these hyperscalers pledge doesn't come from huge volumes, but from saving on staff?

Just a rant ...

benterix · on Sept 5, 2023

Maybe a rant, but there is some truth in it.

baz00 · on Sept 5, 2023

So the root cause is incompetence. They didn't have enough people and they didn't have a fully tested recovery plan, relying on assumptions and winging it.

badcppdev · on Sept 5, 2023

Important to note that the root cause is incompetence at the upper management level.

Sparkyte · on Sept 5, 2023

You can't eliminate jobs to make up loss without risk. Fire middle management before firing engineers.

highwaylights · on Sept 5, 2023

"It is difficult to get a man to understand something, when his salary depends upon his not understanding it"

Middle managers aren't going to give up their salaries when there are perfectly good underlings to sacrifice first, especially when they can just tell the Chat GPT to do the codes like they read in that ebook they bought last night.

Sparkyte · on Sept 5, 2023

It'll never be that simple. Middle managers want to imagine and dream in a place where AI is the bottom tier of their worker category, but in the end of the day it makes them bottom tier.

A way I've always explained it to people is that ChatGPT is based off our knowledge and if knowledge is never improved by a human constantly updating ChatGPT it will never improve it will just make up shit to fill in the blanks and it will not be free of error. AI can loose coherency on it's data if the data it is training on is full of errors. Like data death due to compressing what is compressed over and over again.

mdriley · on Sept 5, 2023

The report says the cooling issue caused "a loss of service availability for a subset of [one] Availability Zone".

How did a single-AZ failure cause outages for two dozen services?

Why did a single-AZ failure mean "approximately half of Cosmos DB clusters in the Australia East region were either down or heavily degraded" and require those clusters to do a cross-region failover?

gardenhedge · on Sept 5, 2023

Didn't they just fire a load of people?

Cthulhu_ · on Sept 5, 2023

From an armchair point of view, Microsoft seems like the kind of company that will have armies of middle managers whose job is to be seen with shelves of paper and a cup of coffee moving through offices, or other kinds of bullshit jobs.

Dudester230602 · on Sept 5, 2023

That probably best describes Google these days to be honest. All that alleged engineering might yet not much new stuff to show up for it.

mnd999 · on Sept 5, 2023

I guess it’s double shifts for those that are left.

Overwork and tiredness never caused any problems whatsoever, right?

Maxious · on Sept 5, 2023

Given that 3 staff was inadequate to cover the physical infrastructure where 7 are not, clearly the 3 staff need to try harder to be in 2.3 places at the same time

lijok · on Sept 5, 2023

Azure are about to have a boatload of DNS issues

Stranger43 · on Sept 5, 2023

I suspect the big dark secret of the public cloud is that some regions/AZ's are not all that big and that there is no secret tech that's the big players have that the smaller regional players dont when it comes to the actual physical layer.

azangru · on Sept 5, 2023

> Microsoft blames...

Is this the purpose of incident analyses? Blame?

Cthulhu_ · on Sept 5, 2023

Just scanning the actual report / postmortem, they don't seem to "blame" the staff at all - that's The Register's journalistic freedom / interpretation at play. The only thing mentioned about staff size is the following:

> Due to the size of the datacenter campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner. We have temporarily increased the team size from three to seven, until the underlying issues are better understood and appropriate mitigations can be put in place.

Status report is at https://azure.status.microsoft/en-us/status/history/, the tracking code is VVTQ-J98