HN headline is quite different than the actual article, which is "Microsoft admits slim staff and broken automation contributed to Azure outage". The headline as submitted is quite focused on the manager vs employee conflict that is so in vogue on HN at the moment, while in the actual headline (and the rest of the article) Microsoft acknowledges they've put too few people on that shift and have already taken mitigations. Mitigations both to increase the team size and improve the automation that supports them btw.
The title tag on The Register article starts with "Microsoft blames outage on small staff, automation failures", so it is not something added by the submitter.
That's the heading, the title is still "small staff" for me. The heading is what shows at the top of the page, you'll see the page title in the browser tab's name among other places.
The word "small" in the HN headline ("Microsoft blames outage on small staff [...]"). I am not native English, but the image in my head was of the staffers being physically small. A bit curious how a native speaker reads it?
As a native speaker I find the phrase "small staff" to be quite awkward and I would avoid writing it.
I'll also note that it's totally standard for the theregister.com to have headlines that incorporate puns or colloquial language. In this case the original headlines has "slim staff" which is also awkward but has a different mental image :-)
To this native speaker, the intended meaning was immediately obvious, but I agree, it is awkwardly phrased. Something like "inadequate staffing levels" would be much better.
It's on brand for "the register" whose editorial staff never miss a chance for a good pun or double entendre so the awkward phrasing is likely intentional.
This is a good lesson in clear writing, by the way. Take some documentation you've written and try to misunderstand it. Think about synonyms or think about the words in a different context and see how far you can run away from the original meaning.
For example - "original meaning" here is kind of strange. What if the original documentation is wrong or ambiguous? Then we don't want the original meaning, we want the intended meaning. But I'll leave it in here as an example.
Well, the team size increase is only temporary....
> We have temporarily increased the team size from three to seven, until the underlying issues are better understood, and appropriate mitigations can be put in place.
> Microsoft also had trouble understanding why its storage infrastructure didn't come back online.
> Storage hardware damaged by the data hall temperatures "required extensive troubleshooting" but Microsoft's diagnostic tools could not find relevant data because the storage servers were down.
Should've self-hosted it instead of trusting some cloud vendor.
At this point, after dealing with AWS and Azure for the last decade, I concur. We had lower TCO, higher aggregate reliability, lower staffing costs and importantly far far fewer problems when we had two redundant datacentre cages.
I think our costs are around 7x what we had before with no material improvements.
Cloud vendors should focus more on "low/mid level" stuff(hardware and basic virtualization) and less on all their abstraction that should be left to end-users or third party software vendors.
This is what co-location do. Some hosting company still does that. So look for those, then you just bring your own hardware appliances. Then you just using their facilities, uninterrupted power supplies.
Anyway, with "abstractions" I mean all services like AWS App Runner that are build on top of foundational services (ec2, ebs, s3, vpc) that drains resources and money at the expense of the low level stuff.
So the root cause is incompetence. They didn't have enough people and they didn't have a fully tested recovery plan, relying on assumptions and winging it.
"It is difficult to get a man to understand something, when his salary depends upon his not understanding it"
Middle managers aren't going to give up their salaries when there are perfectly good underlings to sacrifice first, especially when they can just tell the Chat GPT to do the codes like they read in that ebook they bought last night.
It'll never be that simple. Middle managers want to imagine and dream in a place where AI is the bottom tier of their worker category, but in the end of the day it makes them bottom tier.
A way I've always explained it to people is that ChatGPT is based off our knowledge and if knowledge is never improved by a human constantly updating ChatGPT it will never improve it will just make up shit to fill in the blanks and it will not be free of error. AI can loose coherency on it's data if the data it is training on is full of errors. Like data death due to compressing what is compressed over and over again.
The report says the cooling issue caused "a loss of service availability for a subset of [one] Availability Zone".
How did a single-AZ failure cause outages for two dozen services?
Why did a single-AZ failure mean "approximately half of Cosmos DB clusters in the Australia East region were either down or heavily degraded" and require those clusters to do a cross-region failover?
From an armchair point of view, Microsoft seems like the kind of company that will have armies of middle managers whose job is to be seen with shelves of paper and a cup of coffee moving through offices, or other kinds of bullshit jobs.
Given that 3 staff was inadequate to cover the physical infrastructure where 7 are not, clearly the 3 staff need to try harder to be in 2.3 places at the same time
I suspect the big dark secret of the public cloud is that some regions/AZ's are not all that big and that there is no secret tech that's the big players have that the smaller regional players dont when it comes to the actual physical layer.
Just scanning the actual report / postmortem, they don't seem to "blame" the staff at all - that's The Register's journalistic freedom / interpretation at play. The only thing mentioned about staff size is the following:
> Due to the size of the datacenter campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner. We have temporarily increased the team size from three to seven, until the underlying issues are better understood and appropriate mitigations can be put in place.