Hacker Newsnew | past | comments | ask | show | jobs | submit | mikiem's commentslogin

Initially, it seemed like a DoS to us too, but it was not. This was confirmed by upstream provider metrics. No major traffic spikes. It was a combination of non-malicious things. More info later, some of us need sleep.


DNS was actually the first alert we received, but alas was not the cause.


I'm a little surprised at the response here.

I feel like there is an element of "Body Doubling" here... a strategy used by those with ADD/ADHD. I recently looked in to this when curious about my own observation that I work longer and with better focus when working in close proximity of someone else.


They were in two mirrors, each mirror in a different server. Each server in different racks in the same row. The servers were on different power circuits from different panels.


You are never going to guess how long the HN SSDs were in the servers... never ever... OK... I'll tell you: 4.5years. I am not even kidding.


Let me narrow my guess: They hit 4 years, 206 days and 16 hours . . . or 40,000 hours.

And that they were sold by HP or Dell, and manufactured by SanDisk.

Do I win a prize?

(None of us win prizes on this one).


These were made by SanDisk (SanDisk Optimus Lightning II) and the number of hours is between 39,984 and 40,032... I can't be precise because they are dead and I am going off of when the hardware configurations were entered in to our database (could have been before they were powered on) or when we handed them over to HN, and when the disks failed.

Unbelievable. Thank you for sharing your experience!


Wow. It's possible that you have nailed this.

Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.

But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.

Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.


This morning, I googled for issues with the firmware and the model of SSD, I got nothing. But now I am searching for "40000 hours SSD" and a million relevant results. Of course, why would I search for 40000 hours.

This thread is making me feel a lot less crazy.


I'm hoping that deep in your spam folder is a critical firmware update notice from Dell/EMC/HP/SanDisk from 2 years ago :).


There are times I don't miss dealing with random hardware mystery bullshit.

This one is just ... maddening.


This kind of thing is why I love Hacker News. Someone runs into a strange technical situation, and someone else happens to share their own obscure, related anecdote, which just happens to precisely solve the mystery. Really cool to see it benefit HN itself this time.


It's also an example of the dharma of /newest – the rising and falling away of stories that get no attention:

HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)

HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)

Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)

HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)

HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)

(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)


Easy to imagine why this didn’t capture peoples’ attention in late March 2020…


Yes, an enterprisey firmware update - all very boring until BLAM!


Was HN an indirect casualty of Covid?


Interesting how something that is so specifically and unexpectedly devastating, yet known for such a long time without any serious public awareness from companies involved, is referred to as a "bug".

It makes you lose data and need to purchase new hardware, where I come from, that's usually referred to as "planned" or "convenient" obsolescence.


The difference between planned and convenient seems to be intent. And in this context that difference very much matters. I wouldn’t conflate the two.


Depends on who exactly we are talking about as having the intent...

Both planned and convenient obsolescence are beneficial to device manufacturers. Without proper accountability for that, it only becomes a normal practice.


> Depends on who exactly we are talking about as having the intent...

The manufacturer, obviously. Who else would it be?

Could be an innocent mistake or a deliberate decision. Further action should be predicated on the root cause. Which includes intent.


Popularity is a very poor relevance / truth heuristic.


I wanted to upvote this comment but that just feels wrong.


You're a good man, Charlie Brown.


I wonder if it might be closer to 40,032 hours. The official Dell wording [1] is "after approximately 40,000 hours of usage". 2^57 nanoseconds is 40031.996687737745 hours. Not sure what's special about 57, but a power of 2 limit for a counter makes sense. That time might include some manufacturer testing too.

[1] https://www.reddit.com/r/sysadmin/comments/f5k95v/dell_emc_u...


It might not be nanoseconds, but something that's a power of 2 number of nanoseconds going into an appropriately small container seems likely. For example, a 62.5MHz counter going into 53 bits breaks at the same limit. Why 53 bits? That's where things start to get weird with IEEE doubles - adding 1 no longer fits into the mantissa and the number doesn't change. So maybe someone was doing a bit of fp math to figure out the time or schedule a next event? Anyway, very likely some kind of clock math that wrapped or saturated and broke a fundamental assumption.


53 is indeed a magic value for IEEE doubles, but why would anybody count an inherently integer value with floating-point? That's a serious rookie mistake.

Of course there's no law that says SSD firmware writers can't be rookies.


Full stack JS, everything is a double down to the SSD firmware!


See! People should register via mail for those important notifications! (Or alternatively do quarterly checks that your firmware is up to date).


A lot of companies have teams dedicated to hardware that don’t give a shit about it. And their managers don’t give a shit.

Then the people under them who do give a shit, because they depend on those servers, aren’t allowed to register with HP etc for updates, or to apply firmware updates, because “separation of duties”.

Basically, IT is cancer from the head down.



Do they use SSD on space missions aswell?


Only for 4 years, 206 days and 16 hours.


is this leased to HN as dedicated/baremetal servers or colocation aka HN owns the hardware?


The former.


It's concerning that a hosting company was unaware of the 40,000 hour situation with SSD it was deploying. Anyone in hosting would have been made aware of this, or at least should have kept a better grip on happenings in the market.


Yeah, this is why you run all equipment in a test environment for 4.5 years before deploying it to prod. Really basic stuff.


The HD makers started issuing warnings in 2020... this was foreseeable


How many other customers will/have hit this?


Every large DC will have hit it (Amazon, Facebook, Google, etc). But it's a shame that all their operational knowledge is kept secret.


I understand BackBlaze is more HDD rather than SSD, but perhaps they might have some level of awareness.


It was part of a mirror of identical SSDs on an LSI MegaRAID RAID card. We see occasional "spectacular" drive failures that take the machine down with a single disk failure. Usually it's just a reboot to come back up, and a disk replacement, then some hours of time to rebuild the array and get back to situation nominal.


People guess the origin of our name often. Maybe this will give you even more of a chuckle. I was not aware of the name of this computer when I named the company. https://en.m.wikipedia.org/wiki/The_Ultimate_Computer


It was an SSD. A 1.6TB SAS3 SSD. (M5 CEO here)


Stop making stuff up guys, I just know that someone at the YCombinator HQ tripped over the power cable of the Raspberry Pi you're hosting this on.


Unrelated issues, but I did hear from our other clients that O365 was having issues at the same time as our network outage affected HN and many others.


Founder and CEO of M5 Hosting here. We did have a network outage today that affected Hacker News. As with any outage, we will do an RCA and we will learn and improve as a result.

I'm a big fan of HN and YC in general, we host of other YC alum, and I have taken a few things through YC Startup School. During this incident, I spoke to YC personally when they called this morning.


We have been using M5 Hosting for one of our servers since 2011. They have been extremely reliable up until today. Based on what was posted about the Hacker News server setup, we have something similar. We have a "warm spare" server in a different data center. We use Debian, not FreeBSD.

We are in the process of slowly moving to a distributed system (distributed DB) that is going to make fallover easier. However, that kind of setup is orders of magnitudes more complex than the current (manual fallover) setup. I really wonder if the planned design is going to be more reliable in practice. Complexity is almost always a bad idea, in my experience. Distributed systems are just fundamentally very complicated.


Oh hi! Thank you for the kind words. I cant tell who you are by your name here, but if you've been with us since 2011, we have certainly spoken. Are you using our second San Diego data center for your failover location? If you and I aren't already talking directly, ask to speak with Mike in your ticket.


I had used M5 some years ago to host an online rent payment / property management app. Have nothing but positive things to say about that experience. We once had an outage that was our own fault on our single server and they had someone go in, in the middle of the night, to reboot it for us and we weren't even on an SLA.


Thank you for sharing your positive experience! We can power cycle power outlets remotely and can connect a console (ip kvm)... and we are staffed 24x7.... in case you need another server. Thanks again!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: