I like fly.io a lot and I want them to succeed. They're doing challenging work...things break.
Have to admit it's disappointing to hear about the lack of communication from them, especially when it's something the CEO specifically called out that they wanted to fix in his big reliability post to the community back in March.
I just got gigabit bidirectional fiber at home and honestly if I were doing personal stuff or doing very early bootstrapping I'd just host from here with a good UPS. No it wouldn't be data center reliability but it'd work at least until it was ready to put in something more resilient.
You can pay for a business class fiber link too. It's about twice as expensive but they have guaranteed outage response times which is really what you pay for.
> enthusiasts would be better served DIY; put a beige box in a local colo
I mean, like, can I provision a zero ops bit of compute from <mystery colo provider> for $20/month?
Edit: looked up colo providers in my city- “get started in 24 hours, pick a rack and amperage, schedule a call now.”. Yeaaah, no. This is why people use cloud providers instead.
The thing is, running a good SaaS service requires quite a bit of staff and hard operational skills and a lot of manpower. You know, the kinda stuff people always call useless, zero-value add, blockers and entirely to automate.
Sure, we have most of the day-to-day grunt work for our applications automated. But good operations is just more. It's more about maintaining control over your infrastructure at one hand, and making sure your customers feel informed and safe about their data and systems. This is hard and takes lots of experience to do well, as well as manpower.
And yes, that's entirely a soft skill. You end up with questions such as: Should we elevate this issue to an outage on the status page? To a degree you'd be scaring other customers. "Oh no, yellow status page. Something terrible must happen!". At the same time you're communicating to the affected customers just how serious you're taking their issues. "It's a thing on the status page after an initial misjudgement - sorry for that." We have many discussions ilke that during degradations and outages.
I appreciate the honest feedback. We could have done better communicating about the problem. We've been marking single host failures in the dashboard for affected users and using our status page to reflect things like platform and regional issues, but there's clearly a spot in the middle where the status we're communicating and actual user experience don't line up.
We've been adding a ton more hardware lately to stay ahead of capacity issues and as you would expect this means the volume of hardware-shaped failures has increased even though the overall failure probability has decreased. There's more we can do to help users avoid these issues, there's more we can do to speed up recovery, and there's more we can do to let you know when you're impacted.
All this feedback matters. We hear it even when we drop the ball communicating.
What hardware are you buying? Across tens of thousands of physical nodes in my environment, only a few would have "fatal" enough problems that required manual intervention per year.
Yes we had hundreds of drives die a year, some ECC ram would exceed error thresholds, but downtime on any given node was rare (aside from patching, but we'd just live migrate KVM instances around as needed.
Not that nothing will fail - but some manufacturers have just really good fault management, monitoring, alerting, etc.
And even the simplest shit like SNMP with a few custom MIBs from the vendor (which theres some that do it better). Facilities and vendors that lend a good hand with remote hands is also nice, if you remote management infrastructure should fail. But out of band, full featured management cards with all the trimmings work so well. Some do good Redfish BMC/JSON/API stuff too on top of the usual SNMP and other nice builtin Easy Buttons.
And today's tooling with bare metal and KVM, working around faults to be quite seamless. Even good NVME raid options if you just absolutely must have your local box with mirrored data protection, 10/40/100Gbps cards with a good libvirt setup to migrates large VMs in mere minutes, resuming on the remote end with nigh 1ms blip.
"it depends". Dell is fairly good overall, on-site techs are outsourced subcontractors a lot so that can be a mixed bag, pushy sales.
Supermicro is good on a budget, not quite mature full fault management or complete SNMP or redfish, they can EOL a new line of gear suddenly.
Have not - looks nice though. Around here, you'll mostly only encounter the Dell/Supermicro/HP/Lenovo. I actually find Dell to have acheived the lowest "friction" for deployments.
You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc. We pre-stage our configurations based on this, have everything ready to go (rack location/RU, switch ports, PDUs, DHCP/DNS).
We literally just plug it all up and power on, and our tools take care of the rest without any intervention. Just verify the serial number of the server and stick it in the right rack unit, done.
Have to admit it's disappointing to hear about the lack of communication from them, especially when it's something the CEO specifically called out that they wanted to fix in his big reliability post to the community back in March.
https://community.fly.io/t/reliability-its-not-great/11253#s...