Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Presumably the DNS being down also wreaks havoc in their internal infrastructure as services can no longer resolve each other's names.


I wonder if Facebook has circular 'boot' dependencies on their microservices or something? I.e. they can't restart stuff now when everything is down.


For sure. Reminds me of the difficulties of starting a power grid from total blackout, bringing generators and power stations to sync.. .


Oh you bet they do. In large organizations with complex microservices these dependencies inevitably arise. It takes real dedication and discipline to avoid creating these circular dependencies.


This is very true. I tell everyone who'll listen that every competent engineer should be well versed in the nuances of feedback in complex systems (https://en.wikipedia.org/wiki/Feedback).

The most successful systems rely on the property of feedback (https://en.wikipedia.org/wiki/Feedback): evolution, untrained learning, genetic algorithms, the diagonal arguments (https://en.wikipedia.org/wiki/Diagonal_argument), artificial general intelligence (https://en.wikipedia.org/wiki/Technological_singularity), financial markets according to no less than George Soros (https://en.wikipedia.org/wiki/Reflexivity_(social_theory)#In...), etc.

That said, virtuous cycles can't exist without vicious cycles. I think we as a society need to do a lot more work into helping people understand and model feedback in complex systems, because at scales like Facebook's it's impossible for any one person to truly understand the hidden causal loops until it goes wrong. You only need to look at something like the Lotka-Volterra equations (https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equatio...) to see how deeply counterintuitive these system dynamics can be (e.g. "increasing the food available to the prey caused the predator's population to destabilize": https://en.wikipedia.org/wiki/Paradox_of_enrichment).


Internal services using public dns records?


Probably not, but their external and internal DNS may share infrastructure that's at the root of the failure


Yikes, seems like an easy redundancy split.


It seems like an easy redundancy split, but imagine driving two cars down the freeway at the same time, because you got a flat tire in one, the other day.

In order to actually be redundant you need to have two sets of infrastructure to serve, and then if the internal one goes down, the external one's basically useless when the internal resolution's down anyway. Capacity planning (because you're inside Facebook and can't pretend that all data-centers ever-where are connected via an infinitely fast network) becomes twice as much work. How you do updates for a couple thousand teams isn't trivial in the first place, now you have to cordon them off appropriately?

I don't know what Facebook's DNS serving infrastructure looks like internally, but it's definitely more complicated than installing `unbound` on a couple of left-over servers.


Yes, all of that (imo) is an argument in favor.

I never said it was free, but it's worth it as long as it's cheaper than failure.

I don't keep backups because I enjoy having multiple copies of my data. I do it because losing that data would be devastating.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: