Presumably the DNS being down also wreaks havoc in their internal infrastructure...

rightbyte · on Oct 4, 2021

I wonder if Facebook has circular 'boot' dependencies on their microservices or something? I.e. they can't restart stuff now when everything is down.

clon · on Oct 4, 2021

For sure. Reminds me of the difficulties of starting a power grid from total blackout, bringing generators and power stations to sync.. .

kccqzy · on Oct 4, 2021

Oh you bet they do. In large organizations with complex microservices these dependencies inevitably arise. It takes real dedication and discipline to avoid creating these circular dependencies.

samhw · on Oct 4, 2021

This is very true. I tell everyone who'll listen that every competent engineer should be well versed in the nuances of feedback in complex systems (https://en.wikipedia.org/wiki/Feedback).

The most successful systems rely on the property of feedback (https://en.wikipedia.org/wiki/Feedback): evolution, untrained learning, genetic algorithms, the diagonal arguments (https://en.wikipedia.org/wiki/Diagonal_argument), artificial general intelligence (https://en.wikipedia.org/wiki/Technological_singularity), financial markets according to no less than George Soros (https://en.wikipedia.org/wiki/Reflexivity_(social_theory)#In...), etc.

That said, virtuous cycles can't exist without vicious cycles. I think we as a society need to do a lot more work into helping people understand and model feedback in complex systems, because at scales like Facebook's it's impossible for any one person to truly understand the hidden causal loops until it goes wrong. You only need to look at something like the Lotka-Volterra equations (https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equatio...) to see how deeply counterintuitive these system dynamics can be (e.g. "increasing the food available to the prey caused the predator's population to destabilize": https://en.wikipedia.org/wiki/Paradox_of_enrichment).

qeternity · on Oct 4, 2021

Internal services using public dns records?

msbarnett · on Oct 4, 2021

Probably not, but their external and internal DNS may share infrastructure that's at the root of the failure

qeternity · on Oct 4, 2021

Yikes, seems like an easy redundancy split.

fragmede · on Oct 4, 2021

It seems like an easy redundancy split, but imagine driving two cars down the freeway at the same time, because you got a flat tire in one, the other day.

In order to actually be redundant you need to have two sets of infrastructure to serve, and then if the internal one goes down, the external one's basically useless when the internal resolution's down anyway. Capacity planning (because you're inside Facebook and can't pretend that all data-centers ever-where are connected via an infinitely fast network) becomes twice as much work. How you do updates for a couple thousand teams isn't trivial in the first place, now you have to cordon them off appropriately?

I don't know what Facebook's DNS serving infrastructure looks like internally, but it's definitely more complicated than installing `unbound` on a couple of left-over servers.

qeternity · on Oct 4, 2021

Yes, all of that (imo) is an argument in favor.

I never said it was free, but it's worth it as long as it's cheaper than failure.

I don't keep backups because I enjoy having multiple copies of my data. I do it because losing that data would be devastating.