You're missing the point that maybe, just maybe, I'm part of a team that looks a...

specialist · on Nov 4, 2022

> a bigger problem is data affinity

For future, please write about how typical cloud customers can design for better data affinity.

Or is it just handled by the provider?

FWIW, at a prev gig, knowing nothing about nothing, I finally persuaded our team to colocate a Redis process on each of our EC2 instances (along side the http servers). Quick & dirty solution to meet our PHBs silly P99 requirements (for a bog standard ecommerce site).

Apologies for belated, noob question.

benlivengood · on Nov 1, 2022

> quick maths: a faster top of rack switch is possibly the same cost as 5 days engineering wage for a mid level google employee. How many new switches do you think you could buy with the engineering effort required to port everything to the new protocol, and have it stable and observable?

So your 5M machines / 40 in the best case of all 1U boxes is 125K TOR-switch-SWE-week-equivalents / 52 weeks in a year which comes to 2K SWE-years to invest in new protocols, observability, and testing. Google got to the scale they are by explicitly spending on SWE-hours instead of Cisco.

KaiserPro · on Nov 1, 2022

> explicitly spending on SWE-hours instead of Cisco.

I strongly doubt that TOR switches are cisco

KaiserPro · on Nov 2, 2022

but to answer your further case. The point is you don't need to replace all the TOR switches. Only the ones that deal with high network IO.

to change protocol you need gateways/loadbalancers either at the edge of the DC just after the public end points, or in the "high speed" areas that are running high network IO. For that to work, you'll need to show its worth the engineering effort/maintenance/latency.