Surprising Economics of Load-Balanced Systems

jacques_chester · on Aug 7, 2020

This can be flipped into a simple heuristic known as the "square root staffing rule" (sometimes "... law"). The name comes from call centres.

Basically, the number of servers you need to serve X amount of demand with Y probability of queuing is not linear with X. It is proportional to the square root of X.

The intuition is that a call centre agent is either talking to a customer, or they are not. If they are not talking to a customer, then they can immediately serve a customer. The odds that a given agent is talking to a customer is on a probability distribution, usually assumed to be the normal distribution.

In a normal distribution, larger samples lead to more and more appearances of outliers in the sample. More agents means more "outliers", in this case, more idle agents. And that curve is not linear, it follows the shape of the normal distribution.

Where it sucks for call centres is that demand shows seasonality and shift planning has relatively inflexible lead times. But in a software scenario we can typically acquire additional capacity quickly, so precision in forecasts is less of a problem.

A decent explanation: https://www.networkpages.nl/the-golden-rule-of-staffing-in-c...

A short scenario: https://www.xaprb.com/blog/square-root-staffing-law/

A simple calculator linked from the short scenario: https://www.desmos.com/calculator/8lazp6txab

hinkley · on Aug 7, 2020

You can plan deferred maintenance and other tasks around seasonality.

For instance at a call center, the cost of training new people is not something you want to pay while the phone is ringing off the hook, right? So you hire too many new people during the preceding lull, train them up, and hopefully the washout rate is low enough that you meet your hiring quota.

Similarly, if you own your own servers, you provision new hardware well in advance of a predictable spike in traffic, and then you keep as many of the old machines running as physics (eg, building thermal or electrical capacity) or information theory allows, but you prioritize the new hardware because it scales better vertically. Once the hype wears off you decommission them.

Blizzard reportedly used to do this for their MMO. Later they bragged about architecture changes they made to increase capacity during peak load, but that's not what I saw. What I saw was how much they could decrease capacity as active session counts declined. When they started, they sharded their system. Then added more and more shards, availability zones, and regions. And then that wasn't enough and too much at the same time.

Shards are probabilistically fair if the incoming traffic doesn't know about them. Blizzard named their shards, and so user clustering happens. So they slowly moved a bunch of jobs to be serviced by a separate cluster of servers, and shared those workers across multiple shards in the same AZ. Now instead of each shard handling its own peak load, the hardware is more proportional to the peak load of the AZ, which is going to behave more like root sum squared (your square root) of the shards. If one shard declines or spikes it takes from the pool.

If the pool in undersubscribed, you shut off servers to save money. But you never shut off the servers that the customers know about, and have developed an attachment to. That would cause a cascading failure.

jacques_chester · on Aug 7, 2020

Seasonality is worth planning for, but the value of investing in plans for seasonality depends very much on how quickly you can react to changes in demand. Building data centres has a long lead time and costs a lot of money, so spending a lot of money on forecasting demand is a worthy investment. On the other hand, a new container process might take a few seconds to load, so you can get away with more error.

There's one caveat, though, where planning for seasonality becomes valuable again: competition for shared resources. Suppose you are using a cloud provider to provision VMs on demand. If you're caught short by a stockout, then fast reaction time now becomes a long reaction time. Setting an approximate or safe base level of capacity on a schedule increases the odds that you won't be caught dangerously short.

(The Blizzard observations were interesting, thankyou).

hinkley · on Aug 7, 2020

> If you're caught short by a stockout, then fast reaction time now becomes a long reaction time.

I've seen a little evidence in Netflix's public statements that they have a degree of load shedding. They've reserved enough machines to service all high priority traffic, plus a couple of layers of buffer, and the buffer gets parceled out for lower priority tasks as space is available, and buy either more base load or more spot servers when that queue gets too long.

But... we have the cloud because people have been doing a simplistic version of this for decades in private data centers, the person in charge gets build-out fatigue, possibly before you were even hired, and you practically have to harass them to get enough hardware, or give up and use more man-power to do the same work a computer should be doing, while slowly you lose all respect for the company.

Cloud is already built. You just show up and pay 140% of the amortized cost of building and running the system, instead of the up-front cost. Now that person is either easier to talk to or you can just go around them entirely.

As I saw someone say recently, "Love is not the most powerful force in the universe, it's spite."

Where was I? Oh yes. So we can do this, but to an extent the people who manage to pull it off are remarkable, because so often it doesn't work out, or they succeed in silence and go back to what they really were supposed to be working on now that they've figured out how to get more out of their meager capacity.

ezrast · on Aug 7, 2020

If we were to extend the first graph a bit more to the right, the linear improvement would quickly trend downwards into negative latencies, a sure sign that it's not the right answer. But if linear is impossible, then super-linear is just as impossible. The line the author describes as such is clearly asymptotic.

If we were discussing overall latency, and not just queue delay, then an asymptotic decrease towards zero could correspond to a super-linear increase in throughput capacity. But that's not what's happening here, either: because average latency is bounded below at one second, total throughput can never exceed c. There is no super-linearity to be found here.

mjb · on Aug 7, 2020

Right, 'asymptotic' is a better description of the behavior. I've updated the post.