I spent a few hours mocking up a solution to this type of status updates in code. It seemed fairly obvious so I assumed it was not a real problem. However, if there is any interest I can turn it to some sort of blog post next weekend.
Sure it's obvious until you have hundreds of millions of nodes that each link to somewhere between 100 and 2,000,000 other nodes that need to be updated when any given node updates (assume around 5,000 nodes are updating every second with a power law kind of distribution).
If you think that's obvious, I think you're significantly beyond all of the people with all of the "hello world" twitter clones out there.
Interestingly, the problem isn't so much in building a scalable store on a single machine that goes up to a certain point -- a billion edges, for example (about what we've tested our DB up to). Distributing the data becomes a really hard problem, however, because the issues of data locality is tricky. Typical map-reduce patterns are only of limited utility since each traversal may imply additional traversals, so you end up having to have "smart" (computing) nodes which may themselves trigger a second tier of map-reduces.
This is quite different from the problems of even large scale web apps where there's essentially a set of data that's pulled from a caching layer and assembled.
It all comes down how well you can slice the process. There is no cheep over the counter solution to these problems but a little custom code can go a long way.
5k node updates per second might sound like a problem, but one core of one machine can easily keep up with that so you can have several copies and several views of the whole network graph. Public vs. private messages can be handled separately and then joined before presentation to the user. You can separate finding which message to display from the message data. You even get to display dirty reads as long as the data is <2 seconds old it's plenty good enough.
Ok, describing the solution based on the above insights takes some time and pictures but does it still sound horrible?
PS: Twitter was forced to morph an architecture built to solve a different problem into a working solution. That takes time and can be fairly difficult. But, starting from scratch it's not that bad.
Twitter was forced to morph an architecture built to solve a different problem into a working solution.
BINGO! Twitter's initial implementation was a "my first blog" in Rails -- when you're starting from an impedance mismatch that massive, and you have to rearchitect it live, jesus.
I think you under-estimate the complexity and performance costs of having a massively distributed data structure. Having N nodes perform computations on a graph many times the size of N is easy. Maintaining some form of coherency throughout all N nodes is hard.
The communication costs are significant, and probably similar to N-body simulations.
When you grow from zero to 100million users you get to watch as each piece breaks under growing load. Keeping a system running as it just keeps growing is hard.
However, Facebook rolled out a messaging system with little problem. I think the problem is guessing and simulating the load before people start messing with it. While not wasting millions building for load that never shows up.
But they didn't. They rolled out the messaging code to the users with no interface on it at first (e.g. presence only, visible only internally). That gave them excellent information on how many people might be connected, how many of their friends were likely to be connected, and so on. By the time they were ready to deploy the interface to it, they had a good idea of how much power the needed under it.