I think it's laughable to put Twitter and Digg in the same category of scalability as Facebook.
Maybe things have changed since the last time I visited Digg over 2 years ago, but the social networking aspects are not very significant. The vast majority of their hits are practically fully page cacheable.
Twitter at least has an interesting scaling problem, but they don't have any features and they move at a glacial pace.
Facebook on the other hand has a graph that almost as nasty as Twitter's (minus the million followers thing), but they have 100 times the features, and they push new code every week.
Facebook also "has about 30,000 servers supporting its operations, hosts 80 billion photos, and serves up more than 600,000 photos to its users every second."
FB should be more difficult as well because they aren't just showing streaming information. They're making (supposedly) intelligent decisions on what information to show you. Twitter is just showing you a "dumb" stream.
I'm pretty sure Twitter is "guaranteed delivery", but thats just my naive user expectation.
Poking, while needing to be reliable is a 1:1 activity, so light on writes and easy to read, whereas broadcasting is difficult because of the wide variance in update frequency and number of followers from user to user.
Just because Twitter and Digg don't operate at the same scale as Facebook doesn't mean they don't face similar issues.
I actually think having a bunch of large sites tackling similar issues will present a whole bunch of neat new technologies for addressing these sorts of issues in the near future.
I'm not talking about absolute scale. Facebook is small compared to Google scale, but in my (probably poorly) informed opinion it's not as impressive.
All these sites have scaled to multiple servers, which means they've all had to address some of the hard problems, and no doubt that many of the issues are very similar. However even if Facebook's traffic were half of Twitter's it would still be a much more impressive architecture. They support a full-on integrated development platform for crying out loud. That is orders of magnitude more impressive than Twitter's lightweight API whether they're serving 100 or 100,000 requests per second.
I'm not dissing Twitter either, it's just that what Facebook has going is pretty incredible. Consider that failing to scale is basically what killed Friendster.
I understand what you're saying Facebook is massively complex compared to digg and twitter - which is likely why it chews up the majority of the article. But I'm not sure why you can't have a top level discussion about scaling (which seems to be the point of the article) that is relevant to all three.
Actually I've found that people follow twice as many people on Twitter as on Facebook and receive updates a whole load more often. Facebook does benefit from being a older company with a lot more engineers though.
What makes Twitter hard is that it is a semi-real-time communications medium that is no longer particularly web-based - a good portion of Tweets come from phones.
But you're right with Digg, I don't see it having any excuses
I wonder if there is value to be had in migrating users to increase social graph locality.
The probability of (B fr C) is greater than the mean if (A fr B) and (A fr C). That's useful information.
Off the top of my head, I wonder how an algorithm like:
- I have N shards
- pick the top N most-connected users
- assign them each to a shard
- assign their immediate friends to the same shard
- randomly fill in other users
would work.
Possible refinements:
- if the %age of shared friends between two users in the top N is > X, put both users in the same shard + add the N+1th user at a new shard-seed
- chase more than one level of immediacy from the shard-seed users to fill the shard
- if a shard is full, don't drop to random allocation for 1st- or 2nd- level friends, but instead put them all onto shard+1
The idea here is that for pull or push you win if you need to contact fewer shards. i.e. the queries and updates needn't be per-user but per-shard. i.e. you can query/update for all users on a shard in one sql statement.
If you somehow manage to keep all of ashton's friends on 10 shards instead of 100, then that's a big win, surely?
I've wondered for a while about some hybrid push/pull distribution scheme. I have 100 followers, just push. Oprah has 1M, 1% of which are currently active, pull. Sounds complicated, but feasible.
But is also has to do with how frequently the information is changed/accessed.
e.g. If Oprah only publishes 3 tweets a day, but her one million followers each check 100 times a day (just to be on the bleeding edge of gossip), it's much less effort to push the change.
On the flip side, if you post status changes several times a day but your followers rarely check (daily/weekly), pull may make more sense.
Although, I suppose that a push inherently requires an update/write, while a pull is generally a read. Seems like this might need to be taken into consideration as well.
This is in fact an unsolved problem. Current database systems cannot handle this level of connectivity. Facebook can't do a "push" because of all the features they support. It gets extremely complicated if they push complex data into user's mailbox. And twitter cannot do "pull" so they can avoid putting a cap on the number friends/followers but then that's why they cannot really add any feature.
I'll preface this with the fact that I've not read those papers (but today was thinking through the concept of write hooks in information graphs), but I'd assume from the common formation of the pattern that it's only really set up for one-dimensional publishes.
The problem in large scale information networks, more in the Facebook way than the Twitter way, is that you you potentially trigger a cascading effect in information updates if you go pub-sub. Specifically, applications that do interesting things with social graphs have to go beyond basically doing message passing.
I naturally look at things from a recommendations angle, but if you've got a new edge that enters the graph that may affect other edges that are connected to the end points. Those may in turn affect the edges that are connected to those nodes and so on. You want to avoid something that effectively becomes a breadth first traversal of the graph doing updates since that's well, slow, to put it mildly.
This is why large-scale graph algorithms like PageRank work on constantly regenerating static matrices rather than doing regeneration of the ranks dynamically, but that naturally is problematic when you're working on data sets where the most recent data is the most important and is being generated at very high rates.
In some models, sure, but is that true in Facebook's? Does anything go beyond one degree?
I can't think of anything, really. If X and Y becomes friends, X's friends and Y's friends see it in their feed ("inbox"), but nobody else. Same goes for tagging someone in a photo, etc.
Ad targeting, what's hot in your network, friend recommenders, etc. are all essentially ranking problems. They're moving more in that direction, definitely.
I spent a few hours mocking up a solution to this type of status updates in code. It seemed fairly obvious so I assumed it was not a real problem. However, if there is any interest I can turn it to some sort of blog post next weekend.
Sure it's obvious until you have hundreds of millions of nodes that each link to somewhere between 100 and 2,000,000 other nodes that need to be updated when any given node updates (assume around 5,000 nodes are updating every second with a power law kind of distribution).
If you think that's obvious, I think you're significantly beyond all of the people with all of the "hello world" twitter clones out there.
Interestingly, the problem isn't so much in building a scalable store on a single machine that goes up to a certain point -- a billion edges, for example (about what we've tested our DB up to). Distributing the data becomes a really hard problem, however, because the issues of data locality is tricky. Typical map-reduce patterns are only of limited utility since each traversal may imply additional traversals, so you end up having to have "smart" (computing) nodes which may themselves trigger a second tier of map-reduces.
This is quite different from the problems of even large scale web apps where there's essentially a set of data that's pulled from a caching layer and assembled.
It all comes down how well you can slice the process. There is no cheep over the counter solution to these problems but a little custom code can go a long way.
5k node updates per second might sound like a problem, but one core of one machine can easily keep up with that so you can have several copies and several views of the whole network graph. Public vs. private messages can be handled separately and then joined before presentation to the user. You can separate finding which message to display from the message data. You even get to display dirty reads as long as the data is <2 seconds old it's plenty good enough.
Ok, describing the solution based on the above insights takes some time and pictures but does it still sound horrible?
PS: Twitter was forced to morph an architecture built to solve a different problem into a working solution. That takes time and can be fairly difficult. But, starting from scratch it's not that bad.
Twitter was forced to morph an architecture built to solve a different problem into a working solution.
BINGO! Twitter's initial implementation was a "my first blog" in Rails -- when you're starting from an impedance mismatch that massive, and you have to rearchitect it live, jesus.
I think you under-estimate the complexity and performance costs of having a massively distributed data structure. Having N nodes perform computations on a graph many times the size of N is easy. Maintaining some form of coherency throughout all N nodes is hard.
The communication costs are significant, and probably similar to N-body simulations.
When you grow from zero to 100million users you get to watch as each piece breaks under growing load. Keeping a system running as it just keeps growing is hard.
However, Facebook rolled out a messaging system with little problem. I think the problem is guessing and simulating the load before people start messing with it. While not wasting millions building for load that never shows up.
But they didn't. They rolled out the messaging code to the users with no interface on it at first (e.g. presence only, visible only internally). That gave them excellent information on how many people might be connected, how many of their friends were likely to be connected, and so on. By the time they were ready to deploy the interface to it, they had a good idea of how much power the needed under it.
I've been wondering for a while why we don't see more often systems with a huge amount of RAM and asynchronous persistent updates, with a more, how should I put it, computer science-ish architecture inside. MVC works ok, but does it really mean we _have_ to use it always? Is it so hard to make system with a TeraByte of RAM? Why even use memcache? Why not make a clear decision that stuff like user statuses will never even see the inside of a hard disk and make use of the simplification it brings? Is there a clear reason, or simply it's not fashionable?
You can buy a machine with a terabyte of RAM. If your check doesn't bounce, IBM, HP, Sun, et al. will deliver it via forklift tomorrow. However, that box will cost much, much more than 16 boxes with 64GB each. Those 16 boxes will also have more CPU, network cards, and aggregate bandwidth to RAM than the terabox. The big honkin' box full of RAM is in most respects an unbalanced system; you can do a lot more with a lot less money if it's acceptable to give up on the single-system image programming model.
The greylist daemon I wrote (http://www.x-grey.com/) keeps everything in memory, with a checkpoint (everything gets written out to disk) every hour (the main process forks, and it's the child that does the writing). A similar approach might be applicable here.
Based on another article on highscalability, Twitter has one master MySQL server that is used for backups only. If a node crashes, it reloads that status from MySQL. That MySQL server handles 300 tweets per second @ roughly 2400 qps.
They now use Scala/JVM rather than rails for the backend, Rails for the frontend.
Exactly how often do servers restart? Your average commodity server - maybe twice a year. For user data it's too much, but for user statuses, I don't think so. And if you save the data in a controlled restart and only consider crashes, with a bit of effort you can probably go down to about once every couple of years or less.
You can't store everything in memory all the time. There simply isn't a computer built which can do that. Horizontal scaling with commodity boxes and layers of cache are the most cheap and effective method of returning a high rate of requests as quickly as possible.
Maybe things have changed since the last time I visited Digg over 2 years ago, but the social networking aspects are not very significant. The vast majority of their hits are practically fully page cacheable.
Twitter at least has an interesting scaling problem, but they don't have any features and they move at a glacial pace.
Facebook on the other hand has a graph that almost as nasty as Twitter's (minus the million followers thing), but they have 100 times the features, and they push new code every week.