This article would be better if it were properly titled: "The Cloud is Not For Me".
The anecdote provided here – about a specific app with a specific architecture and a specific load profile, deployed by specific employees with specific skills at a specific company with a specific budget, a specific revenue model, and a specific schedule – is fine as anecdotes go, but you can't draw a general conclusion from it. Change any of those specifics and the conclusion may be entirely different.
Some of these may be specifics, but budget, revenue, load profile are not.
Dont have a budget? Good luck with that.
Dont have revenue? See above.
No load profile? Doubt you're going to end up with revenue.
My skill set may be more wide reaching than others, but I'm confident that any of the engineer at our company (and most others) would be able to do exactly the same thing.
It's not about fitting it within the constraints of what you have, it's about doing something simply because it's a better choice. Why spend more time and money if you dont have to? I may have spent time making the switch, but I'll save that by not having to worry about performance and scaling concerns for the foreseeable future.
My business is such that external events cause immediate spikes. My traffic might double because Apple released a new firmware, or might go up 10x because someone released a jailbreak without warning me. That capacity requirement quickly trickles down and settles to its original levels over the next two months until it spikes again.
Netflix is in a similar position: you never know when A&E runs a new documentary making everyone suddenly want to watch this one specific old movie at the same time, or when a trailer for a sequel hits during the Super Bowl and when the game is over everyone is inspired to watch the original.
Some businesses thrive on environments where capacity planning 24 hours in advance is impossible, an where holding excess capacity seven days later might mean 10x the costs of running your business. Claiming that these people are incompetent or couldn't possibly have profitable businesses is ludicrous: the cloud isn't for you, but I /love it/.
Btw, the extreme irony is that your entire business model works because your argument is often false. Most of the small blogs I know use DISQUS not because it does something cool, but because they only get traffic randomly based on events they have little control over: Apple announces something, to take advantage of it they write an article about it to be published an hour later, and then their site which normally gets no traffic suddenly needs to handle insane load.
Your cloud-hosted (to be clear, not necessarily that you host it in the cloud, but that you are a cloud to your users) comment engine solves that problem (as well as adding similar deployment advantages and simplicity that is also known from cloud hosting).
(Note: for those who want to heckle me "your service falls over when jailbreaks come out", it doesn't: the third party repositories do. The core site, the payment processing system, and my repository barely notce, and when they do it is normally due to a bug that I rapidly notice and fix, and tends to have limited effect.)
> My business is such that external events cause immediate spikes. My traffic might double because Apple released a new firmware, or might go up 10x because someone released a jailbreak without warning me. That capacity requirement quickly trickles down and settles to its original levels over the next two months until it spikes again.
Have you actually had this occur in real life, that you had to spin up new instances during these spikes ? What kind of database configuration were you using such that it could accommodate all those new application server instances, do you also add new database slaves on the fly ?
When this article got at the idea of "sounds good in practice, but never happens in reality", that was my experience too. We were on Postgresql and the notion that we'd just "add 20 instances" when we had a load spike was ridiculous. I'm just curious who is actually doing this, and if they are also using relational databases.
Here is a graph I generated a few weeks ago: we've since had yet another major traffic spike due to the release of Absinthe 2.0 with Rocky Racoon (an untethered jailbreak for iOS 5.1.1) that is actually one of the most intense spikes yet (but am on my iPhone and can't make new graphs).
I over-allocate the database server for Cydia, but spawn up new web servers on demand. I keep as much of the CPU-intensive work then off the database, store as many static assets as I can on services such as S3, and use distributed queued logging (RELP).
For JailbreakQA's database (where downtime isn't that important) I do an instance stop, change the type of computer it is running on (such as from m1.large to c1.xlarge), start it again, and have a drastically different machine with only a minute of downtime. EC2 is a godsend (for me).
It's significantly more difficult to scale a traditional relational database (although not impossible!), than to scale the web/app layer that sits in front of it. Snapshot + clone + some kind of sync middleware (like pgpool for postgres) can probably get you 80-90% of the way there. Rearchitecting so that your db server is not the bottleneck should help there as well.
Maybe you need to have a master/slave setup, and on huge load, flip the slave instance over to be a instance type with quadruple the RAM and CPUs for a few hours, then back to a single-core, low-memory instance to keep the data-sycn flowing. There's a million ways to skin this cat.
If your database itself is the bottleneck, then, yeah, on the fly flexibility might be difficult to achieve.
In his case, a relational database probably isn't the bottleneck at all, and scaling out caches, web front ends, etc. is all fairly straight forward. There are huge numbers of folks taking advantage of this kind of flexibility.
Hell, Amazon has a whole API you can integrate with that handles it for you (even has $ references, so you don't accidentally spend yourself bankrupt because of a TC story).
My company provides dynamic content in emails, and as such gets large traffic spikes when 10 million emails get sent at once and everyone begins opening them. The content's configuration (in postgres) is trivially cacheable, but our app servers render different content based on the user's context.
So we have a bunch of shared-nothing app servers that we can spin up and down based on the emails we know are going out. Automatically detecting spikes and spinning up new instances between the send and the peak is much harder, though.
Yeah, we're using Cassandra for logging. Not quite as simple to scale up, but it's write-only in the request cycle and hasn't been anywhere near a bottleneck yet.
I'd argue that most people use Disqus because its more powerful than whatever they had (or didnt have) and it's so easy to setup.
I should redirect the point of my post to targetted more towards small businesses, startups, random hackers, etc.. Obviously if you're a larger company you can do whatever you want, and it will generally work out.
That said, just because "Netflix uses the cloud" doesnt mean it's the right decision for anyone, and they certainly do not use it to handle spikes in traffic. They've publicly stated that their primary reason for AWS/etc was simple the lack of operational complexity that was needed to manage it. Even given that argument, the cloud at a point becomes just as complex, if not more, than just running servers.
One thing that's helped Disqus out is the fact that we can get ridiculously powerful servers to ease the burden of needing to scale out (horizontally) right away. I believe even our smallest database servers are still the maximum size you can run on AWS (in terms of memory).
AFAIK, Netflix doesn't use Amazon for delivering streams to their customer (they use Level3's CDN). They use Amazon for batch jobs like transcoding and to host their API servers.
Using Amazon for API servers is a little surprising, btw. I would have gone with dedicated hardware. Maybe the sheer number of machines for API service is so high that using AWS APIs is a significant saving in operations complexity?
Netflix uses more than one CDN, and, according to posts by their one of their architects (Adrian Cockroft http://perfcap.blogspot.com/), they use EC2 for everything but some basic in-house HR/backoffice stuff.
He's got DOZENS of posts on how it works for them, how they full embraced ec2, etc. and how well it's working for them.
The anecdote provided here – about a specific app with a specific architecture and a specific load profile, deployed by specific employees with specific skills at a specific company with a specific budget, a specific revenue model, and a specific schedule – is fine as anecdotes go, but you can't draw a general conclusion from it. Change any of those specifics and the conclusion may be entirely different.