"Build by SpiderOak on the same proven backend storage network which powers hundreds of thousands of backups"
This concerns me slightly, backup storage is a whole different world to real time data storage. Backups are write once, read occasionally, some people use S3 as a make-shift CDN, so constantly reading data.
Parity based repplication is great for backups, but would it not have performance implications if every request is reading from multiple disks/servers/nodes? I'm not an expert on hardware, but I would have thought being able to read an entire file of one disk is faster than having to put together pieces of data from multiple disks, anyone want to correct/inform me?
If you can offer me a serious alternative to S3 at a cheaper price, and open source software, I can't wait to try it out. I might sound negative but I just wanted to put across my first thoughts on having a look around the site.
Pretty much, it's always faster to read from multiple disks.
There are many reasons why. First is that by splitting things into small blocks spread around the cluster you have more consistent load (why is left as an exercise to the reader), you can more easily read ahead from later blocks, etc.
This is exactly what they said on their blog post:
Long term archival data is different than everyday data. It's created in bulk, generally ignored for weeks or months with only small additions and accesses, and restored in bulk (and then often in a hurried panic!)
This access pattern means that a storage system for backup data ought to be designed differently than a storage system for general data. Designed for this purpose, reliable long term archival storage can be delivered at dramatically lower prices.
Their architecture page seems to confirm this. It seems that their service is explicitly designed to have different performance characteristics from Amazon S3, so maybe they aren't quite a direct competitor to S3, but there are probably a lot of people using S3 for the use cases that Nimbus.IO claims to do better on, simply because S3 was available at the time.
Yes exactly. Nimbus.io is designed for long term archival storage at more affordable prices. We think it's a great time to be competing on price.
We may compete with S3 for low-latency service later on (latency can be made arbitrarily low by spending enough money on caching.) Initial calculations suggest we could be almost as low-latency as S3 and still under price by a good margin.
Latency may be able to be made low through caching, but depending on the distribution the point at which additional cache is uneconomical may be well before the edge of your performance envelope.
How are you calculating your latency? Also, what distribution do you assume your file accesses will come from?
A backblaze type box is ca 12K for 135TB of storage.
Assume an interest rate of 5% and 36 months worth of repayments and the server itself is worth $725/month
It's uses roughly 1kw of power and 4u of rack space, so say you have 6/rack with a 30A rack. You can get the rack for say 5k, giving us a total rack cost/server of $833/month
Total cost/server month is $1558/month.
Total cost/gb month is $0.011/gb month.
Add in parity replication (1 in 4, 25%), $0.014/gb month.
This doesn't include compression or dedup, both of which drops cost price dramatically.
Compare that to say S3's $0.14/gb and you can see why I'd say the margins are stupid, especially at the scale they're running at.
Note that the BackBlaze machines are optimized for very cold data since they only need to support backup and restore. We also do custom hardware at SpiderOak, but we support web/mobile access, real time sync, etc. That makes our hardware slightly more expensive because of the generally warmer data. So you're off by a few pennies, but certainly in the right zone.
For Amazon, I suspect their internal S3 cost is actually quite a bit higher than either BackBlaze or SpiderOak since their data is warmer.
This concerns me slightly, backup storage is a whole different world to real time data storage. Backups are write once, read occasionally, some people use S3 as a make-shift CDN, so constantly reading data.
Parity based repplication is great for backups, but would it not have performance implications if every request is reading from multiple disks/servers/nodes? I'm not an expert on hardware, but I would have thought being able to read an entire file of one disk is faster than having to put together pieces of data from multiple disks, anyone want to correct/inform me?
If you can offer me a serious alternative to S3 at a cheaper price, and open source software, I can't wait to try it out. I might sound negative but I just wanted to put across my first thoughts on having a look around the site.