A Gluster developer's thoughts on Torus

spotman · on June 7, 2016

This is a tough one. It's easy to be the expert and point out what will not work.

The crux of it is as he states the false advertising, and how many people will see this and start to use it without knowing what they are doing. However it's hard to separate false advertising, with overconfidence. Maybe they even do not have to be the different things in this context.

The author is correct, this is incredibly hard to get right. Without having some experts heavily contributing and steering a project like this (and, by experts, I mean distributed filesystem experts, not just general distributed application knowledge) it is going to be a long uphill battle, fraught with terror.

But, who's to say that they won't find that type of collaboration. They may just do that, and they may just pull off making a decent project in some years.

But is it hard, hell yes. Is it going to happen as fast as they claim, probably not. Is it a silly idea to build this and not contribute to prior art more? Probably. Is someone a little pissed off about this and taking to the internet to moan about it a bit? Probably.

At the end of the day, file systems are hard, and if your job is to deploy, manage, scale, or recover them, you won't just be blindly throwing your petabytes at a brand new project anyways, and if you are, you should be removed from your current position.

The type of folks that will be early adopters and contributors won't be putting their banking transactions on it.

dice · on June 6, 2016

>Single-threaded sequential 1KB writes for a total of 4GB, without even oflag=sync? Bonnie++? Sorry, but these are not "good benchmarks to run" at all. They're garbage. People who know storage would never suggest these.

As a non-storage person: what should I be using instead of dd and bonnie++?

notacoward · on June 6, 2016

I usually recommend iozone and fio. Iozone isn't the most powerful tool, but it's simple to use and with the right options can at least provide some useful information. Fio is much more powerful, in terms of the workloads it can generate and the information it can provide, but it's a real PITA to work with.

As it turns out, I've given a few presentations on this stuff. Here's one of the more recent ones:

https://docs.google.com/presentation/d/1CQgn_fXivdXZ-fcqJ39v...

bcantrill · on June 6, 2016

bonnie++ is, as Jeff says, well-known as a canonical example of a benchmark gone horribly wrong; see Brendan Gregg's excellent post on active benchmarking and bonnie++[1] for the gory details.

Beyond bonnie++, storage benchmarks are fraught with peril; years ago, I dismembered SPEC SFS as being similarly unsafe at any speed when benchmarking storage systems, albeit for much more subtle reasons than the glaring mechanical flaws in bonnie++.[2] And as for dd, I actually think it's okay as long as you explain clearly what it is (and isn't); Jeff's complaint is that they seem to be treating this single dd invocation as "write performance", when it fact the truth is subtler -- and things like blocksize and synchronicity matter a great deal.

More generally, anyone interested in storage benchmarks would be wise to read essentially everything that Brendan Gregg has written on the subject, starting with his five-part (!!) series on file system latency.[3][4][5][6][7]

[1] http://www.brendangregg.com/ActiveBenchmarking/bonnie++.html

[2] http://dtrace.org/blogs/bmc/2009/02/02/eulogy-for-a-benchmar...

[3] http://dtrace.org/blogs/brendan/2011/05/11/file-system-laten...

[4] http://dtrace.org/blogs/brendan/2011/05/13/file-system-laten...

[5] http://dtrace.org/blogs/brendan/2011/05/18/file-system-laten...

[6] http://dtrace.org/blogs/brendan/2011/05/24/file-system-laten...

[7] http://dtrace.org/blogs/brendan/2011/06/03/file-system-laten...

notacoward · on June 6, 2016

+1 for anything Brendan writes or says. Seriously, to those of us who do this stuff, he's idol material.

SEJeff · on June 7, 2016

Ditto for Mr Cantrill, one of the fathers of DTrace fwiw

sijoe · on June 7, 2016

TL;DR version: Anything that is not your application won't use the resources in the same way, and will have very different properties (scaling, performance, contention, etc.)

Longer:

bonnie++ is not a load generator. Really. Start a typical run with a command line test case you find online and you'll see no actual IO. Lots of cache hits. We used to use it, 10 years ago, to play around with load generation, but found that it didn't generate enough IO, or in a way that actually matched what people do.

IOzone is marginal ... I wouldn't use it for a serious test, and when people suggest dd or IOzone, I ask them how well the code actually matches their use case. Chances are that it is also largely irrelevant to this. Worse still is that the IOzone throughput measurements are basically bogus, using a naive sum of bandwidths, rather than showing the interesting data (the actual histogram or distribution of performance, including the start/end times, and the rates per thread/process as a function of time).

fio is good, in that you can implement many types of tests that have a reasonable chance at being meaningful. I caught Sandforce controllers compressing non-random data on benchmarks they were using for the SSDs that used them with fio. Actual SF performance was lower than spinning rust once you fed it real random data.

We wrote something called io-bm (https://gitlab.scalableinformatics.com/joe/io-bm) for pounding on parallel file systems (specifically to stress them and see how they scaled and dealt with contention for network, metadata, etc.). I am not sure if the repo has our histogramming and time series bits in it so we can see individual thread performance as well as overall performance, might be in a private repo.

Basically it boils down to the TL;DR above. If its not your code, then you are likely testing with an application that is somewhere between partially to wholly irrelevant to your use case.

Terribledactyl · on June 6, 2016

There are two main issues with generalized benchmarking.

1) You (probably) aren't testing the things most done by your application/platform/etc. And worse yet, the hardware/software maybe simply going for big numbers in the common benchmarks.

2) The test may or may not be garbage depending on your use cases.

subway · on June 6, 2016

fio is a favorite of mine.

https://github.com/axboe/fio

bogomipz · on June 7, 2016

Fio. https://github.com/axboe/fio

SEJeff · on June 6, 2016

fio, maintained by Jens Axboe, the guy who maintains the Linux kernel block layer.

wildlogic · on June 7, 2016

Intel developed tool - http://www.iometer.org/

DeepYogurt · on June 6, 2016

CompileBench is a decent one. It's made by the author of btrfs. https://oss.oracle.com/~mason/compilebench/

merb · on June 7, 2016

    It's not true for Gluster. It's not true for Ceph. It's not true for Lustre, 
    OrangeFS, and so on. It's not even true for Sheepdog, which Torus very strongly
    resembles. None of these systems were designed for small clusters.

That's true. There is no system that is easy to administrate and starts with 1 Node and then can Scale to 1, 3, 5, 7, etc.

No system addresses this (they don't want to or it's to hard whatever).

sshykes · on June 6, 2016

It started off so nice and friendly, and then reads so hostile at the end. Sort of like a shit sandwich, except with the bottom slice of bread missing.

I am curious to understand how Torus is similar to Sheepdog [0].

From the Sheepdog website:

    Sheepdog is a distributed object storage system for
    volume and container services and manages the disks
    and nodes intelligently. Sheepdog features ease of use,
    simplicity of code and can scale out to thousands of
    nodes.

    The block level volume abstraction can be attached to
    QEMU virtual machines and Linux SCSI Target and supports
    advanced volume management features such as snapshot,
    cloning, and thin provisioning.

    The object level container abstraction is designed to
    be Openstack Swift and Amazon S3 API compatible and can
    be used to store and retrieve any amount of data with a
    simple web services interface.

[0] https://sheepdog.github.io/sheepdog/

notacoward · on June 6, 2016

Try this link to see the similarity a bit more clearly.

https://github.com/sheepdog/sheepdog/wiki/Sheepdog-Design

They're both basically block storage, with similar approaches to sharding and replication. Sheepdog seems to be using the term "object" more than they used to, but it's important to note that sheepdog objects have semantics closer to files or virtual disks than to S3/Swift style objects. The two also use related approaches (consensus vs. virtual synchrony) for coordination. Most of the differences are related to the fact that Sheepdog has already evolved over several years to have many of the features that are still on Torus's nascent road map. Ceph's RADOS/RBD is only a bit further from either one than they are from each other. None of them are identical, of course, and I never said they were, but from a purely technical perspective Torus's stated goals could have been achieved more quickly by contributing to Sheepdog than by starting a new project.

the_common_man · on June 6, 2016

> Anybody who would suggest these is not a storage professional, and should not be making any claims about how long it might take to implement filesystem semantics on top of what Torus already has.

The entire post reeks of condescension and arrogance.

I also don't like the potshots at marketing. I think Torus is of to a very good start. It's a project after all and they are claiming things to set across their vision. They didn't "lie" about things being here already but it's going to happen in the near future. What's wrong with that? Because "storage experts" think it takes years to build ? Sorry, visionaries don't listen to "experts" and set out doing things.

rdtsc · on June 7, 2016

> visionaries don't listen to "experts" and set out doing things.

I am stealing that quote. It is my new favorite sarcastic quote.

timv · on June 7, 2016

It turns out that my 2 year old is a "visionary" - who knew?

jacquesm · on June 7, 2016

It explains roughly 30% or so of the tech dd's that fail. What is surprising is how far some of these will go before the inevitable reality check.

SwellJoe · on June 7, 2016

But, is it wrong?

The Torus folks really are making huge promises for things that have, in other projects, required years to implement, and they've waved away very complex tasks as merely minor implementation details.