This is great work, but certain part of the comparison is not accurate, probably due to their lack of understanding of Spark.
First and foremost, it would make more sense to compare against the DataFrame API of Spark, which is very Pandas like.
"Dask gives up high-level understanding to allow users to express more complex parallel algorithms."
- I don't think this is true. Their example of complex algorithms (SVD) is not that complicated, and there are even implementations of that in Spark's MLlib directly. Spark's DAG/RDD API is essentially the low level user-facing task API.
"If you have petabytes of JSON files, a simple workflow, and a thousand node cluster then you should probably use Spark. If you have 10s-1000s of gigabytes of binary or numeric data, complex algorithms, and a large multi-core workstation then you should probably use dask."
- Actually a lot of Spark users use Spark as a way to parallelize Python programs on multi-core machines.
"If you have a terabyte or less of CSV or JSON data then you should forget both Spark and Dask and use Postgres or MongoDB."
- Loading "terabyte" of JSON into Postgres seems pretty painful.
Thanks for the comments. I wrote most of that document so I'll try to explain my reasoning in line.
> First and foremost, it would make more sense to compare against the DataFrame API of Spark, which is very Pandas like.
It would make more sense to me to compare dask.dataframe to spark's dataframe. This document is comparing dask to spark. dask.dataframe is a relatively small part of dask.
>> "Dask gives up high-level understanding to allow users to express more complex parallel algorithms."
> I don't think this is true. Their example of complex algorithms (SVD) is not that complicated, and there are even implementations of that in Spark's MLlib directly. Spark's DAG/RDD API is essentially the low level user-facing task API.
The point here is that it's quite natural for dask users to create custom graphs (here is another example matthewrocklin.com/blog/work/2015/07/23/Imperative/). Doing this in Spark requires digging more deeply into its guts. This sort of work is not idiomatic or much intended in Spark.
> Loading "terabyte" of JSON into Postgres seems pretty painful
I've found that loading a terabyte of CSV into Postgres or a terabyte of JSON into Mongo to be quite pleasant actually. I'd be curious to know what problems you ran into.
One thing that bothers me with these articles is how they never address the software engineering challenges of writing large complex systems in a pure functional style. There are some tasks where just adding a sprinkle of imperative programming can make the design of the entire system much easier to understand (a kind of "mostly functional"). An article that really crystallized this point is here:
That post's been on HN before. I don't know if it's really making a good point.
> Imagine you've implemented a large program in a purely functional way. All the data is properly threaded in and out of functions, and there are no truly destructive updates to speak of. Now pick the two lowest-level and most isolated functions in the entire codebase. They're used all over the place, but are never called from the same modules. Now make these dependent on each other: function A behaves differently depending on the number of times function B has been called and vice-versa.
This is a complaint that the programming language is preventing you from introducing a hidden dependency. This is a strange complaint, given that hidden dependencies are a problem in software maintainability.
I mean, yes, it's convenient now to be able to "just" add global state here and there, but it will come back to bite you later.
> [Single-assignment form] is cleaner in that you know variables won't change. They're not variables at all, but names for values. But writing [single-assignment form] directly can be awkward.
Well, you don't have to write single-assignment form functional code. If you want to modify state within a function in the same way we all know and love from C, you actually can do that. In Haskell you could do this with the state monad, for example. Purely functional programming enables the composing of operations in many different ways, so you actually have a huge amount of freedom in what style you write your code.
> For me, what has worked out is to go down the purely functional path as much as possible, but fall back on imperative techniques when too much code pressure has built up. Some cases of this are well-known and accepted, such as random number generation (where the seed is modified behind the scenes), and most any kind of I/O (where the position in the file is managed for you).
Random number generation doesn't require threading the seed if you really don't want to. In Haskell, for example, you can also:
* Generate an infinite list of random numbers
* Call the IO monad function to get a new random number, which advances the generator behind-the-scenes
While the author might have a point that some things are simpler in imperative code, I don't think their examples really support this. Perhaps they had not delved very deep into functional programming.
I wish I could find that article again, it's just a claim in a comment at that point but that guy said his team rewrote a c++ system deemed first in class / world class (real time trading or something close) and unbeatable by its 30-years-in-the-field authors. Yet the F# (IIRC) was smaller AND faster, because they had better abstractions to describe the system, find and solve bottlenecks while the previous team members were drowning in c++ LoC and hubris.
Sysomos is redefining social media analytics with a powerful product suite that provides customers with the tools to measure, monitor, understand and engage with the social media landscape. Sysomos analyzes huge amounts of social data from Twitter (full firehose), YouTube, Facebook, forums, blogs and many more.
We are looking for extremely bright individuals to join us as Data Scientists in the Toronto Labs team. We are a newly created team tasked to bring new applied research to the Sysomos social analytic platform. We use a whole bunch of techniques (stats modeling, ML, text mining, graph analysis etc.) to analyze the massive amounts of social data we ingest.
Requirements:
* Advanced knowledge of data mining, machine learning, text mining, NLP or information retrieval.
* Hands-on experience in statistical computing software (R, MatLab, SPSS), big data analytics tools (e.g. Hadoop, Mahout, Map-Reduce, Impala), or NLP packages (e.g. OpenNLP, LingPipe).
* Experience with database, search, or indexing technologies (e.g. MySQL, Lucene/Solr).
* Strong programming background with advanced knowledge of algorithms and data structures in a popular language (Java, C/C++, Python, etc.) in a Linux environment.
* Bonus points for implementation of big data/streaming data analytics or visualization, or working with large-scale social media data (e.g. Twitter, Facebook, Linkedin, Google+).
* Exceptional analytical, problem solving and communication skills are a must for close collaboration with colleagues and customers.
http://dask.pydata.org/en/latest/spark.html