Parallel Out-Of-Core Dataframes in Python: Dask and OpenStreetMap

IndianAstronaut · on Aug 15, 2015

This is awesome. One of the reasons I shifted away from Pandas is it's difficulty in dealing with out of core data. Can't wait to try this out.

nimrody · on Aug 15, 2015

What did you end up using instead of Pandas?

IndianAstronaut · on Aug 15, 2015

Right now I do my analysis work in SQL, currently MSSQL previous was MySQL. Always struggle to apply any functions to the data. I then do some sampling and analysis then in R.

bjlkeng · on Aug 15, 2015

Comparison of PySpark vs Dask:

http://dask.pydata.org/en/latest/spark.html

rxin · on Aug 15, 2015

This is great work, but certain part of the comparison is not accurate, probably due to their lack of understanding of Spark.

First and foremost, it would make more sense to compare against the DataFrame API of Spark, which is very Pandas like.

"Dask gives up high-level understanding to allow users to express more complex parallel algorithms."

- I don't think this is true. Their example of complex algorithms (SVD) is not that complicated, and there are even implementations of that in Spark's MLlib directly. Spark's DAG/RDD API is essentially the low level user-facing task API.

"If you have petabytes of JSON files, a simple workflow, and a thousand node cluster then you should probably use Spark. If you have 10s-1000s of gigabytes of binary or numeric data, complex algorithms, and a large multi-core workstation then you should probably use dask."

- Actually a lot of Spark users use Spark as a way to parallelize Python programs on multi-core machines.

"If you have a terabyte or less of CSV or JSON data then you should forget both Spark and Dask and use Postgres or MongoDB."

- Loading "terabyte" of JSON into Postgres seems pretty painful.

mrocklin · on Aug 16, 2015

Thanks for the comments. I wrote most of that document so I'll try to explain my reasoning in line.

> First and foremost, it would make more sense to compare against the DataFrame API of Spark, which is very Pandas like.

It would make more sense to me to compare dask.dataframe to spark's dataframe. This document is comparing dask to spark. dask.dataframe is a relatively small part of dask.

>> "Dask gives up high-level understanding to allow users to express more complex parallel algorithms."

> I don't think this is true. Their example of complex algorithms (SVD) is not that complicated, and there are even implementations of that in Spark's MLlib directly. Spark's DAG/RDD API is essentially the low level user-facing task API.

The point here is that it's quite natural for dask users to create custom graphs (here is another example matthewrocklin.com/blog/work/2015/07/23/Imperative/). Doing this in Spark requires digging more deeply into its guts. This sort of work is not idiomatic or much intended in Spark.

> Loading "terabyte" of JSON into Postgres seems pretty painful

I've found that loading a terabyte of CSV into Postgres or a terabyte of JSON into Mongo to be quite pleasant actually. I'd be curious to know what problems you ran into.

justinsaccount · on Aug 15, 2015

Their conclusion is interesting:

  If you have a terabyte or less of CSV or JSON data
  then you should forget both Spark and Dask and use
  Postgres or MongoDB.

I don't really have "big data" problems. I have "annoying data" problems. 500G of 10:1 compressed csv log files that I want to run reports on every now and then. Often just count or topk by a column, but sometimes grouping+counting (i.e, sum of column 5 grouped by column 3 where column 2='foo')

I've been looking into tools like Spark and Drill, but my tests running on a single machine found them to be extremely slow. Maybe things would be faster if I converted the log files to their native formats?

I've been considering trying to load the data into a postgres db using cstore_fdw, but what I really just want is a high performance sql engine for flat files, something probably like Kdb.

Like this article that I read recently: http://www.frankmcsherry.org/graph/scalability/cost/2015/01/... I know this can be done efficiently enough on a single machine, I just need the right software.

mrocklin · on Aug 16, 2015

Flat CSV or JSON files are hard to parse. Fast CSV parsers and gzip decompression both run at around 100MB/s. If you want to get faster than this you'll need to use better (ideally binary) formats.

This notebook might interest you: http://nbviewer.ipython.org/gist/mrocklin/c16c5c483b2b9859de... , particularly the sections starting at "Eleven minutes is a long time." It compares CSV costs (minutes) to custom binary storage formats (seconds) on a 20 GB dataset.