Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm the founder and a core developer of Pachyderm so I can weigh in on how it compares (there is, of course, some potential bias here). I also was at Airbnb around the time we released Airflow, so I got to see it being built up close and used the system it was replacing quite a bit as well.

I think it's fair to say that Mara and Airflow are both in the same category of DAG (directed acyclic graph) schedulers for Python; Python makes a ton of sense as the language to focus on as it's the de facto lingua franca for data science. I'd also put Luigi in that bucket, although I think Airflow has degraded its mind-share quite a bit. All of them are targeting the data pipeline use case, which is very well represented as a DAG, but the actual management of the data is left up to the user. They (Mara, Airflow, or Luigi) schedule tasks for you after all the tasks they depended on have completed, but you have to figure out where to store your data so that downstream tasks can find the data their upstream tasks outputted. At Airbnb we used HDFS as this storage layer, often with Hive or Presto on top. Storing in s3 is also a common pattern.

Pachyderm is also a DAG scheduler but we're a lot more prescriptive about where you store the data, and a lot less prescriptive about what languages and tools you use. Pachyderm ships with its own distributed filesystem (pfs) that we use for storage, it does a few things that other storage solutions can't do. In particular, it version controls your data and records "provenance" i.e. where data comes from. For example if you train a machine learning model then its provenance will be the data you used to train it. In terms of processing we're much less prescriptive, because we let users express their code as a Docker container rather than only having bindings for one language. So you can use anything that you can fit in a Docker container. Data is exposed to your code via the local filesystem, so regardless of language you have a very natural interface to your data: system calls on files.

Hope this helps understand the differences between the various systems and thanks for your interest in Pachyderm. Swing by our users slack channel [0] if you'd like some help getting started with it.

[0] http://slack.pachyderm.io/



Random question:

I want to automate some workflows on my local machine, and besides the obvious of just writing a script, I am interested in a system where I could describe my workflow as a DAG and then have an easy (web?) UI where I could specify which DAG nodes have changed (e.g. my data pre-processing code) and have it automatically run all of the nodes that (recursively) depend on it as an input, while not executing those whose inputs have not changed.

I am passing around very little actual data between these jobs; they are mostly writing data to a (distributed) file system, so at most I need to pass some paths around.

Some of the stages require launching a remote job and polling to find out if it has completed.

Is there a good system for doing this? Now that I've described it I could probably hack it together with a command-line UI without too much difficulty, but having a pretty UI for launching and monitoring jobs would be great.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: