Spark is worse because you need Scala as well as regular Java. I've tried buildi...

itg · on Dec 9, 2019

And then add PySpark on top of that. Couldn't leave my last job fast enough when they decided to use Hadoop/PySpark when the largest incoming files we received were at most a few GBs.

boxy310 · on Dec 9, 2019

I once had a consulting gig where the customer desperately wanted to build a Spark/Scala ML pipeline, for a dataset that was 10 MB. We spent 3 months hammering it together for a flat Python process that would've taken us 2 weeks.

snaky · on Dec 9, 2019

> This find xargs mawk pipeline gets us down to a runtime of about 12 seconds, or about 270MB/sec, which is around 235 times faster than the Hadoop implementation.

https://adamdrake.com/command-line-tools-can-be-235x-faster-...

buzzkillington · on Dec 9, 2019

If you'd sent it off to mechanical Turk it would have been done in an afternoon.

marcinzm · on Dec 9, 2019

I build it in a container for work and didn't find it that difficult to be honest. And Google has plenty of example Dockerfiles that show the steps needed.

The only real system dependencies are Java8, maven and texlive (and Python/R if you build for that). Then it's `make-distribution.sh` with the appropriate flags. Scala and everything else that is needed is downloaded by maven. The resulting directory is self-contained assuming you have java8 runtime on your target machine.

dcolkitt · on Dec 9, 2019

FWIW, the below linked Dockerfile will download, build and install Spark in a single step.

https://gist.github.com/Mister-Meeseeks/1ebf875b6e1262449cbc...