IBM Invests to Help Apache Spark

rpalmaotero · on June 15, 2015

For everyone that wants to start working with Spark and Big Data, I recommend them to enrole into this MOOC published by UC Berkeley at EDX: https://www.edx.org/course/introduction-big-data-apache-spar...

TDL · on June 15, 2015

Thanks for that link. I've been looking for something like this to get a better working understanding of Spark.

rshaban · on June 15, 2015

It's been pretty good so far -- you can sign up now and only get late deductions (if you care about the grade) on two assignments.

zaroth · on June 15, 2015

I read about an interesting technique, an "all-or-nothing tracker" in a blog post from an Apache Spark engineer.

You dispatch n jobs, where n is quite large, and you want to know; have all n jobs have completed, or has less than n jobs have completed. How to do so with a small fixed number of bytes with very high probability?

Give each job a random 128-bit ID number. XOR each ID number together as you start each job, and XOR into the same value as each job completes. If all the jobs have completed, the result is 0. The chance of zero turning up randomly if all jobs are not complete is negligible.

The technique is mentioned here under 'Lineage Tracking': https://highlyscalable.wordpress.com/2013/08/20/in-stream-bi... but there's a better blog post I remembered reading but can't find at the moment...

nl · on June 15, 2015

This is kinda a Bloom filter: https://en.m.wikipedia.org/wiki/Bloom_filter

sappapp · on June 15, 2015

Both concepts make use of the mathematics behind xor logic gates.

snissn · on June 15, 2015

I've used bloomfilters for a bit, but I'm not sure how bloom filters use xors, can you explain that for me? thanks!

anonymousDan · on June 15, 2015

This is basically how Apache Storm fault tolerance worked. The problem I have with it is what do you do when one of your workers failed and you're not sure if it failed before sending a job completion message? Storm just restarts everything I believe, which is not great. If you just restart that job you could be left with 'dangling' acks.

mark_l_watson · on June 15, 2015

Spark is a great technology, for sure. I was hesitant to get into Spark because I have lots of experience writing Hadoop map reduce apps. Then I decided a while back to base all of the machine learning examples in my current book project on Spark and MLlib and I am happy with that decision.

As the article mentioned, IBM certainly did validate the Linux "market." When people would ask me what was great about Linux I used to just say that IBM was investing billions in Linux, and that was an acceptable answer for people.

TallGuyShort · on June 15, 2015

Curious about what your concerns with Spark were? I work for a company that supports Spark development, but I don't work closely with that project, so my opinion is not sufficiently well-informed, and obviously biased.

As far as I know, virtually any MapReduce job can be rather trivially translated to Spark's .map() and .reduce() operations. And the downsides are: it's model hasn't yet been proven at the largest scale's MapReduce has been used, and possibly the use of Scala (although Java / Python bindings are obviously available). Were there any other major factors in your hesitance?

mark_l_watson · on June 15, 2015

I didn't have concerns about Spark, rather I already felt comfortable with Hadoop.

Another issue is that I am sort of retired now. I still accept small consulting jobs and do a lot of writing but my technology choices have shifted to fun things like Pharo Smalltalk, Haskell, etc.

boothead · on June 15, 2015

If you're interested in this domain and Haskell and have some free time, you might find hailstorm interesting [1]

From what I understand of it - it's an implementation of apache storm in Haskell.

[1] http://hailstorm-hs.github.io/hailstorm/demo/#1

rshaban · on June 15, 2015

Neat! Thanks for sharing this, it's a great overview.

century19 · on June 15, 2015

"IBM said it will put more than 3,500 of its developers and researchers to work on Spark-related projects."

That is impressive. I wonder how that will be split among core contributors, consultants, etc.

rshaban · on June 15, 2015

A link to the IBM press release: http://www-03.ibm.com/press/us/en/pressrelease/47107.wss

"At the core of this commitment, IBM plans to embed Spark into its industry-leading Analytics and Commerce platforms, and to offer Spark as a service on IBM Cloud. IBM will also put more than 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs worldwide; donate its breakthrough IBM SystemML machine learning technology to the Spark open source ecosystem; and educate more than one million data scientists and data engineers on Spark."

jamesblonde · on June 15, 2015

It will be interesting to see if IBM now bet big on their ipython kernel for Spark - https://github.com/ibm-et/spark-kernel. I've looked at it, and it's way behind Zeppelin and even Spark-Notebook. An Eclipse for Spark as a Notebook-style IDE will be a game-changer.

itaysk · on June 15, 2015

It's even more interesting to observe the dynamics in this increasingly open source world of software.

By deciding to sponsor Spark, I think IBM is becoming practically it's owner, without having to do anything prior to this move. Does it mean it is possible today to "acquire" technology a project by naming your own price?

sixdimensional · on June 15, 2015

Really great observation.

This is a really interesting phenomenon - existing enterprises taking lead roles in open source projects in such a way that they almost look like they "own it". Not the first time, I mean, RedHat did exactly this with Linux in some ways, for example.

Teradata recently made a similar move with PrestoDB - which had Facebook as the owner but no hardcore platform technology development supporter. And MapR did the same with Apache Drill.

I have been telling everyone I know my opinion for a long time - open source is not just a movement - it is a strategy. It is amazing the potential for change and/or disruption that open source causes.

And, that potential may not always be good - just look at Hadoop fragmentation as an example (although there is a lot of good in that fragmentation as well).

DannoHung · on June 15, 2015

Doesn't Databricks actually employ most of the core committers for Spark?

agibsonccc · on June 15, 2015

https://cwiki.apache.org/confluence/display/SPARK/Committers

lmm · on June 15, 2015

Yes and no. You can steer an open-source project up to a point, but if you turn too far away from what the community wants then they will take matters into their own hands - see e.g. Joyent / node.js / io.js.

dlandis · on June 15, 2015

> I think IBM is becoming practically it's owner

Hmm, why do you think this makes IBM the technology's "owner"? What do you mean by owner exactly?

itaysk · on June 15, 2015

Maybe owner is not the right word. But IBM is definitely going to be associated with this technology in an exclusive manner. Like when you say Bootstrap or Storm you have to mention it's by Twitter, or when you talk about AngularJS you'd mention it by Google. They are all open and open source, but the main sponsor is always taking over the mind share. In this case I suggest that IBM can become to Spark what Google is for Angular without even being there from the start. Dumb idea? maybe...

nl · on June 15, 2015

SystemML (which is one of the technologies they are donating) looks very interesting:

Declarative large-scale machine learning (ML) in SystemML aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations to distributed computations on MapReduce or Spark. ML algorithms are expressed in an R-like syntax, that includes linear algebra primitives, statistical functions, and ML-specific constructs.

http://researcher.watson.ibm.com/researcher/view_group.php?i...

motdiem · on June 15, 2015

The register's title [1] is a bit more brutal - I wonder how this investment will spread among committers, tooling etc.... From an open source platform perspective, it raises interesting questions in terms of finding the right balance for management, as well as sustainability of the project.

[1] http://www.theregister.co.uk/2015/06/15/ibm_backs_apache_spa...

agibsonccc · on June 15, 2015

There's also competition for databricks itself.

http://venturebeat.com/2015/06/14/ibm-spark/

The spark ecosystem itself has a lot of players now.

Horton/Cloudera/MapR in their hadoop distros Typesafe: https://www.typesafe.com/community/other-projects/apache-spa...

baldfat · on June 15, 2015

Has anyone tried the new R implantation? I read about Spark and I find it very interesting but I have no need for it with my datasets currently.

itkawrje · on June 16, 2015

Here's another course on Spark Fundamentals: http://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundame...

There's a second, more advanced course too: http://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundame...

wzymaster · on June 16, 2015

I strongly recommend the Spark course at http://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundame..., very good course along with Docker-based practice hand-on lab !

wzymaster · on June 16, 2015

Strongly recommend the Spark course at http://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundame...

rshaban · on June 15, 2015

Some interesting perspective from The Register: "Up until last year, Spark had just 465 contributors."