I read about an interesting technique, an "all-or-nothing tracker" in a blog post from an Apache Spark engineer.
You dispatch n jobs, where n is quite large, and you want to know; have all n jobs have completed, or has less than n jobs have completed. How to do so with a small fixed number of bytes with very high probability?
Give each job a random 128-bit ID number. XOR each ID number together as you start each job, and XOR into the same value as each job completes. If all the jobs have completed, the result is 0. The chance of zero turning up randomly if all jobs are not complete is negligible.
This is basically how Apache Storm fault tolerance worked. The problem I have with it is what do you do when one of your workers failed and you're not sure if it failed before sending a job completion message? Storm just restarts everything I believe, which is not great. If you just restart that job you could be left with 'dangling' acks.
Spark is a great technology, for sure. I was hesitant to get into Spark because I have lots of experience writing Hadoop map reduce apps. Then I decided a while back to base all of the machine learning examples in my current book project on Spark and MLlib and I am happy with that decision.
As the article mentioned, IBM certainly did validate the Linux "market." When people would ask me what was great about Linux I used to just say that IBM was investing billions in Linux, and that was an acceptable answer for people.
Curious about what your concerns with Spark were? I work for a company that supports Spark development, but I don't work closely with that project, so my opinion is not sufficiently well-informed, and obviously biased.
As far as I know, virtually any MapReduce job can be rather trivially translated to Spark's .map() and .reduce() operations. And the downsides are: it's model hasn't yet been proven at the largest scale's MapReduce has been used, and possibly the use of Scala (although Java / Python bindings are obviously available). Were there any other major factors in your hesitance?
I didn't have concerns about Spark, rather I already felt comfortable with Hadoop.
Another issue is that I am sort of retired now. I still accept small consulting jobs and do a lot of writing but my technology choices have shifted to fun things like Pharo Smalltalk, Haskell, etc.
"At the core of this commitment, IBM plans to embed Spark into its industry-leading Analytics and Commerce platforms, and to offer Spark as a service on IBM Cloud. IBM will also put more than 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs worldwide; donate its breakthrough IBM SystemML machine learning technology to the Spark open source ecosystem; and educate more than one million data scientists and data engineers on Spark."
It will be interesting to see if IBM now bet big on their ipython kernel for Spark - https://github.com/ibm-et/spark-kernel. I've looked at it, and it's way behind Zeppelin and even Spark-Notebook. An Eclipse for Spark as a Notebook-style IDE will be a game-changer.
It's even more interesting to observe the dynamics in this increasingly open source world of software.
By deciding to sponsor Spark, I think IBM is becoming practically it's owner, without having to do anything prior to this move. Does it mean it is possible today to "acquire" technology a project by naming your own price?
This is a really interesting phenomenon - existing enterprises taking lead roles in open source projects in such a way that they almost look like they "own it". Not the first time, I mean, RedHat did exactly this with Linux in some ways, for example.
Teradata recently made a similar move with PrestoDB - which had Facebook as the owner but no hardcore platform technology development supporter. And MapR did the same with Apache Drill.
I have been telling everyone I know my opinion for a long time - open source is not just a movement - it is a strategy. It is amazing the potential for change and/or disruption that open source causes.
And, that potential may not always be good - just look at Hadoop fragmentation as an example (although there is a lot of good in that fragmentation as well).
Yes and no. You can steer an open-source project up to a point, but if you turn too far away from what the community wants then they will take matters into their own hands - see e.g. Joyent / node.js / io.js.
Maybe owner is not the right word. But IBM is definitely going to be associated with this technology in an exclusive manner. Like when you say Bootstrap or Storm you have to mention it's by Twitter, or when you talk about AngularJS you'd mention it by Google.
They are all open and open source, but the main sponsor is always taking over the mind share.
In this case I suggest that IBM can become to Spark what Google is for Angular without even being there from the start.
Dumb idea? maybe...
SystemML (which is one of the technologies they are donating) looks very interesting:
Declarative large-scale machine learning (ML) in SystemML aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations to distributed computations on MapReduce or Spark. ML algorithms are expressed in an R-like syntax, that includes linear algebra primitives, statistical functions, and ML-specific constructs.
The register's title [1] is a bit more brutal - I wonder how this investment will spread among committers, tooling etc....
From an open source platform perspective, it raises interesting questions in terms of finding the right balance for management, as well as sustainability of the project.