I think that if this were true, the data science libraries in Python wouldn't all be written in C under the hood, linking to Fortran, etc. Language can have a huge impact on performance with big data.
Julia is just a small step forward. It still has performance problems where the JIT can't help much - like the DataFrame structure, which is effectively a black box and therefor hard to optimize.
If you need performance on these kind of structures, you can use structured arrays in python.
The main reason I like the idea of Julia is that I don't see the problem of parallelism being addressed in python, at least not without cumbersome libraries (disclaimer: I haven't tried dask yet). If python had an equivalent to
#pragma omp parallel for
I wouldn't have Julia ready in mind for my next project that requires creating a high performance algorithm.
This is not quite accurate as stated: the problem is the combination of a data structure that is not amenable to type inference with an API that would only be efficient if type inference had perfect knowledge of the contents of a DataFrame. There is a lot of work being done to develop a new API that has none of these problems and provides very high performance, which demonstrates that Julia's JIT is up to the task so long as you choose an appropriate API. Julia, like any language, has intrinsic limitations, but this is not a good example: it's an example of how good API design for Julia differs from good API design for other systems.
> This is not quite accurate as stated: the problem is the combination of a data structure that is not amenable to type inference with an API that would only be efficient if type inference had perfect knowledge of the contents of a DataFrame.
Sorry if I implied otherwise - what I meant is that the current DF structure is hard for the JIT to optimize, not that this is a fundamental limitation of Julia. Just that JIT's are limited. I shouldn't have said Julia is only a small step forward, I think. A JIT gets you a long way.
> There is a lot of work being done to develop a new API that has none of these problems and provides very high performance, which demonstrates that Julia's JIT is up to the task so long as you choose an appropriate API.
Can you link to information on this? I'd love to see information on the design around this.
JIT's can only do so much by themselves, but the addition of generated functions in Julia gives the ability for library developers to modify generated code just before execution. It's almost like having a fully programmable compiler chain with a fraction of the work or effort.
Python numerical libraries links to LAPACK/BLAS/etc because that's how people used to do scientific computing before Python. Also for scientific computing you're mostly CPU bound which means that no one would use Python if the libraries were slower than Fortran or C. But big data problems are different in the sense that you're typically bound by IO, so while it certainly helps to have fast code it wouldn't be the first problem you'd want to solve.
It depends on what kind of data you're working with and what kind of operations you perform on it. If protein folding, then that's computationally expensive and the language will matter, doing mapreduce jobs, language matters little as the network overhead is going to be your bottleneck in a distributed environment.
Julia is just a small step forward. It still has performance problems where the JIT can't help much - like the DataFrame structure, which is effectively a black box and therefor hard to optimize.