I think that if this were true, the data science libraries in Python wouldn't al...

VodkaHaze · on Nov 8, 2016

If you need performance on these kind of structures, you can use structured arrays in python.

The main reason I like the idea of Julia is that I don't see the problem of parallelism being addressed in python, at least not without cumbersome libraries (disclaimer: I haven't tried dask yet). If python had an equivalent to

    #pragma omp parallel for

I wouldn't have Julia ready in mind for my next project that requires creating a high performance algorithm.

johnmyleswhite · on Nov 8, 2016

This is not quite accurate as stated: the problem is the combination of a data structure that is not amenable to type inference with an API that would only be efficient if type inference had perfect knowledge of the contents of a DataFrame. There is a lot of work being done to develop a new API that has none of these problems and provides very high performance, which demonstrates that Julia's JIT is up to the task so long as you choose an appropriate API. Julia, like any language, has intrinsic limitations, but this is not a good example: it's an example of how good API design for Julia differs from good API design for other systems.

staticassertion · on Nov 8, 2016

> This is not quite accurate as stated: the problem is the combination of a data structure that is not amenable to type inference with an API that would only be efficient if type inference had perfect knowledge of the contents of a DataFrame.

Sorry if I implied otherwise - what I meant is that the current DF structure is hard for the JIT to optimize, not that this is a fundamental limitation of Julia. Just that JIT's are limited. I shouldn't have said Julia is only a small step forward, I think. A JIT gets you a long way.

> There is a lot of work being done to develop a new API that has none of these problems and provides very high performance, which demonstrates that Julia's JIT is up to the task so long as you choose an appropriate API.

Can you link to information on this? I'd love to see information on the design around this.

mandelken · on Nov 8, 2016

>Can you link to information on this?

This blog post is a good starting point on the current shortcomings and design solutions of dataframes in Julia: http://julialang.org/blog/2016/10/StructuredQueries

staticassertion · on Nov 8, 2016

Thanks a lot.

elcritch · on Nov 9, 2016

JIT's can only do so much by themselves, but the addition of generated functions in Julia gives the ability for library developers to modify generated code just before execution. It's almost like having a fully programmable compiler chain with a fraction of the work or effort.

Here's to hoping it'll be put to good usage!

hadsed · on Nov 8, 2016

Python numerical libraries links to LAPACK/BLAS/etc because that's how people used to do scientific computing before Python. Also for scientific computing you're mostly CPU bound which means that no one would use Python if the libraries were slower than Fortran or C. But big data problems are different in the sense that you're typically bound by IO, so while it certainly helps to have fast code it wouldn't be the first problem you'd want to solve.

urschrei · on Nov 8, 2016

I would say they link to LAPACK/BLAS because there's nothing to touch the combination of speed and robustness they provide.

quasarken · on Nov 8, 2016

It depends on what kind of data you're working with and what kind of operations you perform on it. If protein folding, then that's computationally expensive and the language will matter, doing mapreduce jobs, language matters little as the network overhead is going to be your bottleneck in a distributed environment.