Have you used OpenMP (http://en.wikipedia.org/wiki/OpenMP)? It has the flavor of parfor -- you identify the embarrassingly parallel loops in your C or Fortran, put in something like
#pragma openmp parallel for
in front of them, and your code transfers through pretty much intact -- it handles the thread wrappers. You can add other pragmas for the times when you need locking.
This is a much less intrusive setup than CUDA; you don't have to worry about loading data, or double/float conflicts.
The OpenMP extensions could be a very good fit for scientific programming on this coprocessor.
Thanks; I'll take a look. But OpenMP is CPU-only right? Apple's got their (currently less portable, admittedly) Grand Central Dispatch that does something similar. But as far as I know, if you want portable GPU code your only option is OpenCL, and even then it requires optimisation depending what device you're using it on (or so I've heard).
OpenMP 4.0 is likely to have support for accelerator devices (i.e, move the necessary data on to the device, run the computation, and move back to the host). in fact, that's one of the methods you can use the Phi right now (intel have extensions to OpenMP)
or if you can't be bothered to wait for such a standard, you should have a look at OpenACC[1], which does exactly this, and exists now. you end up adding code like
#pragma acc kernels for
on top of your for loops, it does the low level work for you.
This is a much less intrusive setup than CUDA; you don't have to worry about loading data, or double/float conflicts.
The OpenMP extensions could be a very good fit for scientific programming on this coprocessor.