You know, with all the "C++11 wooh C++ is back" and "Functional programming is the salvation of efficient computation in a multicore environment", you'd think there'd be a few more attempts out there to provide libraries that combine to two.
But all I've found is a single library called FC++, that is mostly lambdas and lazy lists, and one github repository that could be used as a basis and takes advantage of shared_ptr, which is a clever idea for sharing the memory though the current implementation it uses does come at the expense of locality.
Maybe I've just become too enraptured by Scala and Haskell of late...
So the thing is, function pointers are only good for efficient computation in a subset of multicore environments.
GPU architecture, which is sort of my specialty at this point, is vastly different. Your typical CPU is Multiple Instruction, Multiple Data (MIMD); whereas gpu architecture is much more oriented towards Single Instruction, Multiple Data (SIMD). What this means is that the CPU has multiple instruction sets; one for each of your cores. Functional programming works well there, since each core is off doing its own thing. The GPU, on the other hand, has significant sharing of instruction sets. A typical GPU has thousands of cores, but a complex hierarchy of allocation which results in around 128 - 512 cores at a time (user specified) to be all following the same set of instructions. Within these dynamically allocated, user-specified chunks, there are manufacturer specific "warps"/"wavefronts" (term varies by manufacturer/gpgpu language, but they're all basically the same thing; 32 for NVidia, 64 for AMD). These warps consist of a number of compute cores all executing
exactly the same instructions in lockstep. Even having 2 code paths is very bad, since this setup runs both code paths, simple idling the threads in the warp which do not enter into a branch. So the following pseudocode:
if(warpId < 16)
DoStuff();
else
DoOtherStuff();
Will first run the DoStuff branch, taking all 32 cores for the ride, then run the DoOtherStuff branch afterwards, again taking all 32 cores for the ride. The first 16 cores will DoStuff, while the last 16 cores simply idle, waiting for the code irrelevant to their path to finish; afterwards, the first 16 cores will simple idle, while the last 16 cores DoOtherStuff. Because these warps are part of GPU hardware architecture, the cores within them can not be re-allocated to another process while idling during branching instructions like this. And so, simply by splitting your cores into 2 groups this way, you're wasting a full 50% of your compute power. So, while functional programming can be nice for some applications, it utterly fails on many of the most powerful forms of parallelism.
And really, even on the CPU, I doubt it's the most efficient use of resources. You would likely get higher performance from the most parallelizable problems by using a
Data Oriented Programming technique (super easy to load balance, and thus parallelize efficiently, while optimal in terms of cache misses which can destroy performance on modern cpus). The only real reason you would want to use functional programming for parallelism is problems which are not already massively parallel. Or in summary, parallelizing a problem is only the first step, and usually not the hard one. The next step is taking into account hardware architecture and how to make a solution which is as efficient as possible. Doubling your compute power for a 1.2x speedup is nice; doubling your compute power for a 1.95x speedup is better.