It would probably be better if the problem was attacked at the root, i.e. if DF itself was redesigned to be/support multithreaded. One clear candidate for a separate thread is the calculations for events external to the fortress (at least in fortess mode) since they have a very small interaction with the fortress itself (probably only diplomat/liason reports) and thus should have a limited need for multiple access protection and separation, another one is graphics, a third one is equipment wear, and a fourth is liquid flow and temperature calculations (there are probably others as well).
I doubt a GPU would be of much help for most of DF, since GPUs are very good at doing a very narrow range of repetitive tasks really fast, but are quite poor for general computing. There is, after all, a reason CPUs aren't just replaced by GPUs.
Another problem with GPUs is that they are not nearly as standardized as CPUs are, so you may end up with separate binaries for separate GPU configurations.
I would also suspect virtualization will slow things down rather than speeding them up, since emulation carries an overhead cost. If there is something that a virtualization improves, it can probably be improved more and better if the corresponding change was made in DF itself (but if the virtualization was targeted to a specific architecture, it might do tricks unsuitable for a general solution).
Of course the best solution would be for DF itself to be optimized, but thats not happening any time soon. I've done quite a bit of thought, and there are a lot of calculations that can be done on the GPU for a great gain. All map calculations for instance. The CPU can't efficiently do calculations on the map, because not only does it iterate through every tile for temperature, it also iterates through many tiles to calculate flow, as well as probably thousands of iterations for pathfinding. On the GPU, it could run a single iteration for all tiles for temperature, a single iteration for all flow. Pathfinding is more complex, but it could run those iterations in parallel, essentially meaning it only iterates for a single path, but the largest path. Offloading those calculations onto the GPU would be very efficient. Also, the GPU wouldn't have the issue of caching part of the map, it could store all of the map data, so there is less memory being manipulated constantly.
And thats the idea of the emulation, for this. While it does slow things down, an efficient VM is very much comparable to a native executable. Separate binaries would be an issue, for the VM, not DF, and could be done. The calculations that aren't suitable on the GPU, and there are many, could be done separately. And with some clever memory mapping, we might even be able to perform some of DF's calculations in parallel. Of course, it wouldn't truly be parallel, it would be more like predicting it, and when DF finally calls the specific function, finds that the VM already did the calculations. This would be difficult, if possible at all. Its an idea, an intriguing idea, but even if the project were possible, it certainly wouldn't be feasible by a single person.
Even if it did work, it would be such a hack, very much dependent on disassembly of DF, and it would have huge compatibility issues between versions, that it wouldn't be suitable for normal use without a team supporting it. I'd say the community is strained enough supporting DF hack, let alone a Frankenstein-esque VM designed to work around an inefficient program. Although, to Toady's credit some optimizations that may have huge benefits would be very difficult, so there is certainly no fault in not producing that, although some simple multi-threading in a few key places would go a long way, as you mentioned.