Was apparently ninjaed even before I started this, and now six more replies, but I've started so I'll finished, and my answer (could do with being reduced a bit) is still more or less going to be:
Running a single process on a single core, runs instructions in order. e.g. Instruction 1, instruction 2, instruction 3, etc.
Running multiple processes on a single core needs each process queue to be time-shared and dealt with a bit at a time, e.g. Instruction A1, Instruction B1, Instruction C1, Instruction A2, Instruction B2, Instruction C2.
If you have multiple cores, you can (ideally, perhaps) have one running Instruction A1, 2, 3, etc while another is running Instruction B1, 2, 3, etc. Faster than a single core can possibly cycle round.
But if you have just one process, are you going to put Instruction 1, 3, 5 on CoreA and Instruction 2, 4, 6 on CoreB? (Or whatever combo you need when you've got CoreC, D, E, F, etc.) Well, it would shift a set of non-branching instructions requiring absolutely
no immediate reference to the results of the previous command (currently still being operated on the other core, in your effort to save time). But as soon as either complication occurs (and others, left out for simplicity) then you're going to have an overhead of "shouldn't have run that, forget it" or just having to NOOP while you're waiting to see
what you're now accessing. Never mind that you're probably sharing a bus to the shared cache and another, slower one, to the main memory (and, onwards, to any disk storage, whether virtual memory or actual file data).
Which is not to say that you couldn't get something to work that way, by deliberately coding so that result dependencies aren't needed until later, and decision-branches are known well in advance, and working nicely on a clump of data that can reside in the respective non-shared cache for periods, before synchronising back to the shared areas. But then you're effectively programming for parallel processing... Taking bare (single-threading) code and pushing it through the system is going to have varying, but probably not fortunate, results.
Massive multi-processing is an interesting thing. The kind the guys who get 100+ games consoles and put them in racks to handle subdivisions of a program. But, even then, there's an amount of speculative processing. Being done on the assumption that another calculation says this was Ok (grabbing separating blocks of data, running them in isolation, then recombining and re-splitting accordingly).
As I haven't worked (at any appreciably low level) with this sort of stuff for a decade, there's probably new thoughts (as well as details I've been forced to leave out/smudge), but... TL;DR; Single-core don't work that way.
(Although I'm sure that we're supposed to have the graphics on a separate thread from the physics, more or less, in current DF. And if you have a good multi-core system (and Windows and the firmware, combined, does its job in scheduling all processes in a path of least resistance), you might at least get your multi-core speed-up on multi-core because one core gets the one half of DF, another core gets the other half of DF and all the rest of your 126 cores[1] is handling all the AV scanning, Blue-Ray ripping, torrent-streaming, encryption-breaking, additional seventy separately initiated Worldgens starting and (above all!) process scheduling, and not slowing your main game down at all... Right?
)
[1] On a speculatively futuristic system, here...