So... can I just check I'm taking the right information out of this?
1. Modern computer speed gains are almost entirely dependent on multi-threading.
Not really... but the diminishing returns are leveling off. The three major categories of improvement *other* than "more cores" are raw clock speed, architecture improvements, and process size. Raw clock speed improvements has been leveling off for some time, process size improvements (die shrink) is leveling off sharply, and architecture improvements are starting to level off.
Consider:
This chart of single-thread performanceThis 2012 article on historical single-core performanceBack in the earlier days of DF, single-core performance was climbing at around 50% per year. Somewhere around 10 years ago, it started to level off toward 20% per year by some measures, or even negative by others. And most of the gain was from Intel.
Clock speeds have severely leveled out; a nice but not exceptional CPU from 2008 might be 3.3 GHz, and fairly few people today have CPUs over 4 GHz. In terms of single-thread performance, that 2008 CPU is about 700% of an (older) baseline, and the best available desktop today for single thread is only about 1,200%... so that's only about a 70% increase in *eight years*.
2. Multi-threading is different from most other optimising, which would be doing little tricks with the code done, rather than changing the way the computer runs DF.
Optimizing, for multi-threading or otherwise, is rarely little tricks. It's frequently a very time consuming process involving code analyzers, and doesn't port well; beyond the simplest stuff like taking out debug code, optimizing for 64-bit Linux on Intel CPU with gcc compiler may be completely different from optimizing for 32-bit Windows on Atom with Visual Studio, or whatever.
In particular, optimizing for newer CPUs frequently breaks compatibility with older ones. A significant fraction of the speedup in modern CPUs is better ways of doing things, which aren't applicable or don't work on older ones. Toady has been pretty good about leaving in support for older systems, partly because performance increases aren't as big of a concern this early, and maintaining multiple builds was a major hassle with his older setup; so having one "runs everywhere" build was the simplest solution.
Among other things, the 64-bit build doesn't need to support anything older than around 2000 at all, basically Pentium 4 era, and quite possibly not older than 2003 or so. If we are lucky, cleaning out the 386, 486, etc. structures may provide more improvements than the minor hit to word length costs.
3. Multi-threading as opposed to optimising the way the linear computing works would be more efficiently and easily implemented now as opposed to later.
Multi-threading for *today's* computers of *today's code* would be more efficient done sooner. Given that we really don't know what the computers of tomorrow will bring, it's highly speculative to say that optimization for the computers of today would help for the computers around when DF is closing in on 1.0. How much effort to optimize for the new Pentium CPUs of 20+ years ago would still be valid today? Much of the code has been replaced, and the architecture looks completely different; that's at least the sort of difference we may be looking at.
One very possible answer is that CPUs may go the way of GPUs, with *thousands* of micro-cores. In which case optimization might actually look at assigning a core (thread) per map tile and/or per dwarf, plus a few hundred for overhead. Each micro-core would be responsible for tracking a "Game of Life" like simple state set for fluids like water, magma, and sand; plus some info about obstructions and occupancy. (You could do this with GPGPU / OpenCL / CUDA programming on a GPU today, mind you; but that's very non-trivial to set up and would require a *second* powerful GPU in your system (other than any running your actual graphics), which is still rare. There is a system down the hall that I work with that has four Tesla C2075 cards; each has fourteen 32-core subprocessors for 448 total cores, and 6 GB of on-board RAM, which sounds like a lot... but it's only 12 MB per core after ECC overhead and dividing it up. Still, it's a more than 1,792 core system; it could for instance process a 32x32 sub-region in a single pass.)
If quantum processors kick in, one of the things that might be relevant is pathfinding; splitting up pathfinding into 2^(n-1) threads, where n is the number of Qubits in your quantum card, might be the thing to do.
4. Whether or not multi-threading could actually work really depends on how important the order of tasks completed is in DF: whether the computer can do all or any at once, or if there's a set order.
That is ultimately adjustable. For instance, the current fluid flow probably isn't the best way to do it if you're going massively parallel, for instance; but the subtle details of how fortresses flood under pressure may well change in the future as that is brought up to more modern standards. I wouldn't count on u-bend tricks, wave behavior to trigger pressure plate timers, and the like working the same way ten years out, for certain.
The important trick for multi-threading is *get the low hanging fruit first*.
Amdahl's law will limit you eventually; nothing as complex as DF will ever be completely parallel, and cross-process communication takes an increasing toll. It's also possible that things like memory access bottlenecks dominate performance so that parallel processing doesn't even affect much; if the limiting factor is really the amount of information that you can get between the main RAM and the CPU die, how many threads you're running on the die doesn't make much difference.