The idea in my head looks like this:
With one thread, you do x+y, then you decide you need to do +z. z is too big for the cache, so the cpu has to wait for the ram.
With 100 threads (hypothetical number), there is already another thread where z is already fetched from ram, so the cpu isn't idle. It can do something else. Even all 4 CPUs (or how many you have) can do somthing else.
But fetched from RAM to *where*? The whole point is that all of the on-die cache put together is smaller than some of the individual data structures DF may need to iterate over. So it has to read a piece from system RAM, work on it, then send it back and get the next piece. In general terms, IF this is the limiting factor, there are four sorts of hypothetical improvements one can make to that sort of process:
* Reduce the size of the data involved so it fits into on-die cache. Usually not applicable; but at some future point we might find that with continuing improvements in hardware, a no-history 1x1 hermit fort or something would fit entirely on the chip, and it would run dramatically faster than larger ones. Few improvements expected here in the near future.
* Reduce the number of different passes needed to be done over the data per "tick". This is largely a compiler optimization issue, although turning features like temperature and weather off can help if the user is willing to accept the reduction in quality. Toady jumping up a couple of compiler versions may have brought some improvement here, although I'd not expect further huge gains.
* Significant improvements in the amount of low-level cache; this is Intel's responsibility, and it's not likely to get huge jumps. Theoretically, you could design a variant CPU package that would normally have a large number of compute cores, but instead had only a few cores swimming in a huge L2 / L3 cache. It would be a niche product, however.
* Improvements in the overall speed of system RAM access. This is an area where moderate improvement is likely due to continuing progress at Intel, but dramatic improvement is not expected. This could be improvements in the FSB speed, the actual RAM latency, the memory controller, or whatever.
Again, we don't know for sure whether DF is thus constrained, or more likely what fraction of the time DF is thus constrained. And this will depend on the rest of the computer as well. It's entirely possible that DF is CPU limited on weaker processors, and RAM access limited on stronger ones, for a very simple example. And anything that gets anywhere near a disk, even a SSD, is so ridiculously slow by comparison that even momentary glitches there can have significant effects on overall speed; this discussion assumes you already have enough system RAM to store all of DF's code, all of DF's working data, and all of your operating system, web browser, music player, and whatever else in RAM; otherwise they will be stealing precious slices of memory access bandwidth with the very slow process of paging or streaming.