Try this for a start:
Even well-researched absolute statements about DF performance are still wrong for some situations. Most direct statements about DF performance are wrong for a lot of situations.. (Including this one, of course.)
Optimizing DF is not fundamentally different from optimizing any other heavy-duty program; it seems to have a considerable similarity to ordinary real-world tasks such as running large matrix operations in Matlab, or to automated circuit-routing optimization on complex circuit boards. However, many (if not most) reviews and comparisons you'll find on the Internet are made against workloads with considerably different characteristics, such as 3D-heavy games, or stream-heavy video transcoding.
At any given combination of gross system resources, competing resource demands, net system resources, and target program state there will be one element that is the limiting factor (sometimes referred to as the "long pole", as in when packing a tent). Improvements to other factors without changing the long pole will have reduced, if any, effect. Improving the long pole factor will have far more direct effect. After improvements, it may *still* be the long pole, in which case the situation remains similar. However, at some point, the long pole will have been made enough shorter that something else is now the long pole; further improvements to the original long pole will drop off in increased effectiveness, possibly quite sharply.
In the most general terms, optimization comes from getting as much of the actively-manipulated problem space onto the fastest available handlers as practical. These are usually in clusters, and would typically look sort of like the following, in decreasing order of performance:
* actual registers, which on some designs may have multiple internal levels
* On-chip cache, usually dedicated to a core and frequently with layers such as L1 and L2
* On-die cache, in modern designs likely to be L3 and shared between cores
* Main RAM, usually these days operating at a unified speed
* "Fast" swap if present (still far slower than RAM), typically an expensive, fast, small hard drive (either high RPM spinning-media or a SSD), but sometimes RAM connected over a slow bus
* Main storage, typically a spinning-media hard drive
* Slow storage, such as network-mounted hard drives, hard drives limited by an external interface such as USB, etc.
* Archival storage, such as tape drives, swappable hard drives, etc.
Even a small DF embark is too large to fit into L3/on-die cache; so the fastest storage that can handle everything it needs to do when it needs to handle large-area calculations is main RAM. If your computer is not capable of loading DF's program code, the data block of your chosen embark, your operating system overhead, and anything else you choose to run at the same time into main RAM, the insufficient quantity of main RAM is almost certainly your long pole. Even a fast hard drive is more than an order of magnitude slower than most RAM, and more commonly two orders of magnitude slower or worse. Other than keeping an eye on memory usage, one other statistic you can use to get a handle on this problem is
page faults; you want these to be low / few. "An average hard disk has an average rotational latency of 3ms, a seek-time of 5ms, and a transfer-time of 0.05ms/page. So the total time for paging comes in near 8ms (8 000 000 ns). If the memory access time is 200ns, then the page fault would make the operation
about 40,000 times slower."
TL;DR: If you don't have enough memory (RAM) to run your embark, the most significant thing you can do is fix that. Run fewer other programs, use a smaller embark, or buy more RAM.Once you have *enough* memory, however, there is little gain from adding more. A little bit of headroom may offer some moderate improvement, as the OS has to do less work to manage assignments and garbage collection, and you're less likely to have fragmentation problems; in modern systems I'd guess somewhere between 10% and 25%. Beyond that, it simply doesn't matter. You then need to look at a bunch of far more complex factors; the biggest are likely to be memory channels, memory bus speed, memory CAS latency, CPU architecture (including L1, L2, and L3 cache), and CPU clock speed. In modern designs, several of these are also affected by traditionally "external" factors, such as cooling; most CPUs these days have adaptive performance based on real-time thermal monitoring, and a common problem in laptops and cheap desktops is real-world performance well under theoretical maximums due to insufficient cooling.
Obviously, at some level, having both a wider pipe and a more responsive pipe are good. Unfortunately, in the real world, these parameters are somewhat mutually exclusive. One of the more interesting places to try Science! would be to try to characterize DF's memory access behavior. Real-time vs. clock-related CAS latency, how important the burst transfer versus line transfer rate is, and other elements can sometimes be difficult to reduce to easy-to-explain numbers.