I dare ya to write a program that outputs the numbers 1 to 1 million in order that runs faster on multiple CPUs than it does on a single processor.
Indeed. Although ask for something a tad more complicated (e.g. Roman Numeral representations of said numbers) you could (perhaps) dedicate n-1 cores to working out a number, on demand, and the nth to shepherd this process by allocating the numbers to be processed, retrieving the answers and outputting these in the correct order. The more complex numbers (e.g. 888,888, depending on how you translate the respective numerics) might lag behind some of the following numbers[1], but you can probably absorb that overhead.
(Make it an analysis of the periodicity of a PRNG with each of a range of successive seed conditions, e.g. for crypto-breaking purposes[2], and you might find that some outputs (the better ones!) are delayed enough that a queue of the quickly-ending seed-conditions builds up, although you're probably not concerned about sequential output, just "largest periodicity so far confirmed", and perhaps even "periodicities not yet confirmed, but certainly beyond a certain quantity of terms" alongside, to be enumerated as and when. But I digress.)
But it really needs a programmer's eye to work that sort of stuff out, in general, or at least some way of fulfilling the programmer's intentions (who might not care about sequentiality of some massively repetitive test, but it was only that the linearly-looping code most logically went through a sequence in order and output results in said order). Getting (as per the OP) the single-process code split virtually into into a multi-core version could probably done automatically (distilling a subset of the programmer's own knowledge into the MCP-thingy), but the method for doing so needs code of its own which has enough 'understanding' to be able to rip apart the linear code and re-assemble it in a functionally identical (or sufficiently specification-compliant) asynchronous version. Doing it on the fly would incur overheads (although on-the-fly with a view to looking for patterns that can be cloned and kept running persistently in their own little microcosm might mean it speeds up towards the end), and a suitably-complex pre-processor would certainly work for such a simple program but add its own difficulties, and I very much suspect that either kind of automated re-optimisation would be murder, in the case of DF. Until and unless Toady personally rejigs the whole thing towards asynchronous code (and I think majority consensus is composed of those that want the next version ASAP plus those that that don't mind waiting a bit, but on the basis of wait==content++[3], not (as would be needed) a huge pause while existing features are just 'multicore-friendlified') the side-optimisation of the rendering engine and just keeping the main DF-running core largely free of other system stuff is what we've got. (As noted, by someone else, seems to work well for those that have mentioned they've checked this out.)
[1] Actually, bad example, as most of the complexity in that number is shared by the next few numbers, anyway, with just the last one or two decimal digits needing whatever "extra effort" you might consider VIII to be over the following IX, X, [XC] and [XC]I that would be the next few numbers in the queue. BYGTI.
[2] Or, something that I've been wondering... It's possible (obviously increasingly and exponentially unlikely) to roll successively similar numbers with a dice, or a continuous sequence of common coin-results (as in Rosencrantz And Guildenstern Are Dead, Alas), and yet the
first conditional requirement for a good PRNG seems to be that such sequences are less desirable in PRNGs than in actual 'real life'. Obviously, for cryptography purposes, that is all for the best (because it avoids possibly identifiable sequences in a stream-cipher that are essentially encoded by a simple shift-cipher, albeit that it reduces the entropy in the same way as "no character is ever equal to itself" under the Enigma system which was used to good effect in paring down the possibilities), but in a dice-rolling game you
do sometimes get multiple sixes (or ones) in a row, and so you
should get that in a computationally-equivalent environment, and with the same average frequency (over many, many iterations, naturally, to make it statistically significant) as you'd get in RL, once you distill the output states down to groups that represent a roll of a one, a roll of a two, a roll of a three, etc[2a]. But I digress once more.
[2a] Another point of analysis being not just the PRNG mechanism and status/seed, but the final grouping mechanism, given that a high-precision output bit-reversed (or otherwise chopped and mixed up) might give a differing frequency of group-following-group patterns than the original (once this new output is once more split into several contiguous ranges of values that boil down to the desired lower-precision output states), and might be worth chasing up as an additional (computationally-intensive!) variable in pursuing the desired sequence that provides one with the same "possibility of flukes". More digression. Ignore.
[3] Or "++content", I suppose, technically.