Yep. Careful micro-management cut the critically important graphicst::display function's cost by 4x.
I am most disturbed. Compilers are supposed to be smarter. Like GHC.
When given the choice, compilers usually pick "safe" over "smart". But you can often get them to be much more daring by passing the correct flags. Unfortunately, it's a trade-off; the more you let gcc optimize your code, the more likely it's gonna break something by accident. In 99.9999% of cases there's no need to mess with those flags, but if you are trying to squeeze performance out of your code, it can sometimes be useful to experiment with more aggressive optimization settings.
The default is not to optimize *at all* (-O0) to make debugging easier. For a production release, you should be using at least -O2, since that turns on all the 'safe' optimizations as well as the 'should be safe unless you write horrible code' optimizations. (strict-aliasing, BTW, is enabled by -O2). -O2 is generally considered the maximum combination of safe and fast. You could *try* building the code with -O3 but I've heard that gcc 4 does some pretty dumb things at that optimization level, possibly making your code *slower* than before.
Since you're building for an x86 machine, which are ridiculously low on registers, you could also try building your production releases with -fomit-frame-pointer. That option makes it impossible to get a useful backtrace from a crash, but it frees up a register for use by the actual code, which can make a big difference, especially in processor-heavy code like gfx. Since it breaks debugging, gcc will never enable it by default on any x86-based target CPU.
Lastly, by default gcc won't enable instructions for things like MMX/SSE/etc unless you tell it those are OK. Again, there's a trade-off: if you enable MMX, your code won't run on old 386/486 processors, but will probably run much faster on Pentiums and up. SSE would speed things up even more, but eliminate anything older than a Pentium4 or Athlon-XP. Depending on what you think your target audience is, you could probably get away with at least -march=i686 (will run on any Pentium Pro or later CPU, which is any computer less than 15 years old.)
There is also the '-mtune' option, which doesn't change the instruction set used but will reorder the compiler code to match the internal scheduler of the CPU in question. This would make the code more likely to run at peak efficiency on that CPU, while having relatively little effect on others, but you'd have to pick a CPU to focus on (e.g., -march=i686 -mtune=core2).
(Of course, this is also where I being up the idea of doing a parallel 64-bit build of DF; since the x64 CPUs have way more registers, and all of them support all the various types of SIMD instructions. But that's a completely different endlessly pointless discussion.)