So what you're saying is that unrolled loops are only always good on systems that lack any form of cache. Once you're on anything even remotely recent, it stops making sense to use unrolled loops all the time, since that means that you end up stalling the system as it tries to access memory because your code won't fit in cache. You ironically lose performance by doing something that "should" give you more.
Or is the real solution just giving everything 512 megs of L3 cache, because that's clearly not wasteful at all?
No.
The reason to use unrolled loops has to do with branch prediction.
https://en.wikipedia.org/wiki/Branch_predictorBasically, with a very tight loop, you want to have "The most likely to be executed" path stored in the cache, with data the loop will need prefetched. This is to avoid that unsightly necessity where you have to fetch data from the memory bus, which is many times slower than operating out of the cache, and introduces whole operation cycle waiting times. (which is why it does not make sense to do, from a performance standpoint, unless your loop uses an absurd number of conditional checks. If your loop is both tight AND lean, it will execute faster than an unrolled, but larger than the cache can hold bit of code, because even though there is conditional code being executed, the time to execute the condition is smaller than the penalty for hitting the memory bus-- OR, where there is a significant chance that your conditional logic will need to pull data in other than what is prefetched, in which case you have the extra loop control conditional checking hit, AND the memory bus access hit. This latter problem is usually able to be resolved with more clever design of the loop though, but that gets into the "My time is VAAAAAASTLY more important than YOURS, you FILTHY USER." attitude of many programmers.)
When your loop is unrolled, there is no branch. The code is one long string of spaghetti. HOWEVER, as the wiki article on unrolling points out, it HIDES this performance hit. Instead of sailing along at fucking warpspeed and then suddenly go "Deerrrrrrrp" for a moment (as either branch prediction fails and a cache miss happens, or when memory must be accessed for some other reason), then sailing along at warpspeed again-- it instead stays in Derrrrrrrp type speeds, because it is constantly hitting the memory bus.
It is a useful technique for code that is timing sensitive, such as for code that needs to run in a multiprocessor environment, and which needs to have cache and data concurrency.
What I am saying, is that for most purposes where you want actual speed performance, you want to use a tight loop, and use the branch predictor inside the processor in your compiler optimization flag set-- NOT glom down a huge memory array, with code that chipper-shreds it, and use unlooped code to make sure you stay concurrent, should another processor need that data.
Don't have the gall to say it is a performance enhancement when it really isn't, and where real-world examination shows it is quite a lot slower after you do it, vs code that is tightly looped.
Don't pretend that it is always OK to do it either, because when *EVERY* bit of software in the computer tries to do it, you make a high end bit of hardware run like a fucking idaho potato. (Because if two or more processors exist in the system, and both processors are trying to use the same data, and so you are using unlooped code, then access to the databus has to wait for one processor to be done with it before the other can work, and the previous processor then has to wait for this second processor to finish working with the memory before it can continue. The wait for the memory bus is made EEEEVEENN LOONNNGER-- When that happens *ALL THE FUCKING TIME* because *EVERYONE* is doing it with their code, the whole system acts constipated, all the time.)
It makes the programmer rockstars on slashdot get severe butthurt to make them re-evaluate their architectural decisions in a "bigger picture" framework. Ideally, only routines that must share data across processors should be unrolled, and memory structures should be such that program data and application data have some degree of separation, and the program code is tolerant of going "Derrrrrrrrrp" every so often when data has to be fetched from the memory bus.
Doing that requires having significantly more "Give a damn" though, and "My time is super duper important, and shit, and your time as an end user is not of concern to me", so they get really shitty when you suggest that user experience and actual performance requires more give a shit than they are willing to put down.