Yeah, you guys are saying what I was thinking about. IIRC loop unrolling helps because it decreases the number of branches dealt with, but loops are generally easy branches to predict so branch prediction nowadays removes most of the benefit of unrolling.
The history is mixed up here. They didn't even have predictive CPUs back when loop unrolling was the main thing you did. So it's basically completely wrong to say the main rationale behind loop unrolling was about decreasing branches. There were no branches.
What loop unrolling is for is avoiding the instruction overhead per cycle. By unrolling a loop, you get rid of the counter variable and the jump instructions. You will save the one hit on the branch prediction, but only on the last iteration, and by the time that was important, well instruction cache size was becoming important, too, so loop unrolling was already becoming counter-productive.
The branch penalty thing is a thing where if it makes sense to unroll the loop for other reasons, or all else being equal then it might tip it over the edge to being worthwhile, rather than being the main show. For example, say you're looping something 1024 times and you decide to use a little loop-unrolling. You might unroll on 8s, so you do 8, increment the index by 8, then do another 8 and so on. You'll get the benefit of unrolling, but you can fine-tune this so that it doesn't overload the instruction cache, and find an optimal value for how many you step through per loop. However, the branch predictor will only fail on the last iteration, so once per 1024 elements, no matter if you unrolled this loop or not. So, given even the most rudimentary branch prediction on the CPU then unrolling in this case wasn't about that. Partially unrolling loops is far more common than completely removing them, which is what you'd need to do to get any benefit of avoiding branch misses.