I think I found the problem. It's an issue of alignment. See, most video decodes to 8-bit RGB, in which each R, G and B component gets 8 bits (1 byte) each. When you push this through an audio codec without proper settings, it's not gonna understand that all 3 bytes in there are important. If it sees it as 24-bit PCM (which were my first instincts), you'll end up with
┌──────────────────────────────┬────────┬────────┬────────┐
│ │ R │ G │ B │
│ Raw video ├────────┼────────┼────────┤
│ │11110100│11000011│10010001│
├──────────────────────────────┼────────┴────────┴────────┤
│ │ 24bit sample │
│ ├──────────────────────────┤
│ Interpreted as 24-bit PCM │ │
│ │11110100 11000011 10010001│
│ │ │
├──────────────────────────────┼──────────────────────────┤
│ │ │
│ What an audio codec might do │11110100 00110000 00000001│
│ │ │
└──────────────────────────────┴──────────────────────────┘
So in this example, the color #F4C391 becomes #F43001, which are decidedly different colors. It thinks that the B and G components aren't important, so it just loses precision there, since those are the last 2 bytes in the 24-bit integer anyway. It'd be like rounding off the last 3 digits of a 7-decimal-point number. As for why it produces such "incredible" results, I don't know.
The way you fix this is by telling FFmpeg that its input is unsigned 8-bit PCM. This way, the audio codec is "aware" that each component is actually important, and won't round off color components by accident. This does end up tripling your encoding and decoding times, but at least you won't give people seizures.
I'm not even sure how the 'image-to-audio-to-image' version didn't suffer from this.