I suspect that DF also suffers from a side effect that comes with Von Neumann architecture computers (which is practically all general computers since 70s): code is data. Generally only L0 cache is split into dedicated code and data parts, everything from then on is shared. And being shared it is also contested.
I assume that DF is programmed in a traditional style, with large functions working on large data structures; updating a dwarf splits into updating their brain and their kidneys and their liver and their temperature and their soul and so on. Also each of these work on a small part of a large "dwarf" data structure. This is generally called the "array of structures" approach. This leads into the cache either throwing out bits of code or bits of data to bring in new stuff.
In the opposite approach, "structure of arrays", you have arrays that contain a part of a dwarf (or any other creature), which are then updated separately in one fell swoop each. Since the CPU is only running a short piece of code over large amount of data, the code will remain in cache, and there data can be processed in a streaming fashion rather than being cherrypicked with random accesses.
In a concrete example, the first approach would be a bottling machine that fills bottles, caps them, glues on labels and prints last-use-dates; the latter approach would be a production line where first machine bottles, second machine puts on caps, third machine glues on labels, etc. And you won't see the first approach in any larger scale production facility.