In a semi-tech-demo way, my 7DRL could support what I assume to have been ~2000 enemies. Each player turn required about 10 loops of the list, so 20k. Additionally, although holding a button would lag, it would get over one per second, so maybe 50k. As it was a circular doubly linked list, I assume it may have had a much greater chance of a cache miss, and by moving to large arrays it would likely improve further. On top of that, the data I am refrencing was on an old school computer and it was compiled without optimization. On any newer computer, even a mostly full map would show little delay at all. Additionally, I was using a somewhat inefficient graphics system, so the fact that it was only giving a few frames per second was likely a result of that, rather than the AI being slow. Further, as the AI in my RL was moving at an average of once every 10 loops, it was closer to 30 loops per second, plus rendering delays, and using a similar system of counting down and then acting when it passes 0 for a game like this would average something closer to 100 loops before any recalculation must be done, and that it would only aim to loop once or twice a second, I can come to the conclusion that my idea for simulating the entire population would likely be plausable even for (random but plausible number) 100k people.
Also, the flaw of a generated-as-needed system is when you see the whole population at once, or when you quit the view temporarily and then try to return to viewing the same citizen, or worse, view somewhere else and return. It would end up caching everyone in hopes of being able to provide realism or leaving the players with subtle moments of immersion breaking evident fakeness.