You would likely need a large sample size, with hardware information for every test participant. Deciding your sample size would be a great starting point for organizing this sort of benchmark test.
The largest issue with DF benchmarks has always been isolating variables. You would want to test each world size, and with a set of various worldgen lengths - say 1, 50, 100, 250, 500, 1000. Comparing world size effect and gen length would be very helpful.
I suspect DF has a
compounding sort of lag, where multiple variables work constructively to cause slowdown. So like a large number of items laying around is not so bad, unless you
also have gigantic open spaces all over your fortress - then it becomes a FFA for the dwarven pathfinding and things slow down.
and I should say that I don't necessarily believe this to actually be a cause of lag, since I haven't seen conclusive testing, but in my personal experience this kind of fortress is a recipe for sub 50 fps..
So the development of the test-bed fortresses would need to be methodical, in that you would need to separate these confounding variables as best as possible. Kind of a warped logic puzzle, in a way.
Otherwise, you would merely be producing empirical data sets that, while still interesting and worthwhile, wouldn't set anything in stone regarding exact causes of lag.
I think more questions could be directly asked of Toady, since a knowledge of the programming might help us unpack the confounding variables. Though honestly I don't think it's a focus for him, so I'm not sure this would be fruitful. He's a busy man, you know!
To return to the question of world age, using 1 year old worlds would provide a very fast way to benchmark, since a 1000 year old world would likely run like trash from the get-go. However testing all ages of worldgen would still be important.