Does the built-in FPS counter have accuracy issues?
When working on large, complicated data, one generally tries to minimize influence of unknown factors.
For example: when trying to figure out the side-effects of a new drug, you'd want to feed the drug to only healthy people. Ideally, you'd even test these people for conditions etc. they themselves might not be aware of. (sidenote; ethics do not normally allow this)
You'll get very clean data, and produce some very pretty clear pictures of what's going on.
The problem of this type of experimentation is that you'll only find results that are valid for the group of healthy individuals, which can mean your results don't apply to any other group. There've been multiple situations in which this has become quite clear.
In testing DF, the definition of 'healthy'... doesn't really apply in the dictionary way. Even worse, the variability between playstyles and the resulting fortresses (and the nature of the computational loads etc.) makes a single test fortress impossible, and would require a lot of test fortresses.
This is sometimes seen in medical research; variation between subjects is too big, and comparing them becomes impossible. In some cases there's a very easy way to compensate for this using difference scores. You expose a subject to two conditions, like a medication and a placebo (not at the same time), and note the differences between the two outcomes.
Since a computer usually doesn't display a placebo-effect, I would say this is a valid and easy way to track some big issues, like processor architecture, OS, RAM speed, etc. Simply said: you simply load the same save on both PC's, and note the FPS on both.
Yes, and it fluctuates over time. It has to be collected for a period of time and then averaged.
Ideally, yes, but then you're introducing the time as a factor, which needn't be one. FPS doesn't fluctuate that much if a fort is stable (assuming you let the game run for a bit after loading), and big changes (breaching the caverns, releasing the clowns, etc) require a re-measurement anyway. I would even say that you can treat them as separate forts, for testing purposes, since the load is of a different variety.
However, I'm starting to think that setting up a set of 'volunteer testing fortresses' and 'volunteer testing systems' is starting to get inevitable, if I want to see issues like these brought to light...