I beseech anyone to do a test of their dwarves attributes (I used advanced xml converter, free for trial use, and scalc to get it done). Some of the attributes were averaging LOWER than the wikis! So do dwarves lose attributes too? Anyways, that is why I was proposing an average based on the current population of dwarves, rather than the wiki. However since I have had two independant samples conclude the same averages. I'm sticking w my averages until I see a different sample that proves the wiki is more accurate.
Also, using the current populations average makes sense (if the average indeed can change from the wikis). I feel it would be more optimized for your current population and avoid arguments over as to what the "correct" averages are.
I also have a recommendation for how to get % based on my formula using statistics. I wasn't sure how to combine the different attributes, but I believe I know how to do it correctly. You basically combine the attribute arrays into one super array worth of values and derive the mean and standard deviation from that, then using elementary statistics, you can derive what % your at from the mean (this is assuming a bell curve WHICH THE WIKI SAYS THERE ISN'T ONE, but using my sample, it's plainly obvious that there is a bell curve of some kind). The only issue I see with this, is if the distribution curve is skewed to the left or the right, I don't remember if that matters when using standard deviation and want to look into it. However, I hope to have an answer today. I'm also hoping that I can just average the standard deviations from all the involved attributes and get a new standard deviation to apply to the school of attributes involved with a specific role rather than have to derive it from one large super sample of all involved attributes.
UPDATE: Using a 340+ sample of dwarves, averaging a standard deviation of 315.15 and 400.67 gave me 357.91, combining the samples into one and doing a standard deviation of 367.1, so I guess the averaging gave me a bias of
2%.
Upon further reading, I'm finding that I shouldn't average the standard deviations, but use the entirety of the dataset
. Something I don't know how to do with dwarf therapist script engine (it took me forever how to figure out how to declare a var!)
I recommend using standard deviation, it's quite easy to find:
your list of numbers: 1, 3, 4, 6, 9, 19
mean: (1+3+4+6+9+19) / 6 = 42 / 6 = 7
list of deviations: -6, -4, -3, -1, 2, 12
squares of deviations: 36, 16, 9, 1, 4, 144
sum of deviations: 36+16+9+1+4+144 = 210
divided by one less than the number of items in the list: 210 / 5 = 42
square root of this number: square root (42) = about 6.48
I have yet to re-figure out how to find a % along the distribution curve using standard deviation, but that should be easy. It's important that when getting a mean from a super-set, that we apply the weights to each set of attributes individually before we combine them into a superset, you can apply weights to the superset afterwards (because there all combined into one dataset). For example, str weight of 1.2, willpower of .8. so we would multiple the strength attribute dataset by 1.2, then the willpower attribute dataset by .8, then combine them into a superset, then find the mean, and standard deviation.
I think I got it right, we then might need to take an individual dwarf's attribute's and apply the weight to his attributes before comparing it to the superset (to make it match the superset weights), I'm not 100% sure on this. Weights make it confusing for me, but I do know we need to apply the weights to the datasets BEFORE combining them into one superset to get the mean/standard deviation.
Update:
Okay, I re-remembered how to do standard deviations. Unfortunately, there's no easy math formula for it, you have to reference a table (which means probably coding this into an array, or something convenient for referencing). It's called the Table of the Standard Normal Distribution, aka z-value
http://www.fmi.uni-sofia.bg/vesta/virtual_labs/tables/tables1.html. Then you take a attribute you want to compare, and compare it to this table and measure it in how many standard deviations it is away from the mean, and that is your % away from 50% (either above or below 50%). The table I supplied can be reversed above 50% using the same standard deviations in column z, if that is confusing I can provide a more complete table, but it's simply counting down from 50% the same distance as it counts up from 50%.
I have an example for you:
you can use the sample of dwarves I have on
http://dffd.wimbli.com/file.php?id=5834 to verify. Again, because the distribution curve is slightly skewed, we're gonna have some bias, but it's accurate to within 5% (statistics become more accurate with larger data sets).
Avg of column c (analytical): 1052
Standard Deviation: 383.01
Specific instance of dwarf attribute: 1225 (last attribute in column c)
Formula: Absolute value: (Instance-Average) divided by Standard deviation |(Instance-average)|/standard deviation
(1225-1052)/383.01=0.4505813 standard deviations from the mean.
Then compare this value to the Table of the Standard Normal Distribution, and that's your % along the distribution curve.
In this case 70.88%
Doing a =(COUNTIF(C2:C341;"<1225"))/339 method in the spreadsheet resulted in a 76.1% (close enough, as I said again, the distribution is slightly skewed, which is obvious when you have a different range from minimum value to mean compared to maximum value to mean).
Note: 339 is the size of the sample (i.e. number of values; used for determining the %).
Another Example:
second to last value in column c: 887
|(887-1052)|=165
165/383.01=0.4307981514842954
.43 on the table is: .6664 (now this value "887" is below the mean, so we have to go below 50%, you do this by subtracting 1-.6664)=
0.3336000
=(COUNTIF(C2:C341;"<887"))/339
is
0.3008850
As a note, the standard distribution curve is used for normal distributions, which dwarf attributes may not be made up of (there is a curve though). However... this method is probably the best we're going to get, and uses the same "normal" logic I was using with my min/max range method (that method was assuming a flat distribution vs a curve, but still symmetrical). Good thing I took statistics in High School, otherwise this stuff would have been harder for me.
Update:
i realized multiplying a dataset by a weight and THEN finding it's average,
is the same as multiplying the dataset average by the weight (which means you can average all the new weighted averages to get the superset averageUpdate, no it's not, the distance from mean multiplied means the stdevp is multiplied by the weight. However, a superset of all role attributes (after being weighted) still needs to be done to determine the variation from the new mean to properly figure standard deviation.
the (sample-1) can be skipped and a (sample) can be used when dividing squared deviation from mean from.
Example:
Agility average: 879.59
Analytical average: 1052.42
Creativity average: 1044.32
Sum: 2976.33
# of attributes: 3
Update:
My formula's had a messed up weighted mean, I had addition occurring before multiplication, it was the other way around.
It's (attribute mean * weight)+(attribute mean * weight)+...(for each attribute)/(# of attributes) = weighted mean.
((1250*1.1)+( 900*1.2))/2 =
1227.5
Not
average of attribute averages: (1250 + 900)/2 = 1075
average of weights: 1.1 + 1.2 = 2.3; 2.3/2 = 1.15
multiply attribute average by weight average: 1236.25
The weighted mean WAS ONLY USED TO DETERMINE THE > #. Which was a % (10-20%) of the weighted average.
So the impact on the formula's is minimal, the rest of the formula was sound.
Update:
I think the original reason I requested a user definable column, was to be able to do my own sort. Whether I wanted to use flat out numbers, or %. That way I could create a script (hopefully) in dt, and then see my own new column that I can do a sort operation on.