I thought we were really really really close to ready on this last beta/alpha until I noticed one problem that was about to derail this whole new version.
A description of the problem can be found here
http://stats.stackexchange.com/questions/104637/ecdf-skewed-distribution-wish-to-mean-adjust-to-5Almost each and every question I've asked on stats.stackexchange I have failed miserably to communicate properly, but I have found my own answers.
The problem is how ecdf returns a %. It works fine when the data is somewhat distinct uniformly. [Distinct] Meaning more or less a majority of distinct comparable values. When you have a set of data that are all the same, or a majority are the same, and if these values are 0 or if these values are either low in the distribution or high in the distribution, can have an affect on the %.
Rank() in excel returns the ordinal position of a value from a set of values in a list. If there is a tie, it returns the earliest position, and skips the next position as it has been taken up by the tied value. So... if 2 values are ranked at rank 3, the next rank displayed is rank 5. Another example is if two values are ranked at rank 36, the next rank would be 38; if 3 values were tied at rank 36, the next rank position reported would be rank 39.
What's confusing is ECDF isn't called a ranking function, it derives a % based on rank. I assumed it worked with the same position as rank() in excel, but when I actually compared the two, I found that it was working with the earliest position, where-as ecdf was working with the last position of the tie. It was like ecdf was rank's evil other twin, and you needed them to combine to make a centered value.
It was this affect that was having a huge impact on the way we were trying to "deskew" skills if skills were mostly 0's, and a few were non 0 values. The non 0 values would have high %'s on ecdf by itself, but the
0's would have a huge percent that I back-end corrected and came up with a very convoluted formula to pad a 0%'s value to almost but less than 50%, and I transformed the rest of the non 0 values to above 50%, but based on an ecdf of just the non 0 values. -
I know it's complicated, but you don't have to remember it, because it's getting removed due to our new insight.Very complicated hack.
Well. Our problem occurred when we had **non 0 values** with maklak's skill emulation formula (plus some pre vs post weighting issues) for a starting embark of dwarfs with no skills, but because they had skill levels of 3 being reported, they were getting a high boost due to default skill rate formula (only old schoolers who really follow this thread will know what maklak's formula is - *this is is no way meant as a slight to mk, just trying to reference his contributions and our desire to preserve them in any future work)*. I know it's confusing, but our deskew method assumed and really needed to work with 0 values vs minimum values. We were about to just replace a minimum value with 0 if it was also median. Then we thought, what happens if there is one little value below the large skew? Splinterz found a null below a 0! So yeah, we had to come up with something.
Anyways... it was showing a bunch of [0 skilled] dwarf's for skill only roles as really good fits.
It was a conundrum. We had dwarf's with 0 skills, but transformed to be level 3 due to skill rate) being listed as a good fit for the job than other labors. I had to figure a way to autocorrect it, but I was failing.
Then came along rank.
It autocorrected it for me. It took values that are similar but low, and gave them a % that started at their first position of a tie. ECDF worked the opposite, gave last position and returned a %. So I combined the two, and found that low value skews were under 50%, and values above this were 50%+, which was our desired behaviour.
This breakthrough should be able to replace all the other convoluted formula's we had worked on for preferences as well.
I believe it will make the whole system more robust and centered, and 50% will now mean neutral. <50% = bad for job, 50%+ = good for job, there will no longer be columns of jarring red's, but instead column's of blanks, or 50%.
It also means the labor optimizer will treat a [starting embark] population with no shearer skills (a skill only role at the time of this writing) as a 50% drawn value vs 0%.
It's basically saying, this person is neither bad, nor good at this job compared to the rest of the population (as in they are all tied).
This is an important distinction in the behvaior of the labor optimizer, as before no skill meant 50%. However, as soon as a dwarf starts to improve in that skill, you'll notice a 100% value and a ~<50% value for the rest. This means during labor optimization, those who are considered truly bad at a job compared to the rest of the population will be scored lowered than these neutral values. Which means the labor optimizer will assign neutral jobs before bad jobs. It also means when looking at the screen 50% = good, and your labor optimizer shouldn't be overexhausted to assign values below 50% (as in trying to assign too many labors).
The way we derive %'s in this new setup is based on the comparable value of items within categories.
What's a category? Attributes, traits, skills, and preferences.
So when we look at a category, we lay it out as a grid example:
Attributes
x = dwarfs, y = Attribute Names (19)
Traits
x= dwarfs, y = Traits(~60)
Skills
x = dwarfs, y = Skill names (~119)
Preferences
x = dwarfs, y = roles(~100+) - *
*
*Preferences was a hard one, we decided to quantify preference %'s by the # of matching preferences / # of preferences defined in the role, so Splinterz had to calculate all the role %'s like twice on the backend, once for preference category, then fed into ecdf/rank, then back into preference as an actual distinct % separate from traits, skills, and attributes.
Then we run this through the ecdf/rank% average method mentioned above, and we get a skew corrected value for our low values at <50%, and our positive values at 50%+
This gives us a large # of comparable elements of varying size that we can definately relate internally within it's own category to each other on a scale of 0 to 100%, but we cannot due it outside of this category. So... what and how do we do it?
We use ecdf/rank % conversion on each value compared to each other value within a category (you get very large datasets when doing this, 53 dwarf's gave me 1007 comparable [attribute] values that turned into comparable %'s, that when combined with the %'s derived for the other categories, allows a distinct % combination for each and every combination. Even for very small populations, a starting embark will have 133 comparable attribute values to draw 133 distinct % values. If your fort dies and your down to your last man, 1 dwarf will have 19 distinct % values for his set of attributes.
Friggin' amazing right?
Here's how the %'s are derived from some standard #'s.
https://docs.google.com/spreadsheets/d/1gitnUzUyaROi-QroCXvXbY2raFBJTHZk7YCMWTlOHjw/edit?usp=sharinghere's what it looks like raw vs scaled (left is raw, top is attributes, bottom is skills, right is scaled)
http://imgur.com/WlulpOhwhat your seeing is the distributions scaled based on their ordinal rank positions, centered within a % if ties exist.
The reason for the "squish" of values is due to how ordinal ranks work, every difference in value is only worth 1 point. Where-as the raw values stress larger frames. However, for deriving % purposes from a scale of 1 to 100%, this works perfectly, as it retains the ordinal position and achieves a mean/median of .5, and a min of ~1% and a max of ~100%.
The reason these individual comparisons don't have a mean of .5 (although the median of most of those skills is our <50% value), is because only the larger "category [this case 'skills']" is centered around a .5 median/mean. These elements within roles are subsection views of attributes and skills within the larger grid of data we normalized. Ordinal positions within categories are respected
(compare median with min and max). It cuts off excessive differences in values and tries to preserve it as a % based on the # of elements being counted.
However, in the case of a skewed distribution, the range in values was preserved and produced a mixture model for me.
http://m.imgur.com/KKljFOgso you can see, that 0 values - <50%, and there is a huge gap where the ~90% values start, which is entirely as intended. These 90%'s represent the vast gap that is produced by the skew. It's hard to wrap your head around, but this achieves an overall output target of 50% when we apply the same methods to all categories. Which allows for maximum centered comparisons when defining roles.
A final picture showing quartile comparisons of attributes sorted by attribute median.
http://imgur.com/ki3fgyo