Topic: Dwarf Therapist v42.1.7 | DF 50.14 (Read 423521 times)

Clément · « **Reply #315 on:** March 16, 2018, 07:45:07 am »

Quote from: thistleknot on March 16, 2018, 02:04:07 am

I was hoping for a way to list all preferences by category then specifics and scroll left to right to see them all, maybe expand them using a +/- box dialog.

I am not sure what you are describing. Scrolling won't do it: some categories have hundreds of items (see the custom role editor dialog). The grid view editor will need to be redone to be usable with preferences, using a filter search like the custom role editor. The current nested context menus cannot contains the hundreds of creature and material preferences.

Quote from: thistleknot on March 16, 2018, 02:04:07 am

Spreadsheet Berserker

Indeed. Sorry, I cannot comment your data, I have no idea what is going on here.

If you are the statistics expert, do you any idea what could be causing this issue.

thistleknot · « **Reply #316 on:** March 16, 2018, 09:31:15 am »

Quote from: Clément on March 16, 2018, 07:45:07 am

Quote from: thistleknot on March 16, 2018, 02:04:07 am
I was hoping for a way to list all preferences by category then specifics and scroll left to right to see them all, maybe expand them using a +/- box dialog.

I am not sure what you are describing. Scrolling won't do it: some categories have hundreds of items (see the custom role editor dialog). The grid view editor will need to be redone to be usable with preferences, using a filter search like the custom role editor. The current nested context menus cannot contains the hundreds of creature and material preferences.

Quote from: thistleknot on March 16, 2018, 02:04:07 am
Spreadsheet Berserker

Indeed. Sorry, I cannot comment your data, I have no idea what is going on here.

If you are the statistics expert, do you any idea what could be causing this issue.

I apologize for the lack of context. I worked with Splinterz back with v13 of Dwarf Therapist to help derive scores based on attributes, traits, skills, preferences, etc. I have intimate knowledge of how the role ratings are derived (at least up till v34). I was unaware the project was transferred to a new lead until yesterday.

Either way. The current way raw values are converted to %'s I believe use a transformation from min/max around the average, then from min/max around the median, then combined with an empirical cumulative distribution function to get a %. These %'s are then fed into a weighted sum algorithm to get role ratings.

I was proposing a new method that uses a simpler weighted standard deviation approach around the median (similar to a median/median absolute deviation approach) that allows for allowance for skewed distributions (i.e. extremely large values in a data set, i.e. outlier's).

The new proposed method doesn't do any weird transforms around the average/median that would leave the rest of the dataset not proportionally accurate (i.e. when transforming around different centers, the areas around the center get shifted to varying degrees, so the distances around the center are not the same on either side of the center). With this new proposed method, the distances would be equal always, plus the added benefit that there is only one center now (the median)

I could code it directly myself. I'm a bit slow though, I created the first prototype for the labor optimizer in v17 (took me two weeks and Splinterz said it was more or less a superfunction of spaghetti code). I generally make mockups in excel for review. That file I recently uploaded is a bit messy as I was trying a lot of things, but I could pair it down.

~~Alternatively, I was looking at using Kernel Density Estimates (and maybe integrating R) to derive probability distributions.~~ Scratch that, KDE's are basically Empirical Cumulative Distribution Functions and do not measure the distance between values. However, this method is much much much easier.

And yes, I am somewhat of a stats expert, I'm currently in a masters for Data Science, but the issue with that is, most methods for normalizing assume a NORMAL DISTRIBUTION, and the data that Dwarf Fortress produces isn't normal. So parametric methods have to be used. This I guess is a novel/hackey parametric method by using a weighted sum algorithm for the standard deviation which in turn allows for capturing the larger values appropriately while maintaining a ~0 (min) to 50 (median) to ~100% spread (max).

I could create a complete mockup of roles in excel so you can get a better idea. I could do a comparison with old vs new. Or I could just scratch that and attempt at coding a prototype. It's been a while, and if I do, it will probably take me a minute (a month?)

Edit:

As to the bug... not sure. I noticed a bug a while ago in the way roles were calculated that I would also like to address (especially if I'm proposing a new method to calculate roles).

I would love to work with you if you are willing. I can do qa testing and work on developing a newer version. I can even deliver mockups before [if] we go that route.

Clément · « **Reply #317 on:** March 16, 2018, 12:35:16 pm »

Quote from: thistleknot on March 16, 2018, 09:31:15 am

Either way. The current way raw values are converted to %'s I believe use a transformation from min/max around the average, then from min/max around the median, then combined with an empirical cumulative distribution function to get a %. These %'s are then fed into a weighted sum algorithm to get role ratings.

If I am reading the code correctly, there is actually different methods applied depending on the distribution. This started with this commit from v28 (and there is your name in the commit message). The one you are describing is RoleCalcRecenter, right?

DT first check if the data is "skewed" (the first and second quartiles are equal). For non skewed distributions, it applies RoleCalcRecenter. For skewed distributions, it applies RoleCalcBase if there is less than 25% unique values in the data set, else it applies RoleCalcMinMax.

I can understand the base_rating (ECDF?) and range_transform, but I am not sure about the linear combinations:
RoleCalcBase's rating is the average of ECDF and 1.
RoleCalcMinMax is the average of ECDF, MinMax (similar to range_transform but without a middle) and 1 (with double weight).
RoleCalcRecenter is the average of ECDF and range transformed value (around average then median).

RoleCalcBase and RoleCalcMinMax are producing ratings in the 50%-100% range (actually the minimum may be even higher than 50% because of how base_rating work), I guess they are meant for data set with a lot of zeros (e.g. rare skills and preferences) where the median and the minimum are the same value (0), so the minimum is considered average.

Quote from: thistleknot on March 16, 2018, 09:31:15 am

I was proposing a new method that uses a simpler weighted standard deviation approach around the median (similar to a median/median absolute deviation approach) that allows for allowance for skewed distributions (i.e. extremely large values in a data set, i.e. outlier's).

The new proposed method doesn't do any weird transforms around the average/median that would leave the rest of the dataset not proportionally accurate (i.e. when transforming around different centers, the areas around the center get shifted to varying degrees, so the distances around the center are not the same on either side of the center). With this new proposed method, the distances would be equal always, plus the added benefit that there is only one center now (the median)

I could code it directly myself. I'm a bit slow though, I created the first prototype for the labor optimizer in v17 (took me two weeks and Splinterz said it was more or less a superfunction of spaghetti code). I generally make mockups in excel for review. That file I recently uploaded is a bit messy as I was trying a lot of things, but I could pair it down.

~~Alternatively, I was looking at using Kernel Density Estimates (and maybe integrating R) to derive probability distributions.~~ Scratch that, KDE's are basically Empirical Cumulative Distribution Functions and do not measure the distance between values. However, this method is much much much easier.

And yes, I am somewhat of a stats expert, I'm currently in a masters for Data Science, but the issue with that is, most methods for normalizing assume a NORMAL DISTRIBUTION, and the data that Dwarf Fortress produces isn't normal. So parametric methods have to be used. This I guess is a novel/hackey parametric method by using a weighted sum algorithm for the standard deviation which in turn allows for capturing the larger values appropriately while maintaining a ~0 (min) to 50 (median) to ~100% spread (max).

I could create a complete mockup of roles in excel so you can get a better idea. I could do a comparison with old vs new. Or I could just scratch that and attempt at coding a prototype. It's been a while, and if I do, it will probably take me a minute (a month?)

If you don't want to write the code, you could give the formulae and explain how they should be used. Even if you write a patch ready to merge, I would not mind a detailed explanation. This kind of work would deserve an article.

Quote from: thistleknot on March 16, 2018, 09:31:15 am

As to the bug... not sure. I noticed a bug a while ago in the way roles were calculated that I would also like to address (especially if I'm proposing a new method to calculate roles).

Rereading the code, I think RoleCalcBase::find_median is incorrect for even sized vectors (and overkill for sorted vectors). The result may be random (but always lower than the actual median) depending on how is implemented std::nth_element.

thistleknot · « **Reply #318 on:** March 16, 2018, 01:12:32 pm »

ECDF stands for empirical cumulative distribution function. With any distribution [especially if it's not normal], what is being measured is the FREQUENCY of times a value occurs (simple countif), not the difference between values. Hence why we use ECDF with a min/max conversion. The min/max does the opposite. Doesn't care for frequency, but only scores DISTANCE between values. The averaging of the two methods results in a score that accounts for BOTH DISTANCE AND FREQUENCY.

In hindsight, it may not be necessary to use ECDF (because the data will be present in the form of duplicates/frequency already?). I liked it because ECDF had the property of 0-50-100% respectively, and always averaged to 50%, which helped when merging other distribution methods to achieve a near 50% mean. In other words, if all values are unique, ECDF is literally a flat distribution of equal width.

I would propose still utilizing ECDF along with a new transform method that would replace the linear transform (i.e. min/max around mean/median's).

If you are willing to write the algorithm, I'll just do a mockup then.

A special method handles datasets with a large proportion of nulls (I believe a check of Quartiles determines this). That is something I would like to keep in place. I could draw out the if structures to how I think it actually does the if checks and maybe we can model/remodel it and go from there? I mean, I can also mock it all up in excel so it makes more sense with a separate tab for each transformation method.

The bug I mentioned I wrote up in here, it's one of my earlier posts under splinterz thread: http://www.bay12forums.com/smf/index.php?topic=122968.msg7305942#msg7305942

Edit:

As to preferences. Can't it just limit itself to the preferences that are found amongst the population and discard the rest?

Edit: [Heavily reduced/cleaned up] latest version of proposed labor calculations. Currently just calculated Attributes. (fixed the way ECDF values were averaged back in).

Sheet: MCZNewMAD, Cells P4 and P5 are what are fed into an excel norm.dist() function using the raw attribute values

the weighted sdev is calculated in cells l5:m11 which is based on the 68–95–99.7 rule. I basically derive the percentile score of the z-score percent value at z-scores min, -3, -2, -1, 0, 1, 2, 3, max respectively with min as 0% and max as 100% for purposes of deriving a weight for that portion that contributes to the deviation which is a weighted sum algorithm where the weights are the differences between the cumulative distribution of these z-scores. I then derive the differences between these percents (e6:e11) to derive weights. The quantile position of the z-scores is utilized by a percentile() function (f5:f11). The difference between values (l5:l9) is treated as a pre-defined width (single deviation) as defined earlier up to min/max as all values beyond 2 deviations to derive weighted (m6:m11) deviations which are summed together (weighted sum) to derive a new deviation (m3).

https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule

https://drive.google.com/file/d/17XyRkO0Ljm0v50ZVDTEHlJyxY44KNJB0/view?usp=sharing

Clément · « **Reply #319 on:** March 17, 2018, 05:17:53 am »

I think I get most of it, but what do you do with this WSMAD value once you have it?

Also shouldn't the [min, -2] and [+2, max] bands be ignored when averaging WSMAD? Their width is not the standard deviation in the theoretical normal distribution. You should add the -3 and +3 z-scores (and use the 97.7 from the 68-95-97.7 rule), if you need more values.

thistleknot · « **Reply #320 on:** March 17, 2018, 10:11:51 am »

The min and max is precisely what I aim to acquire. the data ISNT normal. If you use normal scores then the data will output to normal. By using min/max we allow for skew to be incoproated into the weighted sdev.

I think min,-2,-1,0,1,2,max is probably best practice. I have a few other ideas (such as -3), or trimming x,y outtliers or chebyshev theorem vs empirical rule or use a skewed cumulative distribution function

If you use normal distribution assumptions, it will result in more high end values cappining at 100% (loss in values)

I have a stat exam this morning and while studying I reviewed a lot of other options. Using a weighted sdev isn't unheard of

Edit:

Fyi, I updated the spreadsheet I've uploaded a few times. You may want to check it out from time to time. I came up with 3 different methods to derive the standard deviation using the median (P5:P7). I fixed a few bugs and the values are not capping out at 1 like they were before. If I see them, I see maybe 1 or 2 or 3, not like 10-15+ for 50*19 attributes. One is similar to this function MAD

array formula

Code: [Select]

={sqrt(sum((R-MEDIAN(R))^2)/count(r))}

the entire formula given any x of say dataset attributes that contained a conjoined set of all columns of attributes

Code: [Select]


medianSDEV={sqrt(sum((attributeRange-MEDIAN(attributeRange))^2)/count(attributeRange))}

tempMadScore = (x-median(attributeRange))/medianSDVE

//explanation of excel function normdist:
//(x, mean, sdev, cumulative?)

tempMadScore = normdist(tempMadScore, 0, 1, 1)

ECDFscore = (countif(attributeRange,"<="&x) + countif(attributeRange,"<"&x))/2/count(attributeRange)

score = (tempMadScore + ECDFScore)/2

which gives a median based population standard deviation that approximates what I was doing

Alternatively, instead of min, -2, -1, 0, 1, 2 I used the formula here for uniform distributions for variance: https://en.wikibooks.org/wiki/Statistics/Distributions/Uniform which is (max-min)^2/12, in this case since I had min mapped to 0 and max to 1, the result was 28.8% standard deviation (O12), which I used to derive uniform percentiles (t4:x11).

My opinion is that the original method seemed adequate enough. However, for simplicity and thoroughness sake, I think the array formula listed would be the easiest method to implement.

thistleknot · « **Reply #321 on:** March 18, 2018, 09:56:51 am »

Quote from: thistleknot on March 17, 2018, 10:11:51 am

Code: [Select]
={sqrt(sum((R-MEDIAN(R))^2)/count(r))}

scratch that. I noticed my calculations were off (missing column) and this method results in standard deviations that are too high and result in the minimum being much higher from 0% than the max is from 100%.

I did find a better method.

It basically is a two pass table lookup (averages two methods to find sdev), p11. It seems to produce values from within 2% of max. I tried merging the two tables, which also seemed to work and produced much lower sdev's, but I felt the values weren't dispersed enough. They tended to hug the center of the distribution (high peak).

https://drive.google.com/open?id=17XyRkO0Ljm0v50ZVDTEHlJyxY44KNJB0

boxplots of before and after

TheDorf · « **Reply #322 on:** March 18, 2018, 11:51:53 am »

Is there any way to get this to work with the latest version? Or will I have to wait for a 44.07 release?

PatrikLundell · « **Reply #323 on:** March 18, 2018, 12:00:37 pm »

Look further up in this thread...

Clément · « **Reply #324 on:** March 19, 2018, 06:21:38 am »

New version released: 39.3.0

Memory layouts for the new versions and some cosmetic changes.

Changelog:

added memory layouts for DF 0.44.06 and 0.44.07
added a retry button in the lost connection dialog
added profession icon for monster slayers
changed some colored text to be more adapted to the palette in use
updated links in help menu

Windows builds are also available on DFFD (win32, win64).

Pvt. Pirate · « **Reply #325 on:** March 19, 2018, 07:13:41 am »

downloading - awaiting the result of the avast-lab... but at least it didn't crash right away.

Clément · « **Reply #326 on:** March 24, 2018, 09:17:47 am »

For solving an issue with where the updater saves the downloaded memory layouts. I need to change how DT look up for data files. The simplest solution is to rely only QStandardPaths to get the directories to search (and write to).

On Windows and MacOS, QStandardPaths contains directories relative to the application directory. Some directory name may change, but it should continue to work as before. But on Linux, QStandardPaths only contains the standard prefixes (/usr, /usr/local, ~/.local), this will prevent DT from finding the data files (memory layouts or grid views) if it is installed in a non-standard prefix, unless the XDG_* environment variables are set.

I can also add a portable mode (through a run-time command line parameter or a build time option) where all the files (including the settings?) will be looked up relatively to the application directory. The application would need to be writable so the settings and updated memory layouts can be written.

The current updater behavior is to write it in the working directory which can be anywhere depending on how you launched DT. Only changing the directory where the memory layout is written won't work, since it will not be in the first priority directory and may be shadowed by another file when it needs to be read later. And I don't think I can find a directory order that can fit every use case at once. That is why I want to propose two distinct mode: one where it uses standard paths (user files are stored in APPDATA on Windows, XDG_*_HOME on Linux), the other fully portable where all is contained in the same (writable) directory. Am I missing some use cases?

I am also not sure how it affects MacOS (the example paths given in the doc look fine). Actually I have no idea where the memory layouts currently are. They are not copied in the deployment scripts. Are they downloaded by the updater on the first run? Where are they written?

I may use the opportunity of breaking all file paths to change where the settings file is stored. I find the current "UDP Software" hard to guess and I am thinking about removing the organization name and storing it in QStandardPaths::writableLocation(QStandardPaths::AppConfigLocation) + "/settings.ini" instead of the default path (e.g. ~/.config/dwarftherapist/settings.ini instead of ~/.config/UDP Software/Dwarf Therapist.ini).

feelotraveller · « **Reply #327 on:** March 24, 2018, 09:42:44 pm »

Like the proposed changes but would it be better to use 'dtsettings.ini' or similar to aid in searching when a user does not know where the file is located? ('settings.ini' will likely result in many hits)

jecowa · « **Reply #328 on:** March 25, 2018, 01:24:00 am »

Quote from: Clément on March 24, 2018, 09:17:47 am

I am also not sure how it affects MacOS (the example paths given in the doc look fine). Actually I have no idea where the memory layouts currently are. They are not copied in the deployment scripts. Are they downloaded by the updater on the first run? Where are they written?

Yes, the memory layouts get downloaded by the auto updater just fine on Mac last time I checked. Applications on MacOS are actually folders called packages. This package contains the executable, the icon file, some libraries, and maybe UI graphics, and other stuff. The Mac version of Dwarf Therapist stores the memory layouts in a folder inside of its application package. If you need to know the exact location, it's /DwarfTherapist.app/Contents/MacOS/share/memory_layouts/osx/

Clément · « **Reply #329 on:** March 25, 2018, 03:34:03 am »

Quote from: feelotraveller on March 24, 2018, 09:42:44 pm

Like the proposed changes but would it be better to use 'dtsettings.ini' or similar to aid in searching when a user does not know where the file is located? ('settings.ini' will likely result in many hits)

dwarftherapist.ini then, dtsettings.ini is not very good for search either. Maybe I should use different application names on different platforms. Linux prefers "dwarftherapist" (all lower case, no space), but Windows (and MacOS?) may prefer "Dwarf Therapist".

Quote from: jecowa on March 25, 2018, 01:24:00 am

Yes, the memory layouts get downloaded by the auto updater just fine on Mac last time I checked. Applications on MacOS are actually folders called packages. This package contains the executable, the icon file, some libraries, and maybe UI graphics, and other stuff. The Mac version of Dwarf Therapist stores the memory layouts in a folder inside of its application package. If you need to know the exact location, it's /DwarfTherapist.app/Contents/MacOS/share/memory_layouts/osx/

So the bundle is writable. You still need internet access on the first run to get the memory layout.

If I understand the paths correctly, the new standard path would be /DwarfTherapist.app/Contents/Resources/memory_layouts inside the bundle and new memory layouts would downloaded to ~/Library/Application Support/dwarftherapist/memory_layouts.

News:

Author Topic: Dwarf Therapist v42.1.7 | DF 50.14 (Read 423521 times)

Clément

Re: Dwarf Therapist (Really Maintained Branch) v.39.2 | DF 44.05

thistleknot

Re: Dwarf Therapist (Really Maintained Branch) v.39.2 | DF 44.05

Clément

Re: Dwarf Therapist (Really Maintained Branch) v.39.2 | DF 44.05

thistleknot

Re: Dwarf Therapist (Really Maintained Branch) v.39.2 | DF 44.05

Clément

Re: Dwarf Therapist (Really Maintained Branch) v.39.2 | DF 44.05

thistleknot

Re: Dwarf Therapist (Really Maintained Branch) v.39.2 | DF 44.05

thistleknot

Re: Dwarf Therapist (Really Maintained Branch) v.39.2 | DF 44.05

TheDorf

Re: Dwarf Therapist (Really Maintained Branch) v.39.2 | DF 44.05

PatrikLundell

Re: Dwarf Therapist (Really Maintained Branch) v.39.2 | DF 44.05

Clément

Re: Dwarf Therapist (Really Maintained Branch) v.39.2 | DF 44.05

Pvt. Pirate

Re: Dwarf Therapist (Really Maintained Branch) v.39.3 | DF 44.07

Clément

Re: Dwarf Therapist (Really Maintained Branch) v.39.3 | DF 44.07

feelotraveller

Re: Dwarf Therapist (Really Maintained Branch) v.39.3 | DF 44.07

jecowa

Re: Dwarf Therapist (Really Maintained Branch) v.39.3 | DF 44.07

Clément

Re: Dwarf Therapist (Really Maintained Branch) v.39.3 | DF 44.07