I'm also doing a start over rewrite... I'm already at the declutter phase of the original dictionary, will have a modified uncluttered dictionary in a week, I'm also writing side by side the scripts that will parse that modified dictionary and produce the scripts. funny thing is the first time I ignored like 99% of what was in the original dictionary. Most of the bad spellings would of been fixed by the definitions following the words. I also didn't realize a few things about word tokens... really I could of tossed from the get go all the ADJ:X-ing words as they are covered by their verb forms. [STANDARD VERB] provides participles as adjectives as if they had THE_COMPOUND_ADJ and FRONT_COMPOUND_ADJ. Also tags in the definitions provide plural forms, and identify N.Plurals and all sorts of things I didn't bother to look for. I'm currently running a Fix script that seeks target information that identifies it as potentially bad (the original dictionary was 30k deep, I deleted manually from that as I rushed through it mostly by hand). My goal is to finish the deletions by next weekend.
Things language wise are not simple enough that you can just create a script and expect it to make sense of the sheer complexity of the various tenses of the words and so on. Like it or not, in order to get a good outcome you have to trawl through all the words in your word file to make sure everything is fine because lots of words do not follow the 'rules' that most of them do.
Another problem is that these are *not* words as they actually appear in the game. What we are doing is translating words into another language and using the symbolic tokens to actually make use of them, a lot of the time you need more than one of the same word because the same word (in english) does not mean the exact same thing.
A lot of the time you have nouns which mean a different thing symbolically to their verbs or adjectives under the same form. You basically have to manually iron out the quirks of the English language in order to create a universal language that can sustain a large number of other languages. A computer basically cannot do that, the errors that it has added to the spelling are trivial compared to the time consuming process of having to google search all the different meanings the word has to check if the verb, noun and adjective are all the same thing basically, also to duplicate words where needed.
of course a lot of the subtlety of the language can be lost with bad scripting... but the bad thing a lot of what makes me upset with the current version I have.... came from my own hand. I shoved many words together trying to reduce the script in ways that were bad... the original dictionary actually had all the rules for spelling, words were separated better by meaning, and really all it needed was bad words culled, more than anything.... which is what I'm doing right now with the rewrite. I culled many things before without checking to see if it was even in the game, then left a lot of other stuff for variety that really was just clutter. we don't need cabin, house, cottage, etc. when just house would be enough.... further attempts to cull the script turned into a fiasco.
If you want to see the original dictionary, I've got a place its uploaded to where you can download it.
the entire thing was a multistage project, how it should of been done was:
1. cull the original dictionary of all bad words period this results in a modified dictionary. I messed up here, because I let people rush me and goad me into getting the job done fast. culling properly at this stage wouldn't have forced the rash decisions of the later stages that I took.
2. turn the modified dictionary into a script dictionary, this is a dictionary that is easily read by a parser to create the word files so that manually editing each file separately will not be necessary. better scripts at this point would of caught the spelling errors by using information nested in the definitions (it literally has tags for when a different past tense and past participle tense should be used, or when a word requires an x to xx transformation). that would of caught by my estimates 95% of the spelling errors... other scripts at this point could identify invalid adjectives. adjectives that would be unnecessary due to the presence of a past/past participle verb word already present.
3. create scripts that take the script dictionary and parse it into the proper files. This part I had actually perfected pretty well. A little too well in some aspects, I provided way too many options which made the script dictionary extremely hard to read and caused a lot of the later failures at culling properly or adding symbolism.
4. MANUALLY select the symbol tables. I made several attempts to add symbolism through scripting and here you are absolutely right, its neigh impossible to accomplish... I don't own a watson super computer.
but through it all when I say target selections for removal, then manually delete, they are rather mundane scripts that basically scan the dictionary for words such as say "abbr." which is the dictionary term for abbreviation, then I check the line and remove it if it is an abbreviation. my targeting script shows me when a line is a duplicate line, duplicate start word, contains abbr., and various other "dictionary" terms of words that generally need to be removed. I work each section down, remove the target tags, rerun the script on the new file, until each target returns zero tagged lines. rinse and repeat with a new target. If I don't think a line deserves deletion I remove the offending target that appeared. (in other words when targeting american, I cut american out of all the plant and animal definitions that do exist inside off the vanilla DF scripts.). This works better than searching all the entries line by line manually as some definitions are truly massive.... I can focus on one thing at a time. when I run out of targets that I'm looking to remove, I can scan a section (100 or so lines) and find another handful of "targets" for removal.