So I haven't been working on this in a long time, ever since I tested out the Variable-Order Markov Model approach and discovered that the words it generated weren't noticeably better than the regular one I was already using.
Not sure what you mean by "variable-order Markov model", but constructing words by
phonemes - or a workable approximation thereof - instead of letters does work remarkably better in creating believable words
As an approximation, you can replace multi-letter constructs which are usually a single phoneme or a diphthong with a pseudo-letter before constructing the Markov chains. then reverse the process for word output. The list can be done automatically (if a specific
two- or
three-letter combination is very common in the input corpus) or manually as desired.
For English, such a list could start with "th", "ch", "sh", "ph", "qu", "ea", "au", "ee" and some doubling of consonants (especially sonorants like "ll", "rr" and "mm").
For German, that could be "pf", "sch", "eu", "ch", "ph", "qu", "tz" and again consonant doubling.
For Polish, that would be "ch", "cz", "sz", "rz", "dz", "dź", "dż" and some palatised combinations with "i" ("si", "ci", "dzi", "ni", "zi" and so on).
Fantasy languages could use some of this too.
EDIT: Actually, I just ran the English translation of "War and Peace" through a little analyser, and the most common diphongs there are "th", "ng", "ou", "ea", "ll", "wh", "sh", "ch", "ow", "ss", "ai", "ee", "oo", "gh", "ay", "rr", "tt", "ts", "ff", "ck", "pp", "au", "qu", "oi", "aw", "nn", "ue", "ui", "eo", "mm" and "yi" - in that order.