Bay 12 Games Forum

Please login or register.

Login with username, password and session length
Advanced search  
Pages: 1 [2]

Author Topic: Fixing accented letters in names  (Read 5286 times)

mifki

  • Bay Watcher
  • works secretly...
    • View Profile
    • mifki
Re: Fixing accented letters in names
« Reply #15 on: August 24, 2014, 06:09:44 pm »

"Text will be Text" plugin for DFHack by Mifki allows multiple tilesets to override font and objects that normally use the same tile, but I'm unsure if there is way to direct an override of capital letters. I would suggest asking Mifki. Perhaps he could do something about it?

Edit: Fixed link.

Well, characters that are present in the encoding should be displayed correctly with TWBT and appropriate tileset. Characters that are not present, like È... wait, what's displayed instead of it?

Hello, thanks for stopping by!
Lower case è is displayed in all instances of a name, even when used as a capital/uppercase letter. Even if there is a tile for È available on the tileset to be used... it's just not used. Plus there's letters that don't have equivalent uppercase versions in the main vanilla tileset, so we would need a second tileset to do this?  :)

No, it's not about tilesets. As lethosor said, these letters just can't be represented in the encoding. I think it's better just not to use such letters.

King Mir

  • Bay Watcher
    • View Profile
Re: Fixing accented letters in names
« Reply #16 on: August 24, 2014, 07:05:03 pm »

The convention for using Diacritic is often to exclude them when the letter is capitalized. That's probably why the code page only has lowercase for many accented letters. It's also true that diacritics aren't as common at the beginning of words in real languages.

Larix

  • Bay Watcher
    • View Profile
Re: Fixing accented letters in names
« Reply #17 on: August 24, 2014, 07:30:53 pm »

The convention for using Diacritic is often to exclude them when the letter is capitalized.

Where is such a convention in force? Diacritics have a clear phonetic meaning, they're not optional; if you use them in lower-case, you should also use them in upper-case.

Quote
That's probably why the code page only has lowercase for many accented letters. It's also true that diacritics aren't as common at the beginning of words in real languages.

No, that really depends on the accented letters you're thinking of - the Umlaute as well as the É commonly appear at the beginnings of words[1], and sure enough, the capitals É, Ä, Ö, Ü (and Å and Æ) are found in the tilesets. The other types - Î and Ù and the like - which are indeed extremely rare in the beginnings of words, are missing.

[1] i'm not going to make up frequency tables, but they're definitely common enough that a tileset missing those capitals would end up looking criminally shitty when trying to render german or french.
Logged

Spectre Incarnate

  • Bay Watcher
  • Possibly inside a dragon's toothy maw.
    • View Profile
Re: Fixing accented letters in names
« Reply #18 on: August 25, 2014, 12:38:01 am »

"Text will be Text" plugin for DFHack by Mifki allows multiple tilesets to override font and objects that normally use the same tile, but I'm unsure if there is way to direct an override of capital letters. I would suggest asking Mifki. Perhaps he could do something about it?

Edit: Fixed link.

Well, characters that are present in the encoding should be displayed correctly with TWBT and appropriate tileset. Characters that are not present, like È... wait, what's displayed instead of it?

Hello, thanks for stopping by!
Lower case è is displayed in all instances of a name, even when used as a capital/uppercase letter. Even if there is a tile for È available on the tileset to be used... it's just not used. Plus there's letters that don't have equivalent uppercase versions in the main vanilla tileset, so we would need a second tileset to do this?  :)

No, it's not about tilesets. As lethosor said, these letters just can't be represented in the encoding. I think it's better just not to use such letters.

Well, darn. What makes a tile allowed to act as a capital letter then? I know this can't be fixed in vanilla cause there's only one tileset to go by and no room for new tiles, but I guess I thought with TWBT we could redirect lower case letters (while acting as capitals) to new sprites for capital letters instead.

*confused*  :-\
Logged
The in-game text has punctuation!  Who knew?
Mister Adams,
How many licks does it take to get to the [candy] center of a Dwarf Fortress?

mifki

  • Bay Watcher
  • works secretly...
    • View Profile
    • mifki
Re: Fixing accented letters in names
« Reply #19 on: August 25, 2014, 06:36:50 pm »

"Text will be Text" plugin for DFHack by Mifki allows multiple tilesets to override font and objects that normally use the same tile, but I'm unsure if there is way to direct an override of capital letters. I would suggest asking Mifki. Perhaps he could do something about it?

Edit: Fixed link.

Well, characters that are present in the encoding should be displayed correctly with TWBT and appropriate tileset. Characters that are not present, like È... wait, what's displayed instead of it?

Hello, thanks for stopping by!
Lower case è is displayed in all instances of a name, even when used as a capital/uppercase letter. Even if there is a tile for È available on the tileset to be used... it's just not used. Plus there's letters that don't have equivalent uppercase versions in the main vanilla tileset, so we would need a second tileset to do this?  :)

No, it's not about tilesets. As lethosor said, these letters just can't be represented in the encoding. I think it's better just not to use such letters.

Well, darn. What makes a tile allowed to act as a capital letter then? I know this can't be fixed in vanilla cause there's only one tileset to go by and no room for new tiles, but I guess I thought with TWBT we could redirect lower case letters (while acting as capitals) to new sprites for capital letters instead.

*confused*  :-\

I'll try one more time :)

DF uses numbers 0-255 to encode all possible characters. In this range THERE'S NO NUMBER FOR CERTAIN CAPITAL LETTERS. As well as there's no numbers for Chinese symbols, Cyrillic letters and so on. For the game engine they just don't exist. And the game can't ask renderer to draw something that doesn't exist. So it always asks to draw lowercase letters, and renderer can't distinguish whether the game wanted lower- or upper-case letter.

lethosor

  • Bay Watcher
    • View Profile
Re: Fixing accented letters in names
« Reply #20 on: August 25, 2014, 06:41:42 pm »

Essentially, TwbT can render uppercase diatrics (from a separate image), but there's no way to tell which lowercase diatrics should be uppercase when rendered.
Logged
DFHack - Dwarf Manipulator (Lua) - DF Wiki talk

There was a typo in the siegers' campfire code. When the fires went out, so did the game.

Spectre Incarnate

  • Bay Watcher
  • Possibly inside a dragon's toothy maw.
    • View Profile
Re: Fixing accented letters in names
« Reply #21 on: August 25, 2014, 07:25:27 pm »

I'll try one more time :)

DF uses numbers 0-255 to encode all possible characters. In this range THERE'S NO NUMBER FOR CERTAIN CAPITAL LETTERS.

OKAY. :P

I'm sorry for not following at first. I did not mean to question your judgement, just misunderstood something.

Logged
The in-game text has punctuation!  Who knew?
Mister Adams,
How many licks does it take to get to the [candy] center of a Dwarf Fortress?

draeath

  • Bay Watcher
  • So it has come to this...
    • View Profile
Re: Fixing accented letters in names
« Reply #22 on: August 26, 2014, 03:17:22 pm »

You know, if Toady was going to change the encoding to use... I really think putting in UTF-16 would be quite awesome. There's an absolute TON of glyphs in those codepages... (only 0000-FFFF, the ones with more digits are UTF-32 which would be crazy ridiculous to support. Modern OS shouldn't have trouble with UTF-8, or UTF-16, but I have no idea about UTF-32)

For example, these. (assuming font support, and there's plenty of freely-available fonts that have lots.

Might as well go all the way and not gimp it with some silly encoding!
Logged
Urist McAlchemist cancels extract isotope: interrupted by supercriticality accident.
This kea is so raw it stole my wheelbarrow!

Larix

  • Bay Watcher
    • View Profile
Re: Fixing accented letters in names
« Reply #23 on: August 26, 2014, 04:34:28 pm »

I'll try one more time :)

DF uses numbers 0-255 to encode all possible characters. In this range THERE'S NO NUMBER FOR CERTAIN CAPITAL LETTERS.

In the interest of precision: DF seems to use codepage 437 for display. This codepage doesn't include the capitals in question. They certainly fit into a byte-encoding, e.g. in the widely used codepage 850. That one seems to be a common encoding on the web today - at least for me the following string
ÁÂÀÃÊÐÊËÈÍÎÏÌÓÔÒÕÚÛÙÝ
shows up as a line of accented capitals, notably including the capital i-with-circumflex, e-with-grave and u-with-circumflex that are have no capitals in the game.

In theory, the problem could be solved by changing the codepage, but i suspect Toady already uses the symbols that would get replaced.
(Incidentally, there are also 256-entry codepages with the full russian alphabet, so kyrillic is also entirely possible; for most alphabets it's a question of the codepage you choose.)

I'm also in favour of going all Unicode ;)
Logged

fbo

  • Bay Watcher
    • View Profile
Re: Fixing accented letters in names
« Reply #24 on: September 21, 2017, 08:13:30 am »

In theory, the problem could be solved by changing the codepage, but i suspect Toady already uses the symbols that would get replaced.

The codepage can easily be changed with TWBT which separates text and map tiles. The real problem is the hard coded behaviour of lower-casing such a letter and make a DFhack binpatch for disabling it.
Or ask Toady nicely to introduce an option in d_init.txt for this :)
Logged

PatrikLundell

  • Bay Watcher
    • View Profile
Re: Fixing accented letters in names
« Reply #25 on: September 21, 2017, 10:11:23 am »

As was touched upon in the original thread posts from 3 years ago, the good way to solve the issue is for DF to be changed to use a 16-32 bit character set. This would allow for DF to use distinct characters for everything in the game (at the cost of not only re-coding all string handling, but also of Toady and Threetoe going through every mapping to separate multiple usages (and possible reallocate some of the current ones to "better" ones).
It would also have the major advantage that tile sets can look reasonable when used for text even without TwbT's magickery, and would keep tile set creators occupied with creating new tiles ;)

Changing code pages does not achieve anything useful. Toady has selected a code page that fits as well as possible with what he wants to display, and changing to another code page means a lot of things currently displayed as something (sort of) meaningful get replaced by things that make no sense in the context of DF (random unconnected example: by getting "Ð" to display nicely as a "dh" character in text, the "graphics" represents an Imp (or something else without any connection to that letter). Switching code pages on the fly is really just a rather cumbersome (and extremely inefficient, given the character overlap) way of extending the code set with more bits and probably end up using one byte for the code set and another for the code for an extremely inefficient 16 bit "character set".

Skipping code pages and using 8 bit character building (where multiple 8 bit "characters" combine to build e.g. characters with diacriticals) would probably work for text, but not for the DF "graphics", as each "thing" that should be displayed "graphically" would need to be a string of characters to cater for the built characters, which totally messes up how display info is stored.
Logged

mikekchar

  • Bay Watcher
    • View Profile
Re: Fixing accented letters in names
« Reply #26 on: September 21, 2017, 08:56:22 pm »

Yes, in fact if you use TEXT mode and have your terminal set to UTF-8, you will get UTF-8 character being sent to you -- not that it will do any good :-).  A lower case letter is a lower case letter.

Often people miss the idea that there is a difference between an encoding for a character (the way it is stored in memory) and the glyph for a character (the way it is displayed on the screen).  A font (or tileset in DF) maps an encoding to a glyph.  So, in memory when we want to print "A" we might store the number 65 in memory.  When we want to print the "A", we will see the 65 in memory and look up the glyph in the font.  Normally it has a shape like: A.  Now, I can change the glyph easily.  I can make it look like: B.  Whenever you see the word "BAD", it will be displayed on the screen as BBD, but in memory it is still "BAD".

Each lower case character and upper case character has a different encoding.  So "A" is 65 and "a" is 97.  If I print "Bad", it won't print "Bbd" because I only changed the glyph for capital A (encoding 65), not lower case A (encoding 97).

So the original problem that's being talked about is *not* that the glyph is incorrect -- it's that the string is incorrect.  They put a lower case letter with a diacritical in the text rather than an uppercase letter with a diacritical.  Why is this?  Because Toady originally chose to encode the characters using only the numbers from 0 to 255 (i.e., there are only 256 distinct characters).  This is not because Toady is stupid, it's just the way we wrote *all* programs a long time ago.  An encoding that can represent exactly 256 characters is called a "code page".  These days everybody uses one of the "unicode" encodings (there are several -- the most popular being UTF-16 (Windows) and UTF-8 (The rest of the world)).  These allow you to encode every character in virtually every script known to man (*and* as a bonus allows you to encode emojis like pile-of-poo).

So the problem here is that in a 256 character code set, there isn't enough space to have *all* of the characters you might like.  Importantly (in this case), there isn't enough room to encode both the lower case characters with diacriticals *and* the upper chase chracters with diacriticals.  So, they chose to encode only the lower case characters in this code page.  If it frustrates you, you have something in common with every person in Eastern Europe and Scandinavia in the 1990s ;-).

As PatrikLundell says, we could use a different encoding that allows more than 256 characters.  This would be ideal, but would require a massive change in the source code, unfortunately.  Importantly, even if you change the encoding, the game is still only using the lower case letters with diacriticals.  For example, if you type an email using only lower case letters, we can't wave a magic wand and suddenly display it with correct capitalisation.   So, essentially, there is nothing that can be done unless Toady does the overhaul himself -- which he is unlikely to do because it's basically good enough.  I mean, maybe Dwarfs don't use upper case characters with diacriticals...
Logged

PatrikLundell

  • Bay Watcher
    • View Profile
Re: Fixing accented letters in names
« Reply #27 on: September 22, 2017, 03:12:16 am »

Changing the encoding to allow a more complete set of characters (16-32 bits) would allow you to assign codes for characters representing upper case decorated characters that currently are impossible to represent. However, for those characters to actually be used, the code will need to be changed to produce them (although a change of the code to support them would probably allow you to write them yourself in nicknames).
Now, upper/lower case letters follow a logic in how they're used, so programs typically "know" how to convert a lower case letter into an upper case one if it has a lexicon of words written in lower case and is to produce sentences which follow the normal logic of capitalizing the first word (or writing names, e.g. for a tavern, which typically uses capitalization as well). For normal ASCII characters (including balanced code pages for other languages) this typically is done by flipping the bit with the value 32. It can also be noted that character sets typically contain both characters (which have an upper/lower case representation) and non characters (everything else, such as digits, control characters, and special characters (e.g. '&'), and the case flipping logic applies only to characters. With an unbalanced code page the mismatched decorated letters are probably not letters from this point of view.

What this boils down to is that if a wider character set was used (assuming Toady used an existing one, and didn't try to make one from scratch), it should also bring with it a small library that contains functions like conversion between upper and lower case, so when a current word (encoded with the wide set) containing decorated non matched lower case characters was poured into a lexicon using the wide encoding, the upper/lower case translation function ought to magically be able to find the capital version for the character, because according to this translation function, the decorated character is actually a character that has an opposite case version.
Still data structures would have to be overhauled, the strings transferred (although that can probably be done simply by replacing the functions that read them from the UTF encoded raws and stores them as the replacement wide "code page" characters).
Logged

Starver

  • Bay Watcher
    • View Profile
Re: Fixing accented letters in names
« Reply #28 on: September 22, 2017, 03:30:38 am »

I mean, maybe Dwarfs don't use upper case characters with diacriticals...
That was always my personal headcanon. (Though not my headcannon.)
Logged
Pages: 1 [2]