Topic: Unicode-16 support for language raw files. (Read 470 times)

dikbutdagrate · « **on:** August 24, 2023, 10:22:51 pm »

See title.

Basically just asking for iconv() to be applied to the language file vectors.
And/or for a utf16 conversion to be applied to rendering text items from those vectors.

Thought adding runic to a language file could be fun.
ᚠᚢᚦᚨᚱᚲ

BlueManedHawk · « **Reply #1 on:** August 27, 2023, 03:45:03 pm »

As far as i'm aware, there is never any good reason to use anything other than UTF-8. Why suggest you UTF-16?

Putnam · « **Reply #2 on:** September 01, 2023, 11:03:44 pm »

We're using CP437 right now, which is incompatible with UTF-8/UTF-32 anyway.

That said, I've been leaning towards UTF-8 in general, probably converting a bunch of stuff to use it all at once in some grand rework at some point.

mikekchar · « **Reply #3 on:** September 02, 2023, 03:02:01 am »

Just to give some history, Windows originally used a 16 bit unicode encoding for it's first foray into "unicode the everything". There are still people who feel that this encoding should be/is the standard for Windows. However, it's important to note that the 16 bit unicode Windows used was UCS2, not UTF-16. The former can only handle 16 bits of unicode and can *not* do any of the extended characters. There are a lot of advantages to doing this, the main one being that every character is *exactly* 16 bits, while both UTF-8 and UTF-16 can have arbitrarily long characters. The downside is that UCS2 (and especially the Windows version of UCS2 -- because MS is MS and they need their special spice on everything) isn't really used by anybody else. Also, there are things you can't encode in UCS2 (I *think* things like skin tones on emojis and sanskrit are some examples, but it's be a *very* long time since I really was in the middle of this stuff, so I can't remember).

My personal feeling is that switching to UCS2 would probably be a lot easier for DF because you can guarantee character indexing. You just make every characters 2 bytes instead of 1. The downside is that there are some edge cases for fonts, on other systems who don't tend to operate with this encoding. UTF-8 is basically guaranteed to work everywhere. It will be a much bigger job probably -- I'm thinking especially for tools like DFHack or Dwarf Therapist because they likely make some assumptions about string length and number of bytes per character. Having said that, if you restrict the character range to the characters already in use, it makes it a lot easier -- at the expense of making it a bit of a no-op for use. If you don't want to use the range of UTF-8, then you might as well stick to CP437. It's only slightly annoying as it is.

A_Curious_Cat · « **Reply #4 on:** September 02, 2023, 03:06:49 am »

I think UTF-8 should be just fine as long as you don’t allow over-long sequences.

News:

Author Topic: Unicode-16 support for language raw files. (Read 470 times)

dikbutdagrate

Unicode-16 support for language raw files.

BlueManedHawk

Re: Unicode-16 support for language raw files.

Putnam

Re: Unicode-16 support for language raw files.

mikekchar

Re: Unicode-16 support for language raw files.

A_Curious_Cat

Re: Unicode-16 support for language raw files.