• 🏆 Texturing Contest #33 is OPEN! Contestants must re-texture a SD unit model found in-game (Warcraft 3 Classic), recreating the unit into a peaceful NPC version. 🔗Click here to enter!
  • It's time for the first HD Modeling Contest of 2024. Join the theme discussion for Hive's HD Modeling Contest #6! Click here to post your idea!

utf8-aware iterator

Status
Not open for further replies.

LeP

LeP

Level 13
Joined
Feb 13, 2008
Messages
539
This script allows to iterate over whole characters in a string even if they're multibyte.

Example and Comparison:
JASS:
local string test = "aäöo"
local UTF8_Iterator it = UTF8_Iterator.create(test)
local integer i = 0


loop
exitwhen not it.hasNext()
    call BJDebugMsg(I2S(i) +" "+ it.next())
    set i = i+1
endloop


set i = 0
loop
exitwhen i == StringLength(test)
    call BJDebugMsg(I2S(i) +" "+ SubString(test, i, i+1))
    set i = i +1
endloop

This will print:
Code:
0: a
1: ä
2: ö
3: o

0: a
1:
2:
3:
4:
5: o

And here is the code:

JASS:
library UTF8 initializer init
globals
    private integer array offset
endglobals
private function hash takes string s returns integer
    return StringHash(s) / 1366202 + 1572
endfunction

public struct Iterator
    private integer length
    private integer off
    private string input
   
    static method create takes string input returns thistype
        local thistype this = allocate()
        set .length = StringLength(input)
        set .input = input
        set .off = 0
        return this
    endmethod
   
    method hasNext takes nothing returns boolean
        return .off < .length
    endmethod
   
    method next takes nothing returns string
        local string ret = SubString(.input, .off, .off+1)
        local integer t = .off + offset[hash(ret)]
        set ret = SubString(.input, .off, t)
        set .off = t
        return ret
    endmethod
endstruct
private function init takes nothing returns nothing
set offset[753] = 1
set offset[703] = 1
set offset[2592] = 1
set offset[2534] = 1
set offset[2146] = 1
set offset[2119] = 1
set offset[2512] = 1
set offset[301] = 1
set offset[1466] = 1
set offset[1681] = 1
set offset[2442] = 1
set offset[372] = 1
set offset[2075] = 1
set offset[694] = 1
set offset[1222] = 1
set offset[1184] = 1
set offset[1599] = 1
set offset[1223] = 1
set offset[2330] = 1
set offset[2459] = 1
set offset[2699] = 1
set offset[2843] = 1
set offset[663] = 1
set offset[1657] = 1
set offset[647] = 1
set offset[207] = 1
set offset[737] = 1
set offset[1676] = 1
set offset[710] = 1
set offset[1671] = 1
set offset[413] = 1
set offset[243] = 1
set offset[375] = 1
set offset[3026] = 1
set offset[2035] = 1
set offset[1573] = 1
set offset[1962] = 1
set offset[1544] = 1
set offset[1982] = 1
set offset[536] = 1
set offset[1426] = 1
set offset[1470] = 1
set offset[1446] = 1
set offset[3017] = 1
set offset[2134] = 1
set offset[3035] = 1
set offset[1374] = 1
set offset[2917] = 1
set offset[1395] = 1
set offset[2400] = 1
set offset[1098] = 1
set offset[947] = 1
set offset[1074] = 1
set offset[882] = 1
set offset[1081] = 1
set offset[1827] = 1
set offset[185] = 1
set offset[493] = 1
set offset[213] = 1
set offset[1779] = 1
set offset[138] = 1
set offset[1824] = 1
set offset[156] = 1
set offset[535] = 1
set offset[80] = 1
set offset[411] = 1
set offset[345] = 1
set offset[431] = 1
set offset[355] = 1
set offset[2009] = 1
set offset[2150] = 1
set offset[1980] = 1
set offset[382] = 1
set offset[2036] = 1
set offset[392] = 1
set offset[2038] = 1
set offset[333] = 1
set offset[2100] = 1
set offset[2080] = 1
set offset[2161] = 1
set offset[2194] = 1
set offset[2193] = 1
set offset[2158] = 1
set offset[2126] = 1
set offset[2175] = 1
set offset[824] = 1
set offset[2262] = 1
set offset[1176] = 1
set offset[511] = 1
set offset[1547] = 1
set offset[137] = 1
set offset[1912] = 1
set offset[2917] = 1
set offset[2516] = 1
set offset[2188] = 1
set offset[2934] = 1
set offset[403] = 1
set offset[411] = 1
set offset[345] = 1
set offset[431] = 1
set offset[355] = 1
set offset[2009] = 1
set offset[2150] = 1
set offset[1980] = 1
set offset[382] = 1
set offset[2036] = 1
set offset[392] = 1
set offset[2038] = 1
set offset[333] = 1
set offset[2100] = 1
set offset[2080] = 1
set offset[2161] = 1
set offset[2194] = 1
set offset[2193] = 1
set offset[2158] = 1
set offset[2126] = 1
set offset[2175] = 1
set offset[824] = 1
set offset[2262] = 1
set offset[1176] = 1
set offset[511] = 1
set offset[1547] = 1
set offset[137] = 1
set offset[1321] = 1
set offset[1895] = 1
set offset[903] = 1
set offset[167] = 1
set offset[186] = 1
set offset[1304] = 1
set offset[1467] = 1
set offset[322] = 1
set offset[1116] = 1
set offset[640] = 1
set offset[2317] = 1
set offset[2640] = 1
set offset[1898] = 1
set offset[1944] = 1
set offset[433] = 1
set offset[3062] = 1
set offset[806] = 1
set offset[2696] = 1
set offset[2777] = 1
set offset[766] = 1
set offset[94] = 1
set offset[1631] = 1
set offset[2418] = 1
set offset[3142] = 1
set offset[2586] = 1
set offset[632] = 1
set offset[2159] = 1
set offset[2517] = 1
set offset[1746] = 1
set offset[1036] = 1
set offset[2836] = 1
set offset[1890] = 1
set offset[2854] = 1
set offset[1862] = 1
set offset[2102] = 1
set offset[774] = 1
set offset[2059] = 1
set offset[744] = 1
set offset[1149] = 1
set offset[800] = 1
set offset[1182] = 1
set offset[19] = 1
set offset[1925] = 1
set offset[58] = 1
set offset[1906] = 1
set offset[871] = 1
set offset[2282] = 1
set offset[1332] = 1
set offset[2303] = 1
set offset[471] = 1
set offset[3086] = 1
set offset[489] = 1
set offset[3061] = 1
set offset[705] = 1
set offset[1680] = 1
set offset[1809] = 1
set offset[1690] = 1
set offset[1847] = 1
set offset[937] = 1
set offset[2689] = 1
set offset[927] = 1
set offset[2518] = 1
set offset[1220] = 1
set offset[2714] = 1
set offset[1151] = 1
set offset[2682] = 1
set offset[1945] = 1
set offset[1887] = 1
set offset[1972] = 1
set offset[1905] = 2
set offset[627] = 2
set offset[972] = 2
set offset[3010] = 2
set offset[1688] = 2
set offset[3018] = 2
set offset[1698] = 2
set offset[484] = 2
set offset[2196] = 2
set offset[475] = 2
set offset[1856] = 2
set offset[1231] = 2
set offset[1148] = 2
set offset[1248] = 2
set offset[1108] = 2
set offset[532] = 2
set offset[671] = 2
set offset[698] = 2
set offset[734] = 2
set offset[2352] = 2
set offset[3077] = 2
set offset[1542] = 2
set offset[3103] = 2
set offset[2085] = 2
set offset[97] = 2
set offset[1609] = 2
set offset[668] = 2
set offset[1849] = 2
set offset[2658] = 2
set offset[1685] = 2
set offset[678] = 2
set offset[2192] = 2
set offset[1101] = 3
set offset[2814] = 3
set offset[1683] = 3
set offset[805] = 3
set offset[2902] = 3
set offset[2869] = 3
set offset[1736] = 3
set offset[883] = 3
set offset[135] = 3
set offset[216] = 3
set offset[1] = 3
set offset[603] = 3
set offset[1150] = 3
set offset[2882] = 3
set offset[36] = 3
set offset[646] = 3
set offset[2789] = 4
set offset[2535] = 4
set offset[88] = 4
set offset[2970] = 4
set offset[2128] = 4
set offset[2523] = 4
set offset[171] = 4
set offset[2885] = 4
set offset[558] = 4
set offset[1240] = 4
set offset[315] = 4
set offset[597] = 4
set offset[1780] = 4
set offset[1375] = 4
set offset[275] = 4
set offset[956] = 4
endfunction

endlibrary
 
Last edited:
Level 13
Joined
Nov 7, 2014
Messages
571
It seems to work.

Why does the offset table have 255 entries instead of
JASS:
  (0b0_0000000 .. 0b0_1111111)
+ (0b110_00000 .. 0b110_11111)
+ (0b1110_0000 .. 0b1110_1111)
+ (0b11110_000 .. 0b11110_111)
= 128 + 32 + 16 + 8  = 184

183 - 1 (the null byte) = 183

Which byte has StringHash(<byte>) / 1366202 + 1572 = 753 (set offset[753] = 1)? The other entries seemed to be the bytes 0x01 .. 0xFE.
 

Dr Super Good

Spell Reviewer
Level 64
Joined
Jan 18, 2005
Messages
27,198
I have a feeling the logic is wrong... Nowhere is it checking if the Unicode sequence is valid or not. As per Unicode specifications, invalid Unicode sequences must be processed 1 byte at a time and must either return the Unicode code point with the byte value or a special invalid Unicode code pointer, usually 0xFFFD.
 

LeP

LeP

Level 13
Joined
Feb 13, 2008
Messages
539
It seems to work.

Why does the offset table have 255 entries instead of
JASS:
  (0b0_0000000 .. 0b0_1111111)
+ (0b110_00000 .. 0b110_11111)
+ (0b1110_0000 .. 0b1110_1111)
+ (0b11110_000 .. 0b11110_111)
= 128 + 32 + 16 + 8  = 184

183 - 1 (the null byte) = 183
Because i was too lazy to filter out ['A' .. 'Z', '/']. Doesn't matter much.

Which byte has StringHash(<byte>) / 1366202 + 1572 = 753 (set offset[753] = 1)? The other entries seemed to be the bytes 0x01 .. 0xFE.

That would be \0. The init is done for 0 to 255. Although i think SStrHash2("") behaves different than StringHash("")...
I don't think that matters though since you hardly get \0 into your jass strings.

I have a feeling the logic is wrong... Nowhere is it checking if the Unicode sequence is valid or not. As per Unicode specifications, invalid Unicode sequences must be processed 1 byte at a time and must either return the Unicode code point with the byte value or a special invalid Unicode code pointer, usually 0xFFFD.
It's not a unicode sequence. If you don't input valid utf8 encoded string it's your problem. Also it doesn't decode it, it just iterates over the string multiple bytes at a time.
 
Level 13
Joined
Nov 7, 2014
Messages
571
Because i was too lazy to filter out ['A' .. 'Z', '/']. Doesn't matter much.

Actually, I thought only 183 entries were needed because strings in jass were guaranteed to be valid UTF-8, but that's not true. One can easily SubString on bytes and concatenate an invalid UTF-8 byte sequence, I suppose.

The init is done for 0 to 255
So there need to be 256 entries, right?
JASS:
private function init takes nothing returns nothing
set offset[753] = 1 // 0x00
set offset[703] = 1 // 0x01
...
set offset[275] = 4 // 0xFE
set offset[956] = 4 // 0xFF <-- this one is missing?
endfunction
 

LeP

LeP

Level 13
Joined
Feb 13, 2008
Messages
539
Actually, I thought only 183 entries were needed because strings in jass were guaranteed to be valid UTF-8, but that's not true. One can easily SubString on bytes and concatenate an invalid UTF-8 byte sequence, I suppose.


So there need to be 256 entries, right?
JASS:
private function init takes nothing returns nothing
set offset[753] = 1 // 0x00
set offset[703] = 1 // 0x01
...
set offset[275] = 4 // 0xFE
set offset[956] = 4 // 0xFF <-- this one is missing?
endfunction
Yeah i guess. Updated the OP.

RE: validation
we can use the same technique like so:
for all c: c matches 0b10xx xxxx: valid[hash(c)] = true
And when a multibyte character is found, check all valid[SubString(str, .off, .off+i)] for 1 < i < offset[hash(c)]
This should be faster than converting to integer and then doing bit-math
 

Dr Super Good

Spell Reviewer
Level 64
Joined
Jan 18, 2005
Messages
27,198
It's not a unicode sequence. If you don't input valid utf8 encoded string it's your problem.
No it is the problem of the Unicode decoder, as specified by Unicode standards and complimentary standards.
Also it doesn't decode it, it just iterates over the string multiple bytes at a time.
Except according to Unicode standards invalid sequences must be advanced 1 byte at a time, with no attempt being made to read further bytes past the point that it becomes invalid.
Actually, I thought only 183 entries were needed because strings in jass were guaranteed to be valid UTF-8, but that's not true. One can easily SubString on bytes and concatenate an invalid UTF-8 byte sequence, I suppose.
Actually I am guessing JASS is in local multi-byte code page encoding. Hence why WC3 has problems displaying foreign characters since each region version of WC3 uses a different multi-byte code page encoding. Another reason why updating WC3 appears to be very difficult and buggy for the developers as each region needed its own branch with strongly defines the multi-byte encoding used for that region.

Before Unicode, Windows used multi-byte code page encoding. Each region defined a specific encoding to be used, and applications that were not aware the different encodings could break in different regions. For example one could not copy in Japanese text into an English region computer because English used the extension pages for character accents as opposed to different symbols. Doing so resulted in the text being mangled with the impossible to represent characters being replaced by some programmer defined default character. If this sounds familiar then it is as that is why different WC3 regions cannot show some symbols, even if technically they can be made to now.

Modern Unicode aware games such as SC2 and HotS have no problem showing most Unicode symbols, especially from the first primary code page. I think WC3 was made partly Unicode aware, at least for chat messages as I recall someone saying they can be forced to be visible if one imports a highly custom font, however I have some doubts about this and it might simply be changing the used code page.

Here is a C++ iterator snippet I wrote for Simutrans. The open source game also had incorrect Unicode decode logic before, hence why I wrote this. The full source code can be found with google.
Code:
utf32 utf8_decoder_t::decode(utf8 const *const buff, size_t &len) {
	// Implementation derived from RFC 3629.

	// Process character byte.
	size_t i = 0;
	len = 0;
	utf8 const character = buff[i++];
	utf32 cp = 0;
	if(  character <= 0x7F  ) {
		// ASCII character.
		cp = character;
		len = 1;
	} else if(  character < 0xC2  ) {
		// Invalid character.
	} else if(  character <= 0xDF  ) {
		// 2 byte character.
		cp = character & 0x1F;
		len = 2;
	} else if(  character <= 0xEF  ) {
		// 3 byte character.
		if(  !((character == 0xE0 && buff[i] < 0xA0) ||
			(character == 0xED && buff[i] > 0x9F))  ) {
			cp = character & 0xF;
			len = 3;
		}
	} else if(  character <= 0xF4  ) {
		// 4 byte character.
		if(  !((character == 0xF0 && buff[i] < 0x90) ||
			(character == 0xF4 && buff[i] > 0x8F))  ) {
			cp = character & 0x7;
			len = 4;
		}
	} else {
		// Invalid character.
	}

	// Process tail bytes.
	for(  ; i < len ; i++  ) {
		utf8 const tail = buff[i];
		if(  0x80 <= tail && tail <= 0xBF  ) {
			cp <<= 6;
			cp |= tail & 0x3F;
		} else {
			// Invalid tail.
			len = 0;
		}
	}

	if(  len == 0  ) {
		// Replace invalid sequences with code point of the single decoded character (ISO-8859-1).
		len = 1;
		cp = character;
	}

	return cp;
}
Most of the magic numbers used were defined by RFC 3629. They are derived from all invalid Unicode sequences, sequences that should not be decoded and advanced 1 byte at a time.

This snippet is robust enough to cope with surrogate pair rejection, overlong rejection, out of range code point rejection and invalid tail rejection.
 

~El

Level 17
Joined
Jun 13, 2016
Messages
556
Actually I am guessing JASS is in local multi-byte code page encoding. Hence why WC3 has problems displaying foreign characters since each region version of WC3 uses a different multi-byte code page encoding. Another reason why updating WC3 appears to be very difficult and buggy for the developers as each region needed its own branch with strongly defines the multi-byte encoding used for that region.

I am sorry, but this is horseshit. The only reason why WC3 has troubles displaying regional characters, is because each language version is shipped with a different custom font. The english version doesn't support cyrillic, for example.

If you replace the font in the .mpq with a more modern one with better UTF support, then WC3 has no trouble -at all- displaying the characters using that font. I've done that myself for Russian on the English version of WC3 and it works like a charm.

The font doesn't have to be "highly custom" or anything. Any modern UTF-8 compliant font will work, like Georgia.

EDIT: I suspect that despite this, WC3 only ever uses one font for displaying characters. I imagine Japanese/Korean/Chinese characters use completely different fonts than those supporting cyrillic/latinic/whatever, hence the need for special support in those languages, since WC3 can't fall-back to a different font to display those characters.
 

Dr Super Good

Spell Reviewer
Level 64
Joined
Jan 18, 2005
Messages
27,198
I am sorry, but this is horseshit.
Not considering when WC3 was developed. UTF-8 only was defined as it is today in 2003, which means Warcraft III was already in development before UTF-8 standardization. Further more if WC3 did use Unicode it would likely be in the form of UTF-16 as that is what Windows uses for Unicode natively.
The font doesn't have to be "highly custom" or anything. Any modern UTF-8 compliant font will work, like Georgia.
Fonts have nothing to do with UTF-8 encoding. A font maps a Unicode code point to a glyph. This Unicode code point can come from decoding UTF-8, UTF-16, UTF-32 (native Unicode code point form) or even another local multi-byte encoding that is translated.
EDIT: I suspect that despite this, WC3 only ever uses one font for displaying characters. I imagine Japanese/Korean/Chinese characters use completely different fonts than those supporting cyrillic/latinic/whatever, hence the need for special support in those languages, since WC3 can't fall-back to a different font to display those characters.
If WC3 was fully Unicode aware this would not matter. The fonts would be mapped at different Unicode code point ranges.

It is not even a GPU limitation since the game only draws the characters needed onto a GPU texture as seen below.

Seeing how WC3 is being maintained, it is possible that some effort has been made to make Warcraft III fully Unicode aware similar to all modern Blizzard games. As such UTF-8 or UTF-16 might be mixed in now although some of the mechanics are still left over from when local multi-byte encoding was used.
 

Attachments

  • Image_2D_0474_1148.png
    Image_2D_0474_1148.png
    8.8 KB · Views: 163
  • Like
Reactions: pyf

~El

Level 17
Joined
Jun 13, 2016
Messages
556
Not considering when WC3 was developed. UTF-8 only was defined as it is today in 2003, which means Warcraft III was already in development before UTF-8 standardization. Further more if WC3 did use Unicode it would likely be in the form of UTF-16 as that is what Windows uses for Unicode natively.
Fonts have nothing to do with UTF-8 encoding. A font maps a Unicode code point to a glyph. This Unicode code point can come from decoding UTF-8, UTF-16, UTF-32 (native Unicode code point form) or even another local multi-byte encoding that is translated.
If WC3 was fully Unicode aware this would not matter. The fonts would be mapped at different Unicode code point ranges.

It is not even a GPU limitation since the game only draws the characters needed onto a GPU texture as seen below.

Seeing how WC3 is being maintained, it is possible that some effort has been made to make Warcraft III fully Unicode aware similar to all modern Blizzard games. As such UTF-8 or UTF-16 might be mixed in now although some of the mechanics are still left over from when local multi-byte encoding was used.

My bad about the UTF semantics. Yet, my point still stands - the limitation of WC3 being unable to render at least cyrillic characters in the English version stems from the simple fact that the WC3 fonts in the English distribution do not support it. First time I've tried it was way back on 1.26 if not 1.24, before any of the modern patches hit the fan. These are old patches. I don't know about other languages, but Cyrillic and Latin could always be mixed without issues with a proper font installed, and worked everywhere, with Cyrillic always taking 2 bytes, and Latin taking 1. When the font has no cyrillic glyphs, it simply skips the character, turning up as a blank. Simply changing the font fixed that.
 
  • Like
Reactions: pyf
Status
Not open for further replies.
Top