utf8-aware iterator

LeP · Nov 29, 2017

This script allows to iterate over whole characters in a string even if they're multibyte.

Example and Comparison:

JASS:

local string test = "aäöo"
local UTF8_Iterator it = UTF8_Iterator.create(test)
local integer i = 0


loop
exitwhen not it.hasNext()
    call BJDebugMsg(I2S(i) +" "+ it.next())
    set i = i+1
endloop


set i = 0
loop
exitwhen i == StringLength(test)
    call BJDebugMsg(I2S(i) +" "+ SubString(test, i, i+1))
    set i = i +1
endloop

This will print:

Code:

0: a
1: ä
2: ö
3: o

0: a
1:
2:
3:
4:
5: o

And here is the code:

JASS:

library UTF8 initializer init
globals
    private integer array offset
endglobals
private function hash takes string s returns integer
    return StringHash(s) / 1366202 + 1572
endfunction

public struct Iterator
    private integer length
    private integer off
    private string input
   
    static method create takes string input returns thistype
        local thistype this = allocate()
        set .length = StringLength(input)
        set .input = input
        set .off = 0
        return this
    endmethod
   
    method hasNext takes nothing returns boolean
        return .off < .length
    endmethod
   
    method next takes nothing returns string
        local string ret = SubString(.input, .off, .off+1)
        local integer t = .off + offset[hash(ret)]
        set ret = SubString(.input, .off, t)
        set .off = t
        return ret
    endmethod
endstruct
private function init takes nothing returns nothing
set offset[753] = 1
set offset[703] = 1
set offset[2592] = 1
set offset[2534] = 1
set offset[2146] = 1
set offset[2119] = 1
set offset[2512] = 1
set offset[301] = 1
set offset[1466] = 1
set offset[1681] = 1
set offset[2442] = 1
set offset[372] = 1
set offset[2075] = 1
set offset[694] = 1
set offset[1222] = 1
set offset[1184] = 1
set offset[1599] = 1
set offset[1223] = 1
set offset[2330] = 1
set offset[2459] = 1
set offset[2699] = 1
set offset[2843] = 1
set offset[663] = 1
set offset[1657] = 1
set offset[647] = 1
set offset[207] = 1
set offset[737] = 1
set offset[1676] = 1
set offset[710] = 1
set offset[1671] = 1
set offset[413] = 1
set offset[243] = 1
set offset[375] = 1
set offset[3026] = 1
set offset[2035] = 1
set offset[1573] = 1
set offset[1962] = 1
set offset[1544] = 1
set offset[1982] = 1
set offset[536] = 1
set offset[1426] = 1
set offset[1470] = 1
set offset[1446] = 1
set offset[3017] = 1
set offset[2134] = 1
set offset[3035] = 1
set offset[1374] = 1
set offset[2917] = 1
set offset[1395] = 1
set offset[2400] = 1
set offset[1098] = 1
set offset[947] = 1
set offset[1074] = 1
set offset[882] = 1
set offset[1081] = 1
set offset[1827] = 1
set offset[185] = 1
set offset[493] = 1
set offset[213] = 1
set offset[1779] = 1
set offset[138] = 1
set offset[1824] = 1
set offset[156] = 1
set offset[535] = 1
set offset[80] = 1
set offset[411] = 1
set offset[345] = 1
set offset[431] = 1
set offset[355] = 1
set offset[2009] = 1
set offset[2150] = 1
set offset[1980] = 1
set offset[382] = 1
set offset[2036] = 1
set offset[392] = 1
set offset[2038] = 1
set offset[333] = 1
set offset[2100] = 1
set offset[2080] = 1
set offset[2161] = 1
set offset[2194] = 1
set offset[2193] = 1
set offset[2158] = 1
set offset[2126] = 1
set offset[2175] = 1
set offset[824] = 1
set offset[2262] = 1
set offset[1176] = 1
set offset[511] = 1
set offset[1547] = 1
set offset[137] = 1
set offset[1912] = 1
set offset[2917] = 1
set offset[2516] = 1
set offset[2188] = 1
set offset[2934] = 1
set offset[403] = 1
set offset[411] = 1
set offset[345] = 1
set offset[431] = 1
set offset[355] = 1
set offset[2009] = 1
set offset[2150] = 1
set offset[1980] = 1
set offset[382] = 1
set offset[2036] = 1
set offset[392] = 1
set offset[2038] = 1
set offset[333] = 1
set offset[2100] = 1
set offset[2080] = 1
set offset[2161] = 1
set offset[2194] = 1
set offset[2193] = 1
set offset[2158] = 1
set offset[2126] = 1
set offset[2175] = 1
set offset[824] = 1
set offset[2262] = 1
set offset[1176] = 1
set offset[511] = 1
set offset[1547] = 1
set offset[137] = 1
set offset[1321] = 1
set offset[1895] = 1
set offset[903] = 1
set offset[167] = 1
set offset[186] = 1
set offset[1304] = 1
set offset[1467] = 1
set offset[322] = 1
set offset[1116] = 1
set offset[640] = 1
set offset[2317] = 1
set offset[2640] = 1
set offset[1898] = 1
set offset[1944] = 1
set offset[433] = 1
set offset[3062] = 1
set offset[806] = 1
set offset[2696] = 1
set offset[2777] = 1
set offset[766] = 1
set offset[94] = 1
set offset[1631] = 1
set offset[2418] = 1
set offset[3142] = 1
set offset[2586] = 1
set offset[632] = 1
set offset[2159] = 1
set offset[2517] = 1
set offset[1746] = 1
set offset[1036] = 1
set offset[2836] = 1
set offset[1890] = 1
set offset[2854] = 1
set offset[1862] = 1
set offset[2102] = 1
set offset[774] = 1
set offset[2059] = 1
set offset[744] = 1
set offset[1149] = 1
set offset[800] = 1
set offset[1182] = 1
set offset[19] = 1
set offset[1925] = 1
set offset[58] = 1
set offset[1906] = 1
set offset[871] = 1
set offset[2282] = 1
set offset[1332] = 1
set offset[2303] = 1
set offset[471] = 1
set offset[3086] = 1
set offset[489] = 1
set offset[3061] = 1
set offset[705] = 1
set offset[1680] = 1
set offset[1809] = 1
set offset[1690] = 1
set offset[1847] = 1
set offset[937] = 1
set offset[2689] = 1
set offset[927] = 1
set offset[2518] = 1
set offset[1220] = 1
set offset[2714] = 1
set offset[1151] = 1
set offset[2682] = 1
set offset[1945] = 1
set offset[1887] = 1
set offset[1972] = 1
set offset[1905] = 2
set offset[627] = 2
set offset[972] = 2
set offset[3010] = 2
set offset[1688] = 2
set offset[3018] = 2
set offset[1698] = 2
set offset[484] = 2
set offset[2196] = 2
set offset[475] = 2
set offset[1856] = 2
set offset[1231] = 2
set offset[1148] = 2
set offset[1248] = 2
set offset[1108] = 2
set offset[532] = 2
set offset[671] = 2
set offset[698] = 2
set offset[734] = 2
set offset[2352] = 2
set offset[3077] = 2
set offset[1542] = 2
set offset[3103] = 2
set offset[2085] = 2
set offset[97] = 2
set offset[1609] = 2
set offset[668] = 2
set offset[1849] = 2
set offset[2658] = 2
set offset[1685] = 2
set offset[678] = 2
set offset[2192] = 2
set offset[1101] = 3
set offset[2814] = 3
set offset[1683] = 3
set offset[805] = 3
set offset[2902] = 3
set offset[2869] = 3
set offset[1736] = 3
set offset[883] = 3
set offset[135] = 3
set offset[216] = 3
set offset[1] = 3
set offset[603] = 3
set offset[1150] = 3
set offset[2882] = 3
set offset[36] = 3
set offset[646] = 3
set offset[2789] = 4
set offset[2535] = 4
set offset[88] = 4
set offset[2970] = 4
set offset[2128] = 4
set offset[2523] = 4
set offset[171] = 4
set offset[2885] = 4
set offset[558] = 4
set offset[1240] = 4
set offset[315] = 4
set offset[597] = 4
set offset[1780] = 4
set offset[1375] = 4
set offset[275] = 4
set offset[956] = 4
endfunction

endlibrary

Aniki · Nov 30, 2017

It seems to work.

Why does the offset table have 255 entries instead of

JASS:

  (0b0_0000000 .. 0b0_1111111)
+ (0b110_00000 .. 0b110_11111)
+ (0b1110_0000 .. 0b1110_1111)
+ (0b11110_000 .. 0b11110_111)
= 128 + 32 + 16 + 8  = 184

183 - 1 (the null byte) = 183

Which byte has StringHash(<byte>) / 1366202 + 1572 = 753 (set offset[753] = 1)? The other entries seemed to be the bytes 0x01 .. 0xFE.

Dr Super Good · Dec 1, 2017

I have a feeling the logic is wrong... Nowhere is it checking if the Unicode sequence is valid or not. As per Unicode specifications, invalid Unicode sequences must be processed 1 byte at a time and must either return the Unicode code point with the byte value or a special invalid Unicode code pointer, usually 0xFFFD.

LeP · Dec 1, 2017

Aniki said:
It seems to work.

Why does the offset table have 255 entries instead of

JASS:

(0b0_0000000 .. 0b0_1111111) + (0b110_00000 .. 0b110_11111) + (0b1110_0000 .. 0b1110_1111) + (0b11110_000 .. 0b11110_111) = 128 + 32 + 16 + 8 = 184 183 - 1 (the null byte) = 183

Because i was too lazy to filter out ['A' .. 'Z', '/']. Doesn't matter much.

Aniki said:
Which byte has StringHash(<byte>) / 1366202 + 1572 = 753 (set offset[753] = 1)? The other entries seemed to be the bytes 0x01 .. 0xFE.

That would be \0. The init is done for 0 to 255. Although i think SStrHash2("") behaves different than StringHash("")...
I don't think that matters though since you hardly get \0 into your jass strings.

Dr Super Good said:
I have a feeling the logic is wrong... Nowhere is it checking if the Unicode sequence is valid or not. As per Unicode specifications, invalid Unicode sequences must be processed 1 byte at a time and must either return the Unicode code point with the byte value or a special invalid Unicode code pointer, usually 0xFFFD.

It's not a unicode sequence. If you don't input valid utf8 encoded string it's your problem. Also it doesn't decode it, it just iterates over the string multiple bytes at a time.

Aniki · Dec 1, 2017

Because i was too lazy to filter out ['A' .. 'Z', '/']. Doesn't matter much.

Actually, I thought only 183 entries were needed because strings in jass were guaranteed to be valid UTF-8, but that's not true. One can easily SubString on bytes and concatenate an invalid UTF-8 byte sequence, I suppose.

The init is done for 0 to 255

So there need to be 256 entries, right?

JASS:

private function init takes nothing returns nothing
set offset[753] = 1 // 0x00
set offset[703] = 1 // 0x01
...
set offset[275] = 4 // 0xFE
set offset[956] = 4 // 0xFF <-- this one is missing?
endfunction

LeP · Dec 1, 2017

Aniki said:
Actually, I thought only 183 entries were needed because strings in jass were guaranteed to be valid UTF-8, but that's not true. One can easily SubString on bytes and concatenate an invalid UTF-8 byte sequence, I suppose.

So there need to be 256 entries, right?

JASS:

private function init takes nothing returns nothing set offset[753] = 1 // 0x00 set offset[703] = 1 // 0x01 ... set offset[275] = 4 // 0xFE set offset[956] = 4 // 0xFF <-- this one is missing? endfunction

Yeah i guess. Updated the OP.

RE: validation
we can use the same technique like so:
for all c: c matches 0b10xx xxxx: valid[hash(c)] = true
And when a multibyte character is found, check all valid[SubString(str, .off, .off+i)] for 1 < i < offset[hash(c)]
This should be faster than converting to integer and then doing bit-math

Dr Super Good · Dec 1, 2017

LeP said:
It's not a unicode sequence. If you don't input valid utf8 encoded string it's your problem.

No it is the problem of the Unicode decoder, as specified by Unicode standards and complimentary standards.

LeP said:
Also it doesn't decode it, it just iterates over the string multiple bytes at a time.

Except according to Unicode standards invalid sequences must be advanced 1 byte at a time, with no attempt being made to read further bytes past the point that it becomes invalid.

Aniki said:
Actually, I thought only 183 entries were needed because strings in jass were guaranteed to be valid UTF-8, but that's not true. One can easily SubString on bytes and concatenate an invalid UTF-8 byte sequence, I suppose.

Actually I am guessing JASS is in local multi-byte code page encoding. Hence why WC3 has problems displaying foreign characters since each region version of WC3 uses a different multi-byte code page encoding. Another reason why updating WC3 appears to be very difficult and buggy for the developers as each region needed its own branch with strongly defines the multi-byte encoding used for that region.

Before Unicode, Windows used multi-byte code page encoding. Each region defined a specific encoding to be used, and applications that were not aware the different encodings could break in different regions. For example one could not copy in Japanese text into an English region computer because English used the extension pages for character accents as opposed to different symbols. Doing so resulted in the text being mangled with the impossible to represent characters being replaced by some programmer defined default character. If this sounds familiar then it is as that is why different WC3 regions cannot show some symbols, even if technically they can be made to now.

Modern Unicode aware games such as SC2 and HotS have no problem showing most Unicode symbols, especially from the first primary code page. I think WC3 was made partly Unicode aware, at least for chat messages as I recall someone saying they can be forced to be visible if one imports a highly custom font, however I have some doubts about this and it might simply be changing the used code page.

Here is a C++ iterator snippet I wrote for Simutrans. The open source game also had incorrect Unicode decode logic before, hence why I wrote this. The full source code can be found with google.

Code:

utf32 utf8_decoder_t::decode(utf8 const *const buff, size_t &len) {
	// Implementation derived from RFC 3629.

	// Process character byte.
	size_t i = 0;
	len = 0;
	utf8 const character = buff[i++];
	utf32 cp = 0;
	if(  character <= 0x7F  ) {
		// ASCII character.
		cp = character;
		len = 1;
	} else if(  character < 0xC2  ) {
		// Invalid character.
	} else if(  character <= 0xDF  ) {
		// 2 byte character.
		cp = character & 0x1F;
		len = 2;
	} else if(  character <= 0xEF  ) {
		// 3 byte character.
		if(  !((character == 0xE0 && buff[i] < 0xA0) ||
			(character == 0xED && buff[i] > 0x9F))  ) {
			cp = character & 0xF;
			len = 3;
		}
	} else if(  character <= 0xF4  ) {
		// 4 byte character.
		if(  !((character == 0xF0 && buff[i] < 0x90) ||
			(character == 0xF4 && buff[i] > 0x8F))  ) {
			cp = character & 0x7;
			len = 4;
		}
	} else {
		// Invalid character.
	}

	// Process tail bytes.
	for(  ; i < len ; i++  ) {
		utf8 const tail = buff[i];
		if(  0x80 <= tail && tail <= 0xBF  ) {
			cp <<= 6;
			cp |= tail & 0x3F;
		} else {
			// Invalid tail.
			len = 0;
		}
	}

	if(  len == 0  ) {
		// Replace invalid sequences with code point of the single decoded character (ISO-8859-1).
		len = 1;
		cp = character;
	}

	return cp;
}

Most of the magic numbers used were defined by RFC 3629. They are derived from all invalid Unicode sequences, sequences that should not be decoded and advanced 1 byte at a time.

This snippet is robust enough to cope with surrogate pair rejection, overlong rejection, out of range code point rejection and invalid tail rejection.

GhostWolf · Dec 1, 2017

Clap clap. Now back to the topic of useful code that works.

Lizreu · Dec 6, 2017

Dr Super Good said:
Actually I am guessing JASS is in local multi-byte code page encoding. Hence why WC3 has problems displaying foreign characters since each region version of WC3 uses a different multi-byte code page encoding. Another reason why updating WC3 appears to be very difficult and buggy for the developers as each region needed its own branch with strongly defines the multi-byte encoding used for that region.

I am sorry, but this is horseshit. The only reason why WC3 has troubles displaying regional characters, is because each language version is shipped with a different custom font. The english version doesn't support cyrillic, for example.

If you replace the font in the .mpq with a more modern one with better UTF support, then WC3 has no trouble -at all- displaying the characters using that font. I've done that myself for Russian on the English version of WC3 and it works like a charm.

The font doesn't have to be "highly custom" or anything. Any modern UTF-8 compliant font will work, like Georgia.

EDIT: I suspect that despite this, WC3 only ever uses one font for displaying characters. I imagine Japanese/Korean/Chinese characters use completely different fonts than those supporting cyrillic/latinic/whatever, hence the need for special support in those languages, since WC3 can't fall-back to a different font to display those characters.

Dr Super Good · Dec 6, 2017

Sir Moriarty said:
I am sorry, but this is horseshit.

Not considering when WC3 was developed. UTF-8 only was defined as it is today in 2003, which means Warcraft III was already in development before UTF-8 standardization. Further more if WC3 did use Unicode it would likely be in the form of UTF-16 as that is what Windows uses for Unicode natively.

Sir Moriarty said:
The font doesn't have to be "highly custom" or anything. Any modern UTF-8 compliant font will work, like Georgia.

Fonts have nothing to do with UTF-8 encoding. A font maps a Unicode code point to a glyph. This Unicode code point can come from decoding UTF-8, UTF-16, UTF-32 (native Unicode code point form) or even another local multi-byte encoding that is translated.

Sir Moriarty said:
EDIT: I suspect that despite this, WC3 only ever uses one font for displaying characters. I imagine Japanese/Korean/Chinese characters use completely different fonts than those supporting cyrillic/latinic/whatever, hence the need for special support in those languages, since WC3 can't fall-back to a different font to display those characters.

If WC3 was fully Unicode aware this would not matter. The fonts would be mapped at different Unicode code point ranges.

It is not even a GPU limitation since the game only draws the characters needed onto a GPU texture as seen below.

Seeing how WC3 is being maintained, it is possible that some effort has been made to make Warcraft III fully Unicode aware similar to all modern Blizzard games. As such UTF-8 or UTF-16 might be mixed in now although some of the mechanics are still left over from when local multi-byte encoding was used.

Lizreu · Dec 6, 2017

Dr Super Good said:
Not considering when WC3 was developed. UTF-8 only was defined as it is today in 2003, which means Warcraft III was already in development before UTF-8 standardization. Further more if WC3 did use Unicode it would likely be in the form of UTF-16 as that is what Windows uses for Unicode natively.
Fonts have nothing to do with UTF-8 encoding. A font maps a Unicode code point to a glyph. This Unicode code point can come from decoding UTF-8, UTF-16, UTF-32 (native Unicode code point form) or even another local multi-byte encoding that is translated.
If WC3 was fully Unicode aware this would not matter. The fonts would be mapped at different Unicode code point ranges.

It is not even a GPU limitation since the game only draws the characters needed onto a GPU texture as seen below.

Seeing how WC3 is being maintained, it is possible that some effort has been made to make Warcraft III fully Unicode aware similar to all modern Blizzard games. As such UTF-8 or UTF-16 might be mixed in now although some of the mechanics are still left over from when local multi-byte encoding was used.

My bad about the UTF semantics. Yet, my point still stands - the limitation of WC3 being unable to render at least cyrillic characters in the English version stems from the simple fact that the WC3 fonts in the English distribution do not support it. First time I've tried it was way back on 1.26 if not 1.24, before any of the modern patches hit the fan. These are old patches. I don't know about other languages, but Cyrillic and Latin could always be mixed without issues with a proper font installed, and worked everywhere, with Cyrillic always taking 2 bytes, and Latin taking 1. When the font has no cyrillic glyphs, it simply skips the character, turning up as a blank. Simply changing the font fixed that.

utf8-aware iterator

LeP

LeP

Aniki

Aniki

Resources

Dr Super Good

Dr Super Good

Resources

LeP

LeP

Aniki

Aniki

Resources

LeP

LeP

Dr Super Good

Dr Super Good

Resources

GhostWolf

GhostWolf

Resources

Lizreu

Lizreu

Resources

Dr Super Good

Dr Super Good

Resources

Attachments

Lizreu

Lizreu

Resources

Similar threads