Count character in strings not bites

darkravenbest · Dec 22, 2022

Hello!

I have a really retarded question for you today. How to count actual characters in string? The only function i know is StringLength(), which returns string weight, which is okay when we deal with numbers or english letters, but when string has something foreign, something alien, it returns "2", "3" instead of a "1" per such character. Is there function for that? I sweeped whole hive and had no luck to find a solution for that.

Thank you.

Beeša · Dec 22, 2022

Hello, from what I've gathered the StringLength() function returns byte length of a string ([Documentation] String Type). I also was not able to find a solution or think of any way to count the string character length. So I guess I'm just bumping this one up.

Luashine · Dec 23, 2022

There's no such thing as a single character that would be easy to determine: https://utf8everywhere.org/#characters

darkravenbest · Dec 23, 2022

Luashine said:
There's no such thing as a single character that would be easy to determine: https://utf8everywhere.org/#characters

But in other languges(in Visual basic for example) there are functions which return actual numbers of characters in it no matter which characters in them, and im asking about something similar in jass.

LeP · Dec 23, 2022

I did write an utf-8 aware string iterator which uses StringHash. But for some bloody reason blizzard changed StringHash for multi-byte characters. I don't know if it still works in newer patches. I think @Frotty made some adjustments for wurst but I don't know for certain. I think if you were to recreate the table it could still work for newer patches. Again I will repeat my request to reverse it if someone can dump me the new asm.

Frotty · Dec 23, 2022

@LeP Yes I use a variant similar to urs, checking the hash of a partial multi-byte string, which returns the hash 1843378377 for most multi byte chars.

Wurst:

    function next() returns string
        var val = s.substring(currentpos, currentpos + 1)
        if ENABLE_MULTIBYTE_SUPPORT and val.getHash() == 1843378377
            val = s.substring(currentpos, currentpos + 2)
            currentpos += 2
        else
            currentpos++
        return val

This doesn't work for all, there are some other multi byte chars not caught by this, so a complete table would be nice to have. But this works for the most part.
You can configure ENABLE_MULTIBYTE_SUPPORT = true for the stdlib to work with multi byte chars.

darkravenbest · Dec 23, 2022

LeP said:
I did write an utf-8 aware string iterator which uses StringHash. But for some bloody reason blizzard changed StringHash for multi-byte characters. I don't know if it still works in newer patches. I think @Frotty made some adjustments for wurst but I don't know for certain. I think if you were to recreate the table it could still work for newer patches. Again I will repeat my request to reverse it if someone can dump me the new asm.

Well since when this changed, if you mean after reforged this is fine, because im working on 1.31 else if you meand something 1.29 it sure does affect me. How i suppose to call this function to simply get string actual length? Because when i try to use it like

Code:

set udg_Variable =  UTF8_Iterator.create(udg_String)

It seems it returns ever an ever-increasing number after i call it.

darkravenbest · Dec 23, 2022

Frotty said:
@LeP Yes I use a variant similar to urs, checking the hash of a partial multi-byte string, which returns the hash 1843378377 for most multi byte chars.

Wurst:

function next() returns string var val = s.substring(currentpos, currentpos + 1) if ENABLE_MULTIBYTE_SUPPORT and val.getHash() == 1843378377 val = s.substring(currentpos, currentpos + 2) currentpos += 2 else currentpos++ return val

This doesn't work for all, there are some other multi byte chars not caught by this, so a complete table would be nice to have. But this works for the most part.
You can configure ENABLE_MULTIBYTE_SUPPORT = true for the stdlib to work with multi byte chars.

Thank you, but can you attach the whole method, please?

LeP · Dec 23, 2022

darkravenbest said:
Well since when this changed, if you mean after reforged this is fine, because im working on 1.31 else if you meand something 1.29 it sure does affect me. How i suppose to call this function to simply get string actual length? Because when i try to use it like

Code:

set udg_Variable = UTF8_Iterator.create(udg_String)

It seems it returns ever an ever-increasing number after i call it.

It returns a vjass struct which is internally represented as an integer. If you don't know what structs are there are plenty of tutorials and the jasshelpermanual.html (somewhere). This struct exposes two methods one can use: next has hasNext. The very first code block in my thread shows how to use them in tandem. To count the characters you would loop over the string using next and hasNext and simply increment an integer. The result would be the amount of characters.

Also according to the jassbot it should still work in 1.29

darkravenbest · Dec 23, 2022

LeP said:
It returns a vjass struct which is internally represented as an integer. If you don't know what structs are there are plenty of tutorials and the jasshelpermanual.html (somewhere). This struct exposes two methods one can use: next has hasNext. The very first code block in my thread shows how to use them in tandem. To count the characters you would loop over the string using next and hasNext and simply increment an integer. The result would be the amount of characters.

Also according to the jassbot it should still work in 1.29

Sadly it doens work for 1.31 as you said, because when i used your example which you provided with your library. When i used with string from example "aäöo" it showed a long list of four digits numbers enging with 3768. However it work correctly with standart english, which lead me to assume, that it is because of StringHash behavior, change which you mentioned. I think recreating the table not an easy task to fix this, am i right?

Dr Super Good · Dec 24, 2022

Resolving string character length is very difficult, which is why even in actual programming it is often avoided and instead buffer byte length or unicode unit count is usually good enough.

This is because many unicode units can be combined together to form a single glyph/character. Some languages depend heavily on this feature.

If UTF8 is used internally, which I hope Reforged does, then unicode units can be 1, 2, 3 or 4 bytes long. The approach to resolve a single unicode unit might involve iteratively trying substrings between 1 and 4 bytes until something valid is returned. UTF8 was designed to allow graceful failing if trying to resolve incomplete unicode unit byte sequences.

Frotty · Dec 24, 2022

darkravenbest said:
Thank you, but can you attach the whole method, please?

That method is complete, as evident by the last return statement after which no code may follow. It demonstrates one way to detect a multi byte char.

darkravenbest · Dec 25, 2022

Dr Super Good said:
Resolving string character length is very difficult, which is why even in actual programming it is often avoided and instead buffer byte length or unicode unit count is usually good enough.

This is because many unicode units can be combined together to form a single glyph/character. Some languages depend heavily on this feature.

If UTF8 is used internally, which I hope Reforged does, then unicode units can be 1, 2, 3 or 4 bytes long. The approach to resolve a single unicode unit might involve iteratively trying substrings between 1 and 4 bytes until something valid is returned. UTF8 was designed to allow graceful failing if trying to resolve incomplete unicode unit byte sequences.

I didnt dive so deep into operations with symbols. I just tried to project my experiences with different languages, where i have luxury to just use something like len() and just get actually character count, to world editor. Maybe i overlooked something but i used it with siple strings of text or key-codes without complex glyphs, and didnt encounter any difficulties and i thought that maybe there is something similar in warcraft 3, and maybe i just cant find it. Because i need something similar here, just to return actual character count in units tooltip, but sadly in my case it can contain a lot of foreign characters, because i want my method work for different languages. Of course i came up for some crude workaround, and i want to find something normal like the method which can return just actual string size from simple text. I dont know should something like "In" count as glyphs, but i dont think that it could give me a serious problems.

Frotty said:
That method is complete, as evident by the last return statement after which no code may follow. It demonstrates one way to detect a multi byte char.

Alright, thank you. I will try this out.

Dr Super Good · Dec 26, 2022

darkravenbest said:
where i have luxury to just use something like len() and just get actually character count

Again, most modern languages this returns either the underlying byte length or unicode unit count. The actual number of glyphs produced, if distinct glyphs are even produced, requires parsing the string with very complicated logic.

Why do you need to know the "character" length of unit tooltips in game? Is it because you are trying to dynamically change tooltips but have limited "character" count to work with? Understanding the whole problem might allow for alternative solutions.

darkravenbest · Dec 26, 2022

Dr Super Good said:
Again, most modern languages this returns either the underlying byte length or unicode unit count. The actual number of glyphs produced, if distinct glyphs are even produced, requires parsing the string with very complicated logic.

Why do you need to know the "character" length of unit tooltips in game? Is it because you are trying to dynamically change tooltips but have limited "character" count to work with? Understanding the whole problem might allow for alternative solutions.

My goal was to add some elements on existing units tooltip as overlay. As i heard i cant interact with displayed tooltips to get it height, to measure where new elements must be displayed. And i came with the solution to count characters in units tooltip, since it possible, and count character number in which new line in tooltip would be created. Taking some cases as "ln" in account. Based on this date i wanted to get coordinates where frames should be placed. But this method has some obvious flaws in it. And that i realized to came with solid approach of designing of unit tooltip itself, which made the stuff i need be implemented more easily and only than proceed to implement what i want. So, it seems i already came up with alternative solution, because the amount of information in the tooltips will make them same size, and it wont look as unnatural adjustment as i thought before. The problem was before in fact that tooltips could be different in sizes, because of different amount of information. And in that case solution to make tooltips the same sizes would make them to look unnatural. Imagine a unit with a few sentences of information having a tooltip with twice as big as description, lets say castle. Now, it seems, the problem solved itself.

But it is interesting to know for the future anyway, if i really would have a need in knowing exact length of line next time. So i didnt abandoned this topic.

Pyrogasm · Dec 26, 2022

darkravenbest said:
And i came with the solution to count characters in units tooltip, since it possible, and count character number in which new line in tooltip would be created.

This would only be a reliable approach if the font being used for tooltips is monospaced. The english wc3 font isn't, so different characters have different widths. I am not familiar with other language default fonts to say if they are or are not monospaced.

darkravenbest · Dec 26, 2022

Pyrogasm said:
This would only be a reliable approach if the font being used for tooltips is monospaced. The english wc3 font isn't, so different characters have different widths. I am not familiar with other language default fonts to say if they are or are not monospaced.

Indeed. The more i learned about my approach the more i realize its flaws. So i moved back to different way.

The other problem was that position of new line symbol "lr" could be in different places in line, so its may divide a different amount of text and it will affect total tooltipwide/per text length differently, and it needed more amount of work and CPU resources in the future.

pr114 · Dec 26, 2022

Maybe you can try indexing the heights of all tooltips that are selectable for your purpose? Could take a fair amount of work though.

Count character in strings not bites

Similar threads