• 🏆 Texturing Contest #33 is OPEN! Contestants must re-texture a SD unit model found in-game (Warcraft 3 Classic), recreating the unit into a peaceful NPC version. 🔗Click here to enter!
  • It's time for the first HD Modeling Contest of 2024. Join the theme discussion for Hive's HD Modeling Contest #6! Click here to post your idea!

Count character in strings not bites

Status
Not open for further replies.
Level 8
Joined
Aug 5, 2014
Messages
194
Hello!

I have a really retarded question for you today. How to count actual characters in string? The only function i know is StringLength(), which returns string weight, which is okay when we deal with numbers or english letters, but when string has something foreign, something alien, it returns "2", "3" instead of a "1" per such character. Is there function for that? I sweeped whole hive and had no luck to find a solution for that.

Thank you.
 
Level 3
Joined
Jul 5, 2022
Messages
16
Hello, from what I've gathered the StringLength() function returns byte length of a string ([Documentation] String Type). I also was not able to find a solution or think of any way to count the string character length. So I guess I'm just bumping this one up.
 

LeP

LeP

Level 13
Joined
Feb 13, 2008
Messages
539
I did write an utf-8 aware string iterator which uses StringHash. But for some bloody reason blizzard changed StringHash for multi-byte characters. I don't know if it still works in newer patches. I think @Frotty made some adjustments for wurst but I don't know for certain. I think if you were to recreate the table it could still work for newer patches. Again I will repeat my request to reverse it if someone can dump me the new asm.
 
Level 23
Joined
Jan 1, 2009
Messages
1,610
@LeP Yes I use a variant similar to urs, checking the hash of a partial multi-byte string, which returns the hash 1843378377 for most multi byte chars.

Wurst:
    function next() returns string
        var val = s.substring(currentpos, currentpos + 1)
        if ENABLE_MULTIBYTE_SUPPORT and val.getHash() == 1843378377
            val = s.substring(currentpos, currentpos + 2)
            currentpos += 2
        else
            currentpos++
        return val

This doesn't work for all, there are some other multi byte chars not caught by this, so a complete table would be nice to have. But this works for the most part.
You can configure ENABLE_MULTIBYTE_SUPPORT = true for the stdlib to work with multi byte chars.
 
Level 8
Joined
Aug 5, 2014
Messages
194
I did write an utf-8 aware string iterator which uses StringHash. But for some bloody reason blizzard changed StringHash for multi-byte characters. I don't know if it still works in newer patches. I think @Frotty made some adjustments for wurst but I don't know for certain. I think if you were to recreate the table it could still work for newer patches. Again I will repeat my request to reverse it if someone can dump me the new asm.
Well since when this changed, if you mean after reforged this is fine, because im working on 1.31 else if you meand something 1.29 it sure does affect me. How i suppose to call this function to simply get string actual length? Because when i try to use it like
Code:
set udg_Variable =  UTF8_Iterator.create(udg_String)
It seems it returns ever an ever-increasing number after i call it.
 
Level 8
Joined
Aug 5, 2014
Messages
194
@LeP Yes I use a variant similar to urs, checking the hash of a partial multi-byte string, which returns the hash 1843378377 for most multi byte chars.

Wurst:
    function next() returns string
        var val = s.substring(currentpos, currentpos + 1)
        if ENABLE_MULTIBYTE_SUPPORT and val.getHash() == 1843378377
            val = s.substring(currentpos, currentpos + 2)
            currentpos += 2
        else
            currentpos++
        return val

This doesn't work for all, there are some other multi byte chars not caught by this, so a complete table would be nice to have. But this works for the most part.
You can configure ENABLE_MULTIBYTE_SUPPORT = true for the stdlib to work with multi byte chars.
Thank you, but can you attach the whole method, please?
 

LeP

LeP

Level 13
Joined
Feb 13, 2008
Messages
539
Well since when this changed, if you mean after reforged this is fine, because im working on 1.31 else if you meand something 1.29 it sure does affect me. How i suppose to call this function to simply get string actual length? Because when i try to use it like
Code:
set udg_Variable =  UTF8_Iterator.create(udg_String)
It seems it returns ever an ever-increasing number after i call it.
It returns a vjass struct which is internally represented as an integer. If you don't know what structs are there are plenty of tutorials and the jasshelpermanual.html (somewhere). This struct exposes two methods one can use: next has hasNext. The very first code block in my thread shows how to use them in tandem. To count the characters you would loop over the string using next and hasNext and simply increment an integer. The result would be the amount of characters.

Also according to the jassbot it should still work in 1.29
 
Level 8
Joined
Aug 5, 2014
Messages
194
It returns a vjass struct which is internally represented as an integer. If you don't know what structs are there are plenty of tutorials and the jasshelpermanual.html (somewhere). This struct exposes two methods one can use: next has hasNext. The very first code block in my thread shows how to use them in tandem. To count the characters you would loop over the string using next and hasNext and simply increment an integer. The result would be the amount of characters.

Also according to the jassbot it should still work in 1.29
Sadly it doens work for 1.31 as you said, because when i used your example which you provided with your library. When i used with string from example "aäöo" it showed a long list of four digits numbers enging with 3768. However it work correctly with standart english, which lead me to assume, that it is because of StringHash behavior, change which you mentioned. I think recreating the table not an easy task to fix this, am i right?
 

Dr Super Good

Spell Reviewer
Level 64
Joined
Jan 18, 2005
Messages
27,198
Resolving string character length is very difficult, which is why even in actual programming it is often avoided and instead buffer byte length or unicode unit count is usually good enough.

This is because many unicode units can be combined together to form a single glyph/character. Some languages depend heavily on this feature.

If UTF8 is used internally, which I hope Reforged does, then unicode units can be 1, 2, 3 or 4 bytes long. The approach to resolve a single unicode unit might involve iteratively trying substrings between 1 and 4 bytes until something valid is returned. UTF8 was designed to allow graceful failing if trying to resolve incomplete unicode unit byte sequences.
 
Level 8
Joined
Aug 5, 2014
Messages
194
Resolving string character length is very difficult, which is why even in actual programming it is often avoided and instead buffer byte length or unicode unit count is usually good enough.

This is because many unicode units can be combined together to form a single glyph/character. Some languages depend heavily on this feature.

If UTF8 is used internally, which I hope Reforged does, then unicode units can be 1, 2, 3 or 4 bytes long. The approach to resolve a single unicode unit might involve iteratively trying substrings between 1 and 4 bytes until something valid is returned. UTF8 was designed to allow graceful failing if trying to resolve incomplete unicode unit byte sequences.
I didnt dive so deep into operations with symbols. I just tried to project my experiences with different languages, where i have luxury to just use something like len() and just get actually character count, to world editor. Maybe i overlooked something but i used it with siple strings of text or key-codes without complex glyphs, and didnt encounter any difficulties and i thought that maybe there is something similar in warcraft 3, and maybe i just cant find it. Because i need something similar here, just to return actual character count in units tooltip, but sadly in my case it can contain a lot of foreign characters, because i want my method work for different languages. Of course i came up for some crude workaround, and i want to find something normal like the method which can return just actual string size from simple text. I dont know should something like "In" count as glyphs, but i dont think that it could give me a serious problems.

That method is complete, as evident by the last return statement after which no code may follow. It demonstrates one way to detect a multi byte char.
Alright, thank you. I will try this out.
 

Dr Super Good

Spell Reviewer
Level 64
Joined
Jan 18, 2005
Messages
27,198
where i have luxury to just use something like len() and just get actually character count
Again, most modern languages this returns either the underlying byte length or unicode unit count. The actual number of glyphs produced, if distinct glyphs are even produced, requires parsing the string with very complicated logic.

Why do you need to know the "character" length of unit tooltips in game? Is it because you are trying to dynamically change tooltips but have limited "character" count to work with? Understanding the whole problem might allow for alternative solutions.
 
Level 8
Joined
Aug 5, 2014
Messages
194
Again, most modern languages this returns either the underlying byte length or unicode unit count. The actual number of glyphs produced, if distinct glyphs are even produced, requires parsing the string with very complicated logic.

Why do you need to know the "character" length of unit tooltips in game? Is it because you are trying to dynamically change tooltips but have limited "character" count to work with? Understanding the whole problem might allow for alternative solutions.
My goal was to add some elements on existing units tooltip as overlay. As i heard i cant interact with displayed tooltips to get it height, to measure where new elements must be displayed. And i came with the solution to count characters in units tooltip, since it possible, and count character number in which new line in tooltip would be created. Taking some cases as "ln" in account. Based on this date i wanted to get coordinates where frames should be placed. But this method has some obvious flaws in it. And that i realized to came with solid approach of designing of unit tooltip itself, which made the stuff i need be implemented more easily and only than proceed to implement what i want. So, it seems i already came up with alternative solution, because the amount of information in the tooltips will make them same size, and it wont look as unnatural adjustment as i thought before. The problem was before in fact that tooltips could be different in sizes, because of different amount of information. And in that case solution to make tooltips the same sizes would make them to look unnatural. Imagine a unit with a few sentences of information having a tooltip with twice as big as description, lets say castle. Now, it seems, the problem solved itself.

But it is interesting to know for the future anyway, if i really would have a need in knowing exact length of line next time. So i didnt abandoned this topic.
 
Level 39
Joined
Feb 27, 2007
Messages
5,014
And i came with the solution to count characters in units tooltip, since it possible, and count character number in which new line in tooltip would be created.
This would only be a reliable approach if the font being used for tooltips is monospaced. The english wc3 font isn't, so different characters have different widths. I am not familiar with other language default fonts to say if they are or are not monospaced.
 
Level 8
Joined
Aug 5, 2014
Messages
194
This would only be a reliable approach if the font being used for tooltips is monospaced. The english wc3 font isn't, so different characters have different widths. I am not familiar with other language default fonts to say if they are or are not monospaced.
Indeed. The more i learned about my approach the more i realize its flaws. So i moved back to different way.

The other problem was that position of new line symbol "lr" could be in different places in line, so its may divide a different amount of text and it will affect total tooltipwide/per text length differently, and it needed more amount of work and CPU resources in the future.
 
Status
Not open for further replies.
Top