Hashtable vs array

Waffle · Dec 30, 2016

Not directly to the point but might be interesting for those of you interested in hashtable internals:
Modern Dictionaries by Raymond Hettinger (python hashtables, history of different implementations, with analysis and good humor, highly recommended)

i know python dicts are a not the same as jass hashtables (API wise python is superior for sure, perf wise i'm not sure the comparisons are worth anything given jass vm overhead, but python is still prolyl better)

also i want to note i recently stumbled on a lecture about some JVM internals (sry can't seem to find the link right now) and there was a slide about how the call convention differs and for a small signature the assembly to do the conversion was like a full page of thick text. to think JASS does it that well is very optimistic, most likely in calls to most natives the overheads of the VM will drown out a lot of the actual costs.

also i have seen what a c++ vector for loop looks in inlined assembly before the optimiser kicks in and erases all of that crap, that slide of text was so thick it was unreadable.. ofc a good optimiser can strip out 90% of that but you should know the semantics is 100% of the code, that it shrinks to 10% or less is compiler heroics. not directly of relevance to jass loops or anything but it goes to show what looks like trivial bit of dirt on your screen or even something you don't realise you have written can really imply a lot of work.

Finally about hashtables once again. it is all nice and all to compare perf in a tight loop and it CAN give useful data.. however it can be quite misleading. the use cases for an array and hashtable are different.

an array is contiguous memory u can access by index, it is meant to be accessed by that index, and to be commonly iterated over, an excellent general purpose data structure you should use unless otherwise required. (also vjass structs are structore-of-arrays type of constructs, google that if you care, perf implications there both good and bad).

hashtables are essentialyl meant as mappings between 2 objects, they are dynamically sized and are not necessarily contiguous. they are NOT meant to be iterated over, they are meant to be queried for the value mapped for the key. the key might be reduced to an integer for jass case, (shitty api design for blizz..) but semantically that is just boilerplate.

if i want to do a reverse lookup for some value (eg player name) then with a hashtable i can do it with 1 hashtable lookup usually(99.999% of time kind of usually).
with an array i need to iterate over it do a lot if string comparisons till i find the right one, then i can return the player index.

if u benchmark "array vs hashtable" it may look like a hashtable is slower, but in reality it is often a trade off of
(linear search of array with a loop and expensive object equality comparisons ) OR a single lookup from hashtable. in this real scenario the slow hashtable often is equal or faster even if the implementation is slow. because it enables you to use a more efficcient lookup algorithm (for the hashtable case i would hesitate to even use the word algorithm since it is so simple) in addition the hashtable case in this case is much simpler and thus more bug resistant. simpler code is more often correct, hashtables help you write correct code, use them! (also might end up faster sometimes

)

MyPad · Dec 30, 2016

Waffle said:
Not directly to the point but might be interesting for those of you interested in hashtable internals:
Modern Dictionaries by Raymond Hettinger (python hashtables, history of different implementations, with analysis and good humor, highly recommended)

i know python dicts are a not the same as jass hashtables (API wise python is superior for sure, perf wise i'm not sure the comparisons are worth anything given jass vm overhead, but python is still prolyl better)

also i want to note i recently stumbled on a lecture about some JVM internals (sry can't seem to find the link right now) and there was a slide about how the call convention differs and for a small signature the assembly to do the conversion was like a full page of thick text. to think JASS does it that well is very optimistic, most likely in calls to most natives the overheads of the VM will drown out a lot of the actual costs.

also i have seen what a c++ vector for loop looks in inlined assembly before the optimiser kicks in and erases all of that crap, that slide of text was so thick it was unreadable.. ofc a good optimiser can strip out 90% of that but you should know the semantics is 100% of the code, that it shrinks to 10% or less is compiler heroics. not directly of relevance to jass loops or anything but it goes to show what looks like trivial bit of dirt on your screen or even something you don't realise you have written can really imply a lot of work.

Finally about hashtables once again. it is all nice and all to compare perf in a tight loop and it CAN give useful data.. however it can be quite misleading. the use cases for an array and hashtable are different.

an array is contiguous memory u can access by index, it is meant to be accessed by that index, and to be commonly iterated over, an excellent general purpose data structure you should use unless otherwise required. (also vjass structs are structore-of-arrays type of constructs, google that if you care, perf implications there both good and bad).

hashtables are essentialyl meant as mappings between 2 objects, they are dynamically sized and are not necessarily contiguous. they are NOT meant to be iterated over, they are meant to be queried for the value mapped for the key. the key might be reduced to an integer for jass case, (shitty api design for blizz..) but semantically that is just boilerplate.

if i want to do a reverse lookup for some value (eg player name) then with a hashtable i can do it with 1 hashtable lookup usually(99.999% of time kind of usually).
with an array i need to iterate over it do a lot if string comparisons till i find the right one, then i can return the player index.

if u benchmark "array vs hashtable" it may look like a hashtable is slower, but in reality it is often a trade off of
(linear search of array with a loop and expensive object equality comparisons ) OR a single lookup from hashtable. in this real scenario the slow hashtable often is equal or faster even if the implementation is slow. because it enables you to use a more efficcient lookup algorithm (for the hashtable case i would hesitate to even use the word algorithm since it is so simple) in addition the hashtable case in this case is much simpler and thus more bug resistant. simpler code is more often correct, hashtables help you write correct code, use them! (also might end up faster sometimes )

So to summarize this:

Hashtables are for direct access (sacrifices some speed) while Arrays are for continuous iterations of objects (Built for speed)

Ka-Chow!

Waffle · Dec 30, 2016

DracoL1ch said:

/.../

JASS:

function HashvsArray takes nothing returns nothing
    local integer i=1
    local integer key1='cnst'
    local integer key2
    local integer array ar
    set i=1
    loop
        set key2=GetRandomInt(1,8000)
        call SaveInteger(HY,key1,key2,23)
        set ar[key2]=23
        set i=i+1
        exitwhen i>5000
    endloop
    set i=1
    call StartPerfCounter()
    loop                                                     // OVERHEAD
        set key2=GetRandomInt(1,8000)          // OVERHEAD non trivial math involved, here
        set key2=LoadInteger(HY,key1,key2)    // thing under test
        set i=i+1                                          // OVERHEAD
        exitwhen i>5000                                // OVERHEAD branch is potentially expensive
    endloop                                                // OVERHEAD
    call StopPerfCounter()
    set i=1
    call StartPerfCounter()
    loop                                                    // OVERHEAD
        set key2=GetRandomInt(1,8000)         // OVERHEAD non trivial math involved, here
        set key2=ar[key2]                            // thing under test
        set i=i+1                                         // OVERHEAD
        exitwhen i>5000                               // OVERHEAD branch is potentially expensive
    endloop                                               // OVERHEAD
    call StopPerfCounter()
endfunction

/.../

since only 1 in 6 lines of the benchmark is overhead then how the heck are we even arguaing about anything?

all you need do is add like..

JASS:

    set i=1
    call StartPerfCounter()
    loop                                                    // OVERHEAD
        set key2=GetRandomInt(1,8000)         // OVERHEAD non trivial math involved, here
        // set key2=ar[key2]                         // dont run thing under test so we can measure overhead only
        set i=i+1                                         // OVERHEAD
        exitwhen i>5000                               // OVERHEAD branch is potentially expensive
    endloop                                               // OVERHEAD
    call StopPerfCounter()

then remove overhead cost from actual measurements and DSG and we all will be happy(er).. how come this is so hard?!

Also to be pedantic StopPerfCounter calls themselves also have some overhead since jass calls are nontrivial and they do some work that is also inside the timing range before/after the actual work. this should not be too big but who knows .. ?

Also i DO appreciate your efforts, but benchmarking is very hard, there are a LOT of non-obvious sources of error that can mess with your results, please don't presume everything is correct just because you think so. even proffessionals often mess this up(they often realise when they have done so tho..), not to mention added hardships of doing it inside warcraft 3.

EDIT: it would be nice if jass sections used a monospace font.. sigh..

DracoL1ch · Dec 30, 2016

^ I provide compare, not direct values, so I dont care how precious my calls are. they can even provide different timestamps about the same function due to non-linear CPU work, it doesnt really matter.
I give info, its up to you to believe the proofs or waste another hour benchmarking dead game. My point is - except the loop there's nothing else to affect 500 calls, so it's about right
also wc3 uses "lazy" hashtables without hashing so they're bit faster than intended and suffers of collisions more likely.

Waffle · Dec 30, 2016

lets say GetRandomInt(1,8000) is expensive.. costs say 1000 units of time..
and array access is cheap .. say 1 unit of time..

total cost 1001 units of time if we ignore other overheads..

for hashtables to take 2x longer overall they need to cost 2002 units of time..
2002 - 1000 = 1002 units of time for the hashtable.

1 unit of time for array.
1002 units of time for hashtable.
array is 1002x faster?? (1/1002)

what if GetRandomInt(1,8000) is cheap? say 1 unit of time.. and array access is expensive.. say 1000 units of time..
1+ 1000 = 1001

hashtables costing tiwce in total thus cost 2002
2002 - 1 = 2001

array thus is only 2x faster??(1000/2001)

depending on how big the overhead is the ratio in performance can vary by orders of magnitude.. it may not be important but to just dismiss it is irresponsible imo. people will make coding decisions based on this thread, we should at least say upfront when there are obvious unknowns that we have not taken into account..

wheter hashtables are 2x or 20x or 200x or 2000x is significant. for small ratios using hashtables is a reasonable thing, for large x is is best to never use hashtables and emulate them with an array if needed.

DracoL1ch · Dec 30, 2016

Waffle said:
people will make coding decisions based on this thread

gui users have no idea what written here, and jassers should be able to judge for themselves. as GhostWolf said, there's no common rules.
Im stick with hashtables disregard my wish to be FAST. Because they're just as fast as it fits, no need to waste time. If you already use arrays - have fun, it's good. If you're using hashtable, nothing wrong with that either, just FYI they're bit slower.
statistic doesn't mean anything if you have no idea how to extrapolate it on your project

GhostWolf · Dec 30, 2016

Waffle said:
people will make coding decisions based on this thread

The people on this site that care about this already know the truth. Array access is superior in literally every way, and you should never, NEVER used hashtables. Right? Right?
Need to store agent IDs? well, obviously you should subtract from them the base agent ID, which is an implementation detail that you shouldn't know or be able to rely on, and then store them in arrays.
Need to map from a timer to a struct? obviously use arrays, and a loop to find a specific instance.
In fact, whenever you need to map from any kind of object (also non-integral) to another object, use arrays.

Also don't forget to rename all of your variables and functions to use one or two letters, it's faster.

jondrean said:
^ I see lots of C++ people saying "don't help the compiler", it's true for C++, but JASS is slow and JassHelper only inlines, nothing else.

First of all, and this is just on a theoretical level, since it will never happen, but this was the case with C++ too in the past.
Compilers get better.
They never will for Jass though.
Just a thought.

Regardless, if you show me actual real-world Jass code that suffers from performance since it uses hashtables, and doesn't suffer from performance after being converted to arrays...well, then you found a case where your argument is legitimate, and kudos! You are totally right in this case.
But instead, every argument on this site is generalized as "it's always bad for every code". Now it's hashtables, before it was something else.
In fact, this isn't quite specific to the Hive, people do love making generalizations based on specific cases, it's just one of those things that humans do.

Waffle · Dec 30, 2016

the second promised link:

some rant bout the internals of the JVM.. linked time will be the point with the example about how call convention coercion works for native calls (call to C code from Java code basically)

also it is overall an interesting talk not just to prove a point so pls do watch the whole talk

Flux · Dec 30, 2016

GhostWolf said:
The people on this site that care about this already know the truth. Array access is superior in literally every way, and you should never, NEVER used hashtables. Right? Right?
Need to store agent IDs? well, obviously you should subtract from them the base agent ID, which is an implementation detail that you shouldn't know or be able to rely on, and then store them in arrays.

I assume you're referring to TimerUtils about the handle id subtraction. Well, the USE_HASHTABLE option was the default and the author merely provides an option to select what "flavor" you want to use. I honestly never used the unsafe version anyway and I don't recommend to. But the point is, the hashtable option was still an option. Besides, that's a very old thread. I still didn't see any recent resources that subtracts handle ids by an offset to use as an array indices anywhere.

GhostWolf said:
Need to map from a timer to a struct? obviously use arrays, and a loop to find a specific instance.

I don't know what scripts are you reading but I've never seen anything that loops at all instances to retrieve an object when using a hashtable could prevent it. That's just a bad practice no matter how you look at it.

GhostWolf said:
Also don't forget to rename all of your variables and functions to use one or two letters, it's faster.

Depends on the situation. I mean would you rather use 'xPos' than 'x'? You write 'index' instead of 'i' when using loops? Sometimes it's better to use a single lettered variable name if the purpose is clear.

DracoL1ch · Dec 30, 2016

Flux said:
Depends on the situation. I mean would you rather use 'xPos' than 'x'? You write 'index' instead of 'i' when using loops? Sometimes it's better to use a single lettered variable name if the purpose is clear.

he's refering not to speed of typing but common believe that length of var affects perfomance

TriggerHappy · Dec 30, 2016

DracoL1ch said:
he's refering not to speed of typing but common believe that length of var affects perfomance

I don't see anyone writing short variable names because it's faster. We all know the optimizer does it for you. GW is just having a little fit.

GhostWolf · Dec 30, 2016

I know at least one person that used to do that. He also used every other stupid micro optimization the language allowed (mostly vJass nonsense). I think we all know who it is.

Hashtable vs array

Waffle

Waffle

MyPad

MyPad

Resources

Waffle

Waffle

DracoL1ch

DracoL1ch

Resources

Waffle

Waffle

DracoL1ch

DracoL1ch

Resources

GhostWolf

GhostWolf

Resources

Waffle

Waffle

Flux

Flux

Resources

DracoL1ch

DracoL1ch

Resources

TriggerHappy

TriggerHappy

Resources

GhostWolf

GhostWolf

Resources

Similar threads