Question hashtables and arrays

UUcounterspell · Nov 9, 2017

Hello all script writers sorry I'm noob on this topic.
I think that hashtables look similar to arrays of arrays (2d arrays) because 2 indexes are used to recover one value, just like column and row number in a paper table, but they have different names and people write that hashtables hold a larger number than a 2d array.

I am sure that they are different somehow. I want to understand more completely: how is that hash tables can hold so many values that not even an array of an array could? What are other differences?

Please keep simple the answers, because things too technical I would'nt understand. If you could make simple examples to make distinction appear evident, it'd be the best.

Jampion · Nov 10, 2017

Hashtables can be seen as 2d arrays, but they offer some additional functionality.
Arrays are faster than hashtables and you can only have 256 hashtables.
Hashtables have more space. The maximum array index is 8191, whereas hashtables can go up to something like 2^31, if I remember correctly.
Hashtables can also store different types at the same time.
The biggest advantage of hashtables in my opinion is, that you can use GetHandleId on any handle to get its id. With this id you can store data for that handle.
Additionally you can easily get rid of all data attached to that id with one function.

Let's say you want to store the number of kills for every unit. With hashtables you can simply use
hashtable[unit_id][0]=5 (the syntax looks different, but as I said you can see it as a 2d array with enough space, so you can use the ids of handles, which are greater than 8191)
If you would want to use arrays, you would need to index all units (Unit Indexer), because unit_id is greater than 8191.

Another example where hashtables can be very useful is, if you want to store data for a unit type.
You can't use array[unit_type], because unit type is greater than 8191.
You could use a mapping, that maps every unit type to an integer: footman -> 1, rifleman -> 2, etc. and then
use array[1] to acces the data for the footman. Creating this kind of mapping is a lot of work.
With the hashtable you can again use hashtable[unit_type][0] and if you wanted to store another value, you could use hashtable[unit_type][1].

UUcounterspell · Nov 10, 2017

Thank you very much!
To read something more, is this topic "data structures"?
And a mapping is one more, distinct from hashtable and array, data structure?

Jampion · Nov 10, 2017

Arrays and hashtables are both data structures.
A mapping is also realized by a data structure. Ironically in warcraft the best way to realize this mapping is a hashtable. Then you would have a hashtables that maps unit type to an id and you would use this id as index in an array. So in this case it would not make sense to use hashtable mapping and array, if you can simply use a hashtable.

TempMail · Nov 10, 2017

hashtable dynamically change it's memory placement whenever needed. If you fill a new parentkey, a new memory chunk being provided. When you fill a new childkey, yet another chunk being provided, and so on. They are bit slower (irrelevant unless you're workin on really heavy stuff) than arrays due to those lookups.
arrays meanwhile are completely static structures which doesnt change it's address once created

UUcounterspell · Nov 10, 2017

Very interesting.
Compared to arrays, there is much going on in the background of me using a hashtable: something takes my two arbitrary keys I just invented and transforms them into a unique address for a memory chunk. That's some kind of magic!
Also I didn't know that hash table are dynamically allocated.

Arrays seem intuitive to understand in principle, but I still think the hash tables mysterious.
For example: the index of one array is clear what it means: it is reflected by an increase in the address, it points to the next arrayed data.
What dark magic is used by the hashtabel when it receives two keys that I invented arbitrarily and returns one unique memory chunk address? What is a collision?

You know some good introduction tutorial webpage that teaches noobs this topic, I would really ike to read it to receive these answers and others.

Kaijyuu · Nov 10, 2017

Hash table - Wikipedia

Collision (computer science) - Wikipedia

In layman's terms, you take something (like a string, number, object) to be used as a "key" and transform it into a number. You then use that number to index an array, and place a "value" at that position in the array. That is a hashtable.

Collisions are when two separate keys end up being transformed into the same number to index the array. They would point to the same address. Some systems allow this and have a secondary system to distinguish the two, some just bug and have overwritten data. The latter is used when it's deemed impossibly rare for a collision to occur, or laziness (in wc3's case).

UUcounterspell · Nov 11, 2017

wikipedia could be written in turkish for what i care, i don't understand that. After all wikipedia's an encyclopedia clone, it's not a tutorial or a course book, it's not meant to teach to people with no prior knowledge.

layman terms are much better. Excuse me but how do you know that there is no collision management that prevents data overwriting upon collision in warcraft 3? I could watch at w3 hash tables for years and not realize that: how did you?

Dr Super Good · Nov 11, 2017

UUcounterspell said:
I am sure that they are different somehow. I want to understand more completely: how is that hash tables can hold so many values that not even an array of an array could? What are other differences?

Answer is in the name, they are hash tables.

They work by mapping a key to a value. This is done using a bucket array which the key and value is placed into by taking a hash of the key, or restricting the range of the key if it is an integer so it is inside the array. To deal with collisions, where multiple keys fall into the same array index, the key and value is added to a linked list at the array index and the specified key is searched for using a linear search. I do not know how Blizzard implemented the hash table structure exactly, but it either remaps both keys into a single key it uses internally, or it is separated into a parent hash table that can have multiple children hash tables.

The advantage of hash table structures is it allows mapping an arbitrary key to a value with near O(1) complexity. This complexity degrades towards O(n) as the number of collisions in the hash table increases.

WC3 hashtable lookup and set is only 2 or 3 times slower than array lookup, assuming it contains a reasonable number of mappings. This means one can use them instead of arrays to some extent without worrying about poor performance.

UUcounterspell · Nov 11, 2017

Hello friend.
This is great but not enough layman. My brain shortcircuits and smoke comes out of my ears reading this.

Many words need more explanation, like linked list, linear search, the O(1) and On) complexity.
The discussion would go overboard. Is there an introduction tutorial that is easy to read on these terms, that is not wikipedia (too difficult

)?

Dr Super Good · Nov 11, 2017

UUcounterspell said:
linked list

A list composed of nodes that reference each other in a chain. It is a dynamic data structure as nodes can be allocated separately as required. Each node holds an entry in the list.

UUcounterspell said:
linear search

Searching for an entry by checking through a list sequentially until either the entry is found, or the end of list is reached.

UUcounterspell said:
the O(1) and On) complexity

Complexity defines how computational difficulty scales with size of a data structure. Something with O(1) complexity will take as long to perform on a data structure holding 1 element as it would holding a million elements. On the other hand O(n) complexity means that it will take a million times longer to perform when holding a million elements than 1 element, bad scaling in comparison. Complexities can get much worse, like the infamous traveling salesman problem with a complexity of O(n!), of order n factorial.

I suggest Google and Wikipeda any further questions.

Kaijyuu · Nov 11, 2017

UUcounterspell said:
Excuse me but how do you know that there is no collision management that prevents data overwriting upon collision in warcraft 3? I could watch at w3 hash tables for years and not realize that: how did you?

Honestly I'm just repeating what I heard from other people. Could be completely wrong.

UUcounterspell · Nov 11, 2017

Dr Super Good said:
I suggest Google and Wikipeda any further questions

Will try not to get lost in the ocean of information

Your answer unlike google or wikipedia was concise, quick, perfect for one without ground knowledge like me.

Fingolfin · Nov 28, 2017

Hash tables are similar to arrays - usually, the key is also similar to an index, except you use modulo to turn the key into a valid index (since keys can be practically any number). Sometimes, this produces the same index for two keys, and in that case you need to have a backup plan for when the index is already filled - you can compare the hash keys (without the modulo) to see if the items are the same (in which case we overwrite the old value), and if not, we can try jumping to the next index and repeat the process until we find one which is empty (this is called linear probing). You can also use a linked list like Dr Super Good suggested, or running the value through a second hash function (double hashing).

One common trick of array-based hash tables is to make the length of the array a prime number. This help reduce clustering when using the modulo (clustering = many keys end up in the same spot in the array). Also, the amount of collisions depend on how populated your array is. Usually, you want to keep it less than 60-70% populated. Some hash tables solve this by increasing the length of the array to the prime closest to 2x of the current length. This involves copying over all values into the new array though, which is why the lookup times of hash tables can be inconsistent.

Dr Super Good · Nov 29, 2017

Fingolfin said:
One common trick of array-based hash tables is to make the length of the array a prime number. This help reduce clustering when using the modulo (clustering = many keys end up in the same spot in the array).

This comes at the huge cost of having to use integer modulo, one of the slowest instructions of any CPU. Hence most hash table implementations generally use a power of 2, since then bit masking can be used, one of the fastest instructions of any CPU.

Example of such hash table is the hash table used by MPQ archives. It can only be a power of 2.

Overlap is of little concern because a proper hashing algorithm, eg SHA, have the property that one can consider each bit as being deterministic but random for a given input. Hence masking the final N bits out and using that as the index to the bucket array guarantees extremely low collision chance. It may only be a problem with faster hash algorithms or when dealing with arbitrary numbers.

Fingolfin said:
This involves copying over all values into the new array though, which is why the lookup times of hash tables can be inconsistent.

Lookup times are always consistent and approximately O(1) because the hash table is not being changed. Only when modifying the hash table can the times be inconsistent. They will only be inconsistent if the hash table is set to dynamically expand its bucket array, which many are not and instead they rely on increasing complexity of collision management.

Example of a hash table that does not dynamically expand its bucket array is the hash table used by JASS for variable name lookup as well as the JASS hashtable type object.

Fingolfin · Nov 29, 2017

Dr Super Good said:
This comes at the huge cost of having to use integer modulo, one of the slowest instructions of any CPU. Hence most hash table implementations generally use a power of 2, since then bit masking can be used, one of the fastest instructions of any CPU.

Example of such hash table is the hash table used by MPQ archives. It can only be a power of 2.

Overlap is of little concern because a proper hashing algorithm, eg SHA, have the property that one can consider each bit as being deterministic but random for a given input. Hence masking the final N bits out and using that as the index to the bucket array guarantees extremely low collision chance. It may only be a problem with faster hash algorithms or when dealing with arbitrary numbers.

I have never heard of anyone using SHA for any high-performant hash table, it's too large and only has the benefit of encryption safety. CRC32 is much better as a hash key, and has its own x86 instruction. You are right that bit masking modulo is faster though, using prime length is a more old-school approach which is still taught at many universities. The benefit of this approach though is that all the bits of the hash become significant, rather than just the lower bits.

Dr Super Good said:
Lookup times are always consistent and approximately O(1) because the hash table is not being changed. Only when modifying the hash table can the times be inconsistent. They will only be inconsistent if the hash table is set to dynamically expand its bucket array, which many are not and instead they rely on increasing complexity of collision management.

Yes, i meant insertion time, sorry. I wasn't really arguing whether dynamic size hash tables are a good idea or not (which obviously depends on the situation), just answering the question of "what happens when values no longer fit in the table" which would be a logical thing to wonder regarding the implementation of hash tables. Collision management doesn't really help you if your bucket array becomes full, and choosing a too large fixed size leads to wasted memory. I strongly suspect that JASS hash tables are fixed length though, since that seems to be the trend in this game.

Dr Super Good · Nov 29, 2017

Fingolfin said:
I strongly suspect that JASS hash tables are fixed length though, since that seems to be the trend in this game.

It does, purely because the performance degrades when it contains a lot of mappings. If it was dynamically resized the performance would remain constant for any number of mappings.

Deleted member 219079 · Dec 3, 2017

Hash table:

Array:

TempMail · Dec 6, 2017

wrong
arrays are russian car - you can drive it, yet its fucking hard to handle
hashtables are easier to handle yet it costs A BIT more than array.
for normal tasks speed WON'T MATTER A THING.

Deleted member 219079 · Dec 7, 2017

I don't see how arrays are hard to handle.

TempMail · Dec 7, 2017

BlueSaint said:
I don't see how arrays are hard to handle.

try to save some data per unit. Oh, now you gonna get unitindexer. Then you may need to attach data to destructible. To a rect. Gonna index shit out of it? Sorry to say that, but once you called indexer once, you're already slower than hashtable call. Like, A LOT slower.
Arrays support only 8k elements. Thats hard to manage for multi-instance events.

Kaijyuu · Dec 7, 2017

I would honestly expect the overhead of a unit indexer to be more expensive than using a hashtable that runs on handle IDs.

I only use arrays for things that are immediately suited for it by having an obvious index readily available, like player related data.

Deleted member 219079 · Dec 7, 2017

Kaijyuu said:
I would honestly expect the overhead of a unit indexer to be more expensive than using a hashtable that runs on handle IDs.

Can you post an example of what you're referring to?

I only use hash table for key - value pairing.

Edit: @TempMail try to use singleton variable for storing multiple elements of its kind. See, they're hard to handle?

Kaijyuu · Dec 7, 2017

Well hashtables are hashtables, so I don't think I need an example for that. Unit indexers... I've never actually used one, so maybe my understanding of them is wrong, but I was under the impression they assigned a custom value to a unit for array usage. Doing that properly can be a wee bit more complicated than simply taking an integer mod 8192, and since it's at the script level and not compiled code level, a hashtable's hash function seems like it'd end up being faster.

If one is rarely generating units but very often referring to their index, then the indexer's overhead would be inconsequential, of course.

TempMail · Dec 8, 2017

BlueSaint said:
Can you post an example of what you're referring to?

I only use hash table for key - value pairing.

Edit: @TempMail try to use singleton variable for storing multiple elements of its kind. See, they're hard to handle?

try to use PURE FREAKING JASS. All of your "singletons" and shit is out of my picture. Im writing effective code instead of clean. Any wrapper you can EVER imagine, you could EVER use, is still slower than hashtable access.

Dr Super Good · Dec 8, 2017

I think people are prematurely optimizing... Chances are maps will perform badly due to making WC3 do stupid things like handle 10,000 units rather than because a memory accesses made occasionally are hashtables function calls rather than array lookups.

Question hashtables and arrays

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Deleted member 219079

Similar threads