Hashes as a Dictionary
An excellent analogy for the hash data structure is a dictionary. Imagine that you're spending a leisurely Sunday afternoon reading a good book. You come across a word you've never seen before: "Sisyphean". You want to look up the definition, so you consult your always-reliable Oxford dictionary. How do you retrieve the definition of the word "Sisyphean"? Do you look at the entry for the word "shellfish"? Or do you look at the entry for the word "anthropology"? No no, you of course look at the entry for the word "Sisyphean" in order to get its definition.
At the risk of being Ms. Obvious, what this analogy brings out is that you can go directly to a particular definition if you know the word that it's a definition of. Or in more general terms, you can go directly to get a value as long as you know what its label is. Hashes work in exactly the same way, except the labels are called "keys". And each "key" is associated with its own "value".
The hash data structure is really just a plain old array that's been dressed up in some rather clever ways. Recall that we get O(1) index access for arrays. While O(1) reads are great, it's pretty hard to remember what index position a particular value lives in.
For instance, in the array above, it seems completely arbitrary that "spatula" lives in index position 2. Is there a way to retrieve that value without knowing its index position? The answer is to use a key as an alias for a specific index. The key will be a label that's completely obvious for the value we want to retrieve, sort of like looking up "Sisyphean" in order to get its definition. That way, we can forget about indices altogether, yet still use them to achieve O(1) reads!
But how do we determine how a key will map to a particular index? This is where hashing functions come in. Let's say I want to store words and their definitions in a hash. My first entry to input is "Sisyphean" and its definition. The word "Sisyphean" will be a key, and its definition will be the value. So given the key "Sisyphean", which index position should I store its definition in?
There are many ways to do this, but here's one very simple way. Since there are 26 letters in the alphabet, let's assign a number to each letter. So "a" will be 1, "b" will be 2, "c" will be 3, ... etc. For each letter in the key, let's translate the letter to its number and then take the sum of all the numbers. Finally, let's divide the total by n, where n is the size of our array. The remainder will a random index position between 0 and the size of our array minus 1.
19+9+19+25+16+8+5+1+14 = 116 % n
The calculation above shows how we would use this algorithm to compute an index position for the word "Sisyphean". Let's suppose the size of our array is 20. 116 modulo 20 is 16. So we put the definition of "Sisyphean" in index position 16 of our array. And we can repeat the same process in order to insert "anthropology" and its definition into our hash. We can also do the same for "shellfish" and its definition.
1+14+20+8+18+15+16+15+12+15+7+25 = 166 % n
19+8+5+12+12+6+9+19+8 = 98 % n
Assuming the size of our array is still 20, the algorithm spits out index position 6 for "anthropology", so we insert its definition there. And it spits out index position 18 for "shellfish", so we insert that definition there. Following this procedure, we can fill up our hash map with any word and its definition. Now whenever we want to look up the definition of a word in our hash map, we can apply the same hashing function to the word in order to compute the index position of where its definition is stored. And this means we can read or search for any particular definition in O(1) time!
Dealing with collisions
So far it's been smooth sailing, since "Sisyphean", "anthropology", and "shellfish" all happened to map onto different indices. However, it's possible for two different words to map onto the same index position. For instance, "nebula" and "unable" are both anagrams of each other, so they'll both map onto index position 15. How can we deal with these collision of values?
The most common way is through a method known as chaining. Instead of storing a single value in each index position, we're going to store a linked list of values in each index position. Each node of the linked list will contain an entry's key and value.
For instance, in the array above, index 0 and 2 both contain a linked list of two items each. And index 4 and 7 both contain a linked list of one item each. So for lookups, we would still use our hashing function to compute the index position of a given key. And we would still visit this index position in order to look for our value. The only difference is that now there's a linked list stored there. We need to linearly scan through this linked list. Finally, once we reach a node that contains our key, we stop and immediately return that value.
Chaining introduces another problem though. It's theoretically possible that through sheer terrible luck, all of your entries end up in a single linked list at one index position of the array. Even though this is unlikely to happen, we still don't want our linked lists to get too long. If they do, our lookup time will cease to be true O(1) operations. A measurement that can be used to guard against this is load factor.
Don't panic! Load factor is just a fancy word for the number of entries in the hash map divided by the size of the array. If we have 10 entries and our array size is 20, the load factor is 0.5. If we have 20 entries and our array size is 20, the load factor is 1. And if we have 30 entries and our array size is 20, the load factor is 1.5.
load factor = number of entries / size of the array
If the load factor is too high, the probability of collisions increases drastically. At this point, you may be tempted to think that the lower the load factor, the better. Why not keep a 0.1 load factor? Or even a 0.01 load factor? Then we'll never have collisions! Hold on a minute though. We don't want the load factor to be too low either. If it's too low, we'll have too many unused slots in our array. This is wasteful of memory. So we want to keep the load factor at a happy medium. But what is this happy medium? To keep things simple, we'll say that the ideal load factor is 0.7.
When the load factor surpasses 0.7, the hash table should be resized appropriately. Just don't forget, you also need to change your hashing function in order to account for this change. Specifically, you need to change
n to reflect the new size of the array. Otherwise, your hashing function will never spit out any index positions that cover the newly available slots!
Although we didn't get into each operation, all operations for a hash table are O(1) operations. This makes the hash table a very powerful data structure indeed. There are many implementation details to keep in mind though in order to maintain O(1) operations in practice. Hopefully this post has shed some light on a few of these details. We'll continue our discussion of data structures next time by talking about binary trees.