Hash Tables

A hash function in computer science uses an algorithm that maps a given input into an output that falls within a restricted length or range.

For an example, imagine a simple hashing algorithm that takes the ASCII value of each character in an input, and gives the sum of all the values as the output.

pop -> 335 (because 'p' = 112 'o' = 111)
jazz -> 447 ('j' = 106 'a' = 97 'z' = 122)

Assuming we are only ever allowed an input of a maximum length of 10 characters ‘A’ to ‘Z’ and ‘a’ to ‘z’ , the output would be a value between 0 and 1220 since the highest possible value would be 10 times 122 (the ASCII value for ‘z’ for an input of ‘zzzzzzzzzz’).

Activity 1

Use the algorithm above with an ASCII table and a calculator to write out with working the output from the hash function for:

  • “Jazz”
  • “rock”

Activity 2

Discussion points:

  • Is it possible to guess what the input was from the output?
  • Where might this be useful?
  • What do you think the meaning of the term cryptographic hash function might mean?

Index

You will be familiar with the concept of an index from arrays – the index lets you locate a particular item in an array.

In the example below bandMembers[i] will give you different values depending on the value of i. If i is 2, “George” is the element selected.

JohnPaulGeorgePete
0123
bandMembers[ ]

A hash table takes the idea of the hash function, and combines it with an array to create a data structure that allows us to store and retrieve data quickly.

Why is a hash table needed? You may recall that a linear search is one way to locate items in an array. It is easy to understand and easy for a beginner programmer to code. However it has a huge disadvantage which is that as the array grows, it gets slower and slower to search. This order of growth is known as O(n) meaning the average time for the algorithm to complete will increase in a linear relationship with the size of the array. If the array size doubles, the time taken to search the array will double on average. Hash tables don’t have this disadvantage. They can find items very quickly no matter what the size of the array that they are stored in. In fact, a well implemented hash table will have order of growth of O(1) meaning no matter how big the table, the item will be found in constant time.

The value gives the index

With a hash table, a clever idea is used. The data value being stored is used to decide WHERE to store the data value itself in an array. A hashing function maps the values to locations in the table.

Let’s say we have a table with 5 spaces available to store 5 integers. We could use a simple hashing algorithm that simply takes a number and does a modulo operation on it to decide where to store it. Let’s say our algorithm uses modulo 5:

35 -> 0 (35 divided by 5 gives remainder 0)
13 -> 3 (13 divided by 5 gives remainder 3)
121 -> 1 (121 divided by 5 gives remainder 1)
19 -> 4  (19 divided by 5 gives remainder 4)

The resulting hash table would look like so:

351211319
01234
Hash table resulting from a simple hashing algorithm of input modulo 5

To see if a particular value is present in the hash table, we calculate its index, and see if the value is present at that index. If we ran the number 121 through our hashing algorithm, we will get an index of 1, and on inspecting the table at position 1, we see that it does indeed contain 121.

However if we ran the number 232 through the algorithm, we would get position 2. Inspecting the table at position 2 we find that it is empty. Therefore 232 is not present in the hash table. You can see that this is much quicker than a linear search which would have to step through every element in the array comparing it with the value 232, only stopping once it reaches the end of the array. In contrast, the hash table is only checked in one location (position 2).

Activity 3

An algorithm uses a modulo of 11 to map values to locations in a table/array of 11 elements. Draw the table that will result from the values below being passed to the function. For example the first value 511, will be stored in location 5 because 511 % 11 is 5.

511, 123, 4223, 23423, 511, 634, 22, 4096

Clashes

As you can see from Activity 3, it is possible that a particular input value generates exactly the same output as another, different, input value.

23423 is already stored at index 4, but 4096 generates the same index. We have what we call a clash.

What should we do in such a situation? Overwrite the value that is already there, or discard the new value that we are trying to store? Neither is acceptable as it will result in the loss of the data we are trying to store.

A simple solution is to look for the next free location in the table/array and simply store the value there. If we are already at the end of the table, then start again from the beginning of the table and try to find the first empty location there. In this case, 5 is also occupied so we cannot store 4096 there, however the next location, 6, IS free so we can store it there.

This technique, called linear probing, does make searching for values in the hash table slightly more complicated, as if we don’t find the data in the expected location we have to search subsequent locations in a predetermined fashion (e.g. the next location, or the location after that etc) until we either find it or find an empty location which means the data is not in the table.

Activity 4

We have seen already in Activity 1 that you can hash strings of text by converting each character in the string to an ASCII value before indexing and storing.

Store the following inputs in a table of length 11, using modulo 11 on the sum of the ASCII character values. In the event of a clash, store the value in the next available consecutive location in the table. If the end of the table is reached before any empty slots are found, start at the beginning of the table and search for empty slots from there.

Inputs: “pop”, “Jazz”, “rock”, “folk”, “Indie”, “House”, “Hip Hop”, “ska”

Handling Clashes

If we are having too many clashes, it will slow down both storing and searching for data in the hash table. It’s important to find solutions to prevent high numbers of clashes.

One way is to choose the table size and modulo value used carefully. If the table is too sparsely populated (i.e. lots of empty slots) that is an inefficient use of memory. If the table is too densely populated it is likely to result in more clashes. We need to find a balance between the two. A suggested minimum is to have 75% coverage so that at any one time no more than 25% of slots are free.

Another solution is to use a linked list at each location in the table. The first item stored at a particular location in the table will be the first item in the linked list. Subsequent items that need to be stored at the SAME location in the table, will be added as the next item in the linked list. This solution, called chaining, is going to be slightly slower to find items in, as the linked list has to be searched too, but is a more efficient solution than simply looking for the next free location as is done in linear probing.

A Dictionary

There is a data structure called a dictionary in some programming languages, that is closely related to a hash table. The difference here is that instead of using the data itself to decide where to store it, a separate key value is used. For example in Python, we could define a dictionary as follows:

Genres = { 20: "pop", 23: "rock", 11: "Jazz"}

If we wanted to retrieve an item we give the key value and it is returned

print ( Genres[11] ) # Will print Jazz

The order in which the items are stored in memory is not known, but will in fact be determined by a hash table. The key is hashed to give the location where the value will be stored in the hash table. Locating items in the dictionary is therefore quick.

Activity 5

Complete questions 5b, 5c and 5d from the OCR Computer Science A Level Paper 1 June 2019. You have 15 minutes.

If you finish early, answer 7b) iv) from Paper 1 June 2021

Knowledge Check

  • What is a hashing function?
  • What is a clash and what causes it to happen?
  • Where could you store a value when there is a clash?
  • What happens when you reach the end of the table and have not found an empty slot to store a value?
  • What can be done to minimise the impact of clashes?