15110 Summer 2017

Lab 7

Goals

This lab is aimed at developing your understanding of the importance of hash function design for the efficiency of search using a hash table. When you are done, you should be able to:

State the worst-case performance of search in a hash table (with an arbitrary hash function).
Explain what features of a hash function affect the performance of search in a hash table and why.
Explain why search in a hash table can be expected to take constant time, given a good enough hash function and a large enough table.

Deliverables

experiments.txt
mean_bucket.py
largest_bucket.py
my_hash_function.py (challenge)

Place these files in a lab7 folder. Before leaving lab, zip up the lab7 folder and hand the zip file in.

Part 1 - Hashing Functions

1.1 CA Demonstration

Overview of hash tables and 4 hash functions

The correct operation of a hash table relies on a hash function that maps keys (in our case, these are strings) to non-negative integers that indicate which bucket the key belongs to. The hash table will work as long as the hash function maps each key to a non-negative integer less than the size of the hash table, but it will only perform well if it distributes keys relatively uniformly across buckets.

Consider the following hash functions:

	Hashing Function	Description
Very Simple	def h0(key,ht_size): return 0	Perhaps the simplest possible hash function is one that just ignores the `key` and always returns a constant such as 0.
First Character ASCII	def h1(key,ht_size): return ord(key[0]) % ht_size	Another hash function obtains the ASCII code of the first character in the key, divides that code by the size of the hash table, and returns the remainder of that division.
Totaled ASCII Characters	def h2(key, ht_size): x = 0 for i in range(len(key)): ascii_code = ord(key[i]) x = x + ascii_code return x % ht_size	Another hash function totals up the ASCII codes of all of the characters in the key, divides that total by the size of the hash table, and returns the remainder of that division. This scheme examines all the letters in the string, but is not sensitive to order. So the strings "act", "cat", and "tac" will all hash to the same value.
Radix-128	def h3(key,ht_size): x = 0 for i in range(len(key)): ascii_code = ord(key[i]) x = 128 * x + ascii_code return x % ht_size	It is also possible to build a hash function that is position sensitive, based on the idea of treating each character of the string as a digit in a number. The most familiar way of representing a number is to use a radix-10 system with the Arabic numerals: "0", "1", "2", "3", "4", "5" "6", "7", "8", "9". So, for example, "52473" means ((((((510+2)10)+4)10)+7)10)+3. One can, however, also represent numbers using other systems with different sets of symbols. For example, one could represent numbers using the letters "a" - "z" for the numbers 0 through 25 to obtain the Radix-26 system described in your textbook. Characters in the computer are often stored using 7-bit ASCII codes (with values 0-127). So we can treat a string of ASCII characters as a radix-128 number.

1.2 Student Activity

Hash Functions

Download and save the code from hash_table.py into your lab7 folder.

For each of the hash functions, h0, h1, h2, and h3, find the result of hashing the following strings for a table of size 100,000.
1. "Hash table"
2. "Table hash"
3. "Towers of Hanoi"
For each of these hash functions, would you get any collisions (i.e., more than one key hashing to the same bucket) if you were to insert these keys into a hash table?

Record the results in experiments.txt.
Hash Table Statistics

Our hash tables are implemented as Python lists where each element is a bucket. That bucket is a list that holds the entries that hashed to that bucket's index. In order to evaluate the effectiveness of different hash functions it is useful to be able to gather some statistics about the hash table after we've inserted some entries.
1. Define a Python function largest_bucket(ht) in largest_bucket.py that returns the maximum number of entries contained within any single bucket of the hash table ht. You can follow this algorithm:
  1. Set max_so_far to 0.
  2. For each element, bucket, in the hash table, do the following:
    1. If the length of bucket is greater than max_so_far, then:
      1. Set max_so_far to the length of bucket
  3. Return max_so_far.
  Example:
```
>>> largest_bucket([["a","b"],[],["w","x","y","z"],["c"]])
4
```
2. Define a Python function mean_bucket(ht) in mean_bucket.py that returns the arithmetic mean of the number of entries in non-empty buckets of the hash table ht. You can follow this algorithm:
  1. Set nonempty_buckets to 0.
  2. Set entries to 0.
  3. For each element, bucket, in the hash table, do the following:
    1. If the length of bucket is greater than 0.
      1. Add one to nonempty_buckets.
      2. Add the length of bucket to entries.
  4. Return entries/nonempty_buckets
  Example:
```
>>> mean_bucket([["a","b"],[],["w","x","y","z"],["c"]])
2.33333333333333
```
3. What is the point of gathering these statistics? Ideally, would we want the maximum bucket size and the mean bucket size to be small or large? Write a brief explanation in experiments.txt.

Part 2 - Hash Table Run Time Experimentation

2.1 TA Demonstration

using the run function for run("h0")

2.2 Student Activity

Hash Table Experiments

Look at the code from hash_table.py that you saved into your lab7 folder.

This code implements hash table creation (new_hash_table), insertion (ht_insert), search (ht_search), and the four hash functions (h0, h1, h2, h3). There is a "main" function called run(hash_fun) that sets up a hash table, inserts a large number of words, and then does some searches. It measures the elapsed time for these operations, and reports some statistics about the buckets (using your functions largest_bucket and mean_bucket). If you haven't finished writing these functions earlier use the code given in bucket_stat.py

Load hash_table.py by opening the file from IDLE and runing it (or by typing python3 -i hash_table.py into the terminal), and call run("h0"), run("h1"), run("h2"), and run("h3"), which will perform the timing and statistical measurements for hash tables with the respective hash function. In interpreting the results, take note of the fact that run() inserts 100,000 words but only searches 1000 times.

Record the output of these runs in experiments.txt. How well do the different hash functions perform for the insertion and search operations? What correlation do you see between how evenly the hash function distributes keys across buckets and the search performance of the hash functions?
Challenge: Devise a Better Hash Function

This question is open-ended. We do not expect a specific answer (because there is none) or approach here. We just want to give you the option to try out those functions that you think should improve upon the given hash functions.
Design your own hash function and call it h4. Modify the run function in order to experiment on your hash function. (You will need to modify the ht_insert and ht_search functions as well). Remember that your goal for a hash function is to avoid as many collisions as possible.
How does your function compare with the four described above? Try to beat h0, h1, h2, and h3.

Save your hash function in my_hash_function.py. Include the results of your experimental evaluation of your hash function in experiments.txt