15-200 Lecture 7 (Wednesday, September 13, 2006)

15-200 Lecture 7 (Wednesday, September 13, 2006)

Bucket Sort

Picture a bunch of buckets all lined up in a row, facing you.

Now imagine that you have 1 green tennis ball, 1 blue tennis ball, 1 yellow tennis ball, 1 purple tennis ball, and 1 red tennis ball, and that you need to put each ball into its proper bucket. Not too hard to do, is it? You simply put each ball into its corresponding bucket. Red maps to red, blue maps to blue, etc.
Now imagine that you have a collection of numbers from 0-99 and an array of size 100. If you simply put the 0 into index 0, 1 into index 1, 2 into index 2, 3 into index 3, etc., you'll have no trouble finding anything in the array. If you want to find 1, you go to index 1. If you want to find 33, you go to index 33. Simple.
With a collection of numbers from 0-99, it's easy to put each item into its proper place by putting 0 into index 0, 1 into index 1, 2 into index 2, etc. But what do you do in the real world? What if you want to store information about people in a medical database and find their information quickly in an emergency?
If you have a database, based on a hash table, of Person objects, you assign a unique number, called a key, to the Person object when you insert it. The process we use to map keys data is called hashing. To hash, say, a Person object to a number, we write a hash function.

Hashing and Hash Functions

If you're hashing Person objects, one simple hash function could be to assign a number to each letter of the person's first name; A=1, B=2, C=3, etc. Then, if you store a Person object with a firstName field of "Fred," you can store Fred in F=6, R=13, E=5, D=4, 6+13+5+4= array index 28.
This simple hash function converted, or hashed, the Person object, Fred, into a number. All inserts, lookups, and removals of Fred will be based on this number. The hash function is a mapping from data (e.g., Person objects) to numbers between 0 and size of the hash table.
A hash function must map the data to a number, then return that number modulo the size of the hash table (think of a circular hash table).
This process of obliterating the actual item into something you'd never recognize as the real thing -- 28 in the case of Fred -- is called hashing because it comes from the idea of chopping up corned beef into something you'd never recognize as corned beef - corned beef hash. There's no meaning between the actual data value and the hash key, so there's no practical way to traverse a hash table. Hash table items are not in any order. The purpose of hash tables is to provide fast lookups. And the function we used to turn our friend Fred into the number 28 (turning the letters of his first name into numbers and adding them together) is the hash function.
When I want to find Fred in the hash table, I simply use my hash function again (6+13+5+4). Fred is in hash table position 6+13+5+4. That's the way a hash table works. You store and find your items with the same hash function so that you can always find what you're looking for, even though, by just eyeballing the hash key 28, you'd have no idea that it hashes to Fred.

Collisions

So if every first name hashes to a different hash key, we're set. It's just like our buckets. We only had one blue tennis ball, which went into the blue bucket, and so on for each color tennis ball. But the real world isn't like that. What if we try to store a Person object Ned into our hash table using our hash function? N=19, E=5, D=4 is 19+5+4=index 28. Now we've got a problem, because Fred is at 28.
When we try to put an item into a spot in the hash table that's occupied, this is called a collision. We'll talk about how to manage collision next.

(Separate) Chaining, a.k.a. Closed Addressing

One common technique for resolving collisions within in memory hash tables is known as separate chaining. It works like this: each position in the hash table is a linked list. When we try to insert something into a spot that's already taken, we just make a new node and insert it into that hash table element's list.
This technique is known as separate chaining, because each hash table element is a separate chain (linked list). This is easy to do, and this way, you always have a place for anything you want to put into the table.
But, if there are many collisions, it can become less efficient, because, you can end up with long linked lists at certain array indices and nothing at others. This problem is called clustering.

Avoiding Collisions

Chaining, at the cost of longer searches, allows a hash table to function despite collision. But, it is certainly better to have little or no collision that to spend much time searching lists.
The example hashing function for Fred is really horrible. But to make things worse, what if the table size were, say, 11, and the implementation ignored index 0 and put items into indices 1-10? Then each time an item hashed to a multiple of 10, like 1560, there would be a collision!
Avoiding collisions depends on having the right table size (large enough and prime) and an effective hash function.
Table Size should be:

Double the amount of data you expect to put into the table
A prime number. This is because if my hash table size is, say, 100, then any hash key that's a multiple of 100 will first try to go into index 100, causing a collision after the first insert.

A good hash function will scatter the data evenly throughout the hash table. In Java, the Object class, from which all objects in Java descend, has a method called hashCode, which returns a hash code (computed using a hash function) of whatever object you give it.

Resizing and Rehashing

So if you want to avoid collisions, it's a good idea to start out with a table size double the number of elements you expect to put into it. But what if you didn't know what you needed to begin with, and now you're having so many collisions that it's slowing things down? You'll have to:

"Resize" the hash table by making a new, larger table
Rehash all of the items in the old table into the new table

You can't simply double the size of the current hash table, because your new table size should be prime. And after creating a new table of size prime, you'll need to rehash all of the items in the old table with a new hash function, because your current hash function is modulo the size of the hash table. So you'll need to apply your new hash function to every item from the old table into the new table before putting them into the new table.

Lookups

So how do you find things stored in a hash table?
You first look for your target at its hash key index in the array.
If there is nothing at the hash key index, then the element you're looking for isn't in the hash table. If the data in that position is what you're looking for then you've found it and you're done.
But if the data in that position is something other than you're looking for, then it might still be in the hash table, since there could have been a collision while inserting. In this case you resolve the collision, just like you would have done when you inserted it. In a hash table, you go through the same process to find the item that you did to store the item. This allows you to randomly access the item.