15-100 Lecture 26 (Monday, April 7, 2008)

15-100 Lecture 26 (Monday, April 7, 2008)

Bucket Sort

Picture a bunch of buckets all lined up in a row, facing you.

Now imagine that you have 1 green tennis ball, 1 blue tennis ball, 1 yellow tennis ball, 1 purple tennis ball, and 1 red tennis ball, and that you need to put each ball into its proper bucket. Not too hard to do, is it? You simply put each ball into its corresponding bucket. Red maps to red, blue maps to blue, etc.
Now imagine that you have a collection of numbers from 0-99 and an array of size 100. If you simply put the 0 into index 0, 1 into index 1, 2 into index 2, 3 into index 3, etc., you'll have no trouble finding anything in the array. If you want to find 1, you go to index 1. If you want to find 33, you go to index 33. Simple.
With a collection of numbers from 0-99, it's easy to put each item into its proper place by putting 0 into index 0, 1 into index 1, 2 into index 2, etc. But what do you do in the real world? What if you want to store information about people in a medical database and find their information quickly in an emergency?
If you have a database, based on a hash table, of Person objects, you assign a unique number, called a key, to the Person object when you insert it. The process we use to map keys data is called hashing. To hash, say, a Person object to a number, we write a hash function.

Hashing and Hash Functions

If you're hashing Person objects, one simple hash function could be to assign a number to each letter of the person's first name; A=1, B=2, C=3, etc. Then, if you store a Person object with a firstName field of "Fred," you can store Fred in F=6, R=13, E=5, D=4, 6+13+5+4= array index 28.
This simple hash function converted, or hashed, the Person object, Fred, into a number. All inserts, lookups, and removals of Fred will be based on this number. The hash function is a mapping from data (e.g., Person objects) to numbers between 0 and size of the hash table.
A hash function must map the data to a number, then return that number modulo the size of the hash table (think of a circular hash table).
This process of obliterating the actual item into something you'd never recognize as the real thing -- 28 in the case of Fred -- is called hashing because it comes from the idea of chopping up corned beef into something you'd never recognize as corned beef - corned beef hash. There's no meaning between the actual data value and the hash key, so there's no practical way to traverse a hash table. Hash table items are not in any order. The purpose of hash tables is to provide fast lookups. And the function we used to turn our friend Fred into the number 28 (turning the letters of his first name into numbers and adding them together) is the hash function.
When I want to find Fred in the hash table, I simply use my hash function again (6+13+5+4). Fred is in hash table position 6+13+5+4. That's the way a hash table works. You store and find your items with the same hash function so that you can always find what you're looking for, even though, by just eyeballing the hash key 28, you'd have no idea that it hashes to Fred.

Collisions

So if every first name hashes to a different hash key, we're set. It's just like our buckets. We only had one blue tennis ball, which went into the blue bucket, and so on for each color tennis ball. But the real world isn't like that. What if we try to store a Person object Ned into our hash table using our hash function? N=19, E=5, D=4 is 19+5+4=index 28. Now we've got a problem, because Fred is at 28.
When we try to put an item into a spot in the hash table that's occupied, this is called a collision. We'll talk about how to manage collision next.

(Separate) Chaining, a.k.a. Closed Addressing

One common technique for resolving collisions within in memory hash tables is known as separate chaining. It works like this: each position in the hash table is a linked list. When we try to insert something into a spot that's already taken, we just make a new node and insert it into that hash table element's list.
This technique is known as separate chaining, because each hash table element is a separate chain (linked list). This is easy to do, and this way, you always have a place for anything you want to put into the table.
But, if there are many collisions, it can become less efficient, because, you can end up with long linked lists at certain array indices and nothing at others. This problem is called clustering.

Linear Probing

Linked lists are reasonably efficient in memory, but they are very expensive on disk. The reason for this is that main memory access time is (more-or-less) constant -- it does not vary by (if we ignore caching, etc). So, we can skip around RAM to access each node of the linked list without penalty. But, when it comes to disks, it is much cheaper to access the next sector than the prior sector or another track. This is because it can take milliseconds to move the disk heads and wait for the disk to rotate into the right position. This delay can be avoided if we read linearly -- the next sector will be readable as soon as we're done with the first. And, with caching, many nearby sectors, a whole track, are often read at once.
As a result for hash tables on disk, we need to use other techniques to manage collision. These techniques are generally known as probing, or looking for the record in other buckets. Basically, if we can't insert the record in its asigned bucket, we try to use other buckets, until we can insert it. Similarly, when we must look for the record, we search (or probe) for it beginning with its assigned bucket, but then through other buckets.
Linear probing simply tries the next bucket, and then the one after that, &c, wrapping around when needed, until it finds an empty bucket. So if Fred hashes to 28, but there's already something at index 28, we just put him into index 29. If, when we try to put Fred into position 29, we find that position 29 is filled, too, we just keep going, trying position 30, then 31, &c, until we find an empty place for Fred.
Linear probing is easily implemented, but often suffers from a problem known as primary clustering. Although the hashn function should uniformly distribute the records across the address space, sometimes clusters appear in the distribution. This might be the result of similarities in the original data that aren't randomized by the hash function, or side-effects of the hash function, itself.
Regardless, if linear probing is used, it might spend a significant amount of time probing within a cluster, instead of "getting past the crowd" and using the subsequent available space.

Quadratic Probing

Quadratic Probing is just like linear probing, except that, instead of looking just trying one ndex ahead each time until it findd an empty index, it takes bigger and bigger steps each time.
On the first collision it looks ahead 1 position. On the second it looks ahead 4 (2²), and on the third collision it looks ahead 9 (3³), wrapping around as necessary (modulo the size of our hash table).
This is just as easy to implement as linear probing, and tends to step beyond primary clusters faster than linear probing.

Double Hashing

When a collision occurs, this approach switches to a second hash function. So if Fred=28 hashes to a filled spot, it will try another hash function, say, using multiplication instead of addition, and hashing to F=6, R=13, E=5, D=4, 6*13*5*4= 1560 (modulo the table size).
In some sense, this is the "blame the hash function" approach. If one hash function seems to be causing, or at least allowing, clustering, it tries another.

Avoiding Collisions

Chaining, Linear and Quadratic Probing, and Double Hashing are ways to resolve collisions. But it's better not to have a collision in the first place.
Now, the example hashing function for Fred is really horrible. But to make things worse, what if the table size were, say, 11, and the implementation ignored index 0 and put items into indices 1-10? Then each time an item hashed to a multiple of 10, like 1560, there would be a collision!
Avoiding collisions depends on having the right table size (large enough and prime) and an effective hash function.
Table Size should be:

Double the amount of data you expect to put into the table
A prime number. This is because if my hash table size is, say, 100, then any hash key that's a multiple of 100 will first try to go into index 100, causing a collision after the first insert.

A good hash function will scatter the data evenly throughout the hash table. In Java, the Object class, from which all objects in Java descend, has a method called hashCode, which returns a hash code (computed using a hash function) of whatever object you give it.

Resizing and Rehashing

So if you want to avoid collisions, it's a good idea to start out with a table size double the number of elements you expect to put into it. But what if you didn't know what you needed to begin with, and now you're having so many collisions that it's slowing things down? You'll have to:

"Resize" the hash table by making a new, larger table
Rehash all of the items in the old table into the new table

You can't simply double the size of the current hash table, because your new table size should be prime. And after creating a new table of size prime, you'll need to rehash all of the items in the old table with a new hash function, because your current hash function is modulo the size of the hash table. So you'll need to apply your new hash function to every item from the old table into the new table before putting them into the new table.

Lookups

So how do you find things stored in a hash table?
You first look for your target at its hash key index in the array.
If there is nothing at the hash key index, then the element you're looking for isn't in the hash table. If the data in that position is what you're looking for then you've found it and you're done.
But if the data in that position is something other than you're looking for, then it might still be in the hash table, since there could have been a collision while inserting. In this case you resolve the collision, just like you would have done when you inserted it. In a hash table, you go through the same process to find the item that you did to store the item. This allows you to randomly access the item.

Deletes

How do you delete an item from a hash table?
First you perform a lookup and find the item. After you've found the item, if you're resolving collisions using chaining, then the data can be removed from the linked list. But what if you used linear or quadratic probing to resolve collisions? Removing items would cause serious problems. If you remove an item from a hashtable like you would in a binary tree, you won't be able to find previous items that collided with it, and you'll end up with a mess.
So you don't literally delete items. You just mark that position in the array. These marked items are sometimes called tombstones. Future inserts can overwrite these markers, but lookups treat them as collisions. Without these markers, we might insert two elements with the same hash value, then remove the first one and leave the second item unreachable.

Using Hash Tables

We've talked a lot about binary trees this semester. They're great for finding things in log₂n time. If you have a balanced binary tree of 10,000 items, you'll find what you're looking for in no more than 13 steps (the tree will be no more than 13 levels deep). But if you want to access something "instantly," hash tables provide the random access you need to find something almost immediately.