15-200 Lecture 8 (Friday, September 15, 2006)

15-200 Lecture 8 (Friday, September 15, 2006)

Linear Probing

Dynamic data structures are reasonably efficient in memory, but they are very expensive on disk. The reason for this is that main memory access time is (more-or-less) constant -- it does not vary by (if we ignore caching, etc). So, we can skip around RAM to access part of the array list or linked list without penalty. But, when it comes to disks, it is much cheaper to access the next sector than the prior sector or another track. This is because it can take milliseconds to move the disk heads and wait for the disk to rotate into the right position. This delay can be avoided if we read linearly -- the next sector will be readable as soon as we're done with the first. And, with caching, many nearby sectors, a whole track, are often read at once.
As a result for hash tables on disk, we need to use other techniques to manage collision. These techniques are generally known as probing, or looking for the record in other buckets. Basically, if we can't insert the record in its asigned bucket, we try to use other buckets, until we can insert it. Similarly, when we must look for the record, we search (or probe) for it beginning with its assigned bucket, but then through other buckets.
Linear probing simply tries the next bucket, and then the one after that, &c, wrapping around when needed, until it finds an empty bucket. So if Fred hashes to 28, but there's already something at index 28, we just put him into index 29. If, when we try to put Fred into position 29, we find that position 29 is filled, too, we just keep going, trying position 30, then 31, &c, until we find an empty place for Fred.
Linear probing is easily implemented, but often suffers from a problem known as primary clustering. Although the hashn function should uniformly distribute the records across the address space, sometimes clusters appear in the distribution. This might be the result of similarities in the original data that aren't randomized by the hash function, or side-effects of the hash function, itself.
Regardless, if linear probing is used, it might spend a significant amount of time probing within a cluster, instead of "getting past the crowd" and using the subsequent available space.

Quadratic Probing

Quadratic Probing is just like linear probing, except that, instead of looking just trying one ndex ahead each time until it find an empty index, it takes bigger and bigger steps each time.
On the first collision it looks ahead 1 position. On the second it looks ahead 4 (2²), and on the third collision it looks ahead 9 (3³), wrapping around as necessary (modulo the size of our hash table).
This is just as easy to implement as linear probing, and tends to step beyond primary clusters faster than linear probing.

Double Hashing

When a collision occurs, this approach switches to a second hash function. So if Fred=28 hashes to a filled spot, it will try another hash function, say, using multiplication instead of addition, and hashing to F=6, R=13, E=5, D=4, 6*13*5*4= 1560 (modulo the table size).
In some sense, this is the "blame the hash function" approach. If one hash function seems to be causing, or at least allowing, clustering, it tries another.

Avoiding Collisions

Chaining, Linear and Quadratic Probing, and Double Hashing are ways to resolve collisions. But it's better not to have a collision in the first place.
Now, the example hashing function for Fred is really horrible. But to make things worse, what if the table size were, say, 11, and the implementation ignored index 0 and put items into indices 1-10? Then each time an item hashed to a multiple of 10, like 1560, there would be a collision!
Avoiding collisions depends on having the right table size (large enough and prime) and an effective hash function.
Table Size should be:

Double the amount of data you expect to put into the table
A prime number. This is because if my hash table size is, say, 100, then any hash key that's a multiple of 100 will first try to go into index 100, causing a collision after the first insert.

A good hash function will scatter the data evenly throughout the hash table. In Java, the Object class, from which all objects in Java descend, has a method called hashCode, which returns a hash code (computed using a hash function) of whatever object you give it.

Resizing and Rehashing

So if you want to avoid collisions, it's a good idea to start out with a table size double the number of elements you expect to put into it. But what if you didn't know what you needed to begin with, and now you're having so many collisions that it's slowing things down? You'll have to:

"Resize" the hash table by making a new, larger table
Rehash all of the items in the old table into the new table

You can't simply double the size of the current hash table, because your new table size should be prime. And after creating a new table of size prime, you'll need to rehash all of the items in the old table with a new hash function, because your current hash function is modulo the size of the hash table. So you'll need to apply your new hash function to every item from the old table into the new table before putting them into the new table.

Lookups

So how do you find things stored in a hash table?
You first look for your target at its hash key index in the array.
If there is nothing at the hash key index, then the element you're looking for isn't in the hash table. If the data in that position is what you're looking for then you've found it and you're done.
But if the data in that position is something other than you're looking for, then it might still be in the hash table, since there could have been a collision while inserting. In this case you resolve the collision, just like you would have done when you inserted it. In a hash table, you go through the same process to find the item that you did to store the item. This allows you to randomly access the item.

Deletes

How do you delete an item from a hash table?
First you perform a lookup and find the item. After you've found the item, if you're resolving collisions using chaining, then the data can be removed from the bucket. But what if you used linear or quadratic probing to resolve collisions? Removing items would cause serious problems. If you remove an item from a hashtable like you would in a binary tree, you won't be able to find previous items that collided with it, and you'll end up with a mess.
So you don't literally delete items. You just mark that position in the array. These marked items are sometimes called tombstones. Future inserts can overwrite these markers, but lookups treat them as collisions. Without these markers, we might insert two elements with the same hash value, then remove the first one and leave the second item unreachable.