15-111 Lecture 19 (Wednesday, June 16, 2004)

15-111 Lecture 19 (Wednesday, June 16, 2004) Hash tables

Now we're going to talk about a great way of accessing values. We've learned about indexed data, including arrays, ArrayLists, Vectors, data with some relationships, such as LinkedList and BSTs.
We're now going to learn something called a Bucket Sort. Each faculty member has his own office, if you want to see one then go to that office. If you want to contact a person, call their cellphone.
This can be done with Java as well, but how would you be able to find their office or phone number?
We're now going to make a magic function, called getBucket(), and you can ask it where it is. If you want to add something to a bucket, just call the getBucket() method and add it to the bucket. To get something from a bucket, call the getBucket() method and find it in.
To make this getBucket() method, you pass in an object with its many properties, and then create a number from crunching on these numbers. This has to be randomized, in the sense that if you pass in a different Object, it should return a different number. However, every time you pass in the same Object, it always needs to return the same number. The distribution of the numbers has to be uniform across the table. We treat this number as an address for where the Object's bucket lies.
Often this is implemented using a key-value relationship. You pass it some sort of key, and it returns the value associated with it. This is often
This method is not a magic method, but instead is just a randomization method. You very well might get two independent objects that point to the same bucket. This is called a collision. If you have open spaces, this problem can be averted, but what if you have 11 items to store and 10 buckets, is there going to be a collision? There has to be two items in one bucket somewhere, so collision is definite. If you have 10 items and 1 million buckets, is a collision likely? It's possible, but not very likely. In practice, if a hash table is created, then a ratio of 40% is a good number. If you have 40 items, then you'd want a size of around 100.
One solution for collisions is to take an approach called open-chaining. Rather than just have buckets consisting of Object references, we have buckets consisting of LinkedLists. Every time you have a collision, just add the item to the end of the list.
Like quicksort, hashing is an unstable structure. It is in practice on average O(1), constant time. You just get your item, hash it, and then go to the bucket. However, what if every single item gets hashed to one spot? Then your hash table has degenerated into a Linked List, which is completely unordered and thus has O(n) worst case.
The next approach is often used for disks. Hard disks work by having to do seek latency to move the needle over to the right spot, then wait for the disk to rotate for the right point, and then transfer the data over. These are very time consuming processes. As a result, linked lists don't work on disk. You don't want to jump back and forth in completely different parts of disk. Instead, an approach called linear probing is usually used. If a collision occurs in bucket 10, for example, you try the next buckets immediately. So seeing 10 is full, try 11, then if that's full try 12, then 13.. etc until you find the right one. To read, how would you find it? You check
This creates something called clustering, where records clump together and don't have a nice uniform distribution. To avoid this from happening, we can used what is called quadratic probing. If the space your hash returns is full, then you try the next 1^2 spot, then 2^2, then 3^2, then 4^2, etc. This tends to spread out the values some. Another technique is called double hashing. You hash, try to insert that item, and if it's full then try to hash again with another function and try to insert at the sum of the two functions.
With all this probing, how are we going to remove? If we add something, and then need to take 3 hops to insert it. Someone else deletes the value that caused the collision. If you try to search for that item again, you notice that the spot is empty. What do you conclude? You would probably think it doesn't exist. In reality, you don't want that spot to be 100% empty, or else the search would just give up. If you delete, you don't want your data there, but you have to fill it with something else. If you kill an item, you put something to replace it to say something was once there. These are often referred to as tombstones. You can remove a tombstone only by writing over it.
When you create your hash function, the common case is to create a very large value to prevent collisions. Then you take that number, and mod it by the table size to create its index. Because of this, you should have a prime number as a size of your table, for reasons that a mathematician can only explain. What if you want to increase the size of your table? Then when you do the modulus of the new table size, your values will all point to different indices from before. So if you want to increase your size, you have to do what is called rehashing and move each of the elements to their new items.