15-200 Lecture 4 (Wednesday, January 25, 2006)
Indexed Data Structures

Indexed Data Structures are a means of storing information in a manner that enables us to retrieve the data by calling on its index. In an abstract sense it does not matter if it is an array, an ArrayList or a Vector. They are all rich data structures but when you take two steps back and look at them, then you basically have an array. They are all indexes structures. This means instant acess to an item in the array by address. Accessing data by address is known as random access. There is no penalty(or cost) for jumping around in the data structure. There is also sequencial access, we can look at things either forwards or backwards in order.

So indexed data structures offer instant random access and sequential access

What is the cost of jumping right to something in this array? We do not know exactly so we will make it C. The cost is the same no matter what we are accessing, therefore C is a constant. If we want to travel through this array forwards or backwards, the cost is C for each element and if there are 10 elements in the array the cost is 10c, if there are n elements then the cost is n*C.

Getting something by index is what we call a constant time index. No matter how long the array is the cost of getting to the third element is always the same. The total cost of a traversal is directly proportional to the number of elements in the array.

Knowing the cost helps us then understand other elements of the data structures. If this array is completely out of order then finding something inside of the structure will cost n*C -> worst case scenario. If I get lucky I could take my first step and find it. I am not lucky so I would walk through the whole array and it wont be there. The cost of that will be n*C because I had to walk through each element of the array to get to the end.

The Big-O Notation

Since I have no idea what C is, I am just going to factor out the C and call it 1. The Big-O of 1, O(1) is constant time. Big-O notation is an apporoximate notation. It is there to capture the shape of the curve, the big picture and that's the main reason why I can throw out the C. Mathematicians may cringe, but for computer programmers constants like 2, 10, and C are all the same and we don't care about them when it comes to Big-O notation.

As time varies the curve changes. In this notation I have thrown out the constant and I have thrown out the lesser terms. If it was (n+2) I would have thrown out the +2.

Do you remember end behavior? If you took a look at x2, everything that has x2 as the largest term has the same basic shape. Big-O is concerned with the large basic shape. It is concerned with the large big picture, the end behavior.

Say we had 3n3+ 9n2+ 1/2n +2000000

The 2000000 may be significant for smaller data sets, but if I have enough items to emortize it against then it stops mattering. If we square one million, compared to the first term, two million is negligible. It is the 3n3 that is going to control the end behavior. If you remember from high school algebra class, we dont care about the 3 the coefficient, because eventually the n3 term will be significantly bigger at the end behavior, no matter what. In the end all that will matter is the n3.

So in introducing Big-O notation, I am talking about worst case scenario behavior and end behavior.

But what about an insertion, what if I want to insert something in the middle and push everything else back a space? Now what is the cost going to be?

If I have n items then the detailed cost is ((n-1)-x), but the Big-O of the cost is O(n). X and -1 are constants and they get thrown away. If we take two steps backwards, we know that if we have to insert in the first slot 0 then we will have to shift n things, so Big-O = O(n).

We know from programming that this shift is a variable assignment and all the variable assignments are roughly the same cost, so O(n) makes sense.

What if I flip this problem around, and I want to remove an item from this list? From first principals would be ((n-1)-x) so the Big-O is O(n). Then intuition tells us that in the worst case I am removing the 0th element, therefore I am moving all the other terms and it is still the same variable assignment operation. So for this removal the Big-O is still O(n).

So we see that:

Generally speaking when we are talking about Big-O we are talking about time. The old thinking was that storage could be bought but, processor speed is hard to come by. Therefore the Big-O usually refers to time unless specified for space.

Insertion Sort

What about if we want to sort the array? There are many ways to do this, and we will study more advanced algorithms later in the course, but for now, we will use a very simple sorting algorithms, Insertion Sort.

We already know we can insert an object into a given position of an array in linear time. However, if we know that our array is sorted, we can go one step further and insert an object into the array in order.

For example, if we have an array containing A,B,D,E, and we wanted to insert C, we can insert it such that our final array is A,B,C,D,E. We accomplish this by simply traversing the array until we encounter an element that comes after the object we are inserting, and inserting the new object in that position. Like before, we then have to shift everything to the right of this position one spot over.

Like the regular insert, our worst case is still when we have to insert into the first position, meaning every single element in the array has to be shifted to the right. So our insertInOrder algorithm will also be O(n).

Now that we have an insertInOrder algorithm that is O(n), we can think about Insertion Sort. We simply make a new empty array, and insertInOrder every element from the original array. Based on the above properties of insertInOrder, once we've inserted everything into the new array, the new array will contain all the elements in sorted order.

So what is the Big-O for Insertion sort? Well, we have to call insertInOrder on n items. Each of these inserts will be O(n), which has a cost of about Cn. So our total cost will be Cn2. Like before, we dont care about the constant when determining Big-O, so our Insertion Sort is O(n2).

Bucket Sort

Picture a bunch of buckets all lined up in a row, facing you.

Now imagine that you have 1 green tennis ball, 1 blue tennis ball, 1 yellow tennis ball, 1 purple tennis ball, and 1 red tennis ball, and that you need to put each ball into its proper bucket. Not too hard to do, is it? You simply put each ball into its corresponding bucket. Red maps to red, blue maps to blue, etc.

Now imagine that you have a collection of numbers from 0-99 and an array of size 100. If you simply put the 0 into index 0, 1 into index 1, 2 into index 2, 3 into index 3, etc., you'll have no trouble finding anything in the array. If you want to find 1, you go to index 1. If you want to find 33, you go to index 33. Simple.

With a collection of numbers from 0-99, it's easy to put each item into its proper place by putting 0 into index 0, 1 into index 1, 2 into index 2, etc. But what do you do in the real world? What if you want to store information about people in a medical database and find their information quickly in an emergency?

If you have a database, based on a hash table, of Person objects, you assign a unique number, called a key, to the Person object when you insert it. The process we use to map keys data is called hashing. To hash, say, a Person object to a number, we write a hash function.

Hashing and Hash Functions

If you're hashing Person objects, one simple hash function could be to assign a number to each letter of the person's first name; A=1, B=2, C=3, etc. Then, if you store a Person object with a firstName field of "Fred," you can store Fred in F=6, R=13, E=5, D=4, 6+13+5+4= array index 28.

This simple hash function converted, or hashed, the Person object, Fred, into a number. All inserts, lookups, and removals of Fred will be based on this number. The hash function is a mapping from data (e.g., Person objects) to numbers between 0 and size of the hash table.

A hash function must map the data to a number, then return that number modulo the size of the hash table (think of a circular hash table).

This process of obliterating the actual item into something you'd never recognize as the real thing -- 28 in the case of Fred -- is called hashing because it comes from the idea of chopping up corned beef into something you'd never recognize as corned beef - corned beef hash. There's no meaning between the actual data value and the hash key, so there's no practical way to traverse a hash table. Hash table items are not in any order. The purpose of hash tables is to provide fast lookups. And the function we used to turn our friend Fred into the number 28 (turning the letters of his first name into numbers and adding them together) is the hash function.

When I want to find Fred in the hash table, I simply use my hash function again (6+13+5+4). Fred is in hash table position 6+13+5+4. That's the way a hash table works. You store and find your items with the same hash function so that you can always find what you're looking for, even though, by just eyeballing the hash key 28, you'd have no idea that it hashes to Fred.

Collisions

So if every first name hashes to a different hash key, we're set. It's just like our buckets. We only had one blue tennis ball, which went into the blue bucket, and so on for each color tennis ball. But the real world isn't like that. What if we try to store a Person object Ned into our hash table using our hash function? N=19, E=5, D=4 is 19+5+4=index 28. Now we've got a problem, because Fred is at 28.

When we try to put an item into a spot in the hash table that's occupied, this is called a collision. We'll talk about how to manage collision next.

(Separate) Chaining, a.k.a. Closed Addressing

One common technique for resolving collisions within in memory hash tables is known as separate chaining. It works like this: each position in the hash table is a linked list. When we try to insert something into a spot that's already taken, we just make a new node and insert it into that hash table element's list.

This technique is known as separate chaining, because each hash table element is a separate chain (linked list). This is easy to do, and this way, you always have a place for anything you want to put into the table.

But, if there are many collisions, it can become less efficient, because, you can end up with long linked lists at certain array indices and nothing at others. This problem is called clustering.

Linear Probing

Linked lists are reasonably efficient in memory, but they are very expensive on disk. The reason for this is that main memory access time is (more-or-less) constant -- it does not vary by (if we ignore caching, etc). So, we can skip around RAM to access each node of the linked list without penalty. But, when it comes to disks, it is much cheaper to access the next sector than the prior sector or another track. This is because it can take milliseconds to move the disk heads and wait for the disk to rotate into the right position. This delay can be avoided if we read linearly -- the next sector will be readable as soon as we're done with the first. And, with caching, many nearby sectors, a whole track, are often read at once.

As a result for hash tables on disk, we need to use other techniques to manage collision. These techniques are generally known as probing, or looking for the record in other buckets. Basically, if we can't insert the record in its asigned bucket, we try to use other buckets, until we can insert it. Similarly, when we must look for the record, we search (or probe) for it beginning with its assigned bucket, but then through other buckets.

Linear probing simply tries the next bucket, and then the one after that, &c, wrapping around when needed, until it finds an empty bucket. So if Fred hashes to 28, but there's already something at index 28, we just put him into index 29. If, when we try to put Fred into position 29, we find that position 29 is filled, too, we just keep going, trying position 30, then 31, &c, until we find an empty place for Fred.

Linear probing is easily implemented, but often suffers from a problem known as primary clustering. Although the hashn function should uniformly distribute the records across the address space, sometimes clusters appear in the distribution. This might be the result of similarities in the original data that aren't randomized by the hash function, or side-effects of the hash function, itself.

Regardless, if linear probing is used, it might spend a significant amount of time probing within a cluster, instead of "getting past the crowd" and using the subsequent available space.

Quadratic Probing

Quadratic Probing is just like linear probing, except that, instead of looking just trying one ndex ahead each time until it find an empty index, it takes bigger and bigger steps each time.

On the first collision it looks ahead 1 position. On the second it looks ahead 4 (22), and on the third collision it looks ahead 9 (33), wrapping around as necessary (modulo the size of our hash table).

This is just as easy to implement as linear probing, and tends to step beyond primary clusters faster than linear probing.

Double Hashing

When a collision occurs, this approach switches to a second hash function. So if Fred=28 hashes to a filled spot, it will try another hash function, say, using multiplication instead of addition, and hashing to F=6, R=13, E=5, D=4, 6*13*5*4= 1560 (modulo the table size).

In some sense, this is the "blame the hash function" approach. If one hash function seems to be causing, or at least allowing, clustering, it tries another.

Avoiding Collisions

Chaining, Linear and Quadratic Probing, and Double Hashing are ways to resolve collisions. But it's better not to have a collision in the first place.

Now, the example hashing function for Fred is really horrible. But to make things worse, what if the table size were, say, 11, and the implementation ignored index 0 and put items into indices 1-10? Then each time an item hashed to a multiple of 10, like 1560, there would be a collision!

Avoiding collisions depends on having the right table size (large enough and prime) and an effective hash function.

Table Size should be:

A good hash function will scatter the data evenly throughout the hash table. In Java, the Object class, from which all objects in Java descend, has a method called hashCode, which returns a hash code (computed using a hash function) of whatever object you give it.

Resizing and Rehashing

So if you want to avoid collisions, it's a good idea to start out with a table size double the number of elements you expect to put into it. But what if you didn't know what you needed to begin with, and now you're having so many collisions that it's slowing things down? You'll have to:

You can't simply double the size of the current hash table, because your new table size should be prime. And after creating a new table of size prime, you'll need to rehash all of the items in the old table with a new hash function, because your current hash function is modulo the size of the hash table. So you'll need to apply your new hash function to every item from the old table into the new table before putting them into the new table.

Lookups

So how do you find things stored in a hash table?

You first look for your target at its hash key index in the array.

If there is nothing at the hash key index, then the element you're looking for isn't in the hash table. If the data in that position is what you're looking for then you've found it and you're done.

But if the data in that position is something other than you're looking for, then it might still be in the hash table, since there could have been a collision while inserting. In this case you resolve the collision, just like you would have done when you inserted it. In a hash table, you go through the same process to find the item that you did to store the item. This allows you to randomly access the item.

Deletes

How do you delete an item from a hash table?

First you perform a lookup and find the item. After you've found the item, if you're resolving collisions using chaining, then the data can be removed from the linked list. But what if you used linear or quadratic probing to resolve collisions? Removing items would cause serious problems. If you remove an item from a hashtable like you would in a binary tree, you won't be able to find previous items that collided with it, and you'll end up with a mess.

So you don't literally delete items. You just mark that position in the array. These marked items are sometimes called tombstones. Future inserts can overwrite these markers, but lookups treat them as collisions. Without these markers, we might insert two elements with the same hash value, then remove the first one and leave the second item unreachable.