15-111 Lecture 22 (Monday, June 22, 2004)

15-111 Lecture 22 (Monday, June 22, 2004)

For the next exam, here's a general skeleton of what you need to know:

Graphs:
  Intro - is a collection of edges and vertices, etc.  
  Terminologies
     - directed vs undirected
     - weighted vs unweighted
     - connected vs unconnected
     - cycles
     - spanning trees
     
  Representing them:
    - adjacency matrix
    - adjacency lists
    - node references
    - tradeoffs of each

Graph algorithms:
  - union-find algorithm for detecting cycles
  - creating spanning trees 
     - depth first approach
     - breadth first approach
     - minimum spanning trees
         - prim's and kruschal's algorithms 

  - djikstra's shortest path algorithm
  
Bucket algorithms:
  - bucket sort 
  - hashing
    - collision management
      - open-chaining
      - linear probing
      - double hashing
      - quadratic probing
  - tradeoffs/when to use each
  - idea behind hash functions - should provide uniform distribution
  - why table should be prime size - we take modulus of table size and 
    if not prime in size, it can fold up on itself 
  - examples of using them - eg. upc codes and spell checkers

Spanning tree overview

   d -----
  / \    |
 /   \   |  a-b = 3  b-c = 2
a --- c  |  a-c = 2  b-d = 4
 \   /   |  a-d = 6  c-d = 1
  \ /    |
   b -----
Now we want to create a spanning tree. Remeber that a spanning tree has to have exactly n-1 edges, and for Krushkal's algorithm you need to use the union-find algorithm as well.
You start out with each possible edge ordered from smallest to largest:
 c-d = 1
 a-c = 2
 b-c = 2
 a-b = 3
 b-d = 4
 a-d = 6 
 
Add the edge from c-d, since it's the smallest.
  0   1   2   3  4 
[ X, -1, -1, -1, 3]
      a   b   c   d
 
Add the edge from a-c, since it's the next smallest.
  0   1   2   3  4 
[ X, -1, -1,  1, 3]
      a   b   c  d
 
Now the one from b-c
  0   1   2   3  4 
[ X, -1,  3,  1, 3]
      a   b   c  d
 
Now you've added 3 edges, for a 4 node graph, so you created a spanning tree and thus are done.
Prim's algorithm solves the same problem in a fairly different way by having a table to keep whether something is known or not rather than using union-find. Assuming we're starting at d, the smallest path is trivially to d, so add it to the list as known, putting in all of its children.
  known | min path  | length
  --------------------------
a   n   |     d     |    6
b   n   |     d     |    4 
c   n   |     d     |    1
d   y   |     d     |    0
  
The smallest unknown length is 1, so visit that as known, checking all of its children for better paths. We see that from c, we can get to a and b both with a length of 2, which is much better than from d, so update those.
  known | min path  | length
  --------------------------
a   n   |     c     |    2
b   n   |     c     |    2 
c   y   |     d     |    1
d   y   |     d     |    0  
 
The next smallest is a, so make that known and check its children for better paths.
  known | min path  | length
  --------------------------
a   y   |     c     |    2
b   n   |     c     |    2 
c   y   |     d     |    1
d   y   |     d     |    0
 
The next smallest is b, which is the last value and trivially the smallest path so make that known and we're done.
 
  known | min path  | length 
  -------------------------- 
a   y   |     c     |    2 
b   y   |     c     |    2  
c   y   |     d     |    1 
d   y   |     d     |    0 
 
Djikstra's algorithm is very similar to Prim's algorithm, except it just calculates the total lengths possible from the start to finish rather than just checking individual edge values. So if we start at d, and then we can get to c in one hour, and then from c to b in another two hours. Collectively you can say the best path would be 3 hours, since you have to go to c before you can go to b.
We start off the same way as Prim's algorithm. Assuming we're starting at d, the smallest path is trivially to d, so add it to the list as known, putting in all of its children.
  known | min path  | length
  --------------------------
a   n   |     d     |    6
b   n   |     d     |    4 
c   n   |     d     |    1
d   y   |     d     |    0
 
The smallest unknown length is 1, so visit that as known, checking all of its children for better paths. We see that from c, we can get to a and b from c both with a length of 2, plus the length of 1 to get from d to c, giving a total distance of 3, which is still much better than from d, so update those.
  known | min path  | length
  --------------------------
a   n   |     c     |    3
b   n   |     c     |    3 
c   y   |     d     |    1
d   y   |     d     |    0  
 
The next smallest is a, so make that known and check its children for better paths.
 
  known | min path  | length 
  -------------------------- 
a   y   |     c     |    3
b   n   |     c     |    3  
c   y   |     d     |    1 
d   y   |     d     |    0 
 
The next smallest is b, which is the last value and trivially the smallest path so make that known and we're done.
  
  known | min path  | length  
  --------------------------  
a   y   |     c     |    3  
b   y   |     c     |    3   
c   y   |     d     |    1  
d   y   |     d     |    0  
 
Tradeoffs between adjacency lists and matrices:
Adjacency matrices in the above example would be a 4x4 matrix, since there are 4 nodes.
   1  2  3  4
  -----------
1| 0  3  2  6
2| 3  0  2  4
3| 2  2  0  1
4| 6  4  1  0
 
This method is extremely wasteful for very sparse trees. This is because there are many references that are allocated but never used. Also, if you wanted a list of adjacencies, you would have to go visit every single item of the row, which can be very time consuming.
For this reason, adjacency lists are often used. An adjacency list of the above example would give
1 -> [ (2, 3) -> (3, 2) -> (4, 6) -> null] 

2 -> [ (1, 3) -> (3, 2) -> (4, 4) -> null]

3 -> [ (1, 2) -> (2, 2) -> (4, 1) -> null]

4 -> [ (1, 6) -> (2, 4) -> (3, 1) -> null]
 
You need both the adjacent vertex and the weight of getting to that vertex from the current one, as well as maintaining different node references, which wastes a lot of space in comparison to the adjacency matrix for dense graphs.
More tradeoffs to consider is how you would be using the graph. If you want to be able to answer whether you can directly get from point a to b, then you can instantly get the answer of that in the adjacency matrix. For this question, you would instead have to search the whole adjacency list of a to see if b is adjacent, which can be extremely costly. If you want to be able to quickly give a list of all adjacencies, then an adjacency list would obviously be favored since one is already created for you. For an adjacency matrix, however, you would have to go through the whole row and only add the ones that do not have an infinite cost.

Hashing

The job of a hashing function is to produce a number that is repeatable, in that it should give the same number for a particular object every time. You have to chew on different properties and try to find one that would create a uniform distribution.
Unfortunately, we can often run into the problem of collisions, where there can be many Objects that might point to a single number. There are many different ways of doing this. In memory, often the best way to solve this is to use what is called open-chaining. This is simply just having each bucket hold some sort of list, which you can append to if there is a collision.
However, jumping around references on disk is very costly, so what is often used instead for disks is called probing. The basic kinds of probing are linear and quadratic. Linear probing simply tries to put an element in the given spot, and then if something is there then search all the subsequent spaces for an open one and place it there. This can lead to a problem known as primary clustering, however, where there are large numbers of objects all together into one group, resulting in multiple collisions, because one object might fill the bucket of another object, so the other object would then fill the bucket of its next object, etc.
For this reason, a method called quadratic probing is often used. This simply checks its given index, and then checks the next one, the 4th one, the 9th one, 16th, ... n^2th one .. etc. This has a tradeoff that you won't be gauranteed to visit every single node, so you might miss a space that the object can go in, but if you can't find a space then that's a sign that you either have a really bad hash function or your table size is much too small anyways.