October 7, 2010 (Lecture 13)

October 7, 2010 (Lecture 13)

Designing Generic Data Structures

Most of you probably learned to implement data structures in Java. Like most object-oriented lanaguages, Java provided a straight-forward mechanism for describing both the types of state and methods associated with a data structure: The class and the instances thereof, a.k.a objects.
In C, we aren't going to have quite this much language support, but we are able to design our data structures in a similar way. So, let's walk through the steps. Today, we are going to talk our way through the interface for a hash set -- but we aren't actually going to implement the individual functions. We're just concerned about the interface.

Generic Pointers

It is desirable to implement the logic of a data structure once, and to have it work for any type of datum. In Java, we achieved this by making use of the "Object reference" that can reference any type of object. The C-equivalent is the void *, a.k.a., void pointer, or generic pointer. A void pointer can point to any type of object.
But, it is important to realize that, although one can declare a "void *" pointer type, one cannot declare a "void" type: There is no such thing. And, for exactly this reason, one can't dereference a void pointer and get the size of the targer, pointer arithmetic isn't going to work as one might like, etc.
But, they do allow us to do what we want -- keep any type of object within our data structure.

Capturing the State: Structs Are Our Friend

In managing a hash set, each instance is going to need to maintain certain state: the underlying array, the size of the array, and the number of items presently in the set. In C, this state, which would constitute the instance variables of a Java implementation, should be captured within a struct.
About the only interesting thing is that we'll represent the hash set as a "void **". The hash set itself is an array of void * pointers. But, too allow flexibility, we are going to dynamically allocate it, rather than fixing its size in the struct. The end game is that we'll put a pointer to this "void *" into our struct -- a "void **".
  typedef struct {
    void **slots;
    unsigned long numslots;
    unsigned long count;
  } hashset;
  

The Hash Function and Function Pointers

So, for those of you who are not familair with hashing, it is a simple game that involves a little magic. We map from an object to a position. This mapping is performed by a hashing function that is specific to the type of data. So, each hash set is going to need this magic function ever time it accesses the set. But, since it is specific to the type of data, we can't code it as part of the data structure. Instead, we need to have it handed to use by the designer of the data.
In order to allow this to happen, we are going to make our first use of function pointers. The idea is that we'll store the address of the hash function, a.k.a. a pointer to the hash function, in our struct and we'll use this function every time we need to perform the object to address mapping.
Declaring a function pointer involves specifying the name of the identifier as well as the arfuments and the return type. The general form of this specification is as follows:
  return_type (*identifier_name) (argument_list)
  
The only thing that is a bit strange is the extra set of parenthesis -- those that surround the "starred identifier". These parenthesis are necessary so that the compiler knows that we are declaring a function pointer -- and not providing a forward reference to a function that returns a pointer.
Consider the following example. Notice that it returns a pointer to an int -- the *-asterisk associates with the return type.
  int * something(int x, int y);
  
Now, consider this example. It declares a variable, something, that is a pointer to a function that returns an int. Notice how the ()-parenthesis shift the association of the *-asterisk away from the return value such that we create a pointer type:
Calling a function via a pointer is exactly the same as calling it otherwise. The pointer's identifier is used as an alias for the true function name. Other than using the name of the pointer in place fo the true name, there is no difference.
Here's a quick example that illustrates both the passing of a function pointer and the calling of a function via a pointer:
#include 
#include 

int add (int x, int y) {
  return x + y;
}

int multiply (int x, int y) {
  return x * y;
}

int calculate (int a, int b, int (*math)(int, int)) {

  return math(a,b);
}

int main() {

  int a = 5;
  int b = 10;

  printf ("%d [] %d = %d\n", a, b, calculate(a, b, add));

  return  0;
}
  

Improving Our Struct

So, let's add a pointer to the hash function to our struct, so we have it with us and can easily call it -- otherwise the user would need to pass it in with each and every call. Our struct now looks like this:
  typedef struct {
    void **slots;
    unsigned long numslots;
    unsigned long count;

    int (*hashfn)(void *items, unsigned long *hashvalue);
  } hashset;
  

The Initialization Method, a.k.a. The "Constructor"

In order to use our hash set, we'll need to create one of the proper size. So, let's consider how we'd like to do this:
  hashset hs;

  int hashfunction(void *item, unsigned long *hashcode) { ...}

  hashinit (&ht, hashfunction, 1009); /* Create a hashset of size 1009 */
  
Now, let's think about this initialization method's signature:
  int hashinit (hashset *, int (*hashfn)(void *, unsigned long *), unsigned long size); 
  
We might have gone a slightly differnt route and had the init method actually allocate the space. As you guys know, this technique would have required a "handle" to the object, a pointer to a pointer, rather than a single level or indirection. If we'd gone this way, it would have looked like this:
  int hashinit (hashset **ht, int (*hashfn)(void *, unsigned long *), unsigned long size); 

  ...

  hashset *hs;

  int hashfunction(void *item, unsigned long *hashcode) { ...}

  hashinit (&hs, hashfunction, 1009); /* Create a hashset of size 1009 */
  
But, this technique really doesn't buy us anything, as the data structure itself is light-weight. It is just as easy for the user to hand it to as it is for us to create it for them. And, as you'll soon see, it is more uniform with the rest of the interface to do it by reference.

The Basics of the Interface

So, what does the interface look like as a whole? At least in its essential parts?

  int hashinit (hashset *hs, int (*hashfn)(void *, unsigned long *), unsigned long size); 
  int hashadd (hashset *hs, void *item);
  int hashremove (hashset *hs, void *item);
  int hashcontains (hashset *hs, void *item);
  int hashdestroy (hashset *hs);

The "this" Pointer, Of Sorts

Every time we make a (non-static) method call in Java, our method is automatically passed the "this" pointer, a reference to the active instance of the class. Each method invocation or instance variable usage is either explicitly or implicitly accessed via "this".
In C, we certainly do not have that type of auto-magic. But, take a look at what we've built. What is the first argument to each of our methods? "hashset *hs". This is really an explicit "this" reference, isn't it. It identifies the active instance of the hashset. How 'bout that?

Thinking About hashdestroy()

So, let's recall the golden rules of memory management:

A library should provide a way to free any allocations it makes -- if it makes it, it should free it
A library should not free any memory it didn't allocate -- that is the responsibility of the allocating code

So, what should our hashset free? It should free exactly the "slots" array. This is the only thing that it allocated. It should not free the hashset struct, itself. It didn't allocated it. And, it might not even have been dynamically allocated -- for all our library knows, it was a static allocation. And, similarly, it should not free the datums stored in the hashset. The library didn't allocate them (as long as it didn't make a deep copy), so it shouldn't free them.

And, again, trying to free them is problematic. They might be static allocations. Or, they might have a complex structure with nested allocations. Or they might be aliased into another data structure that we'd destroy.