Return to lecture notes index
November 15, 2010 (Lecture 20)

Cassandra and HBase: The Data Model

Cassandra and HBase are both based on Google's BigTable model. Instead of mapping data into two-dimensional tables of rows and columns, as is done in a RDBMS, data is essentially mapped into multi-dimensional "maps of maps".

A couple of the interesting assumptions in these systems are that the data will be sparse, e..g not all rows will have all columns, and that we are more likely to be intersted in a particular property across a large number of rows, than all of the details of a particular row. As a result, the data is organized into columns, where the rows are variable field. There is also an assumption that data is going to be read more often than mutated, which allows for more efficient management of concurrency control.

A really great article describing the systems can be found here:

The original Google paper might also be well worth reading:

The following provides a good overview specifically of the HBase architecture:

And, the following article provides a great introduction to Cassandra: