November 15, 2010 (Lecture 20)

November 15, 2010 (Lecture 20)

Cassandra and HBase: The Data Model

Cassandra and HBase are both based on Google's BigTable model. Instead of mapping data into two-dimensional tables of rows and columns, as is done in a RDBMS, data is essentially mapped into multi-dimensional "maps of maps".
A couple of the interesting assumptions in these systems are that the data will be sparse, e..g not all rows will have all columns, and that we are more likely to be intersted in a particular property across a large number of rows, than all of the details of a particular row. As a result, the data is organized into columns, where the rows are variable field. There is also an assumption that data is going to be read more often than mutated, which allows for more efficient management of concurrency control.
A really great article describing the systems can be found here:

Big Table Model With Cassandra and HBase by Ricky Ho

The original Google paper might also be well worth reading:

BigTable: A Distributed Storage System for Structured Data (Chang, at al)

The following provides a good overview specifically of the HBase architecture:

Hadoop Wiki: HBase Architecture

And, the following article provides a great introduction to Cassandra:

Cassandra By Example (Eric Evans)