Internet services companies such as Google, Yahoo!, Amazon,
and Facebook, have pioneered systems that have achieved unprecedented scale
while still providing high level availability and a high cost-performance. These systems differ from mainstream high
performance systems in fundamental ways.
They are data intensive rather than compute intensive as we see with
mainstream super computers spending the bulk of their time performing data I/O
and manipulation rather than computation.
They need to inherently support scalability, typically having high
reliability and availability demands as well.
Given that they often operate in the commercial space the
cost-performance of these systems needs to be such that the organizations
dependent on such systems can turn a profit.
Designing and building these systems require a specialized
set of skills. This course will cover the set of topics needed in order to
design and build data intensive scalable systems. In this domain engineers not only need to
know how to architect systems that are inherently scalable, but to do so in a
way that also supports high availability, reliability, and performance. Given the large distributed nature of these
systems basic distributed systems concepts such as consistency and time and
synchronization are also important. These
systems largely operate around the clock, placing an emphasis on operational
concerns. This course will introduce
students to these concerns with the intent that they understand the extent to
which things like deploying, monitoring, and upgrading impact the design.
The course will be a hands-on project oriented course. The basic concepts will be given during the
lectures and applied in the project. The
students will gain exposure to the core concepts needed to design and build
such systems as well as current technologies in this space. Class size will be limited.
Students in this class will learn:
Core distributed systems concepts. The students should understand topics such
as:
More than understanding these topics independently, the
students should understand the relationship between these concepts, the design
alternatives available, and the systemic properties promoted and inhibited by
the system.
Data in a distributed environment. Students will learn about scalable
persistence options. We will talk about
the advantages and disadvantages of various data models. We will talk about modern distributed file
systems and how to optimize based on specific patterns of I/O. We will introduce the notion of a parallel
programming model and look at Map Reduce.
Engineering for systemic properties. Students will be able to understand what
options exist to support desired systemic properties such as scalability,
performance, reliability, and availability.
They will learn the general patterns and tactics for each and understand
how to evaluate a given design in this space with respect to these properties.
Operational concerns.
Students will learn how to deploy, monitor, and test systems that
require 24/7 availability and live in a dynamic environment. They will learn about things like continuous
integration and deployment, live testing, and other dev ops related concerns.
There will be 4 individual projects and a final project
completed by a small team of students that incorporates all of the concepts in
the course.
The exercises consist of individual assignments and a group
final project:
Assignment 1: The first assignment will be to write a basic
single user application that accesses unstructured data in a file. The goal is that students understand the
difficulty doing basic things like retrieval, sorting, comparing values, and
ensuring the integrity of data when you are working with a basic file system.
Assignment 2: The second assignment will have the same
functionality of the previous assignment but will add multiple users. The system will be a client server system
with the client existing on a separate machine.
The students will be required to provide availability even when there is
no network connection to the file system.
They will have to deal with concurrency in a shared data system and
learn how to manage consistency given multiple data caches.
Assignment 3: The
third assignment will distribute the files to multiple machines. The students will now need to figure out how
to manage a rudimentary distributed file system. They will have to make concrete tradeoffs
limiting their ability to specific systemic properties. The goal of this assignment is that they
recognize the relationship between the optional decisions and systemic properties.
Assignment 4: In this
assignment students will add reliability and availability. They will need to think about a variety of
faults pushing them to think about their replication strategy, worry about
reliable message delivery, deal with network disruption, and think about data
integrity when working in an unreliable environment. Again they need to understand the relationship
between the choices available and the related properties.
Final Project: The
final project will be completed in small teams.
The goal is that they put together all of the concepts that they have
learned and develop a scalable data intensive application to solve a problem
given to them. They will deploy this
system on Amazon AWS and have to provide a write up that describes the specific
decisions they’ve made and the resulting tradeoffs.
Grading will be:
·
Assignments 50%
·
Final Project 40%
· Participation 10%

