Schedule
Engineering Data Intensive Scalable Systems
17-648


Overview

Internet services companies such as Google, Yahoo!, Amazon, and Facebook, have pioneered systems that have achieved unprecedented scale while still providing high level availability and a high cost-performance.  These systems differ from mainstream high performance systems in fundamental ways.  They are data intensive rather than compute intensive as we see with mainstream super computers spending the bulk of their time performing data I/O and manipulation rather than computation.  They need to inherently support scalability, typically having high reliability and availability demands as well.  Given that they often operate in the commercial space the cost-performance of these systems needs to be such that the organizations dependent on such systems can turn a profit.

Designing and building these systems require a specialized set of skills. This course will cover the set of topics needed in order to design and build data intensive scalable systems.  In this domain engineers not only need to know how to architect systems that are inherently scalable, but to do so in a way that also supports high availability, reliability, and performance.   Given the large distributed nature of these systems basic distributed systems concepts such as consistency and time and synchronization are also important.  These systems largely operate around the clock, placing an emphasis on operational concerns.  This course will introduce students to these concerns with the intent that they understand the extent to which things like deploying, monitoring, and upgrading impact the design.

The course will be a hands-on project oriented course.  The basic concepts will be given during the lectures and applied in the project.  The students will gain exposure to the core concepts needed to design and build such systems as well as current technologies in this space.  Class size will be limited.


Learning Objectives

Students in this class will learn:

Core distributed systems concepts.  The students should understand topics such as:

More than understanding these topics independently, the students should understand the relationship between these concepts, the design alternatives available, and the systemic properties promoted and inhibited by the system.

Data in a distributed environment.  Students will learn about scalable persistence options.  We will talk about the advantages and disadvantages of various data models.  We will talk about modern distributed file systems and how to optimize based on specific patterns of I/O.  We will introduce the notion of a parallel programming model and look at Map Reduce.   

Engineering for systemic properties.  Students will be able to understand what options exist to support desired systemic properties such as scalability, performance, reliability, and availability.  They will learn the general patterns and tactics for each and understand how to evaluate a given design in this space with respect to these properties.

Operational concerns.  Students will learn how to deploy, monitor, and test systems that require 24/7 availability and live in a dynamic environment.  They will learn about things like continuous integration and deployment, live testing, and other dev ops related concerns.


Assignments

There will be 4 individual projects and a final project completed by a small team of students that incorporates all of the concepts in the course.

The exercises consist of individual assignments and a group final project:

Assignment 1: The first assignment will be to write a basic single user application that accesses unstructured data in a file.  The goal is that students understand the difficulty doing basic things like retrieval, sorting, comparing values, and ensuring the integrity of data when you are working with a basic file system.

Assignment 2: The second assignment will have the same functionality of the previous assignment but will add multiple users.  The system will be a client server system with the client existing on a separate machine.  The students will be required to provide availability even when there is no network connection to the file system.  They will have to deal with concurrency in a shared data system and learn how to manage consistency given multiple data caches.

Assignment 3:  The third assignment will distribute the files to multiple machines.  The students will now need to figure out how to manage a rudimentary distributed file system.  They will have to make concrete tradeoffs limiting their ability to specific systemic properties.  The goal of this assignment is that they recognize the relationship between the optional decisions and systemic properties.

Assignment 4:  In this assignment students will add reliability and availability.  They will need to think about a variety of faults pushing them to think about their replication strategy, worry about reliable message delivery, deal with network disruption, and think about data integrity when working in an unreliable environment.  Again they need to understand the relationship between the choices available and the related properties.

Final Project:  The final project will be completed in small teams.  The goal is that they put together all of the concepts that they have learned and develop a scalable data intensive application to solve a problem given to them.  They will deploy this system on Amazon AWS and have to provide a write up that describes the specific decisions they’ve made and the resulting tradeoffs.


Grading

Grading will be:

·         Assignments 50%

·         Final Project 40%

·         Participation 10%

Instructors: