| 
|  |  |  |  
|  |  |  |  
|  | CLASS MEETS:Time: TUE/THU 4:30PM - 5:50PMPlace: HBH 1202
 
 
 PEOPLE:Instructor: Leman Akoglu, ( lakoglu @ andrew )
    Office: HBH 2118C, office ph 412-268-30 four threeOffice hour: THU 1-2 PM; also, by appointment Teaching Assistants:
 
    
        
            | Darshan Tina 
                    Email:invert (andrew.cmu.edu @ dtina)Office hour: WED 5-6 PM @HBH A007B | Kushagr Arora 
                    Email:invert (andrew.cmu.edu @ kushagra)Office hour: MON 5-6 PM @HBH A007D |  
 COURSE DESCRIPTION:
The rate and amount of data being generated in today's world by both humans and machines are unprecedented. Being able to store, manage, and analyze large-scale data has critical impact on business intelligence, scientific discovery, social and environmental challenges.
 
The goal of this course is to equip students with the understanding, knowledge, and practical skills to develop big data / machine learning solutions with the state-of-the-art tools, particularly those in the Spark environment, with a focus on programming models in MLlib, GraphX, and SparkSQL. See the syllabus  for more details. Students will also gain hands-on experience with MapReduce and Apache Spark using real-world datasets.
 
This course is designed to give a graduate-level student a thorough grounding in the technologies and best practices used in big data machine learning. The course assumes that the students have the understanding of basic data analysis and machine learning concepts as well as basic knowledge of programming (preferably in Python or Java). Previous experience with Hadoop, Spark or distributed computing is NOT required.
 Learning Objectives
By the end of this class, students will
 
    gain understanding of the MapReduce paradigm and Hadoop ecosystemunderstand scalability challenges for common ML tasksstudy distributed machine learning algorithmsunderstand details of SparkSQL, GraphX, and MLlib (Spark's ML library)implement distributed pipelines in Apache Spark
        using real datasets
RECOMMENDED TEXTBOOKS:There is no official textbook for the course. I will post all the lecture notes and several readings on course website.Below you can find a list of recommended  reading.
 
    Scaling up Machine Learning: Parallel and Distributed Approaches, Cambridge University Press Ron Bekkerman,
Mikhail Bilenko,
    John Langford
 Hadoop in Practice, Manning Publications Co.Alex Holmes
 
        Learning Spark, O'Reilly Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia
 
    Advanced Analytics with Spark, O'Reilly Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills
 
 BULLETIN BOARD and other info
    We will use the Canvas for course materials, homework deposits, announcements, and grades.We will use Piazza for questions and discussions.Carnegie Mellon 2017-2018 Official academic
        calendar 
 MISC - FUN:Joke-1     Joke-2
    Joke-3
 
 
 |  |