T 94-842: Programming in R for Analytics, Fall 2019

Course Description

This course introduces students to R, a widely used statistical programming language. Students will learn to manipulate data objects, produce graphics, analyse data using common statistical methods, and generate reproducible statistical reports. They will also gain experience in applying these acquired skills in various public policy areas.

By the end of the class, students learn to:
  • Use RStudio, read R documentation, and write R scripts.
  • Import, export and manipulate data.
  • Produce statistical summaries of continuous and categorical data.
  • Produce basic graphics using standard functions, and produce more advanced graphics using the ggplot2 library.
  • Perform common hypothesis tests, and run simple regression models in R
  • Produce reports of statistical analyses in R Markdown.


All of the course materials on this page are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Creative Commons License


While there are no required textbooks for this class, the following references--particuorly the Grolemund and Wickham text, which is freely available at the link provided---are highly recommended. Students may find it useful to own a personal copy of one or two of the texts below.

Recommended textbooks

Helpful resources

There are many resources online that may help you to learn R. A few that are particularly relevant for this course are listed below.

Course Work

Your grade in this course will be determined by a series of 5 weekly homework assignments (35%), lab participation (10%), quizzes (10%) and a final project (45%).


Weekly assignments will take the form of a single R Markdown text file: namely, code snippets integrated with captions and other narrative. Except where otherwise noted, assignments are typically due on Thursdays at 2:50pm on the dates indicated on Canvas.

Your assignment score for the course will be calculated by averaging your four (4) highest homework scores. That is, your lowest homework score will not count toward your grade.

Each homework assignment will have 5 problems, each of which may have several parts. Your score for each assignment will be assigned according to the scheme outlined in the rubric below.

Homework rubric

Total: 10 points

Correctness : Each problem will be worth 2 points. Deductions will be made at the discretion of the grader.

Knitting: -0.5 deduction if the Rmd file you submit does not knit correctly (i.e., if there are errors and no HTML file is produced when the grader attempts to knit your Rmd file.)

  • If your Rmd file fails to knit, you will be contacted by the grader and will be given 24 hours to resubmit your homework. You will need to trace the source of the error(s) and correct it.

    Style : Coding style is very important. With the exception of Homework 1, you will receive a deduction of up to 1 point if you do not adhere to good coding style.

    • No deduction if your homework is submitted with:
      • good, consistent coding style
      • appropriate use of variables
      • appropriate use of functions
      • good commenting
      • good choice of variable names
      • appropriate use of inline code chunks
    • -0.5 if coding style is acceptable, but fails on a couple of the criteria above.
    • -1 if coding style is overall poor and fails to adhere to many of the above criteria.


    Lab activities

    The Lab session is scheduled for Fridays. Lab attendance is mandatory, and counts toward your course grade. During the lab sessions, students will get hands-on practice with the week's material by working on assigned lab activities. Members of the teaching staff will be present to introduce the activities and to answer any questions you may have. Tasks may include but are not limited to: running or modifying code from the lecture, pair coding, or completing short coding exercises. During weeks where Friday sessions are cancelled due to holidays, you are still expected to attempt and submit the labs.

    All thirteen (13) scheduled lectures will have an associated lab component. Your Lab participation score for the course will be calculated based on the number of labs that you submit, as indicated in the table below.


    There will be 4 short quizzes scheduled during the later weeks of class. Dates and times will be announced in advance. The purpose of these quizzes is to assess your understanding of various concepts that are central to the class. Your score on the quizzes will count for 10% of your final grade.

    Final project

    The final project for the class will ask you to explore a broad policy question using a large publicly available dataset. This project is intended to provide students with the complete experience of going from a study question and a rich data set to a full statistical report. Students will be expected to (a) explore the data to identify important variables; (b) perform statistical analyses to address the policy question; (c) produce tabular and graphical summaries to support their findings; and (d) write a report describing their methodological approach, findings, and limitations thereof.

    While students may work in small groups to decide on appropriate statistical methodology and graphical/tabular summaries, each student will be required to produce and submit their own code and final report.

    Regardless of grading basis, students must receive a score of at least 50% on the final project in order to pass the class.
  • Course Grading

    Your final course grade will be calculated according to the following breakdown.
    Final project45%

    Late submission

    Homework is to be submitted by 2:50pm on Thursdays on the due date indicated, unless an alternate due date is announced.
    Late homework will not be accepted for credit.

    Note that your lowest homework score will not count toward your grade, so you can miss one homework without it counting toward your course grade.


    You are encouraged to discuss homework problems with your fellow students. However, the work you submit must be your own. You must acknowledge in your submission any help received on your assignments. That is, you must include a comment in your homework submission that clearly states the name of the student, book, or online reference from which you received assistance.

    Submissions that fail to properly acknowledge help from other students or non-class sources will receive no credit. Copied work will receive no credit. Any and all violations will be reported to Heinz College administration.

    All student are expected to comply with the CMU policy on academic integrity. This policy can be found online at http://www.cmu.edu/academic-integrity/.

    What constitutes plagiarism in a coding class?

    The course collaboration policy allows you to discuss the problems with other students, but requires that you complete the work on your own. Every line of text and line of code that you submit must be written by you personally. You may not refer to another student's code, or a "common set of code" while writing your own code. You may, of course, copy/modify lines of code that you saw in lecture or lab.

    The following discussion of code copying is taken from the Computer Science and Engineering Department at the University of Washington. You may find this discussion helpful in understanding the bounds of the collaboration policy.

    "[It is] important to make sure that the assistance you receive consists of general advice that does not cross the boundary into using code or answers written by someone else. It is fine to discuss ideas and strategies, but you should be careful to write your programs on your own."

    "You must not share actual program code with other students. In particular, you should not ask anyone to give you a copy of their code or, conversely, give your code to another student who asks you for it; nor should you post your solutions on the web, in public repositories, or any other publicly accessible place. [You may not work out a full communal solution on a whiteboard/blackboard/paper and then transcribe the communal code for your submission.] Similarly, you should not discuss your algorithmic strategies to such an extent that you and your collaborators end up turning in [essentially] the same code. Discuss ideas together, but do the coding on your own."

    "Modifying code or other artifacts does not make it your own. In many cases, students take deliberate measures -- rewriting comments, changing variable names, and so forth -- to disguise the fact that their work is copied from someone else. It is still not your work. Despite such cosmetic changes, similarities between student solutions are easy to detect. Programming style is highly idiosyncratic, and the chance that two submissions would be the same except for changes of the sort made easy by a text editor is vanishingly small. In addition to solutions from previous years or from other students, you may come across helpful code on the Internet or from other sources outside the class. Modifying it does not make it yours."

    "[I] allow exceptions in certain obvious instances. For example, you might be assigned to work with a project team. In that case, developing a solution as a team is expected. The instructor might also give you starter code, or permit use of local libraries. Anything which the instructor explicitly gives you doesn't normally need to be cited. Likewise, help you receive from course staff doesn't need to be cited."

    If you have any questions about any of the course policies, please don't hesitate to ask. You may post your questions on Piazza or ask me directly.



    The statistical computing package we will use in this course is R, which is available on many campus computers. You may download your own copy from http://www.r-project.org. We require that you use R Markdown to complete your assignments, which is enabled very nicely with RStudio.

    Laptop Policy:

    Students are expected to be participate in class, either on their own laptops or on the provided lab machines.


    Assignments and class information will be posted on Canvas and the class website.


    The Piazza forum should be used for general course-related questions that may be of interest to others in the class. For other types of questions (e.g., to report illness, request various permissions) please contact Prof. Chouldechova via email.
    Please include the course code 94842 in the subject line of your email.

    Disability Services:

    If you have a disability and need special accommodations in this class, please contact the instructor. You may also want to contact the Disability Resources office at 8-2013.

    Tentative Schedule

    Week 1: Introduction and Basics
    Lecture 1Introductions. Installing R on personal machines. Retrieving R packages.

    Basics of R, RStudio, R Markdown.

    Basic data types and operations: numbers, characters and composites.

    Vectors, creating sequences, common functions.

    Homework 0 assigned.

    Lecture 1 notes [Rpres] [slides]

    Lab 1 [Rmd] [html]

    Lab 1 Solutions [Rmd] [html]

    Lecture 2 Importing tabular data.

    Simple summaries of categorical and continuous data.

    R style basics

    Lecture 2 notes [Rmd] [slides]

    Lab 2 [Rmd] [html]

    Lab 2 Solutions [Rmd] [html]
    Week 2: Data frames, functions, loops, if/else
    Lecture 3More on data frames and lists.

    Writing functions in R.

    If/else statements.

    Lecture 3 notes [slides] [Rmd]

    Lab 3 [Rmd] [html]
    Lecture 4
    A common data cleaning task.

    For/while loops.

    Using apply() to iterate over data.

    Using with() to specify environment.

    Lecture 4 notes [slides] [Rmd]

    An Introduction to Factors in R

    Lab 4 [Rmd] [html]
    Lab 4 Solutions [Rmd] [html]
    HW 1
    Week 3: Data summaries and Graphics
    Lecture 5
    Multivariate statistical summaries

    Introduction to ggplot2 graphics

    Lecture 5 notes [Rmd] [html]

    Lab 5 [Rmd] [html]

    Lab 5 Solutions [Rmd] [html]

    Lecture 6

    Lecture 6 notes [Rmd] [html]

    Lab 6 [Rmd] [html]

    Lab 6 Solutions [Rmd] [html]
    Homework 3 assigned.
    HW 2
    Week 4: Statistical tests and models
    Lecture 7
    Testing for differences in means between two groups

    QQ plots

    Tests for 2x2 tables

    Plotting confidence intervals

    Lecture 7 notes [Rmd] [html]

    Lab 7 [Rmd] [html]

    Lab 7 Solutions [Rmd] [html] Supplement: Statistical significance testing [Rmd] [html]

    Lecture 8

    Linear regression

    Assessing multicollinearity

    Diagnosing and interpreting regression

    Lecture 8 notes [Rmd] [html]

    Lab 8 [Rmd] [html]

    Lab 8 Solutions [Rmd] [html] Supplement: diagnostic plots for lm objects
    [Rmd] [html]

    Homework 4 assigned.
    HW 3
    Week 5: Linear regression
    Lecture 9
    More linear regression

    Lecture 9 notes [Rmd] [html]
    [proj.Rmd] [proj.html]

    Lab 9 [ISLE link] Lab 9 [Rmd] [html]

    Lab 9 Solutions [Rmd] [html]

    Supplement: Shiny apps
    Ordinary least squares
    Basic diagnostics

    Final project assigned.
    Lecture 10
    Interpreting categorical variables in regression

    Interaction terms in regression

    Lecture 10 notes [Rmd] [html]

    Lab 10 [ISLE link]
    HW 4
    Week 6: Final project, regression
    Lecture 11
    Stratified regressions

    More ggplot practice

    Lecture 11 notes [html] [Rmd]

    Lab 11 To be posted by EOD November 25
    Lecture 12

    Week 7: Interactive Graphics and Prediction
    Lecture 13
    Interactive graphics in R


    Lecture 13 notes [Rmd] [html] [shiny]

    [Rpres] [project slides]

    HW 5
    Lecture 14
    Predictive modeling

    Lecture 14 notes [Rmd] [html]