Lecture 1: Introduction and Basics ==== author: Prof. Alexandra Chouldechova date: 94-842 font-family: Gill Sans autosize: false width:1920 height:1080 What are we trying to accomplish? ==== Here's a sample analysis. > The analysis was shown only in class and is not viewable in this version of the notes. Agenda ======================================================== - Course overview - Introduction to R, RStudio and R Notebooks/R Markdown - Programming basics How this class will work ======================================================== - No programming knowledge presumed - Some stats knowledge presumed. E.g.: - Hypothesis testing (t-tests, confidence intervals) - Linear regression - Class attendance is mandatory - Class will be _very_ cumulative Mechanics ======================================================== - Two 80 minute lectures a week: - First 60-80 minutes: concepts, methods, examples - Last 0-20 minutes: short labs (time permitting) - Class participation (10%) - Quizzes (10%) - Weekly homework (35%) - Final project (2.5 weeks) (45%) - **Disclaimer:** To pass the class, you must achieve a passing score on the final project (at least 23 / 45) Mechanics === - __Class participation__ (10%) - **Labs**: Each lecture has an accompanying lab assignment. - Friday Lab sessions give you an opportunity to work on the labs - Course website shows how participation grade will be calculated - __Quizzes__ (10%) - 4 quizzes in the second half of term. Dates TBA. - __Homework assignments__ (35%) - There will be 5 weekly HW assignments - Single _lowest_ HW score will be dropped - HW assigned on Thursdays, **due Thursdays at 2:50pm** - Late homework __will not be accepted for credit__ - __Final project__ (45%) - You will write a report analysing a policy question using a publicly available data set Course resources ======================================================== - Assignments, office hours, class notes, grading policies, useful references on R: http://www.andrew.cmu.edu/~achoulde/94842/ - Canvas for __gradebook__ and for __turning in homework__ - Piazza for __forum__ - Please __post class/homework related question on Piazza__ instead of emailing the teaching staff - Check the class website for everything else - No required textbook, but several are _recommended_: - Garrett Grolemund and Hadley Wickham, _R for Data Science_ - Phil Spector, _Data Manipulation with R_ - Winston Chang, _The R Graphics Cookbook_ Goal of this class ===== > This class will teach you to use R to: - Generate graphical and tabular data summaries - Perform statistical analyses (e.g., hypothesis testing, regression modeling) - Produce _reproducible_ statistical reports using R Markdown and R Notebooks - Integrate R with other tools (e.g., databases, web, etc.) Why R? ===== - Free (open-source) - Programming language (not point-and-click) - Excellent graphics - Offers broadest range of statistical tools - Easy to generate reproducible reports - Easy to integrate with other tools The R Console ==== left:30 Basic interaction with R is through typing in the **console** This is the **terminal** or **command-line** interface *** The R Console ==== - You type in commands, R gives back answers (or errors) - Menus and other graphical interfaces are extras built on top of the console - We will use **RStudio** in this class 1. Download R: http://lib.stat.cmu.edu/R/CRAN 2. Then download RStudio: http://www.rstudio.com/ ====== left:30 **RStudio** is an IDE for R RStudio has 4 main windows ('panes'): - Source - Console - Workspace/History - Files/Plots/Packages/Help *** Console pane ======== left: 35 - Use the **Console** pane to type or paste commands to get output from R - To look up the help file for a function or data set, type `?function` into the Console - E.g., try typing in `?mean` - Use the `tab` key to auto-complete function and object names *** Source pane ======== left: 35 - Use the **Source** pane to create and edit R and Rmd files - The menu bar of this pane contains handy shortcuts for sending code to the **Console** for evaluation *** Files/Plots/Packages/Help pane ======== left: 35 - By default, any figures you produce in R will be displayed in the **Plots** tab - Menu bar allows you to Zoom, Export, and Navigate back to older plots - When you request a help file (e.g., `?mean`), the documentation will appear in the **Help** tab *** RStudio: Panes overview ======== 1. __Source__ pane: create a file that you can save and run later 2. __Console__ pane: type or paste in commands to get output from R 3. __Workspace/History__ pane: see a list of variables or previous commands 4. __Files/Plots/Packages/Help__ pane: see plots, help pages, and other items in this window. RStudio: Source and Console panes ======================================================== RStudio: Console ======================================================== RStudio: Toolbar ======= R Markdown, R Notebooks ==== - R Markdown allows the user to integrate R code into a report - When data changes or code changes, so does the report - No more need to copy-and-paste graphics, tables, or numbers - Creates __reproducible__ reports - Anyone who has your R Markdown (.Rmd) file and input data can re-run your analysis and get the exact same results (tables, figures, summaries) - R Notebooks are R Markdown documents that allow you to execute code interactively and view the output in the notebook itself. - Can output report in HTML (default), Microsoft Word, or PDF R Markdown ==== left: 30 - This example shows an **R Markdown** (.Rmd) file opened in the Source pane of RStudio. - To turn an Rmd file into a report, click the **Knit HTML** button in the Source pane menu bar - The results will appear in a **Preview window**, as shown on the right - You can knit into html (default), MS Word, and pdf format - These lecture slides are also created in RStudio (R Presentation) *** R Markdown ==== left: 30 - To integrate R output into your report, you need to use R code chunks - All of the code that appears in between the "triple back-ticks" gets executed when you Knit *** In-class exercise: Hello world! ==== 1. Open **RStudio** on your machine 2. File > New File > R Markdown ... 3. Change `summary(cars)` in the first code block to `print("Hello world!")` 4. Click `Knit HTML` to produce an HTML file. 5. Save your Rmd file as `helloworld.Rmd` > All of your Homework assignments and many of your Labs will take the form of a single Rmd file, which you will edit to include your solutions and then submit on Blackboard. Basics: the class in a nutshell ===== - Everything we'll do comes down to applying **functions** to **data** - **Data**: things like 7, "seven", $7.000$, the matrix $\left[ \begin{array}{ccc} 7 & 7 & 7 \\ 7 & 7 & 7\end{array}\right]$ - **Functions**: things like $\log{}$, $+$ (two arguments), $<$ (two), $\mod{}$ (two), `mean` (one) > A function is a machine which turns input objects (**arguments**) into an output object (**return value**), possibly with **side effects**, according to a definite rule Data building blocks ==== You'll encounter different kinds of data types - **Booleans** Direct binary values: `TRUE` or `FALSE` in R - **Integers**: whole numbers (positive, negative or zero) - **Characters** fixed-length blocks of bits, with special coding; **strings** = sequences of characters - **Floating point numbers**: a fraction (with a finite number of bits) times an exponent, like $1.87 \times {10}^{6}$ - **Missing or ill-defined values**: `NA`, `NaN`, etc. Operators (functions) ==== You can use R as a very, very fancy calculator Command | Description --------|------------- `+,-,*,\` | add, subtract, multiply, divide `^` | raise to the power of `%%` | remainder after division (ex: `8 %% 3 = 2`) `( )` | change the order of operations `log(), exp()` | logarithms and exponents (ex: `log(10) = 2.302`) `sqrt()` | square root `round()` | round to the nearest whole number (ex: `round(2.3) = 2`) `floor(), ceiling()` | round down or round up `abs()` | absolute value === ```{r} 7 + 5 # Addition 7 - 5 # Subtraction 7 * 5 # Multiplication 7 ^ 5 # Exponentiation ``` ==== ```{r} 7 / 5 # Division 7 %% 5 # Modulus 7 %/% 5 # Integer division ``` Operators cont'd. === **Comparisons** are also binary operators; they take two objects, like numbers, and give a Boolean ```{r} 7 > 5 7 < 5 7 >= 7 7 <= 5 ``` === ```{r} 7 == 5 7 != 5 ``` Boolean operators === Basically "and" and "or": ```{r} (5 > 7) & (6*7 == 42) (5 > 7) | (6*7 == 42) ``` (will see special doubled forms, `&&` and `||`, later) More types === - `typeof()` function returns the type - `is.`_foo_`()` functions return Booleans for whether the argument is of type _foo_ - `as.`_foo_`()` (tries to) "cast" its argument to type _foo_ --- to translate it sensibly into a _foo_-type value **Special case**: `as.factor()` will be important later for telling R when numbers are actually encodings and not numeric values. (E.g., 1 = High school grad; 2 = College grad; 3 = Postgrad) === ```{r} typeof(7) is.numeric(7) is.na(7) ``` === ```{r} is.character(7) is.character("7") is.character("seven") is.na("seven") ``` Variables === We can give names to data objects; these give us **variables** A few variables are built in: ```{r} pi ``` Variables can be arguments to functions or operators, just like constants: ```{r} pi*10 cos(pi) ``` Assignment operator === Most variables are created with the **assignment operator**, `<-` or `=` ```{r} time.factor <- 12 time.factor time.in.years = 2.5 time.in.years * time.factor ``` === The assignment operator also changes values: ```{r} time.in.months <- time.in.years * time.factor time.in.months time.in.months <- 45 time.in.months ``` === - Using names and variables makes code: easier to design, easier to debug, less prone to bugs, easier to improve, and easier for others to read - Avoid "magic constants"; use named variables - Use descriptive variable names - Good: `num.students <- 35` - Bad: `ns <- 35 ` The workspace === What names have you defined values for? ```{r} ls() ``` Getting rid of variables: ```{r} rm("time.in.months") ls() ``` First data structure: vectors === - Group related data values into one object, a **data structure** - A **vector** is a sequence of values, all of the same type - `c()` function returns a vector containing all its arguments in order ```{r} students <- c("Sean", "Louisa", "Frank", "Farhad", "Li") midterm <- c(80, 90, 93, 82, 95) ``` - Typing the variable name at the prompt causes it to display ```{r} students ``` Indexing ==== - `vec[1]` is the first element, `vec[4]` is the 4th element of `vec` ```{r} students students[4] ``` - `vec[-4]` is a vector containing all but the fourth element ```{r} students[-4] ``` Vector arithmetic === Operators apply to vectors "pairwise" or "elementwise": ```{r} final <- c(78, 84, 95, 82, 91) # Final exam scores midterm # Midterm exam scores midterm + final # Sum of midterm and final scores (midterm + final)/2 # Average exam score course.grades <- 0.4*midterm + 0.6*final # Final course grade course.grades ``` Pairwise comparisons === Is the final score higher than the midterm score? ```{r} midterm final final > midterm ``` Boolean operators can be applied elementwise: ```{r} (final < midterm) & (midterm > 80) ``` Functions on vectors === Command | Description --------|------------ `sum(vec)` | sums up all the elements of `vec` `mean(vec)` | mean of `vec` `median(vec)` | median of `vec` `min(vec), max(vec)` | the largest or smallest element of `vec` `sd(vec), var(vec)` | the standard deviation and variance of `vec` `length(vec)` | the number of elements in `vec` `pmax(vec1, vec2), pmin(vec1, vec2)` | example: `pmax(quiz1, quiz2)` returns the higher of quiz 1 and quiz 2 for each student `sort(vec)` | returns the `vec` in sorted order `order(vec)` | returns the index that sorts the vector `vec` `unique(vec)` | lists the unique elements of `vec` `summary(vec)` | gives a five-number summary `any(vec), all(vec)` | useful on Boolean vectors Functions on vectors === ```{r} course.grades mean(course.grades) # mean grade median(course.grades) sd(course.grades) # grade standard deviation ``` More functions on vectors === ```{r} sort(course.grades) max(course.grades) # highest course grade min(course.grades) # lowest course grade ``` Referencing elements of vectors === ```{r} students ``` Vector of indices: ```{r} students[c(2,4)] ``` Vector of negative indices ```{r} students[c(-1,-3)] ``` More referencing === `which()` returns the `TRUE` indexes of a Boolean vector: ```{r} course.grades a.threshold <- 90 # A grade = 90% or higher course.grades >= a.threshold # vector of booleans a.students <- which(course.grades >= a.threshold) # Applying which() a.students students[a.students] # Names of A students ``` Named components === You can give names to elements or components of vectors ```{r} students names(course.grades) <- students # Assign names to the grades names(course.grades) course.grades[c("Sean", "Frank","Li")] # Get final grades for 3 students ``` Note the labels in what R prints; these are not actually part of the value Useful RStudio tips ==== Keystroke | Description ----------|------------- `` | autocompletes commands and filenames, and lists arguments for functions. Highly useful! `` | cycle through previous commands in the console prompt `` | lists history of previous commands matching an unfinished one `` | execute current line `` | as mentioned, abort an unfinished command and get out of the + prompt
**"Homework" 0**: Course survey - You will receive a survey link after today's class - Please comlpete the survey! - Your (anonymized) responses will be used in Lecture 2. **Lab 1**: http://www.andrew.cmu.edu/~achoulde/94842/ - Look under Tenatative Schedule for today's lecture - Submit modified .Rmd file on Canvas by end of day on Friday