--- title: 'Lecture 2:
Importing data and more basics' author: Prof. Alexandra Chouldechova date: Fall 2020 output: ioslides_presentation: highlight: github widescreen: true smaller: true --- ## Agenda - Wrapping up Lecture 1 content - Importing data - Simple summaries of categorical and continuous data - Coding style - Review homework grading rubric - Lab 2 ## Wrapping up Lecture 1 content Let's go back to where we left off in the [lecture 1 slides](http://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture01/lecture01-94842-Rpres.html#/35). ## Load tidyverse - There are many different ways to perform the data import and manipulation tasks that we will cover in today's class - We won't go full-tidyverse just yet, but we will start to preview a handful of basic functions - Begin by loading tidyverse ```{r, message=FALSE} # Load tidyverse library(tidyverse) ``` - If you do not have `tidyverse` installed yet, run the following code in your Console: ```{r, eval = FALSE} install.packages("tidyverse") ``` ## "Base R" vs tidyverse - In this class we'll learn both about **"base R"** and the **"tidyverse"** - The "tidyverse" describes a set of packages developed by RStudio in an attempt to streamline and unify data import, export, manipulation, summarization, and visualization tasks - To be able to write custom functions and work with packages outside of the tidyverse (most R packages are not "tidy"!) it's worth learning some base R ## Importing data - Start with survey results from "Homework 0" - To import tabular data into R, we use the **`read.table()`** command ```{r} survey <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2020.csv", header=TRUE, sep=",") ``` - Let's parse this command one component at a time - The data is in a file called `survey_data2020.csv`, which is a file on the course website - The file contains a `header` as its first row - The csv format means that the data is comma-separated, so `sep=","` - Could've also used **`read.csv()`**, which is just `read.table()` with the preset `sep=","` ## Exploring the data - R imports data into a **`data.frame`** object. This is the standard base R data object. ```{r} class(survey) ``` - To view the first few rows of the data, use **`head()`** ```{r} head(survey, 3) ``` - `head(data.frame, n)` returns the first `n` rows of the data frame - In the Console, you can also use **`View(survey)`** to get a spreadsheet view ## Simple summary - Use the **`str()`** function to get a simple summary of your data frame object ```{r} str(survey) ```
- This says that TVhours is a numeric variable, while all the rest are factors (categorical) ## Another simple summary ```{r} summary(survey) ``` ## Data frame basics - We will talk more about lists and data frames (and their "tidy" variants, tibbles) next week, but here are a few basics - To see what an R object is made up of, you can use `attributes()` ```{r} attributes(survey) ``` > An R **data frame** is a *list* whose columns you can refer to by *name* or *index* - Those `$` symbols are what tell you it's a list of some kind ## Data frame dimensions - We can use `nrow()` and `ncol` to determine the number of survey responses and the number of survey questions ```{r} nrow(survey) # Number of rows (responses) ncol(survey) # Number of columns (questions) ``` - When writing reports, you will often want to say how large your sample size was - To do this *inline*, use the syntax: ```{r, eval=FALSE} `r nrow(survey)` ``` - This allows us to write "`r nrow(survey)` students responded to the survey", and have the number displayed automatically change when the value of `nrow(survey)` changes. ## Inline code chunks example Here's a more complex example of inline code use. ```{r, eval = FALSE} We collected data on `r ncol(survey)` survey questions from `r nrow(survey)` respondents. Respondents represented `r length(unique(survey[["Program"]]))` CMU programs. `r sum(survey[["Program"]] == "PPM")` of the repondents were from PPM. ``` Which results in: > We collected data on `r ncol(survey)` survey questions from `r nrow(survey)` respondents. Respondents represented `r length(unique(survey[["Program"]]))` CMU programs. `r sum(survey[["Program"]] == "PPM")` of the repondents were from PPM.
**IMPORTANT**: You are expected to use inline code chunks instead of copying and pasting output whenever possible. ## Indexing data frames - There are many different ways of indexing the same piece of a data frame - Each vector below contains `r nrow(survey)` entries. For display purposes, the settings have been adjusted so that only the first 22 are shown below ```{r, eval = FALSE} survey[["Program"]] # "Program" element ``` ```{r, echo = FALSE} head(survey[["Program"]], 22) ``` ```{r, eval = FALSE} survey$Program # "Program" element ``` ```{r, echo = FALSE} head(survey$Program, 22) # "Program" element ``` ```{r, eval = FALSE} survey[,1] # Data from 1st column ``` ```{r, echo = FALSE} head(survey[,1], 22) # "Program" element ``` ## More indexing - Note that single brackets and double brackets have different effects ```{r, eval = FALSE} survey[["Program"]] # Returns the Program column as a vector ``` ```{r, echo = FALSE} head(survey[["Program"]], 22) ``` ```{r} survey["Program"] # single column data frame containing only "Program" ``` ## Bar plot (categorical data) Here we'll use **`qplot()`** from the `ggplot2` library (part of `tidyverse`) ```{r, fig.width = 4, fig.height = 4, fig.align='center'} qplot(survey$Program) ``` ## Histogram (continuous data) ```{r, fig.width = 4, fig.height = 4, fig.align='center'} qplot(survey[["TVhours"]], binwidth = 3, fill = I("steelblue")) ``` ## Indexing multiple columns
```{r} # Data from 1st and 5th columns survey[, c(1,5)] ```
```{r} # Data from "Program" and "Editor" survey[c("Program", "Editor")] ```
## tidy column selection: select() It is preferable to use the **`select()`** function to select subsets of columns ```{r} select(survey, Program, Editor) ``` ## Indexing rows and columns - Data frames have two dimensions to index across - You can use square bracket notation **`df[rows, cols]`** to extract specified `rows` and `cols` from a data frame `df`. - There are also other approaches, as illustrated below ```{r} survey[6, 5] # row 6, column 5 survey[6, "Program"] # Program of 6th survey respondent survey[["Program"]][6] # Program of 6th survey respondent ``` ## More indexing rows and columns - If you leave e.g., the `rows` value blank in **`df[rows, cols]`**, it will pull all of the rows for the specified `cols` - Leaving `cols` blank pulls all the columns for the specified `rows` ```{r} survey[6,] # 6th row ``` ```{r} survey[,2] # 2nd column ``` ## More indexing In Lab 1, you were introduced to the colon operator `:` We can use this operator for indexing ```{r} survey[1:3,] # equivalent to head(survey, 3) survey[3:5, c(1,5)] ``` ## Subsets of data We are often interested in learning something a specific subset of the data ```{r, eval = FALSE} survey[survey$Program=="MISM", ] # Data from the MISM students survey[which(survey$Program=="MISM"), ] # Does the same thing ``` ```{r, echo = FALSE} survey[survey$Program=="MISM", ] # Data from the MISM students ``` ## More subset examples Let's pull all of the PPM students who have never used R before ```{r} survey[survey$Program=="PPM" & survey$Rexperience=="Never used", ] ``` ## "tidy" subsetting with `filter()` - In general, it is preferable to use the **`filter()`** function - Here's an example of selecting all responses from students who are either in PPM or Other and who listed their R experience as "Basic competence". ```{r} filter(survey, (Program == "PPM" | Program == "Other") & Rexperience == "Basic competence") ``` ## `filter()` allows you to split conditions across lines - On the previous slide we had ```{r, eval = FALSE} filter(survey, (Program == "PPM" | Program == "Other") & Rexperience == "Basic competence") ``` - This is equivalent to the easier to parse call: ```{r} filter(survey, Program == "PPM" | Program == "Other", Rexperience == "Basic competence") ``` ## "tidy" selection of rows and columns - What if we wanted to select a subset of rows and columns with a combinatino of `filter` and `select`? - Here's one strategy ```{r} # First, get the desired rows row.subset <- filter(survey, Program == "PPM" | Program == "Other", Rexperience == "Basic competence") # Then, get the right columns select(row.subset, TVhours, Editor) ``` ## Piping with `%>%` - Here's a better strategy, which uses "piping" to supply the output of one computation as an argument into the next - Most of our data processing/summarization pipelines in R will involve lots of piping - The symbol `%>%` is pronounced "pipe" ```{r} filter(survey, Program == "PPM" | Program == "Other", Rexperience == "Basic competence") %>% select(TVhours, Editor) ``` ## Piping: preferred style When piping, it is best to pipe right from the start ```{r, eval = FALSE} # OK: filter(survey, Program == "PPM" | Program == "Other", Rexperience == "Basic competence") %>% select(TVhours, Editor) # Better: survey %>% filter(Program == "PPM" | Program == "Other", Rexperience == "Basic competence") %>% select(TVhours, Editor) ``` ## Splitting a long expression - As your function calls get longer and more complicated, you may find it useful to split them over multiple lines - Suppose you had something like this: ```{r, eval = FALSE} survey[(survey$Program == "PPM" | survey$Program == "Other") & survey$Rexperience == "Basic competence", ] ``` - You can split this across multiple lines by putting a line break **after** an operator ```{r, eval = FALSE} survey[(survey$Program == "PPM" | survey$Program == "Other") & survey$Rexperience == "Basic competence", ] ``` - Note that the line break occurs after the `&` operator ## Some simple calculations ```{r} mean(survey$TVhours[survey$Program == "PPM"]) # Average time PPM's spent watching TV mean(survey$TVhours[survey$Program == "MISM"]) # Average time MISM's spent watching TV mean(survey$TVhours[survey$Program == "Other"]) # Average time "Others" spent watching TV ``` ## (Preview of) "tidy" data summaries with `group_by` and `summarize` Here's a much easier and cleaner way of getting the average TV hours watched by students in each program. We use **`group_by`** and **`summarize`** ```{r} survey %>% group_by(Program) %>% summarize(mean(TVhours)) ``` ## Defining variables - If we wanted to focus on a particular column of the data frame, we could always define it as a new variable (e.g., if you want to easily experiment on a column) ```{r} tv.hours <- survey$TVhours # Vector of TVhours watched mean(tv.hours) # Average time spent watching TV sd(tv.hours) # Standard deviation of TV watching time ``` ```{r} sum(tv.hours >= 5) # How many people watched 5 or more hours of TV? ``` ## R coding style - Coding style (and code commenting) will become increasingly more important as we get into more advanced and involved programming tasks - Borrowing Hadley Wickham's words:
*You don’t have to use my style, but you really should use a **consistent** style.* - [This style guide](https://bookdown.org/dli/rguide/r-style-guide.html) is short and easy to follow - We'll revisit the question of coding style several times over the course of the class ## Enforced style: assignment operator **Assignment operator**. Use **`<-`** not **`=`** - Style guides uniformly promote the use of `<-` instead of `=` as the assignment operator ```{r, eval=FALSE} student.names <- c("Eric", "Hao", "Jennifer") # Good student.names = c("Eric", "Hao", "Jennifer") # Bad ``` - Use `=` when specifying function arguments, ```{r, eval=FALSE} sort(tv.hours, decreasing=TRUE) # Good sort(tv.hours, decreasing<-TRUE) # Works, but not what you want ``` ## Enforced style: Spacing - Binary operators should have spaces around them - Commas should have a space after, but not before (just like in writing) ```{r, eval=FALSE} 3 * 4 # Good 3*4 # Bad which(student.names == "Eric") # Good which(student.names=="Eric") # Bad ``` - For specifying arguments, spacing around `=` is optional ```{r, eval=FALSE} sort(tv.hours, decreasing=TRUE) # Accepted sort(tv.hours, decreasing = FALSE) # Accepted ``` ## Enforced style: Variable names - To make code easy to read, debug, and maintain, you should use **concise** but **descriptive** variable names - Terms in variable names should be separated by `_` or `.` ```{r, eval=FALSE} # Accepted day_one day.one day_1 day.1 day1 # Bad d1 DayOne dayone # Can be made more concise: first.day.of.the.month ``` - Avoid using variable names that are already pre-defined variables or functions in R ```{r, eval=FALSE} # EXTREMELY bad: c T pi sum mean ``` ## Assignments - **Homework 1** will be posted today - **Due: Wednesday, November 4, 1:30PM ET** - Submit your .Rmd and .html files on Canvas (do not zip them together) - Course website contains grading rubric - **Lab 2** is now available - Remember: To earn a participation point for today's class, you must submit Lab 1 and Lab 2 by Friday evening (ET)