---
title: 'Lecture 2:
Importing data and more basics'
author: Prof. Alexandra Chouldechova
date: Fall 2020
output:
ioslides_presentation:
highlight: github
widescreen: true
smaller: true
---
## Agenda
- Wrapping up Lecture 1 content
- Importing data
- Simple summaries of categorical and continuous data
- Coding style
- Review homework grading rubric
- Lab 2
## Wrapping up Lecture 1 content
Let's go back to where we left off in the [lecture 1 slides](http://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture01/lecture01-94842-Rpres.html#/35).
## Load tidyverse
- There are many different ways to perform the data import and manipulation tasks that we will cover in today's class
- We won't go full-tidyverse just yet, but we will start to preview a handful of basic functions
- Begin by loading tidyverse
```{r, message=FALSE}
# Load tidyverse
library(tidyverse)
```
- If you do not have `tidyverse` installed yet, run the following code in your Console:
```{r, eval = FALSE}
install.packages("tidyverse")
```
## "Base R" vs tidyverse
- In this class we'll learn both about **"base R"** and the **"tidyverse"**
- The "tidyverse" describes a set of packages developed by RStudio in an attempt to streamline and unify data import, export, manipulation, summarization, and visualization tasks
- To be able to write custom functions and work with packages outside of the tidyverse (most R packages are not "tidy"!) it's worth learning some base R
## Importing data
- Start with survey results from "Homework 0"
- To import tabular data into R, we use the **`read.table()`** command
```{r}
survey <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2020.csv",
header=TRUE, sep=",")
```
- Let's parse this command one component at a time
- The data is in a file called `survey_data2020.csv`, which is a file on the course website
- The file contains a `header` as its first row
- The csv format means that the data is comma-separated, so `sep=","`
- Could've also used **`read.csv()`**, which is just `read.table()` with the preset `sep=","`
## Exploring the data
- R imports data into a **`data.frame`** object. This is the standard base R data object.
```{r}
class(survey)
```
- To view the first few rows of the data, use **`head()`**
```{r}
head(survey, 3)
```
- `head(data.frame, n)` returns the first `n` rows of the data frame
- In the Console, you can also use **`View(survey)`** to get a spreadsheet view
## Simple summary
- Use the **`str()`** function to get a simple summary of your data frame object
```{r}
str(survey)
```
- This says that TVhours is a numeric variable, while all the rest are factors (categorical)
## Another simple summary
```{r}
summary(survey)
```
## Data frame basics
- We will talk more about lists and data frames (and their "tidy" variants, tibbles) next week, but here are a few basics
- To see what an R object is made up of, you can use `attributes()`
```{r}
attributes(survey)
```
> An R **data frame** is a *list* whose columns you can refer to by *name* or *index*
- Those `$` symbols are what tell you it's a list of some kind
## Data frame dimensions
- We can use `nrow()` and `ncol` to determine the number of survey responses and the number of survey questions
```{r}
nrow(survey) # Number of rows (responses)
ncol(survey) # Number of columns (questions)
```
- When writing reports, you will often want to say how large your sample size was
- To do this *inline*, use the syntax:
```{r, eval=FALSE}
`r nrow(survey)`
```
- This allows us to write "`r nrow(survey)` students responded to the survey", and have the number displayed
automatically change when the value of `nrow(survey)` changes.
## Inline code chunks example
Here's a more complex example of inline code use.
```{r, eval = FALSE}
We collected data on `r ncol(survey)` survey questions from `r nrow(survey)` respondents.
Respondents represented `r length(unique(survey[["Program"]]))` CMU programs.
`r sum(survey[["Program"]] == "PPM")` of the repondents were from PPM.
```
Which results in:
> We collected data on `r ncol(survey)` survey questions from `r nrow(survey)` respondents. Respondents represented `r length(unique(survey[["Program"]]))` CMU programs. `r sum(survey[["Program"]] == "PPM")` of the repondents were from PPM.
**IMPORTANT**: You are expected to use inline code chunks instead of copying and pasting output whenever possible.
## Indexing data frames
- There are many different ways of indexing the same piece of a data frame
- Each vector below contains `r nrow(survey)` entries. For display purposes, the settings have been adjusted so that only the first 22 are shown below
```{r, eval = FALSE}
survey[["Program"]] # "Program" element
```
```{r, echo = FALSE}
head(survey[["Program"]], 22)
```
```{r, eval = FALSE}
survey$Program # "Program" element
```
```{r, echo = FALSE}
head(survey$Program, 22) # "Program" element
```
```{r, eval = FALSE}
survey[,1] # Data from 1st column
```
```{r, echo = FALSE}
head(survey[,1], 22) # "Program" element
```
## More indexing
- Note that single brackets and double brackets have different effects
```{r, eval = FALSE}
survey[["Program"]] # Returns the Program column as a vector
```
```{r, echo = FALSE}
head(survey[["Program"]], 22)
```
```{r}
survey["Program"] # single column data frame containing only "Program"
```
## Bar plot (categorical data)
Here we'll use **`qplot()`** from the `ggplot2` library (part of `tidyverse`)
```{r, fig.width = 4, fig.height = 4, fig.align='center'}
qplot(survey$Program)
```
## Histogram (continuous data)
```{r, fig.width = 4, fig.height = 4, fig.align='center'}
qplot(survey[["TVhours"]], binwidth = 3, fill = I("steelblue"))
```
## Indexing multiple columns
```{r}
# Data from 1st and 5th columns
survey[, c(1,5)]
```
```{r}
# Data from "Program" and "Editor"
survey[c("Program", "Editor")]
```
## tidy column selection: select()
It is preferable to use the **`select()`** function to select subsets of columns
```{r}
select(survey, Program, Editor)
```
## Indexing rows and columns
- Data frames have two dimensions to index across
- You can use square bracket notation **`df[rows, cols]`** to extract specified `rows` and `cols` from a data frame `df`.
- There are also other approaches, as illustrated below
```{r}
survey[6, 5] # row 6, column 5
survey[6, "Program"] # Program of 6th survey respondent
survey[["Program"]][6] # Program of 6th survey respondent
```
## More indexing rows and columns
- If you leave e.g., the `rows` value blank in **`df[rows, cols]`**, it will pull all of the rows for the specified `cols`
- Leaving `cols` blank pulls all the columns for the specified `rows`
```{r}
survey[6,] # 6th row
```
```{r}
survey[,2] # 2nd column
```
## More indexing
In Lab 1, you were introduced to the colon operator `:`
We can use this operator for indexing
```{r}
survey[1:3,] # equivalent to head(survey, 3)
survey[3:5, c(1,5)]
```
## Subsets of data
We are often interested in learning something a specific subset of the data
```{r, eval = FALSE}
survey[survey$Program=="MISM", ] # Data from the MISM students
survey[which(survey$Program=="MISM"), ] # Does the same thing
```
```{r, echo = FALSE}
survey[survey$Program=="MISM", ] # Data from the MISM students
```
## More subset examples
Let's pull all of the PPM students who have never used R before
```{r}
survey[survey$Program=="PPM" & survey$Rexperience=="Never used", ]
```
## "tidy" subsetting with `filter()`
- In general, it is preferable to use the **`filter()`** function
- Here's an example of selecting all responses from students who are either in PPM or Other and who listed their R experience as "Basic competence".
```{r}
filter(survey,
(Program == "PPM" | Program == "Other") & Rexperience == "Basic competence")
```
## `filter()` allows you to split conditions across lines
- On the previous slide we had
```{r, eval = FALSE}
filter(survey,
(Program == "PPM" | Program == "Other") & Rexperience == "Basic competence")
```
- This is equivalent to the easier to parse call:
```{r}
filter(survey,
Program == "PPM" | Program == "Other",
Rexperience == "Basic competence")
```
## "tidy" selection of rows and columns
- What if we wanted to select a subset of rows and columns with a combinatino of `filter` and `select`?
- Here's one strategy
```{r}
# First, get the desired rows
row.subset <- filter(survey,
Program == "PPM" | Program == "Other",
Rexperience == "Basic competence")
# Then, get the right columns
select(row.subset, TVhours, Editor)
```
## Piping with `%>%`
- Here's a better strategy, which uses "piping" to supply the output of one computation as an argument into the next
- Most of our data processing/summarization pipelines in R will involve lots of piping
- The symbol `%>%` is pronounced "pipe"
```{r}
filter(survey,
Program == "PPM" | Program == "Other",
Rexperience == "Basic competence") %>%
select(TVhours, Editor)
```
## Piping: preferred style
When piping, it is best to pipe right from the start
```{r, eval = FALSE}
# OK:
filter(survey,
Program == "PPM" | Program == "Other",
Rexperience == "Basic competence") %>%
select(TVhours, Editor)
# Better:
survey %>%
filter(Program == "PPM" | Program == "Other",
Rexperience == "Basic competence") %>%
select(TVhours, Editor)
```
## Splitting a long expression
- As your function calls get longer and more complicated, you may find it useful to split them over multiple lines
- Suppose you had something like this:
```{r, eval = FALSE}
survey[(survey$Program == "PPM" | survey$Program == "Other") & survey$Rexperience == "Basic competence", ]
```
- You can split this across multiple lines by putting a line break **after** an operator
```{r, eval = FALSE}
survey[(survey$Program == "PPM" | survey$Program == "Other") &
survey$Rexperience == "Basic competence", ]
```
- Note that the line break occurs after the `&` operator
## Some simple calculations
```{r}
mean(survey$TVhours[survey$Program == "PPM"]) # Average time PPM's spent watching TV
mean(survey$TVhours[survey$Program == "MISM"]) # Average time MISM's spent watching TV
mean(survey$TVhours[survey$Program == "Other"]) # Average time "Others" spent watching TV
```
## (Preview of) "tidy" data summaries with `group_by` and `summarize`
Here's a much easier and cleaner way of getting the average TV hours watched by students in each program. We use **`group_by`** and **`summarize`**
```{r}
survey %>%
group_by(Program) %>%
summarize(mean(TVhours))
```
## Defining variables
- If we wanted to focus on a particular column of the data frame, we could always define it as a new variable (e.g., if you want to easily experiment on a column)
```{r}
tv.hours <- survey$TVhours # Vector of TVhours watched
mean(tv.hours) # Average time spent watching TV
sd(tv.hours) # Standard deviation of TV watching time
```
```{r}
sum(tv.hours >= 5) # How many people watched 5 or more hours of TV?
```
## R coding style
- Coding style (and code commenting) will become increasingly more important as we get into more advanced and involved programming tasks
- Borrowing Hadley Wickham's words:
*You don’t have to use my style, but you really should use a **consistent** style.*
- [This style guide](https://bookdown.org/dli/rguide/r-style-guide.html) is short and easy to follow
- We'll revisit the question of coding style several times over the course of the class
## Enforced style: assignment operator
**Assignment operator**. Use **`<-`** not **`=`**
- Style guides uniformly promote the use of `<-` instead of `=` as the assignment operator
```{r, eval=FALSE}
student.names <- c("Eric", "Hao", "Jennifer") # Good
student.names = c("Eric", "Hao", "Jennifer") # Bad
```
- Use `=` when specifying function arguments,
```{r, eval=FALSE}
sort(tv.hours, decreasing=TRUE) # Good
sort(tv.hours, decreasing<-TRUE) # Works, but not what you want
```
## Enforced style: Spacing
- Binary operators should have spaces around them
- Commas should have a space after, but not before (just like in writing)
```{r, eval=FALSE}
3 * 4 # Good
3*4 # Bad
which(student.names == "Eric") # Good
which(student.names=="Eric") # Bad
```
- For specifying arguments, spacing around `=` is optional
```{r, eval=FALSE}
sort(tv.hours, decreasing=TRUE) # Accepted
sort(tv.hours, decreasing = FALSE) # Accepted
```
## Enforced style: Variable names
- To make code easy to read, debug, and maintain, you should use **concise** but **descriptive** variable names
- Terms in variable names should be separated by `_` or `.`
```{r, eval=FALSE}
# Accepted
day_one day.one day_1 day.1 day1
# Bad
d1 DayOne dayone
# Can be made more concise:
first.day.of.the.month
```
- Avoid using variable names that are already pre-defined variables or functions in R
```{r, eval=FALSE}
# EXTREMELY bad:
c T pi sum mean
```
## Assignments
- **Homework 1** will be posted today
- **Due: Wednesday, November 4, 1:30PM ET**
- Submit your .Rmd and .html files on Canvas (do not zip them together)
- Course website contains grading rubric
- **Lab 2** is now available
- Remember: To earn a participation point for today's class, you must submit Lab 1 and Lab 2 by Friday evening (ET)