---
title: 'Lecture 4:
Basic cleaning, loops, and alternatives'
author: "Prof. Alexandra Chouldechova"
date: "Fall 2020"
output:
ioslides_presentation:
highlight: tango
widescreen: true
smaller: true
---
## Agenda
- Lists
- A common data cleaning task
- Factor variables, and when they're useful
- Functions
- If-else statements
- For/while loops to iterate over data
- R coding style
- Rather than picking up where Lecture 3 left off I've woven the Lecture 3 content we haven't yet covered into the Lecture 4 notes
## Package loading
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
Cars93 <- MASS::Cars93 # For Cars93 data again
```
## Basics of lists
> A list is a **data structure** that can be used to store **different kinds** of data
- Recall: a vector is a data structure for storing *similar kinds of data*
- To better understand the difference, consider the following example.
```{r}
my.vector.1 <- c("Michael", 165, TRUE) # (name, weight, is.male)
my.vector.1
typeof(my.vector.1) # All the elements are now character strings!
```
## Lists vs. vectors
```{r}
my.vector.2 <- c(FALSE, TRUE, 27) # (is.male, is.citizen, age)
my.vector.2
typeof(my.vector.2)
```
- Vectors expect elements to be all of the same type (e.g., `Boolean`, `numeric`, `character`)
- When data of different types are put into a vector, the R converts everything to a common type
## Lists
- To store data of different types in the same object, we use lists
- Simple way to construct lists: use **`list()`** function
- (We'll learn about functions like `map` and `map_chr` soon.)
```{r}
my.list <- list("Michael", 165, TRUE)
my.list
map_chr(my.list, typeof)
```
## Named elements
```{r}
patient.1 <- list(name="Michael", weight=165, is.male=TRUE)
patient.1
```
## Referencing elements of a list (similar to data frames)
```{r}
patient.1$name # Get "name" element (returns a string)
patient.1[["name"]] # Get "name" element (returns a string)
patient.1["name"] # Get "name" slice (returns a list)
c(typeof(patient.1$name), typeof(patient.1["name"]))
```
## A common problem
- One of the most common problems you'll encounter when importing manually-entered data is inconsistent data types within columns
- For a simple example, let's look at `TVhours` column in a messy version of the survey data from Lecture 2
```{r}
survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2020_messy.csv",
header=TRUE, stringsAsFactors = FALSE)
# Print out first 20 elements
head(survey.messy$TVhours, 20)
```
- **NOTE**: If you've installed R within the past few months, your version will automatically default to `stringsAsFactors = FALSE`. My version of R is older and still has the old `stringsAsFactors = TRUE` default, a convention that dates back to 1998.
- For a thrilling read, take a look at this [this blog post](https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/) by the R development team
## What's happening?
```{r}
str(survey.messy)
```
- Several of the entries have non-numeric values in them (they contain strings)
- As a result, `TVhours` is being imported as `character` vector
## A look at the TVhours column
```{r}
survey.messy$TVhours
```
## Partial fix
- In Lecture 1 we saw that there exists a family of `as.`*type* functions that will try to objects from one data type to the specified *type*
- We want TVhours to be numeric, so let's try `as.numeric`
```{r}
as.numeric(survey.messy$TVhours)
```
## We can do a bit better
```{r}
as.numeric(survey.messy$TVhours)
```
- All the corrupted cells now appear as `NA`, which is R's missing indicator
- We can do a little better by looking at the corrupted entries and seeing if we can recover more information from the cells that contained non-numeric values
## Deleting non-numeric (or .) characters
- Here we'll use the **`gsub()`** function (global substitution) to clean up more of the corruption
```{r}
head(survey.messy$TVhours, 40)
# Use gsub() to replace everything except digits and '.' with a blank ""
gsub("[^0-9.]", "", survey.messy$TVhours)
```
- As a last step, we should go through and figure out if any of the `NA` values should really be `0`.
- This step is not shown here.
## One-line cleanup
- Let's clean up the `TVhours` column and cast it to numeric all in one command
```{r}
survey <- mutate(survey.messy,
TVhours = as.numeric(gsub("[^0-9.]", "", TVhours)))
str(survey)
```
## Another common problem
- On Homework 2 you'll learn how to wrangle with another common problem
- When data is entered manually, misspellings and case changes are very common
- E.g., a column showing Program information may look like,
```{r}
program <- c("ppm", "PPM", "MISM", "HCA", "hca", "mism", "PPM-DA", "PPM-DA", "MSHCA", "MSPMM-DA", "PPM")
table(program)
```
##
```{r}
table(program)
```
- This vector has a lot of redundant unique values that we won't want to carry through our entire analysis
- E.g., hca and HCA, mism and MISM, ppm and PPM should certainly be combined. We might even want to combine PPM and PPM-DA together.
- On HW 2 you'll see a quick way to fix capitalization issues. For other forms of redundancy, you'll likely want to use a function like `recode()` introduced in Lecture 3.
## When are factor variables useful?
- Factor variables are handy when it's important to have control over the ordering of the variable values.
- E.g., What happens when we plot everyone's prior programming experience?
```{r, fig.height = 3, fig.width = 5, fig.align='center'}
qplot(survey$PriorExp)
```
- The x-axis values appear in alphabetical order. Not always desirable.
- What if we wanted the values to appear in ascending order of experience?
## Factor variables
- We can mutate `PriorExp` into a factor with levels in a specified order using the `factor()` command, specifying the `levels` of the variable in the order we want them to appear
```{r}
survey <- survey %>%
mutate(PriorExp = factor(PriorExp,
levels = c("Never programmed before",
"Some experience",
"Extensive experience")))
head(survey$PriorExp)
```
- Now `PriorExp` is a factor variable, with values ordered from "Never programmed before" to "Extensive experience"
## Reconstructing the plot
- Here's what we get if we run the exact same plotting command again
```{r, fig.height = 3, fig.width = 5, fig.align='center'}
qplot(survey$PriorExp)
```
- Better! This more clearly communicates the distribution of prior programming experience among survey respondents.
## Functions
- We have used a lot of built-in functions: `mean()`, `subset()`, `plot()`, `read.table()`...
- An important part of programming and data analysis is to write custom functions
- Functions help make code **modular**
- Functions make debugging easier
- Remember: this entire class is about applying *functions* to *data*
## What is a function?
> A function is a machine that turns **input objects** (arguments) into an **output object** (return value) according to a definite rule.
- Let's look at a really simple function
```{r}
addOne <- function(x) {
x + 1
}
```
- `x` is the **argument** or **input**
- The function **output** is the input `x` incremented by 1
```{r}
addOne(12)
```
## More interesting example
- Here's a function that returns a % given a numerator, denominator, and desired number of decimal values
```{r}
# Ended here
calculatePercentage <- function(x, y, d) {
decimal <- x / y # Calculate decimal value
round(100 * decimal, d) # Convert to % and round to d digits
}
calculatePercentage(27, 80, 1)
```
- If you're calculating several %'s for your report, you should use this kind of function instead of repeatedly copying and pasting code
## Function returning a list
- Here's a function that takes a person's full name (FirstName LastName), weight in lb and height in inches and converts it into a list with the person's first name, person's last name, weight in kg, height in m, and BMI.
```{r}
createPatientRecord <- function(full.name, weight, height) {
name.vec <- strsplit(full.name, split=" ")[[1]]
first.name <- name.vec[1]
last.name <- name.vec[2]
weight.in.kg <- weight / 2.2
height.in.m <- height * 0.0254
bmi <- weight.in.kg / (height.in.m ^ 2)
list(first.name=first.name, last.name=last.name, weight=weight.in.kg, height=height.in.m,
bmi=bmi)
}
```
## Trying out the function
```{r}
createPatientRecord("Michael Smith", 185, 12 * 6 + 1)
```
## Another example: 3 number summary
- Calculate mean, median and standard deviation
```{r}
threeNumberSummary <- function(x) {
c(mean=mean(x), median=median(x), sd=sd(x))
}
x <- rnorm(100, mean=5, sd=2) # Vector of 100 normals with mean 5 and sd 2
threeNumberSummary(x)
```
## If-else statements
- Oftentimes we want our code to have different effects depending on the features of the input
- Example: Calculating a student's letter grade
- If grade >= 90, assign A
- Otherwise, if grade >= 80, assign B
- Otherwise, if grade >= 70, assign C
- In all other cases, assign F
- To code this up, we use if-else statements
## If-else Example: Letter grades
```{r}
calculateLetterGrade <- function(x) {
if(x >= 90) {
grade <- "A"
} else if(x >= 80) {
grade <- "B"
} else if(x >= 70) {
grade <- "C"
} else {
grade <- "F"
}
grade
}
course.grades <- c(92, 78, 87, 91, 62)
map_chr(course.grades, calculateLetterGrade)
```
## `return()`
- In the previous examples we specified the output simply by writing the output variable as the last line of the function
- More explicitly, we can use the **`return()`** function
```{r}
addOne <- function(x) {
return(x + 1)
}
addOne(12)
```
- We will generally avoid the `return()` function, but you can use it if necessary or if it makes writing a particular function easier.
- Google's style guide suggests explicit returns. Most do not.
## More programming basics: loops
- We'll now learn about loops and some more efficient/syntactically simple loop alternatives
- **loops** are ways of iterating over data
## For loops: a pair of examples
```{r}
for(i in 1:4) {
print(i)
}
phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
phrase <- paste(phrase, word)
print(phrase)
}
```
## For loops: syntax
> A **for loop** executes a chunk of code for every value of an **index variable** in an **index set**
- The basic syntax takes the form
```{r, eval=FALSE}
for(index.variable in index.set) {
code to be repeated at every value of index.variable
}
```
- The index set is often a vector of integers, but can be more general
## Example
```{r}
index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
print(c(i, typeof(i)))
}
```
## Example: Calculate sum of each column
```{r}
fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums
colSums(fake.data) # A better approach (see also colMeans())
```
## while loops
- **while loops** repeat a chunk of code while the specified condition remains true
```{r, eval=FALSE}
day <- 1
num.days <- 365
while(day <= num.days) {
day <- day + 1
}
```
- We won't really be using while loops in this class
- Just be aware that they exist, and that they may become useful to you at some point in your analytics career
## Loop alternatives
Command | Description
--------|------------
`apply(X, MARGIN, FUN)` | Obtain a vector/array/list by applying `FUN` along the specified `MARGIN` of an array or matrix `X`
`map(.x, .f, ...)` | Obtain a *list* by applying `.f` to every element of a list or atomic vector `.x`
`map_(.x, .f, ...)` | For `` given by `lgl` (logical), `int` (integer), `dbl` (double) or `chr` (character), return a *vector* of this type obtained by applying `.f` to each element of `.x`
`map_at(.x, .at, .f)` | Obtain a *list* by applying `.f` to the elements of `.x` specified by name or index given in `.at`
`map_if(.x, .p, .f)` | Obtain a *list* `.f` to the elements of `.x` specified by `.p` (a predicate function, or a logical vector)
`mutate_all/_at/_if` | Mutate all variables, specified (at) variables, or those selected by a predicate (if)
`summarize_all/_at/_if` | Summarize all variables, specified variables, or those selected by a predicate (if)
- These take practice to get used to, but make analysis easier to debug and less prone to error when used effectively
- The best way to learn them is by looking at a bunch of examples. The end of each help file contains some examples.
## Example: apply()
```{r}
colMeans(fake.data)
apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns
# Function that calculates proportion of vector indexes that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive)
```
## Example: map, map_()
```{r}
map(survey, is.numeric) # Returns a list
map_lgl(survey, is.numeric) # Returns a logical vector with named elements
```
## Example: apply(), map(), map_()
```{r}
apply(cars, 2, FUN=mean) # Data frames are arrays
map(cars, mean) # Data frames are also lists
map_dbl(cars, mean) # map output as a double vector
```
## Example: mutate_if
Let's convert all factor variables in Cars93 to lowercase
```{r}
head(Cars93$Type)
Cars93.lower <- mutate_if(Cars93, is.factor, tolower)
head(Cars93.lower$Type)
```
- Note: this has the effect of producing a copy of the `Cars93` data where all of the factor variables have been replaced with versions containing lowercase values
## Example: mutate_if, adding instead of replacing columns
If you pass the functions in as a list with named elements, those names get appended to create modified versions of variables instead of replacing existing variables
```{r}
Cars93.lower <- mutate_if(Cars93, is.factor, list(lower = tolower))
head(Cars93.lower$Type)
head(Cars93.lower$Type_lower)
```
## Example: mutate_at
Let's convert from MPG to KPML but this time using `mutate_at`
```{r}
Cars93.metric <- Cars93 %>%
mutate_at(c("MPG.city", "MPG.highway"), list(KMPL = ~ 0.425 * .x))
tail(colnames(Cars93.metric))
```
Here, `~ 0.425 * .x` is an example of specifying a "lambda" (anonymous) function. It is permitted short-hand for
```{r, eval = FALSE}
function(.x){0.425 * .x}
```
## Example: summarize_if
Let's get the mean of every numeric column in Cars93
```{r}
Cars93 %>% summarize_if(is.numeric, mean)
Cars93 %>% summarize_if(is.numeric, list(mean = mean), na.rm=TRUE)
```
## Example: summarize_at
Let's get the average fuel economy of all vehicles, grouped by their Type
```{r}
Cars93 %>%
group_by(Type) %>%
summarize_at(c("MPG.city", "MPG.highway"), mean)
```
## Another approach
We'll learn about a bunch of select helper functions like `contains()` and `starts_with()`.
Here's one way of performing the previous operation with the help of these functions, and appending `_mean` to the resulting output.
```{r}
Cars93 %>%
group_by(Type) %>%
summarize_at(vars(contains("MPG")), list(mean = mean))
```
## More than one grouping variable
```{r}
Cars93 %>%
group_by(Origin, AirBags) %>%
summarize_at(vars(contains("MPG")), list(mean = mean))
```
## R coding style
Let's return back to the [last few slides of lecture 2](http://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture02/lecture02-94842.html#33)
## Assignments
- **Homework 2** will be posted today
- **Due: Wednesday, November 11, 1:30pm ET**
- Submit your .Rmd and .html files on Canvas
- **Lab 4** is available on Canvas and the course website
- You have until Friday evening to complete it
- Friday's lab session will go over this week's material and help you complete the labs