---
title: 'Lecture 4:
Basic cleaning, loops, and alternatives'
author: "Prof. Alexandra Chouldechova"
date: "Fall 2019"
output:
ioslides_presentation:
highlight: tango
widescreen: true
smaller: true
---
## Agenda
- A common data cleaning task
- For/while loops to iterate over data
- Helpful variants of `map`, `mutate` and `summarize`
## Package loading
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
Cars93 <- MASS::Cars93 # For Cars93 data again
```
## A common problem
- One of the most common problems you'll encounter when importing manually-entered data is inconsistent data types within columns
- For a simple example, let's look at `TVhours` column in a messy version of the survey data from Lecture 2
```{r}
survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv",
header=TRUE)
survey.messy$TVhours
```
## What's happening?
```{r}
str(survey.messy)
```
- Several of the entries have non-numeric values in them (they contain strings)
- As a result, `TVhours` is being imported as `factor`
## A look at the TVhours column
```{r}
survey.messy$TVhours
```
## Attempt at a fix
- What if we just try to cast it back to numeric?
```{r}
tv.hours.messy <- survey.messy$TVhours
tv.hours.messy
as.numeric(tv.hours.messy)
```
## That didn't work...
```{r, eval = FALSE}
tv.hours.messy
as.numeric(tv.hours.messy)
```
```{r, echo = FALSE}
# Show this output, but not this code
head(tv.hours.messy, 40)
head(as.numeric(tv.hours.messy), 40)
```
- This just converted all the values into the integer-coded levels of the factor
- Not what we wanted!
## Something that does work
- Consider the following simple example
```{r}
num.vec <- c(3.1, 2.5)
as.factor(num.vec)
as.numeric(as.factor(num.vec))
as.numeric(as.character(as.factor(num.vec)))
```
> If we take a number that's being coded as a factor and *first* turn it into a `character` string, *then* converting the string to a numeric gets back the number
## Back to the corrupted TVhours column
```{r}
as.character(tv.hours.messy)
as.numeric(as.character(tv.hours.messy))
typeof(as.numeric(as.character(tv.hours.messy))) # Success!! (Almost...)
```
## A small improvement
- All the corrupted cells now appear as `NA`, which is R's missing indicator
- We can do a little better by cleaning up the vector once we get it to `character` form
```{r}
tv.hours.strings <- as.character(tv.hours.messy)
tv.hours.strings
```
## Deleting non-numeric (or .) characters
```{r}
tv.hours.strings
# Use gsub() to replace everything except digits and '.' with a blank ""
gsub("[^0-9.]", "", tv.hours.strings)
```
## The final product
```{r}
tv.hours.messy[1:30]
tv.hours.clean <- as.numeric(gsub("[^0-9.]", "", tv.hours.strings))
tv.hours.clean
```
- As a last step, we should go through and figure out if any of the `NA` values should really be `0`.
- This step is not shown here.
## Rebuilding our data
```{r}
survey <- mutate(survey.messy, TVhours = tv.hours.clean)
str(survey)
```
- **Success!**
## A different approach
- We can also handle this problem by setting `stringsAsFactors = FALSE` when importing our data.
```{r}
survey.meayssy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv",
header=TRUE, stringsAsFactors=FALSE)
str(survey.messy)
```
- Now everything is a `character` instead of a `factor`
## One-line cleanup
- Let's clean up the `TVhours` column and cast it to numeric all in one command
```{r}
survey <- mutate(survey.messy,
TVhours = as.numeric(gsub("[^0-9.]", "", TVhours)))
str(survey)
```
## What about all those other `character` variables?
```{r}
table(survey[["Program"]])
table(as.factor(survey[["Program"]]))
```
- Having factors coded as characters may be OK for many parts of our analysis
## If we wanted to, here's how we could fix things
`mutate_if(.tbl, .predicate, .funs)` applies a functions `.funs` to the elements of `.tbl` for which the predicate (condition) `.predicate` holds.
Here is how we can use `mutate_if` to convert all character columns to factors.
```{r}
survey <- survey %>% mutate_if(is.character, as.factor)
# Outcome:
str(survey)
```
- **Success!**
## Another common problem
- On Homework 2 you'll learn how to wrangle with another common problem
- When data is entered manually, misspellings and case changes are very common
- E.g., a column showing treatment information may look like,
```{r}
treatment <- as.factor(c("dialysis", "Ventilation", "Dialysis", "dialysis", "none", "None", "nnone", "dyalysis", "dialysis", "ventilation", "none"))
summary(treatment)
```
##
```{r}
summary(treatment)
```
- This factor has 8 levels even though it should have 3 (dialysis, ventilation, none)
- We can fix many of the typos by running spellcheck in Excel before importing data, or by changing the values on a case-by-case basis later
- There's a faster way to fix just the capitalization issue (this is an exercise on Homework 2)
## What are all these `map<*>` functions?
- These are all efficient ways of applying a function to margins of an array or elements of a list
- Before we talk about them in detail, we need to understand their more cumbersome but more general alternative: loops
- **loops** are ways of iterating over data
- The `map<*>` functions and their `<*>apply` base-R ancestors can be thought of as good *alternatives* to loops
## For loops: a pair of examples
```{r}
for(i in 1:4) {
print(i)
}
phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
phrase <- paste(phrase, word)
print(phrase)
}
```
## For loops: syntax
> A **for loop** executes a chunk of code for every value of an **index variable** in an **index set**
- The basic syntax takes the form
```{r, eval=FALSE}
for(index.variable in index.set) {
code to be repeated at every value of index.variable
}
```
- The index set is often a vector of integers, but can be more general
## Example
```{r}
index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
print(c(i, typeof(i)))
}
```
## Example: Calculate sum of each column
```{r}
fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums
colSums(fake.data) # A better approach (see also colMeans())
```
## while loops
- **while loops** repeat a chunk of code while the specified condition remains true
```{r, eval=FALSE}
day <- 1
num.days <- 365
while(day <= num.days) {
day <- day + 1
}
```
- We won't really be using while loops in this class
- Just be aware that they exist, and that they may become useful to you at some point in your analytics career
## Loop alternatives
Command | Description
--------|------------
`apply(X, MARGIN, FUN)` | Obtain a vector/array/list by applying `FUN` along the specified `MARGIN` of an array or matrix `X`
`map(.x, .f, ...)` | Obtain a *list* by applying `.f` to every element of a list or atomic vector `.x`
`map_(.x, .f, ...)` | For `` given by `lgl` (logical), `int` (integer), `dbl` (double) or `chr` (character), return a *vector* of this type obtained by applying `.f` to each element of `.x`
`map_at(.x, .at, .f)` | Obtain a *list* by applying `.f` to the elements of `.x` specified by name or index given in `.at`
`map_if(.x, .p, .f)` | Obtain a *list* `.f` to the elements of `.x` specified by `.p` (a predicate function, or a logical vector)
`mutate_all/_at/_if` | Mutate all variables, specified (at) variables, or those selected by a predicate (if)
`summarize_all/_at/_if` | Summarize all variables, specified variables, or those selected by a predicate (if)
- These take practice to get used to, but make analysis easier to debug and less prone to error when used effectively
- The best way to learn them is by looking at a bunch of examples. The end of each help file contains some examples.
## Example: apply()
```{r}
colMeans(fake.data)
apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns
# Function that calculates proportion of vector indexes that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive)
```
## Example: map, map_()
```{r}
map(survey, is.factor) # Returns a list
map_lgl(survey, is.factor) # Returns a logical vector with named elements
```
## Example: apply(), map(), map_()
```{r}
apply(cars, 2, FUN=mean) # Data frames are arrays
map(cars, mean) # Data frames are also lists
map_dbl(cars, mean) # map output as a double vector
```
## Example: mutate_if
Let's convert all factor variables in Cars93 to lowercase
```{r}
head(Cars93$Type)
Cars93.lower <- mutate_if(Cars93, is.factor, tolower)
head(Cars93.lower$Type)
```
Note: this operation is performed "in place", replacing prior versions of the variables
## Example: mutate_if, adding instead of replacing columns
If you pass the functions in as a list with named elements, those names get appended to create modified versions of variables instead of replacing existing variables
```{r}
Cars93.lower <- mutate_if(Cars93, is.factor, list(lower = tolower))
head(Cars93.lower$Type)
head(Cars93.lower$Type_lower)
```
## Example: mutate_at
Let's convert from MPG to KPML but this time using `mutate_at`
```{r}
Cars93.metric <- Cars93 %>%
mutate_at(c("MPG.city", "MPG.highway"), list(KMPL = ~ 0.425 * .x))
tail(colnames(Cars93.metric))
```
Here, `~ 0.425 * .x` is an example of specifying a "lambda" (anonymous) function. It is permitted short-hand for
```{r, eval = FALSE}
function(.x){0.425 * .x}
```
## Example: summarize_if
Let's get the mean of every numeric column in Cars93
```{r}
Cars93 %>% summarize_if(is.numeric, mean)
Cars93 %>% summarize_if(is.numeric, list(mean = mean), na.rm=TRUE)
```
## Example: summarize_at
Let's get the average fuel economy of all vehicles, grouped by their Type
We'll use the helper function `contains()` to indicate that we want all variables that contain MPG.
```{r}
Cars93 %>%
group_by(Type) %>%
summarize_at(c("MPG.city", "MPG.highway"), mean)
```
## Another approach
We'll learn about a bunch of helper functions like `contains()` and `starts_with()`. Here's one way of performing the previous operation with the help of these functions, and appending `_mean` to the resulting output.
```{r}
Cars93 %>%
group_by(Type) %>%
summarize_at(vars(contains("MPG")), list(mean = mean))
```
## More than one grouping variable
```{r}
Cars93 %>%
group_by(Origin, AirBags) %>%
summarize_at(vars(contains("MPG")), list(mean = mean))
```
## Assignments
- **Homework 2** will be posted soon
- **Due: Thursday, November 7, 2:50pm**
- Submit your .Rmd and .html files on Canvas
- **Lab 4** is available on Canvas and the course website
- You have until Friday evening to complete it
- Tomorrow's lab session will give you more time to work on this week's labs