--- title: 'Lecture 4:
Basic cleaning, loops, and alternatives' author: "Prof. Alexandra Chouldechova" date: "Fall 2020" output: ioslides_presentation: highlight: tango widescreen: true smaller: true --- ## Agenda - Lists - A common data cleaning task - Factor variables, and when they're useful - Functions - If-else statements - For/while loops to iterate over data - R coding style - Rather than picking up where Lecture 3 left off I've woven the Lecture 3 content we haven't yet covered into the Lecture 4 notes ## Package loading ```{r, message=FALSE, warning=FALSE} library(tidyverse) Cars93 <- MASS::Cars93 # For Cars93 data again ``` ## Basics of lists > A list is a **data structure** that can be used to store **different kinds** of data - Recall: a vector is a data structure for storing *similar kinds of data* - To better understand the difference, consider the following example. ```{r} my.vector.1 <- c("Michael", 165, TRUE) # (name, weight, is.male) my.vector.1 typeof(my.vector.1) # All the elements are now character strings! ``` ## Lists vs. vectors ```{r} my.vector.2 <- c(FALSE, TRUE, 27) # (is.male, is.citizen, age) my.vector.2 typeof(my.vector.2) ``` - Vectors expect elements to be all of the same type (e.g., `Boolean`, `numeric`, `character`) - When data of different types are put into a vector, the R converts everything to a common type ## Lists - To store data of different types in the same object, we use lists - Simple way to construct lists: use **`list()`** function - (We'll learn about functions like `map` and `map_chr` soon.) ```{r} my.list <- list("Michael", 165, TRUE) my.list map_chr(my.list, typeof) ``` ## Named elements ```{r} patient.1 <- list(name="Michael", weight=165, is.male=TRUE) patient.1 ``` ## Referencing elements of a list (similar to data frames) ```{r} patient.1$name # Get "name" element (returns a string) patient.1[["name"]] # Get "name" element (returns a string) patient.1["name"] # Get "name" slice (returns a list) c(typeof(patient.1$name), typeof(patient.1["name"])) ``` ## A common problem - One of the most common problems you'll encounter when importing manually-entered data is inconsistent data types within columns - For a simple example, let's look at `TVhours` column in a messy version of the survey data from Lecture 2 ```{r} survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2020_messy.csv", header=TRUE, stringsAsFactors = FALSE) # Print out first 20 elements head(survey.messy$TVhours, 20) ``` - **NOTE**: If you've installed R within the past few months, your version will automatically default to `stringsAsFactors = FALSE`. My version of R is older and still has the old `stringsAsFactors = TRUE` default, a convention that dates back to 1998. - For a thrilling read, take a look at this [this blog post](https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/) by the R development team ## What's happening? ```{r} str(survey.messy) ``` - Several of the entries have non-numeric values in them (they contain strings) - As a result, `TVhours` is being imported as `character` vector ## A look at the TVhours column ```{r} survey.messy$TVhours ``` ## Partial fix - In Lecture 1 we saw that there exists a family of `as.`*type* functions that will try to objects from one data type to the specified *type* - We want TVhours to be numeric, so let's try `as.numeric` ```{r} as.numeric(survey.messy$TVhours) ``` ## We can do a bit better ```{r} as.numeric(survey.messy$TVhours) ``` - All the corrupted cells now appear as `NA`, which is R's missing indicator - We can do a little better by looking at the corrupted entries and seeing if we can recover more information from the cells that contained non-numeric values ## Deleting non-numeric (or .) characters - Here we'll use the **`gsub()`** function (global substitution) to clean up more of the corruption ```{r} head(survey.messy$TVhours, 40) # Use gsub() to replace everything except digits and '.' with a blank "" gsub("[^0-9.]", "", survey.messy$TVhours) ``` - As a last step, we should go through and figure out if any of the `NA` values should really be `0`. - This step is not shown here. ## One-line cleanup - Let's clean up the `TVhours` column and cast it to numeric all in one command ```{r} survey <- mutate(survey.messy, TVhours = as.numeric(gsub("[^0-9.]", "", TVhours))) str(survey) ``` ## Another common problem - On Homework 2 you'll learn how to wrangle with another common problem - When data is entered manually, misspellings and case changes are very common - E.g., a column showing Program information may look like, ```{r} program <- c("ppm", "PPM", "MISM", "HCA", "hca", "mism", "PPM-DA", "PPM-DA", "MSHCA", "MSPMM-DA", "PPM") table(program) ``` ## ```{r} table(program) ``` - This vector has a lot of redundant unique values that we won't want to carry through our entire analysis - E.g., hca and HCA, mism and MISM, ppm and PPM should certainly be combined. We might even want to combine PPM and PPM-DA together. - On HW 2 you'll see a quick way to fix capitalization issues. For other forms of redundancy, you'll likely want to use a function like `recode()` introduced in Lecture 3. ## When are factor variables useful? - Factor variables are handy when it's important to have control over the ordering of the variable values. - E.g., What happens when we plot everyone's prior programming experience? ```{r, fig.height = 3, fig.width = 5, fig.align='center'} qplot(survey$PriorExp) ``` - The x-axis values appear in alphabetical order. Not always desirable. - What if we wanted the values to appear in ascending order of experience? ## Factor variables - We can mutate `PriorExp` into a factor with levels in a specified order using the `factor()` command, specifying the `levels` of the variable in the order we want them to appear ```{r} survey <- survey %>% mutate(PriorExp = factor(PriorExp, levels = c("Never programmed before", "Some experience", "Extensive experience"))) head(survey$PriorExp) ``` - Now `PriorExp` is a factor variable, with values ordered from "Never programmed before" to "Extensive experience" ## Reconstructing the plot - Here's what we get if we run the exact same plotting command again ```{r, fig.height = 3, fig.width = 5, fig.align='center'} qplot(survey$PriorExp) ``` - Better! This more clearly communicates the distribution of prior programming experience among survey respondents. ## Functions - We have used a lot of built-in functions: `mean()`, `subset()`, `plot()`, `read.table()`... - An important part of programming and data analysis is to write custom functions - Functions help make code **modular** - Functions make debugging easier - Remember: this entire class is about applying *functions* to *data* ## What is a function? > A function is a machine that turns **input objects** (arguments) into an **output object** (return value) according to a definite rule. - Let's look at a really simple function ```{r} addOne <- function(x) { x + 1 } ``` - `x` is the **argument** or **input** - The function **output** is the input `x` incremented by 1 ```{r} addOne(12) ``` ## More interesting example - Here's a function that returns a % given a numerator, denominator, and desired number of decimal values ```{r} # Ended here calculatePercentage <- function(x, y, d) { decimal <- x / y # Calculate decimal value round(100 * decimal, d) # Convert to % and round to d digits } calculatePercentage(27, 80, 1) ``` - If you're calculating several %'s for your report, you should use this kind of function instead of repeatedly copying and pasting code ## Function returning a list - Here's a function that takes a person's full name (FirstName LastName), weight in lb and height in inches and converts it into a list with the person's first name, person's last name, weight in kg, height in m, and BMI. ```{r} createPatientRecord <- function(full.name, weight, height) { name.vec <- strsplit(full.name, split=" ")[[1]] first.name <- name.vec[1] last.name <- name.vec[2] weight.in.kg <- weight / 2.2 height.in.m <- height * 0.0254 bmi <- weight.in.kg / (height.in.m ^ 2) list(first.name=first.name, last.name=last.name, weight=weight.in.kg, height=height.in.m, bmi=bmi) } ``` ## Trying out the function ```{r} createPatientRecord("Michael Smith", 185, 12 * 6 + 1) ``` ## Another example: 3 number summary - Calculate mean, median and standard deviation ```{r} threeNumberSummary <- function(x) { c(mean=mean(x), median=median(x), sd=sd(x)) } x <- rnorm(100, mean=5, sd=2) # Vector of 100 normals with mean 5 and sd 2 threeNumberSummary(x) ``` ## If-else statements - Oftentimes we want our code to have different effects depending on the features of the input - Example: Calculating a student's letter grade - If grade >= 90, assign A - Otherwise, if grade >= 80, assign B - Otherwise, if grade >= 70, assign C - In all other cases, assign F - To code this up, we use if-else statements ## If-else Example: Letter grades ```{r} calculateLetterGrade <- function(x) { if(x >= 90) { grade <- "A" } else if(x >= 80) { grade <- "B" } else if(x >= 70) { grade <- "C" } else { grade <- "F" } grade } course.grades <- c(92, 78, 87, 91, 62) map_chr(course.grades, calculateLetterGrade) ``` ## `return()` - In the previous examples we specified the output simply by writing the output variable as the last line of the function - More explicitly, we can use the **`return()`** function ```{r} addOne <- function(x) { return(x + 1) } addOne(12) ``` - We will generally avoid the `return()` function, but you can use it if necessary or if it makes writing a particular function easier. - Google's style guide suggests explicit returns. Most do not. ## More programming basics: loops - We'll now learn about loops and some more efficient/syntactically simple loop alternatives - **loops** are ways of iterating over data ## For loops: a pair of examples ```{r} for(i in 1:4) { print(i) } phrase <- "Good Night," for(word in c("and", "Good", "Luck")) { phrase <- paste(phrase, word) print(phrase) } ``` ## For loops: syntax > A **for loop** executes a chunk of code for every value of an **index variable** in an **index set** - The basic syntax takes the form ```{r, eval=FALSE} for(index.variable in index.set) { code to be repeated at every value of index.variable } ``` - The index set is often a vector of integers, but can be more general ## Example ```{r} index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list for(i in index.set) { print(c(i, typeof(i))) } ``` ## Example: Calculate sum of each column ```{r} fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set head(fake.data,2) # print first two rows col.sums <- numeric(ncol(fake.data)) # variable to store running column sums for(i in 1:nrow(fake.data)) { col.sums <- col.sums + fake.data[i,] # add ith observation to the sum } col.sums colSums(fake.data) # A better approach (see also colMeans()) ``` ## while loops - **while loops** repeat a chunk of code while the specified condition remains true ```{r, eval=FALSE} day <- 1 num.days <- 365 while(day <= num.days) { day <- day + 1 } ``` - We won't really be using while loops in this class - Just be aware that they exist, and that they may become useful to you at some point in your analytics career ## Loop alternatives Command | Description --------|------------ `apply(X, MARGIN, FUN)` | Obtain a vector/array/list by applying `FUN` along the specified `MARGIN` of an array or matrix `X` `map(.x, .f, ...)` | Obtain a *list* by applying `.f` to every element of a list or atomic vector `.x` `map_(.x, .f, ...)` | For `` given by `lgl` (logical), `int` (integer), `dbl` (double) or `chr` (character), return a *vector* of this type obtained by applying `.f` to each element of `.x` `map_at(.x, .at, .f)` | Obtain a *list* by applying `.f` to the elements of `.x` specified by name or index given in `.at` `map_if(.x, .p, .f)` | Obtain a *list* `.f` to the elements of `.x` specified by `.p` (a predicate function, or a logical vector) `mutate_all/_at/_if` | Mutate all variables, specified (at) variables, or those selected by a predicate (if) `summarize_all/_at/_if` | Summarize all variables, specified variables, or those selected by a predicate (if) - These take practice to get used to, but make analysis easier to debug and less prone to error when used effectively - The best way to learn them is by looking at a bunch of examples. The end of each help file contains some examples. ## Example: apply() ```{r} colMeans(fake.data) apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns # Function that calculates proportion of vector indexes that are > 0 propPositive <- function(x) mean(x > 0) apply(fake.data, MARGIN=2, FUN=propPositive) ``` ## Example: map, map_() ```{r} map(survey, is.numeric) # Returns a list map_lgl(survey, is.numeric) # Returns a logical vector with named elements ``` ## Example: apply(), map(), map_() ```{r} apply(cars, 2, FUN=mean) # Data frames are arrays map(cars, mean) # Data frames are also lists map_dbl(cars, mean) # map output as a double vector ``` ## Example: mutate_if Let's convert all factor variables in Cars93 to lowercase ```{r} head(Cars93$Type) Cars93.lower <- mutate_if(Cars93, is.factor, tolower) head(Cars93.lower$Type) ``` - Note: this has the effect of producing a copy of the `Cars93` data where all of the factor variables have been replaced with versions containing lowercase values ## Example: mutate_if, adding instead of replacing columns If you pass the functions in as a list with named elements, those names get appended to create modified versions of variables instead of replacing existing variables ```{r} Cars93.lower <- mutate_if(Cars93, is.factor, list(lower = tolower)) head(Cars93.lower$Type) head(Cars93.lower$Type_lower) ``` ## Example: mutate_at Let's convert from MPG to KPML but this time using `mutate_at` ```{r} Cars93.metric <- Cars93 %>% mutate_at(c("MPG.city", "MPG.highway"), list(KMPL = ~ 0.425 * .x)) tail(colnames(Cars93.metric)) ``` Here, `~ 0.425 * .x` is an example of specifying a "lambda" (anonymous) function. It is permitted short-hand for ```{r, eval = FALSE} function(.x){0.425 * .x} ``` ## Example: summarize_if Let's get the mean of every numeric column in Cars93 ```{r} Cars93 %>% summarize_if(is.numeric, mean) Cars93 %>% summarize_if(is.numeric, list(mean = mean), na.rm=TRUE) ``` ## Example: summarize_at Let's get the average fuel economy of all vehicles, grouped by their Type ```{r} Cars93 %>% group_by(Type) %>% summarize_at(c("MPG.city", "MPG.highway"), mean) ``` ## Another approach We'll learn about a bunch of select helper functions like `contains()` and `starts_with()`. Here's one way of performing the previous operation with the help of these functions, and appending `_mean` to the resulting output. ```{r} Cars93 %>% group_by(Type) %>% summarize_at(vars(contains("MPG")), list(mean = mean)) ``` ## More than one grouping variable ```{r} Cars93 %>% group_by(Origin, AirBags) %>% summarize_at(vars(contains("MPG")), list(mean = mean)) ``` ## R coding style Let's return back to the [last few slides of lecture 2](http://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture02/lecture02-94842.html#33) ## Assignments - **Homework 2** will be posted today - **Due: Wednesday, November 11, 1:30pm ET** - Submit your .Rmd and .html files on Canvas - **Lab 4** is available on Canvas and the course website - You have until Friday evening to complete it - Friday's lab session will go over this week's material and help you complete the labs