Fall 2020

## Agenda

• Lists
• A common data cleaning task
• Factor variables, and when they’re useful
• Functions
• If-else statements
• For/while loops to iterate over data
• R coding style

• Rather than picking up where Lecture 3 left off I’ve woven the Lecture 3 content we haven’t yet covered into the Lecture 4 notes

library(tidyverse)
Cars93 <- MASS::Cars93  # For Cars93 data again

## Basics of lists

A list is a data structure that can be used to store different kinds of data

• Recall: a vector is a data structure for storing similar kinds of data

• To better understand the difference, consider the following example.

my.vector.1 <- c("Michael", 165, TRUE) # (name, weight, is.male)
my.vector.1 
## [1] "Michael" "165"     "TRUE"
typeof(my.vector.1)  # All the elements are now character strings!
## [1] "character"

## Lists vs. vectors

my.vector.2 <- c(FALSE, TRUE, 27) # (is.male, is.citizen, age)
my.vector.2
## [1]  0  1 27
typeof(my.vector.2)
## [1] "double"
• Vectors expect elements to be all of the same type (e.g., Boolean, numeric, character)

• When data of different types are put into a vector, the R converts everything to a common type

## Lists

• To store data of different types in the same object, we use lists

• Simple way to construct lists: use list() function

• (We’ll learn about functions like map and map_chr soon.)

my.list <- list("Michael", 165, TRUE)
my.list
## [[1]]
## [1] "Michael"
##
## [[2]]
## [1] 165
##
## [[3]]
## [1] TRUE
map_chr(my.list, typeof)
## [1] "character" "double"    "logical"

## Named elements

patient.1 <- list(name="Michael", weight=165, is.male=TRUE)
patient.1
## $name ## [1] "Michael" ## ##$weight
## [1] 165
##
## $is.male ## [1] TRUE ## Referencing elements of a list (similar to data frames) patient.1$name # Get "name" element (returns a string)
## [1] "Michael"
patient.1[["name"]] # Get "name" element (returns a string)
## [1] "Michael"
patient.1["name"] # Get "name" slice (returns a list)
## $name ## [1] "Michael" c(typeof(patient.1$name), typeof(patient.1["name"]))
## [1] "character" "list"

## A common problem

• One of the most common problems you’ll encounter when importing manually-entered data is inconsistent data types within columns

• For a simple example, let’s look at TVhours column in a messy version of the survey data from Lecture 2

survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2020_messy.csv",
# Print out first 20 elements
head(survey.messy$TVhours, 20)  ## [1] "10.5" "3" "0" "10" "~4" "0" "2" "20ish" ## [9] "4" "0" "15" "5" ">20" "10" "5" "2" ## [17] "14" "10" "4" "3" • NOTE: If you’ve installed R within the past few months, your version will automatically default to stringsAsFactors = FALSE. My version of R is older and still has the old stringsAsFactors = TRUE default, a convention that dates back to 1998. • For a thrilling read, take a look at this this blog post by the R development team ## What’s happening? str(survey.messy) ## 'data.frame': 57 obs. of 6 variables: ##$ Program        : chr  "PPM" "Other" "MISM" "PPM" ...
##  $PriorExp : chr "Some experience" "Extensive experience" "Never programmed before" "Never programmed before" ... ##$ Rexperience    : chr  "Never used" "Basic competence" "Basic competence" "Never used" ...
##  $OperatingSystem: chr "Windows" "Mac OS X" "Windows" "Windows" ... ##$ TVhours        : chr  "10.5" "3" "0" "10" ...
##  $Editor : chr "Other" "Microsoft Word" "Microsoft Word" "Excel" ... • Several of the entries have non-numeric values in them (they contain strings) • As a result, TVhours is being imported as character vector ## A look at the TVhours column survey.messy$TVhours
##  [1] "10.5"      "3"         "0"         "10"        "~4"
##  [6] "0"         "2"         "20ish"     "4"         "0"
## [11] "15"        "5"         ">20"       "10"        "5"
## [16] "2"         "14"        "10"        "4"         "3"
## [21] "6"         ">10"       "2"         "3"         "3"
## [26] "1"         "<1"        "3"         "5"         "10"
## [31] "20"        "none"      "9"         "3"         "4"
## [36] "8"         "7"         "8"         "10"        "10"
## [41] "4"         "10"        "4"         "0"         "1"
## [46] "7"         "2"         "15"        "8"         "10"
## [51] "2"         "3"         "4"         "21"        "10"
## [56] "approx 20" "0"

## Partial fix

• In Lecture 1 we saw that there exists a family of as.type functions that will try to objects from one data type to the specified type

• We want TVhours to be numeric, so let’s try as.numeric

as.numeric(survey.messy$TVhours) ## Warning: NAs introduced by coercion ## [1] 10.5 3.0 0.0 10.0 NA 0.0 2.0 NA 4.0 0.0 15.0 5.0 NA 10.0 ## [15] 5.0 2.0 14.0 10.0 4.0 3.0 6.0 NA 2.0 3.0 3.0 1.0 NA 3.0 ## [29] 5.0 10.0 20.0 NA 9.0 3.0 4.0 8.0 7.0 8.0 10.0 10.0 4.0 10.0 ## [43] 4.0 0.0 1.0 7.0 2.0 15.0 8.0 10.0 2.0 3.0 4.0 21.0 10.0 NA ## [57] 0.0 ## We can do a bit better as.numeric(survey.messy$TVhours)
## Warning: NAs introduced by coercion
##  [1] 10.5  3.0  0.0 10.0   NA  0.0  2.0   NA  4.0  0.0 15.0  5.0   NA 10.0
## [15]  5.0  2.0 14.0 10.0  4.0  3.0  6.0   NA  2.0  3.0  3.0  1.0   NA  3.0
## [29]  5.0 10.0 20.0   NA  9.0  3.0  4.0  8.0  7.0  8.0 10.0 10.0  4.0 10.0
## [43]  4.0  0.0  1.0  7.0  2.0 15.0  8.0 10.0  2.0  3.0  4.0 21.0 10.0   NA
## [57]  0.0
• All the corrupted cells now appear as NA, which is R’s missing indicator

• We can do a little better by looking at the corrupted entries and seeing if we can recover more information from the cells that contained non-numeric values

## Deleting non-numeric (or .) characters

• Here we’ll use the gsub() function (global substitution) to clean up more of the corruption
head(survey.messy$TVhours, 40) ## [1] "10.5" "3" "0" "10" "~4" "0" "2" "20ish" ## [9] "4" "0" "15" "5" ">20" "10" "5" "2" ## [17] "14" "10" "4" "3" "6" ">10" "2" "3" ## [25] "3" "1" "<1" "3" "5" "10" "20" "none" ## [33] "9" "3" "4" "8" "7" "8" "10" "10" # Use gsub() to replace everything except digits and '.' with a blank "" gsub("[^0-9.]", "", survey.messy$TVhours) 
##  [1] "10.5" "3"    "0"    "10"   "4"    "0"    "2"    "20"   "4"    "0"
## [11] "15"   "5"    "20"   "10"   "5"    "2"    "14"   "10"   "4"    "3"
## [21] "6"    "10"   "2"    "3"    "3"    "1"    "1"    "3"    "5"    "10"
## [31] "20"   ""     "9"    "3"    "4"    "8"    "7"    "8"    "10"   "10"
## [41] "4"    "10"   "4"    "0"    "1"    "7"    "2"    "15"   "8"    "10"
## [51] "2"    "3"    "4"    "21"   "10"   "20"   "0"
• As a last step, we should go through and figure out if any of the NA values should really be 0.
• This step is not shown here.

## One-line cleanup

• Let’s clean up the TVhours column and cast it to numeric all in one command
survey <- mutate(survey.messy,
TVhours = as.numeric(gsub("[^0-9.]", "", TVhours)))
str(survey)
## 'data.frame':    57 obs. of  6 variables:
##  $Program : chr "PPM" "Other" "MISM" "PPM" ... ##$ PriorExp       : chr  "Some experience" "Extensive experience" "Never programmed before" "Never programmed before" ...
##  $Rexperience : chr "Never used" "Basic competence" "Basic competence" "Never used" ... ##$ OperatingSystem: chr  "Windows" "Mac OS X" "Windows" "Windows" ...
##  $TVhours : num 10.5 3 0 10 4 0 2 20 4 0 ... ##$ Editor         : chr  "Other" "Microsoft Word" "Microsoft Word" "Excel" ...

## Another common problem

• On Homework 2 you’ll learn how to wrangle with another common problem

• When data is entered manually, misspellings and case changes are very common

• E.g., a column showing Program information may look like,

program <- c("ppm", "PPM", "MISM", "HCA", "hca", "mism", "PPM-DA", "PPM-DA", "MSHCA", "MSPMM-DA", "PPM")

table(program)
## program
##      hca      HCA     mism     MISM    MSHCA MSPMM-DA      ppm      PPM
##        1        1        1        1        1        1        1        2
##   PPM-DA
##        2
table(program)
## program
##      hca      HCA     mism     MISM    MSHCA MSPMM-DA      ppm      PPM
##        1        1        1        1        1        1        1        2
##   PPM-DA
##        2
• This vector has a lot of redundant unique values that we won’t want to carry through our entire analysis

• E.g., hca and HCA, mism and MISM, ppm and PPM should certainly be combined. We might even want to combine PPM and PPM-DA together.

• On HW 2 you’ll see a quick way to fix capitalization issues. For other forms of redundancy, you’ll likely want to use a function like recode() introduced in Lecture 3.

## When are factor variables useful?

• Factor variables are handy when it’s important to have control over the ordering of the variable values.

• E.g., What happens when we plot everyone’s prior programming experience?

qplot(survey$PriorExp) • The x-axis values appear in alphabetical order. Not always desirable. • What if we wanted the values to appear in ascending order of experience? ## Factor variables • We can mutate PriorExp into a factor with levels in a specified order using the factor() command, specifying the levels of the variable in the order we want them to appear survey <- survey %>% mutate(PriorExp = factor(PriorExp, levels = c("Never programmed before", "Some experience", "Extensive experience"))) head(survey$PriorExp)
## [1] Some experience         Extensive experience    Never programmed before
## [4] Never programmed before Never programmed before Some experience
## 3 Levels: Never programmed before ... Extensive experience
• Now PriorExp is a factor variable, with values ordered from “Never programmed before” to “Extensive experience”

## Reconstructing the plot

• Here’s what we get if we run the exact same plotting command again
qplot(survey$PriorExp) • Better! This more clearly communicates the distribution of prior programming experience among survey respondents. ## Functions • We have used a lot of built-in functions: mean(), subset(), plot(), read.table() • An important part of programming and data analysis is to write custom functions • Functions help make code modular • Functions make debugging easier • Remember: this entire class is about applying functions to data ## What is a function? A function is a machine that turns input objects (arguments) into an output object (return value) according to a definite rule. • Let’s look at a really simple function addOne <- function(x) { x + 1 } • x is the argument or input • The function output is the input x incremented by 1 addOne(12) ## [1] 13 ## More interesting example • Here’s a function that returns a % given a numerator, denominator, and desired number of decimal values # Ended here calculatePercentage <- function(x, y, d) { decimal <- x / y # Calculate decimal value round(100 * decimal, d) # Convert to % and round to d digits } calculatePercentage(27, 80, 1) ## [1] 33.8 • If you’re calculating several %’s for your report, you should use this kind of function instead of repeatedly copying and pasting code ## Function returning a list • Here’s a function that takes a person’s full name (FirstName LastName), weight in lb and height in inches and converts it into a list with the person’s first name, person’s last name, weight in kg, height in m, and BMI. createPatientRecord <- function(full.name, weight, height) { name.vec <- strsplit(full.name, split=" ")[[1]] first.name <- name.vec[1] last.name <- name.vec[2] weight.in.kg <- weight / 2.2 height.in.m <- height * 0.0254 bmi <- weight.in.kg / (height.in.m ^ 2) list(first.name=first.name, last.name=last.name, weight=weight.in.kg, height=height.in.m, bmi=bmi) } ## Trying out the function createPatientRecord("Michael Smith", 185, 12 * 6 + 1) ##$first.name
## [1] "Michael"
##
## $last.name ## [1] "Smith" ## ##$weight
## [1] 84.09091
##
## $height ## [1] 1.8542 ## ##$bmi
## [1] 24.45884

## Another example: 3 number summary

• Calculate mean, median and standard deviation
threeNumberSummary <- function(x) {
c(mean=mean(x), median=median(x), sd=sd(x))
}
x <- rnorm(100, mean=5, sd=2) # Vector of 100 normals with mean 5 and sd 2
threeNumberSummary(x)
##     mean   median       sd
## 4.957200 5.267708 2.233534

## If-else statements

• Oftentimes we want our code to have different effects depending on the features of the input

• Example: Calculating a student’s letter grade
• If grade >= 90, assign A
• Otherwise, if grade >= 80, assign B
• Otherwise, if grade >= 70, assign C
• In all other cases, assign F
• To code this up, we use if-else statements

calculateLetterGrade <- function(x) {
if(x >= 90) {
} else if(x >= 80) {
} else if(x >= 70) {
} else {
}
}

course.grades <- c(92, 78, 87, 91, 62)
map_chr(course.grades, calculateLetterGrade)
## [1] "A" "C" "B" "A" "F"

## return()

• In the previous examples we specified the output simply by writing the output variable as the last line of the function

• More explicitly, we can use the return() function

addOne <- function(x) {
return(x + 1)
}

addOne(12)
## [1] 13
• We will generally avoid the return() function, but you can use it if necessary or if it makes writing a particular function easier.
• Google’s style guide suggests explicit returns. Most do not.

## More programming basics: loops

• We’ll now learn about loops and some more efficient/syntactically simple loop alternatives

• loops are ways of iterating over data

## For loops: a pair of examples

for(i in 1:4) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
phrase <- paste(phrase, word)
print(phrase)
}
## [1] "Good Night, and"
## [1] "Good Night, and Good"
## [1] "Good Night, and Good Luck"

## For loops: syntax

A for loop executes a chunk of code for every value of an index variable in an index set

• The basic syntax takes the form
for(index.variable in index.set) {
code to be repeated at every value of index.variable
}
• The index set is often a vector of integers, but can be more general

## Example

index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
print(c(i, typeof(i)))
}
## [1] "Michael"   "character"
## [1] "185"    "double"
## [1] "TRUE"    "logical"

## Example: Calculate sum of each column

fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows
##            [,1]      [,2]      [,3]       [,4]      [,5]
## [1,] -1.0963542  1.268921 -1.287129 -0.5779126 0.5140325
## [2,]  0.1758044 -1.301273  1.097273  0.7481089 0.2204204
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums
## [1] -10.4499437  18.7544425  -0.9722741  -8.9584812  -4.7780701
colSums(fake.data) # A better approach (see also colMeans())
## [1] -10.4499437  18.7544425  -0.9722741  -8.9584812  -4.7780701

## while loops

• while loops repeat a chunk of code while the specified condition remains true
day <- 1
num.days <- 365
while(day <= num.days) {
day <- day + 1
}
• We won’t really be using while loops in this class

• Just be aware that they exist, and that they may become useful to you at some point in your analytics career

## Loop alternatives

Command Description
apply(X, MARGIN, FUN) Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X
map(.x, .f, ...) Obtain a list by applying .f to every element of a list or atomic vector .x
map_<type>(.x, .f, ...) For <type> given by lgl (logical), int (integer), dbl (double) or chr (character), return a vector of this type obtained by applying .f to each element of .x
map_at(.x, .at, .f) Obtain a list by applying .f to the elements of .x specified by name or index given in .at
map_if(.x, .p, .f) Obtain a list .f to the elements of .x specified by .p (a predicate function, or a logical vector)
mutate_all/_at/_if Mutate all variables, specified (at) variables, or those selected by a predicate (if)
summarize_all/_at/_if Summarize all variables, specified variables, or those selected by a predicate (if)
• These take practice to get used to, but make analysis easier to debug and less prone to error when used effectively

• The best way to learn them is by looking at a bunch of examples. The end of each help file contains some examples.

## Example: apply()

colMeans(fake.data)
## [1] -0.104499437  0.187544425 -0.009722741 -0.089584812 -0.047780701
apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns
## [1] -0.104499437  0.187544425 -0.009722741 -0.089584812 -0.047780701
# Function that calculates proportion of vector indexes that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive) 
## [1] 0.46 0.64 0.50 0.47 0.54

## Example: map, map_()

map(survey, is.numeric) # Returns a list
## $Program ## [1] FALSE ## ##$PriorExp
## [1] FALSE
##
## $Rexperience ## [1] FALSE ## ##$OperatingSystem
## [1] FALSE
##
## $TVhours ## [1] TRUE ## ##$Editor
## [1] FALSE
map_lgl(survey, is.numeric) # Returns a logical vector with named elements
##         Program        PriorExp     Rexperience OperatingSystem
##           FALSE           FALSE           FALSE           FALSE
##         TVhours          Editor
##            TRUE           FALSE

## Example: apply(), map(), map_()

apply(cars, 2, FUN=mean) # Data frames are arrays
## speed  dist
## 15.40 42.98
map(cars, mean) # Data frames are also lists
## $speed ## [1] 15.4 ## ##$dist
## [1] 42.98
map_dbl(cars, mean) # map output as a double vector
## speed  dist
## 15.40 42.98

## Example: mutate_if

Let’s convert all factor variables in Cars93 to lowercase

head(Cars93$Type) ## [1] Small Midsize Compact Midsize Midsize Midsize ## Levels: Compact Large Midsize Small Sporty Van Cars93.lower <- mutate_if(Cars93, is.factor, tolower) head(Cars93.lower$Type)
## [1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"
• Note: this has the effect of producing a copy of the Cars93 data where all of the factor variables have been replaced with versions containing lowercase values

If you pass the functions in as a list with named elements, those names get appended to create modified versions of variables instead of replacing existing variables

Cars93.lower <- mutate_if(Cars93, is.factor, list(lower = tolower))
head(Cars93.lower$Type) ## [1] Small Midsize Compact Midsize Midsize Midsize ## Levels: Compact Large Midsize Small Sporty Van head(Cars93.lower$Type_lower)
## [1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"

## Example: mutate_at

Let’s convert from MPG to KPML but this time using mutate_at

Cars93.metric <- Cars93 %>%
mutate_at(c("MPG.city", "MPG.highway"), list(KMPL = ~ 0.425 * .x))
tail(colnames(Cars93.metric))
## [1] "Luggage.room"     "Weight"           "Origin"
## [4] "Make"             "MPG.city_KMPL"    "MPG.highway_KMPL"

Here, ~ 0.425 * .x is an example of specifying a “lambda” (anonymous) function. It is permitted short-hand for

function(.x){0.425 * .x}

## Example: summarize_if

Let’s get the mean of every numeric column in Cars93

Cars93 %>% summarize_if(is.numeric, mean)
##   Min.Price    Price Max.Price MPG.city MPG.highway EngineSize Horsepower
## 1  17.12581 19.50968  21.89892 22.36559    29.08602   2.667742    143.828
##        RPM Rev.per.mile Fuel.tank.capacity Passengers   Length Wheelbase
## 1 5280.645     2332.204           16.66452   5.086022 183.2043  103.9462
##      Width Turn.circle Rear.seat.room Luggage.room   Weight
## 1 69.37634    38.95699             NA           NA 3072.903
Cars93 %>% summarize_if(is.numeric, list(mean = mean), na.rm=TRUE)
##   Min.Price_mean Price_mean Max.Price_mean MPG.city_mean MPG.highway_mean
## 1       17.12581   19.50968       21.89892      22.36559         29.08602
##   EngineSize_mean Horsepower_mean RPM_mean Rev.per.mile_mean
## 1        2.667742         143.828 5280.645          2332.204
##   Fuel.tank.capacity_mean Passengers_mean Length_mean Wheelbase_mean
## 1                16.66452        5.086022    183.2043       103.9462
##   Width_mean Turn.circle_mean Rear.seat.room_mean Luggage.room_mean
## 1   69.37634         38.95699            27.82967          13.89024
##   Weight_mean
## 1    3072.903

## Example: summarize_at

Let’s get the average fuel economy of all vehicles, grouped by their Type

Cars93 %>%
group_by(Type) %>%
summarize_at(c("MPG.city", "MPG.highway"), mean)
## # A tibble: 6 x 3
##   Type    MPG.city MPG.highway
##   <fct>      <dbl>       <dbl>
## 1 Compact     22.7        29.9
## 2 Large       18.4        26.7
## 3 Midsize     19.5        26.7
## 4 Small       29.9        35.5
## 5 Sporty      21.8        28.8
## 6 Van         17          21.9

## Another approach

We’ll learn about a bunch of select helper functions like contains() and starts_with().

Here’s one way of performing the previous operation with the help of these functions, and appending _mean to the resulting output.

Cars93 %>%
group_by(Type) %>%
summarize_at(vars(contains("MPG")), list(mean = mean))
## # A tibble: 6 x 3
##   Type    MPG.city_mean MPG.highway_mean
##   <fct>           <dbl>            <dbl>
## 1 Compact          22.7             29.9
## 2 Large            18.4             26.7
## 3 Midsize          19.5             26.7
## 4 Small            29.9             35.5
## 5 Sporty           21.8             28.8
## 6 Van              17               21.9

## More than one grouping variable

Cars93 %>%
group_by(Origin, AirBags) %>%
summarize_at(vars(contains("MPG")), list(mean = mean))
## # A tibble: 6 x 4
## # Groups:   Origin [2]
##   Origin  AirBags            MPG.city_mean MPG.highway_mean
##   <fct>   <fct>                      <dbl>            <dbl>
## 1 USA     Driver & Passenger          19               27.2
## 2 USA     Driver only                 20.2             27.5
## 3 USA     None                        23.1             29.6
## 4 non-USA Driver & Passenger          20.3             27
## 5 non-USA Driver only                 23.2             29.4
## 6 non-USA None                        25.9             32

## Assignments

• Homework 2 will be posted today
• Due: Wednesday, November 11, 1:30pm ET
• Submit your .Rmd and .html files on Canvas
• Lab 4 is available on Canvas and the course website
• You have until Friday evening to complete it
• Friday’s lab session will go over this week’s material and help you complete the labs