Fall 2020

Agenda

  • Lists
  • A common data cleaning task
  • Factor variables, and when they’re useful
  • Functions
  • If-else statements
  • For/while loops to iterate over data
  • R coding style

  • Rather than picking up where Lecture 3 left off I’ve woven the Lecture 3 content we haven’t yet covered into the Lecture 4 notes

Package loading

library(tidyverse)
Cars93 <- MASS::Cars93  # For Cars93 data again

Basics of lists

A list is a data structure that can be used to store different kinds of data

  • Recall: a vector is a data structure for storing similar kinds of data

  • To better understand the difference, consider the following example.

my.vector.1 <- c("Michael", 165, TRUE) # (name, weight, is.male)
my.vector.1 
## [1] "Michael" "165"     "TRUE"
typeof(my.vector.1)  # All the elements are now character strings!
## [1] "character"

Lists vs. vectors

my.vector.2 <- c(FALSE, TRUE, 27) # (is.male, is.citizen, age)
my.vector.2
## [1]  0  1 27
typeof(my.vector.2)
## [1] "double"
  • Vectors expect elements to be all of the same type (e.g., Boolean, numeric, character)

  • When data of different types are put into a vector, the R converts everything to a common type

Lists

  • To store data of different types in the same object, we use lists

  • Simple way to construct lists: use list() function

  • (We’ll learn about functions like map and map_chr soon.)

my.list <- list("Michael", 165, TRUE)
my.list
## [[1]]
## [1] "Michael"
## 
## [[2]]
## [1] 165
## 
## [[3]]
## [1] TRUE
map_chr(my.list, typeof)
## [1] "character" "double"    "logical"

Named elements

patient.1 <- list(name="Michael", weight=165, is.male=TRUE)
patient.1
## $name
## [1] "Michael"
## 
## $weight
## [1] 165
## 
## $is.male
## [1] TRUE

Referencing elements of a list (similar to data frames)

patient.1$name # Get "name" element (returns a string)
## [1] "Michael"
patient.1[["name"]] # Get "name" element (returns a string)
## [1] "Michael"
patient.1["name"] # Get "name" slice (returns a list)
## $name
## [1] "Michael"
c(typeof(patient.1$name), typeof(patient.1["name"]))
## [1] "character" "list"

A common problem

  • One of the most common problems you’ll encounter when importing manually-entered data is inconsistent data types within columns

  • For a simple example, let’s look at TVhours column in a messy version of the survey data from Lecture 2

survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2020_messy.csv", 
                         header=TRUE, stringsAsFactors = FALSE)
# Print out first 20 elements
head(survey.messy$TVhours, 20) 
##  [1] "10.5"  "3"     "0"     "10"    "~4"    "0"     "2"     "20ish"
##  [9] "4"     "0"     "15"    "5"     ">20"   "10"    "5"     "2"    
## [17] "14"    "10"    "4"     "3"
  • NOTE: If you’ve installed R within the past few months, your version will automatically default to stringsAsFactors = FALSE. My version of R is older and still has the old stringsAsFactors = TRUE default, a convention that dates back to 1998.
    • For a thrilling read, take a look at this this blog post by the R development team

What’s happening?

str(survey.messy)
## 'data.frame':    57 obs. of  6 variables:
##  $ Program        : chr  "PPM" "Other" "MISM" "PPM" ...
##  $ PriorExp       : chr  "Some experience" "Extensive experience" "Never programmed before" "Never programmed before" ...
##  $ Rexperience    : chr  "Never used" "Basic competence" "Basic competence" "Never used" ...
##  $ OperatingSystem: chr  "Windows" "Mac OS X" "Windows" "Windows" ...
##  $ TVhours        : chr  "10.5" "3" "0" "10" ...
##  $ Editor         : chr  "Other" "Microsoft Word" "Microsoft Word" "Excel" ...
  • Several of the entries have non-numeric values in them (they contain strings)

  • As a result, TVhours is being imported as character vector

A look at the TVhours column

survey.messy$TVhours
##  [1] "10.5"      "3"         "0"         "10"        "~4"       
##  [6] "0"         "2"         "20ish"     "4"         "0"        
## [11] "15"        "5"         ">20"       "10"        "5"        
## [16] "2"         "14"        "10"        "4"         "3"        
## [21] "6"         ">10"       "2"         "3"         "3"        
## [26] "1"         "<1"        "3"         "5"         "10"       
## [31] "20"        "none"      "9"         "3"         "4"        
## [36] "8"         "7"         "8"         "10"        "10"       
## [41] "4"         "10"        "4"         "0"         "1"        
## [46] "7"         "2"         "15"        "8"         "10"       
## [51] "2"         "3"         "4"         "21"        "10"       
## [56] "approx 20" "0"

Partial fix

  • In Lecture 1 we saw that there exists a family of as.type functions that will try to objects from one data type to the specified type

  • We want TVhours to be numeric, so let’s try as.numeric

as.numeric(survey.messy$TVhours)
## Warning: NAs introduced by coercion
##  [1] 10.5  3.0  0.0 10.0   NA  0.0  2.0   NA  4.0  0.0 15.0  5.0   NA 10.0
## [15]  5.0  2.0 14.0 10.0  4.0  3.0  6.0   NA  2.0  3.0  3.0  1.0   NA  3.0
## [29]  5.0 10.0 20.0   NA  9.0  3.0  4.0  8.0  7.0  8.0 10.0 10.0  4.0 10.0
## [43]  4.0  0.0  1.0  7.0  2.0 15.0  8.0 10.0  2.0  3.0  4.0 21.0 10.0   NA
## [57]  0.0

We can do a bit better

as.numeric(survey.messy$TVhours)
## Warning: NAs introduced by coercion
##  [1] 10.5  3.0  0.0 10.0   NA  0.0  2.0   NA  4.0  0.0 15.0  5.0   NA 10.0
## [15]  5.0  2.0 14.0 10.0  4.0  3.0  6.0   NA  2.0  3.0  3.0  1.0   NA  3.0
## [29]  5.0 10.0 20.0   NA  9.0  3.0  4.0  8.0  7.0  8.0 10.0 10.0  4.0 10.0
## [43]  4.0  0.0  1.0  7.0  2.0 15.0  8.0 10.0  2.0  3.0  4.0 21.0 10.0   NA
## [57]  0.0
  • All the corrupted cells now appear as NA, which is R’s missing indicator

  • We can do a little better by looking at the corrupted entries and seeing if we can recover more information from the cells that contained non-numeric values

Deleting non-numeric (or .) characters

  • Here we’ll use the gsub() function (global substitution) to clean up more of the corruption
head(survey.messy$TVhours, 40)
##  [1] "10.5"  "3"     "0"     "10"    "~4"    "0"     "2"     "20ish"
##  [9] "4"     "0"     "15"    "5"     ">20"   "10"    "5"     "2"    
## [17] "14"    "10"    "4"     "3"     "6"     ">10"   "2"     "3"    
## [25] "3"     "1"     "<1"    "3"     "5"     "10"    "20"    "none" 
## [33] "9"     "3"     "4"     "8"     "7"     "8"     "10"    "10"
# Use gsub() to replace everything except digits and '.' with a blank ""
gsub("[^0-9.]", "", survey.messy$TVhours) 
##  [1] "10.5" "3"    "0"    "10"   "4"    "0"    "2"    "20"   "4"    "0"   
## [11] "15"   "5"    "20"   "10"   "5"    "2"    "14"   "10"   "4"    "3"   
## [21] "6"    "10"   "2"    "3"    "3"    "1"    "1"    "3"    "5"    "10"  
## [31] "20"   ""     "9"    "3"    "4"    "8"    "7"    "8"    "10"   "10"  
## [41] "4"    "10"   "4"    "0"    "1"    "7"    "2"    "15"   "8"    "10"  
## [51] "2"    "3"    "4"    "21"   "10"   "20"   "0"
  • As a last step, we should go through and figure out if any of the NA values should really be 0.
    • This step is not shown here.

One-line cleanup

  • Let’s clean up the TVhours column and cast it to numeric all in one command
survey <- mutate(survey.messy, 
                 TVhours = as.numeric(gsub("[^0-9.]", "", TVhours)))
str(survey)
## 'data.frame':    57 obs. of  6 variables:
##  $ Program        : chr  "PPM" "Other" "MISM" "PPM" ...
##  $ PriorExp       : chr  "Some experience" "Extensive experience" "Never programmed before" "Never programmed before" ...
##  $ Rexperience    : chr  "Never used" "Basic competence" "Basic competence" "Never used" ...
##  $ OperatingSystem: chr  "Windows" "Mac OS X" "Windows" "Windows" ...
##  $ TVhours        : num  10.5 3 0 10 4 0 2 20 4 0 ...
##  $ Editor         : chr  "Other" "Microsoft Word" "Microsoft Word" "Excel" ...

Another common problem

  • On Homework 2 you’ll learn how to wrangle with another common problem

  • When data is entered manually, misspellings and case changes are very common

  • E.g., a column showing Program information may look like,

program <- c("ppm", "PPM", "MISM", "HCA", "hca", "mism", "PPM-DA", "PPM-DA", "MSHCA", "MSPMM-DA", "PPM")

table(program)
## program
##      hca      HCA     mism     MISM    MSHCA MSPMM-DA      ppm      PPM 
##        1        1        1        1        1        1        1        2 
##   PPM-DA 
##        2

table(program)
## program
##      hca      HCA     mism     MISM    MSHCA MSPMM-DA      ppm      PPM 
##        1        1        1        1        1        1        1        2 
##   PPM-DA 
##        2
  • This vector has a lot of redundant unique values that we won’t want to carry through our entire analysis

  • E.g., hca and HCA, mism and MISM, ppm and PPM should certainly be combined. We might even want to combine PPM and PPM-DA together.

  • On HW 2 you’ll see a quick way to fix capitalization issues. For other forms of redundancy, you’ll likely want to use a function like recode() introduced in Lecture 3.

When are factor variables useful?

  • Factor variables are handy when it’s important to have control over the ordering of the variable values.

  • E.g., What happens when we plot everyone’s prior programming experience?

qplot(survey$PriorExp)

  • The x-axis values appear in alphabetical order. Not always desirable.
  • What if we wanted the values to appear in ascending order of experience?

Factor variables

  • We can mutate PriorExp into a factor with levels in a specified order using the factor() command, specifying the levels of the variable in the order we want them to appear
survey <- survey %>%
  mutate(PriorExp = factor(PriorExp,
                           levels = c("Never programmed before",
                                      "Some experience",
                                      "Extensive experience")))
head(survey$PriorExp)
## [1] Some experience         Extensive experience    Never programmed before
## [4] Never programmed before Never programmed before Some experience        
## 3 Levels: Never programmed before ... Extensive experience
  • Now PriorExp is a factor variable, with values ordered from “Never programmed before” to “Extensive experience”

Reconstructing the plot

  • Here’s what we get if we run the exact same plotting command again
qplot(survey$PriorExp)

  • Better! This more clearly communicates the distribution of prior programming experience among survey respondents.

Functions

  • We have used a lot of built-in functions: mean(), subset(), plot(), read.table()

  • An important part of programming and data analysis is to write custom functions

  • Functions help make code modular

  • Functions make debugging easier

  • Remember: this entire class is about applying functions to data

What is a function?

A function is a machine that turns input objects (arguments) into an output object (return value) according to a definite rule.

  • Let’s look at a really simple function
addOne <- function(x) {
  x + 1
}
  • x is the argument or input

  • The function output is the input x incremented by 1

addOne(12)
## [1] 13

More interesting example

  • Here’s a function that returns a % given a numerator, denominator, and desired number of decimal values
# Ended here
calculatePercentage <- function(x, y, d) {
  decimal <- x / y  # Calculate decimal value
  round(100 * decimal, d)  # Convert to % and round to d digits
}

calculatePercentage(27, 80, 1)
## [1] 33.8
  • If you’re calculating several %’s for your report, you should use this kind of function instead of repeatedly copying and pasting code

Function returning a list

  • Here’s a function that takes a person’s full name (FirstName LastName), weight in lb and height in inches and converts it into a list with the person’s first name, person’s last name, weight in kg, height in m, and BMI.
createPatientRecord <- function(full.name, weight, height) {
  name.vec <- strsplit(full.name, split=" ")[[1]]
  first.name <- name.vec[1]
  last.name <- name.vec[2]
  weight.in.kg <- weight / 2.2
  height.in.m <- height * 0.0254
  bmi <- weight.in.kg / (height.in.m ^ 2)
  list(first.name=first.name, last.name=last.name, weight=weight.in.kg, height=height.in.m,
       bmi=bmi)
}

Trying out the function

createPatientRecord("Michael Smith", 185, 12 * 6 + 1)
## $first.name
## [1] "Michael"
## 
## $last.name
## [1] "Smith"
## 
## $weight
## [1] 84.09091
## 
## $height
## [1] 1.8542
## 
## $bmi
## [1] 24.45884

Another example: 3 number summary

  • Calculate mean, median and standard deviation
threeNumberSummary <- function(x) {
  c(mean=mean(x), median=median(x), sd=sd(x))
}
x <- rnorm(100, mean=5, sd=2) # Vector of 100 normals with mean 5 and sd 2
threeNumberSummary(x)
##     mean   median       sd 
## 4.957200 5.267708 2.233534

If-else statements

  • Oftentimes we want our code to have different effects depending on the features of the input

  • Example: Calculating a student’s letter grade
    • If grade >= 90, assign A
    • Otherwise, if grade >= 80, assign B
    • Otherwise, if grade >= 70, assign C
    • In all other cases, assign F
  • To code this up, we use if-else statements

If-else Example: Letter grades

calculateLetterGrade <- function(x) {
  if(x >= 90) {
    grade <- "A"
  } else if(x >= 80) {
    grade <- "B"
  } else if(x >= 70) {
    grade <- "C"
  } else {
    grade <- "F"
  }
  grade
}

course.grades <- c(92, 78, 87, 91, 62)
map_chr(course.grades, calculateLetterGrade)
## [1] "A" "C" "B" "A" "F"

return()

  • In the previous examples we specified the output simply by writing the output variable as the last line of the function

  • More explicitly, we can use the return() function

addOne <- function(x) {
  return(x + 1)
}

addOne(12)
## [1] 13
  • We will generally avoid the return() function, but you can use it if necessary or if it makes writing a particular function easier.
  • Google’s style guide suggests explicit returns. Most do not.

More programming basics: loops

  • We’ll now learn about loops and some more efficient/syntactically simple loop alternatives

  • loops are ways of iterating over data

For loops: a pair of examples

for(i in 1:4) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
  phrase <- paste(phrase, word)
  print(phrase)
}
## [1] "Good Night, and"
## [1] "Good Night, and Good"
## [1] "Good Night, and Good Luck"

For loops: syntax

A for loop executes a chunk of code for every value of an index variable in an index set

  • The basic syntax takes the form
for(index.variable in index.set) {
  code to be repeated at every value of index.variable
}
  • The index set is often a vector of integers, but can be more general

Example

index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
  print(c(i, typeof(i)))
}
## [1] "Michael"   "character"
## [1] "185"    "double"
## [1] "TRUE"    "logical"

Example: Calculate sum of each column

fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows
##            [,1]      [,2]      [,3]       [,4]      [,5]
## [1,] -1.0963542  1.268921 -1.287129 -0.5779126 0.5140325
## [2,]  0.1758044 -1.301273  1.097273  0.7481089 0.2204204
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
  col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums
## [1] -10.4499437  18.7544425  -0.9722741  -8.9584812  -4.7780701
colSums(fake.data) # A better approach (see also colMeans())
## [1] -10.4499437  18.7544425  -0.9722741  -8.9584812  -4.7780701

while loops

  • while loops repeat a chunk of code while the specified condition remains true
day <- 1
num.days <- 365
while(day <= num.days) {
  day <- day + 1
}
  • We won’t really be using while loops in this class

  • Just be aware that they exist, and that they may become useful to you at some point in your analytics career

Loop alternatives

Command Description
apply(X, MARGIN, FUN) Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X
map(.x, .f, ...) Obtain a list by applying .f to every element of a list or atomic vector .x
map_<type>(.x, .f, ...) For <type> given by lgl (logical), int (integer), dbl (double) or chr (character), return a vector of this type obtained by applying .f to each element of .x
map_at(.x, .at, .f) Obtain a list by applying .f to the elements of .x specified by name or index given in .at
map_if(.x, .p, .f) Obtain a list .f to the elements of .x specified by .p (a predicate function, or a logical vector)
mutate_all/_at/_if Mutate all variables, specified (at) variables, or those selected by a predicate (if)
summarize_all/_at/_if Summarize all variables, specified variables, or those selected by a predicate (if)
  • These take practice to get used to, but make analysis easier to debug and less prone to error when used effectively

  • The best way to learn them is by looking at a bunch of examples. The end of each help file contains some examples.

Example: apply()

colMeans(fake.data)
## [1] -0.104499437  0.187544425 -0.009722741 -0.089584812 -0.047780701
apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns
## [1] -0.104499437  0.187544425 -0.009722741 -0.089584812 -0.047780701
# Function that calculates proportion of vector indexes that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive) 
## [1] 0.46 0.64 0.50 0.47 0.54

Example: map, map_()

map(survey, is.numeric) # Returns a list
## $Program
## [1] FALSE
## 
## $PriorExp
## [1] FALSE
## 
## $Rexperience
## [1] FALSE
## 
## $OperatingSystem
## [1] FALSE
## 
## $TVhours
## [1] TRUE
## 
## $Editor
## [1] FALSE
map_lgl(survey, is.numeric) # Returns a logical vector with named elements
##         Program        PriorExp     Rexperience OperatingSystem 
##           FALSE           FALSE           FALSE           FALSE 
##         TVhours          Editor 
##            TRUE           FALSE

Example: apply(), map(), map_()

apply(cars, 2, FUN=mean) # Data frames are arrays
## speed  dist 
## 15.40 42.98
map(cars, mean) # Data frames are also lists
## $speed
## [1] 15.4
## 
## $dist
## [1] 42.98
map_dbl(cars, mean) # map output as a double vector
## speed  dist 
## 15.40 42.98

Example: mutate_if

Let’s convert all factor variables in Cars93 to lowercase

head(Cars93$Type)
## [1] Small   Midsize Compact Midsize Midsize Midsize
## Levels: Compact Large Midsize Small Sporty Van
Cars93.lower <- mutate_if(Cars93, is.factor, tolower)
head(Cars93.lower$Type)
## [1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"
  • Note: this has the effect of producing a copy of the Cars93 data where all of the factor variables have been replaced with versions containing lowercase values

Example: mutate_if, adding instead of replacing columns

If you pass the functions in as a list with named elements, those names get appended to create modified versions of variables instead of replacing existing variables

Cars93.lower <- mutate_if(Cars93, is.factor, list(lower = tolower))
head(Cars93.lower$Type)
## [1] Small   Midsize Compact Midsize Midsize Midsize
## Levels: Compact Large Midsize Small Sporty Van
head(Cars93.lower$Type_lower)
## [1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"

Example: mutate_at

Let’s convert from MPG to KPML but this time using mutate_at

Cars93.metric <- Cars93 %>% 
  mutate_at(c("MPG.city", "MPG.highway"), list(KMPL = ~ 0.425 * .x))
tail(colnames(Cars93.metric))
## [1] "Luggage.room"     "Weight"           "Origin"          
## [4] "Make"             "MPG.city_KMPL"    "MPG.highway_KMPL"

Here, ~ 0.425 * .x is an example of specifying a “lambda” (anonymous) function. It is permitted short-hand for

function(.x){0.425 * .x}

Example: summarize_if

Let’s get the mean of every numeric column in Cars93

Cars93 %>% summarize_if(is.numeric, mean)
##   Min.Price    Price Max.Price MPG.city MPG.highway EngineSize Horsepower
## 1  17.12581 19.50968  21.89892 22.36559    29.08602   2.667742    143.828
##        RPM Rev.per.mile Fuel.tank.capacity Passengers   Length Wheelbase
## 1 5280.645     2332.204           16.66452   5.086022 183.2043  103.9462
##      Width Turn.circle Rear.seat.room Luggage.room   Weight
## 1 69.37634    38.95699             NA           NA 3072.903
Cars93 %>% summarize_if(is.numeric, list(mean = mean), na.rm=TRUE)
##   Min.Price_mean Price_mean Max.Price_mean MPG.city_mean MPG.highway_mean
## 1       17.12581   19.50968       21.89892      22.36559         29.08602
##   EngineSize_mean Horsepower_mean RPM_mean Rev.per.mile_mean
## 1        2.667742         143.828 5280.645          2332.204
##   Fuel.tank.capacity_mean Passengers_mean Length_mean Wheelbase_mean
## 1                16.66452        5.086022    183.2043       103.9462
##   Width_mean Turn.circle_mean Rear.seat.room_mean Luggage.room_mean
## 1   69.37634         38.95699            27.82967          13.89024
##   Weight_mean
## 1    3072.903

Example: summarize_at

Let’s get the average fuel economy of all vehicles, grouped by their Type

Cars93 %>%
  group_by(Type) %>%
  summarize_at(c("MPG.city", "MPG.highway"), mean)
## # A tibble: 6 x 3
##   Type    MPG.city MPG.highway
##   <fct>      <dbl>       <dbl>
## 1 Compact     22.7        29.9
## 2 Large       18.4        26.7
## 3 Midsize     19.5        26.7
## 4 Small       29.9        35.5
## 5 Sporty      21.8        28.8
## 6 Van         17          21.9

Another approach

We’ll learn about a bunch of select helper functions like contains() and starts_with().

Here’s one way of performing the previous operation with the help of these functions, and appending _mean to the resulting output.

Cars93 %>%
  group_by(Type) %>%
  summarize_at(vars(contains("MPG")), list(mean = mean))
## # A tibble: 6 x 3
##   Type    MPG.city_mean MPG.highway_mean
##   <fct>           <dbl>            <dbl>
## 1 Compact          22.7             29.9
## 2 Large            18.4             26.7
## 3 Midsize          19.5             26.7
## 4 Small            29.9             35.5
## 5 Sporty           21.8             28.8
## 6 Van              17               21.9

More than one grouping variable

Cars93 %>%
  group_by(Origin, AirBags) %>%
  summarize_at(vars(contains("MPG")), list(mean = mean))
## # A tibble: 6 x 4
## # Groups:   Origin [2]
##   Origin  AirBags            MPG.city_mean MPG.highway_mean
##   <fct>   <fct>                      <dbl>            <dbl>
## 1 USA     Driver & Passenger          19               27.2
## 2 USA     Driver only                 20.2             27.5
## 3 USA     None                        23.1             29.6
## 4 non-USA Driver & Passenger          20.3             27  
## 5 non-USA Driver only                 23.2             29.4
## 6 non-USA None                        25.9             32

R coding style

Assignments

  • Homework 2 will be posted today
    • Due: Wednesday, November 11, 1:30pm ET
    • Submit your .Rmd and .html files on Canvas
  • Lab 4 is available on Canvas and the course website
    • You have until Friday evening to complete it
    • Friday’s lab session will go over this week’s material and help you complete the labs