Lecture 4: Basic cleaning, loops, apply functions

Prof. Alexandra Chouldechova
94842

Agenda

  • A common data cleaning task
  • If-else statements
  • For/while loops to iterate over data
  • apply(), lapply(), sapply(), tapply()
  • with() to specify scope

Package loading

library(plyr)
library(dplyr)

A common problem

  • One of the most common problems you'll encounter when importing manually-entered data is inconsistent data types within columns

  • For a simple example, let's look at TVhours column in a messy version of the survey data from Lecture 2

survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_messy.csv", header=TRUE)
survey.messy$TVhours
 [1] 2              15 incl movies 16             0             
 [5] 2 hours        5ish           0              4             
 [9] 0              0              14             7-Jun         
[13] 10             15             4              4             
[17] 10             7              33 (Netflix)   8             
[21] 8              falkjklj       0              0             
[25] 0              none           6              1             
[29] 0              0              4             
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none

What's happening?

str(survey.messy)
'data.frame':   31 obs. of  6 variables:
 $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 3 3 2 3 3 3 2 3 2 ...
 $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 2 3 3 2 2 3 3 3 2 ...
 $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 3 3 1 1 3 3 1 1 1 3 ...
 $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 1 2 1 1 2 2 1 2 2 2 ...
 $ TVhours        : Factor w/ 18 levels "0","1","10","14",..: 8 6 7 1 9 12 1 11 1 1 ...
 $ Editor         : Factor w/ 4 levels "Excel","LaTeX",..: 3 3 3 2 3 3 3 3 3 3 ...
  • Several of the entries have non-numeric values in them (they contain strings)

  • As a result, TVhours is being imported as factor

A look at the TVhours column

survey.messy$TVhours
 [1] 2              15 incl movies 16             0             
 [5] 2 hours        5ish           0              4             
 [9] 0              0              14             7-Jun         
[13] 10             15             4              4             
[17] 10             7              33 (Netflix)   8             
[21] 8              falkjklj       0              0             
[25] 0              none           6              1             
[29] 0              0              4             
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none

Attempt at a fix

  • What if we just try to cast it back to numeric?
tv.hours.messy <- survey.messy$TVhours
tv.hours.messy
 [1] 2              15 incl movies 16             0             
 [5] 2 hours        5ish           0              4             
 [9] 0              0              14             7-Jun         
[13] 10             15             4              4             
[17] 10             7              33 (Netflix)   8             
[21] 8              falkjklj       0              0             
[25] 0              none           6              1             
[29] 0              0              4             
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none
as.numeric(tv.hours.messy)
 [1]  8  6  7  1  9 12  1 11  1  1  4 15  3  5 11 11  3 14 10 16 16 17  1
[24]  1  1 18 13  2  1  1 11

That didn't work...

tv.hours.messy
as.numeric(tv.hours.messy)
 [1] 2              15 incl movies 16             0             
 [5] 2 hours        5ish           0              4             
 [9] 0              0              14             7-Jun         
[13] 10             15             4              4             
[17] 10             7              33 (Netflix)   8             
[21] 8              falkjklj       0              0             
[25] 0              none           6              1             
[29] 0              0              4             
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none
 [1]  8  6  7  1  9 12  1 11  1  1  4 15  3  5 11 11  3 14 10 16 16 17  1
[24]  1  1 18 13  2  1  1 11
  • This just converted all the values into the integer-coded levels of the factor

  • Not what we wanted!

Something that does work

  • Consider the following simple example
num.vec <- c(3.1, 2.5)
as.factor(num.vec)
[1] 3.1 2.5
Levels: 2.5 3.1
as.numeric(as.factor(num.vec))
[1] 2 1
as.numeric(as.character(as.factor(num.vec)))
[1] 3.1 2.5

If we take a number that's being coded as a factor and first turn it into a character string, then converting the string to a numeric gets back the number

Back to the corrupted TVhours column

as.character(tv.hours.messy)
 [1] "2"              "15 incl movies" "16"             "0"             
 [5] "2 hours"        "5ish"           "0"              "4"             
 [9] "0"              "0"              "14"             "7-Jun"         
[13] "10"             "15"             "4"              "4"             
[17] "10"             "7"              "33 (Netflix)"   "8"             
[21] "8"              "falkjklj"       "0"              "0"             
[25] "0"              "none"           "6"              "1"             
[29] "0"              "0"              "4"             
as.numeric(as.character(tv.hours.messy))
 [1]  2 NA 16  0 NA NA  0  4  0  0 14 NA 10 15  4  4 10  7 NA  8  8 NA  0
[24]  0  0 NA  6  1  0  0  4
typeof(as.numeric(as.character(tv.hours.messy)))  # Success!! (Almost...)
[1] "double"

A small improvement

  • All the corrupted cells now appear as NA, which is R's missing indicator

  • We can do a little better by cleaning up the vector once we get it to character form

tv.hours.strings <- as.character(tv.hours.messy)
tv.hours.strings
 [1] "2"              "15 incl movies" "16"             "0"             
 [5] "2 hours"        "5ish"           "0"              "4"             
 [9] "0"              "0"              "14"             "7-Jun"         
[13] "10"             "15"             "4"              "4"             
[17] "10"             "7"              "33 (Netflix)"   "8"             
[21] "8"              "falkjklj"       "0"              "0"             
[25] "0"              "none"           "6"              "1"             
[29] "0"              "0"              "4"             

Deleting non-numeric (or .) characters

tv.hours.strings
 [1] "2"              "15 incl movies" "16"             "0"             
 [5] "2 hours"        "5ish"           "0"              "4"             
 [9] "0"              "0"              "14"             "7-Jun"         
[13] "10"             "15"             "4"              "4"             
[17] "10"             "7"              "33 (Netflix)"   "8"             
[21] "8"              "falkjklj"       "0"              "0"             
[25] "0"              "none"           "6"              "1"             
[29] "0"              "0"              "4"             
# Use gsub() to replace everything except digits and '.' with a blank ""
gsub("[^0-9.]", "", tv.hours.strings) 
 [1] "2"  "15" "16" "0"  "2"  "5"  "0"  "4"  "0"  "0"  "14" "7"  "10" "15"
[15] "4"  "4"  "10" "7"  "33" "8"  "8"  ""   "0"  "0"  "0"  ""   "6"  "1" 
[29] "0"  "0"  "4" 

The final product

tv.hours.messy[1:30]
 [1] 2              15 incl movies 16             0             
 [5] 2 hours        5ish           0              4             
 [9] 0              0              14             7-Jun         
[13] 10             15             4              4             
[17] 10             7              33 (Netflix)   8             
[21] 8              falkjklj       0              0             
[25] 0              none           6              1             
[29] 0              0             
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none
tv.hours.clean <- as.numeric(gsub("[^0-9.]", "", tv.hours.strings))
tv.hours.clean
 [1]  2 15 16  0  2  5  0  4  0  0 14  7 10 15  4  4 10  7 33  8  8 NA  0
[24]  0  0 NA  6  1  0  0  4
  • As a last step, we should go through and figure out if any of the NA values should really be 0.
    • This step is not shown here.

Rebuilding our data

survey <- mutate(survey.messy, TVhours = tv.hours.clean)
str(survey)
'data.frame':   31 obs. of  6 variables:
 $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 3 3 2 3 3 3 2 3 2 ...
 $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 2 3 3 2 2 3 3 3 2 ...
 $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 3 3 1 1 3 3 1 1 1 3 ...
 $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 1 2 1 1 2 2 1 2 2 2 ...
 $ TVhours        : num  2 15 16 0 2 5 0 4 0 0 ...
 $ Editor         : Factor w/ 4 levels "Excel","LaTeX",..: 3 3 3 2 3 3 3 3 3 3 ...
  • Success!

A different approach

  • We can also handle this problem by setting stringsAsFactors = FALSE when importing our data.
survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_messy.csv", header=TRUE, stringsAsFactors=FALSE)
str(survey.messy)
'data.frame':   31 obs. of  6 variables:
 $ Program        : chr  "PPM" "PPM" "PPM" "Other" ...
 $ PriorExp       : chr  "Some experience" "Never programmed before" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Never used" "Never used" "Basic competence" "Basic competence" ...
 $ OperatingSystem: chr  "Mac OS X" "Windows" "Mac OS X" "Mac OS X" ...
 $ TVhours        : chr  "2" "15 incl movies" "16" "0" ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "LaTeX" ...
  • Now everything is a character instead of a factor

One-line cleanup

  • Let's clean up the TVhours column and cast it to numeric all in one command
survey <- mutate(survey.messy, 
                    TVhours = as.numeric(gsub("[^0-9.]", "", TVhours)))
str(survey)
'data.frame':   31 obs. of  6 variables:
 $ Program        : chr  "PPM" "PPM" "PPM" "Other" ...
 $ PriorExp       : chr  "Some experience" "Never programmed before" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Never used" "Never used" "Basic competence" "Basic competence" ...
 $ OperatingSystem: chr  "Mac OS X" "Windows" "Mac OS X" "Mac OS X" ...
 $ TVhours        : num  2 15 16 0 2 5 0 4 0 0 ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "LaTeX" ...

What about all those other `character` variables?

table(survey[["Program"]])

 MISM Other   PPM 
    4     8    19 
table(as.factor(survey[["Program"]]))

 MISM Other   PPM 
    4     8    19 
  • Having factors coded as characters may be OK for many parts of our analysis

To be safe, let's fix things

# Figure out which columns are coded as characters
chr.indexes <- sapply(survey, FUN = is.character)
chr.indexes
        Program        PriorExp     Rexperience OperatingSystem 
           TRUE            TRUE            TRUE            TRUE 
        TVhours          Editor 
          FALSE            TRUE 
# Re-code all of the character columns to factors
survey[chr.indexes] <- lapply(survey[chr.indexes], FUN = as.factor)

Here's the outcome

str(survey)
'data.frame':   31 obs. of  6 variables:
 $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 3 3 2 3 3 3 2 3 2 ...
 $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 2 3 3 2 2 3 3 3 2 ...
 $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 3 3 1 1 3 3 1 1 1 3 ...
 $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 1 2 1 1 2 2 1 2 2 2 ...
 $ TVhours        : num  2 15 16 0 2 5 0 4 0 0 ...
 $ Editor         : Factor w/ 4 levels "Excel","LaTeX",..: 3 3 3 2 3 3 3 3 3 3 ...
  • Success!

Another common problem

  • On Homework 2 you'll learn how to wrangle with another common problem

  • When data is entered manually, misspellings and case changes are very common

  • E.g., a column showing treatment information may look like,

life.support <- as.factor(c("dialysis", "Ventilation", "Dialysis", "dialysis", "none", "None", "nnone", "dyalysis", "dialysis", "ventilation", "none"))
summary(life.support)
   dialysis    Dialysis    dyalysis       nnone        none        None 
          3           1           1           1           2           1 
ventilation Ventilation 
          1           1 
summary(life.support)
   dialysis    Dialysis    dyalysis       nnone        none        None 
          3           1           1           1           2           1 
ventilation Ventilation 
          1           1 
  • This factor has 8 levels even though it should have 3 (dialysis, ventilation, none)

  • We can fix many of the typos by running spellcheck in Excel before importing data, or by changing the values on a case-by-case basis later

  • There's a faster way to fix just the capitalization issue (this is an exercise on Homework 2)

What are all these [l/s/t/]apply() functions?

  • These are all efficient ways of applying a function to margins of an array or elements of a list

  • Before we talk about the details of apply() and its relatives, we should first understand loops

  • loops are ways of iterating over data

  • The apply() functions can be thought of as good alternatives to loops

For loops: a pair of examples

for(i in 1:4) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
  phrase <- paste(phrase, word)
  print(phrase)
}
[1] "Good Night, and"
[1] "Good Night, and Good"
[1] "Good Night, and Good Luck"

For loops: syntax

A for loop executes a chunk of code for every value of an index variable in an index set

  • The basic syntax takes the form
for(index.variable in index.set) {
  code to be repeated at every value of index.variable
}
  • The index set is often a vector of integers, but can be more general

Example

index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
  print(c(i, typeof(i)))
}
[1] "Michael"   "character"
[1] "185"    "double"
[1] "TRUE"    "logical"

Example: Calculate sum of each column

fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows
          [,1]       [,2]     [,3]       [,4]        [,5]
[1,] -1.012349  1.6522842 1.627709 -1.0352775 -1.21955105
[2,] -1.426333 -0.7209012 1.163312 -0.4282115 -0.05492043
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
  col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums
[1]  8.998670 -6.301823 -6.981004 -3.153687 12.620116
colSums(fake.data) # A better approach (see also colMeans())
[1]  8.998670 -6.301823 -6.981004 -3.153687 12.620116

while loops

  • while loops repeat a chunk of code while the specified condition remains true
day <- 1
num.days <- 365
while(day <= num.days) {
  day <- day + 1
}
  • We won't really be using while loops in this class

  • Just be aware that they exist, and that they may become useful to you at some point in your analytics career

The various apply() functions

Command Description
apply(X, MARGIN, FUN) Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X
lapply(X, FUN) Obtain a list by applying FUN to the elements of a list X
sapply(X, FUN) Simplified version of lapply. Returns a vector/array instead of list.
tapply(X, INDEX, FUN) Obtain a table by applying FUN to each combination of the factors given in INDEX
  • These functions are (good!) alternatives to loops

  • They are typically more efficient than loops (often run considerably faster on large data sets)

  • Take practice to get used to, but make analysis easier to debug and less prone to error when used effectively

  • You can always type example(function) to get code examples (E.g., example(apply))

Example: apply()

colMeans(fake.data)
[1]  0.08998670 -0.06301823 -0.06981004 -0.03153687  0.12620116
apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns
[1]  0.08998670 -0.06301823 -0.06981004 -0.03153687  0.12620116
# Function that calculates proportion of vector indexes that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive) 
[1] 0.53 0.45 0.46 0.49 0.52

Example: lapply(), sapply()

lapply(survey, is.factor) # Returns a list
$Program
[1] TRUE

$PriorExp
[1] TRUE

$Rexperience
[1] TRUE

$OperatingSystem
[1] TRUE

$TVhours
[1] FALSE

$Editor
[1] TRUE
sapply(survey, FUN = is.factor) # Returns a vector with named elements
        Program        PriorExp     Rexperience OperatingSystem 
           TRUE            TRUE            TRUE            TRUE 
        TVhours          Editor 
          FALSE            TRUE 

Example: apply(), lapply(), sapply()

apply(cars, 2, FUN=mean) # Data frames are arrays
speed  dist 
15.40 42.98 
lapply(cars, FUN=mean) # Data frames are also lists
$speed
[1] 15.4

$dist
[1] 42.98
sapply(cars, FUN=mean) # sapply() is just simplified lapply()
speed  dist 
15.40 42.98 

Example: tapply()

  • Think of tapply() as a generalized form of the table() function
library(MASS)
# Get a count table, data broken down by Origin and DriveTrain
table(Cars93$Origin, Cars93$DriveTrain)

          4WD Front Rear
  USA       5    34    9
  non-USA   5    33    7
# Calculate average MPG.City, broken down by Origin and Drivetrain
tapply(Cars93$MPG.city, INDEX = Cars93[c("Origin", "DriveTrain")], FUN=mean)
         DriveTrain
Origin     4WD    Front     Rear
  USA     17.6 22.14706 18.33333
  non-USA 23.4 24.93939 19.14286

Example: tapply()

  • Let's get the average horsepower by car Origin and Type
tapply(Cars93[["Horsepower"]], INDEX = Cars93[c("Origin", "Type")], FUN=mean)
         Type
Origin     Compact    Large  Midsize    Small   Sporty    Van
  USA     117.4286 179.4545 153.5000 89.42857 166.5000 158.40
  non-USA 141.5556       NA 189.4167 91.78571 151.6667 138.25
  • What's that NA doing there?
any(Cars93$Origin == "non-USA" & Cars93$Type == "Large")
[1] FALSE
  • None of the non-USA manufacturers produced Large cars!

Example: using tapply() to mimic table()

  • Here's how one can use tapply() to produce the same output as the table() function
library(MASS)
# Get a count table, data broken down by Origin and DriveTrain
table(Cars93$Origin, Cars93$DriveTrain)

          4WD Front Rear
  USA       5    34    9
  non-USA   5    33    7
# This one may take a moment to figure out...
tapply(rep(1, nrow(Cars93)), INDEX = Cars93[c("Origin", "DriveTrain")], FUN=sum) 
         DriveTrain
Origin    4WD Front Rear
  USA       5    34    9
  non-USA   5    33    7

with()

  • Thus far we've repeatedly typed out the data frame name when referencing its columns

  • This is because the data variables don't exist in our working environment

  • Using with(data, expr) lets us specify that the code in expr should be evaluated in an environment that contains the elements of data as variables

with(Cars93, table(Origin, Type))
         Type
Origin    Compact Large Midsize Small Sporty Van
  USA           7    11      10     7      8   5
  non-USA       9     0      12    14      6   4

Example: with()

any(Cars93$Origin == "non-USA" & Cars93$Type == "Large")
[1] FALSE
with(Cars93, any(Origin == "non-USA" & Type == "Large")) # Same effect!
[1] FALSE
with(Cars93, tapply(Horsepower, INDEX = list(Origin, Type), FUN=mean))
         Compact    Large  Midsize    Small   Sporty    Van
USA     117.4286 179.4545 153.5000 89.42857 166.5000 158.40
non-USA 141.5556       NA 189.4167 91.78571 151.6667 138.25
  • Using with() makes code simpler, easier to read, and easier to debug

Assignments

  • Homework 2 will be posted later today

    • Due: Thursday, February 1, 1:20pm
    • Submit your .Rmd file on Blackboard
  • Lab 4 link is available on the course website

    • You have until Friday evening to complete it
    • Tomorrow's recitation session will give you more time to work on this week's labs