Lecture 4: Basic cleaning, loops, apply functions

Prof. Alexandra Chouldechova
94842

Agenda

A common data cleaning task
If-else statements
For/while loops to iterate over data
apply(), lapply(), sapply(), tapply()
with() to specify scope

Package loading

library(plyr)
library(dplyr)

A common problem

One of the most common problems you'll encounter when importing manually-entered data is inconsistent data types within columns
For a simple example, let's look at TVhours column in a messy version of the survey data from Lecture 2

survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_messy.csv", header=TRUE)
survey.messy$TVhours

 [1] 2              15 incl movies 16             0             
 [5] 2 hours        5ish           0              4             
 [9] 0              0              14             7-Jun         
[13] 10             15             4              4             
[17] 10             7              33 (Netflix)   8             
[21] 8              falkjklj       0              0             
[25] 0              none           6              1             
[29] 0              0              4             
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none

What's happening?

str(survey.messy)

'data.frame':   31 obs. of  6 variables:
 $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 3 3 2 3 3 3 2 3 2 ...
 $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 2 3 3 2 2 3 3 3 2 ...
 $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 3 3 1 1 3 3 1 1 1 3 ...
 $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 1 2 1 1 2 2 1 2 2 2 ...
 $ TVhours        : Factor w/ 18 levels "0","1","10","14",..: 8 6 7 1 9 12 1 11 1 1 ...
 $ Editor         : Factor w/ 4 levels "Excel","LaTeX",..: 3 3 3 2 3 3 3 3 3 3 ...

Several of the entries have non-numeric values in them (they contain strings)
As a result, TVhours is being imported as factor

A look at the TVhours column

survey.messy$TVhours

 [1] 2              15 incl movies 16             0             
 [5] 2 hours        5ish           0              4             
 [9] 0              0              14             7-Jun         
[13] 10             15             4              4             
[17] 10             7              33 (Netflix)   8             
[21] 8              falkjklj       0              0             
[25] 0              none           6              1             
[29] 0              0              4             
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none

Attempt at a fix

What if we just try to cast it back to numeric?

tv.hours.messy <- survey.messy$TVhours
tv.hours.messy

 [1] 2              15 incl movies 16             0             
 [5] 2 hours        5ish           0              4             
 [9] 0              0              14             7-Jun         
[13] 10             15             4              4             
[17] 10             7              33 (Netflix)   8             
[21] 8              falkjklj       0              0             
[25] 0              none           6              1             
[29] 0              0              4             
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none

as.numeric(tv.hours.messy)

 [1]  8  6  7  1  9 12  1 11  1  1  4 15  3  5 11 11  3 14 10 16 16 17  1
[24]  1  1 18 13  2  1  1 11

That didn't work...

tv.hours.messy
as.numeric(tv.hours.messy)

 [1] 2              15 incl movies 16             0             
 [5] 2 hours        5ish           0              4             
 [9] 0              0              14             7-Jun         
[13] 10             15             4              4             
[17] 10             7              33 (Netflix)   8             
[21] 8              falkjklj       0              0             
[25] 0              none           6              1             
[29] 0              0              4             
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none

 [1]  8  6  7  1  9 12  1 11  1  1  4 15  3  5 11 11  3 14 10 16 16 17  1
[24]  1  1 18 13  2  1  1 11

This just converted all the values into the integer-coded levels of the factor
Not what we wanted!

Something that does work

Consider the following simple example

num.vec <- c(3.1, 2.5)
as.factor(num.vec)

[1] 3.1 2.5
Levels: 2.5 3.1

as.numeric(as.factor(num.vec))

[1] 2 1

as.numeric(as.character(as.factor(num.vec)))

[1] 3.1 2.5

If we take a number that's being coded as a factor and first turn it into a character string, then converting the string to a numeric gets back the number

Back to the corrupted TVhours column

as.character(tv.hours.messy)

 [1] "2"              "15 incl movies" "16"             "0"             
 [5] "2 hours"        "5ish"           "0"              "4"             
 [9] "0"              "0"              "14"             "7-Jun"         
[13] "10"             "15"             "4"              "4"             
[17] "10"             "7"              "33 (Netflix)"   "8"             
[21] "8"              "falkjklj"       "0"              "0"             
[25] "0"              "none"           "6"              "1"             
[29] "0"              "0"              "4"

as.numeric(as.character(tv.hours.messy))

 [1]  2 NA 16  0 NA NA  0  4  0  0 14 NA 10 15  4  4 10  7 NA  8  8 NA  0
[24]  0  0 NA  6  1  0  0  4

typeof(as.numeric(as.character(tv.hours.messy)))  # Success!! (Almost...)

[1] "double"

A small improvement

All the corrupted cells now appear as NA, which is R's missing indicator
We can do a little better by cleaning up the vector once we get it to character form

tv.hours.strings <- as.character(tv.hours.messy)
tv.hours.strings

 [1] "2"              "15 incl movies" "16"             "0"             
 [5] "2 hours"        "5ish"           "0"              "4"             
 [9] "0"              "0"              "14"             "7-Jun"         
[13] "10"             "15"             "4"              "4"             
[17] "10"             "7"              "33 (Netflix)"   "8"             
[21] "8"              "falkjklj"       "0"              "0"             
[25] "0"              "none"           "6"              "1"             
[29] "0"              "0"              "4"

Deleting non-numeric (or .) characters

tv.hours.strings

 [1] "2"              "15 incl movies" "16"             "0"             
 [5] "2 hours"        "5ish"           "0"              "4"             
 [9] "0"              "0"              "14"             "7-Jun"         
[13] "10"             "15"             "4"              "4"             
[17] "10"             "7"              "33 (Netflix)"   "8"             
[21] "8"              "falkjklj"       "0"              "0"             
[25] "0"              "none"           "6"              "1"             
[29] "0"              "0"              "4"

# Use gsub() to replace everything except digits and '.' with a blank ""
gsub("[^0-9.]", "", tv.hours.strings)

 [1] "2"  "15" "16" "0"  "2"  "5"  "0"  "4"  "0"  "0"  "14" "7"  "10" "15"
[15] "4"  "4"  "10" "7"  "33" "8"  "8"  ""   "0"  "0"  "0"  ""   "6"  "1" 
[29] "0"  "0"  "4"

The final product

tv.hours.messy[1:30]

 [1] 2              15 incl movies 16             0             
 [5] 2 hours        5ish           0              4             
 [9] 0              0              14             7-Jun         
[13] 10             15             4              4             
[17] 10             7              33 (Netflix)   8             
[21] 8              falkjklj       0              0             
[25] 0              none           6              1             
[29] 0              0             
18 Levels: 0 1 10 14 15 15 incl movies 16 2 2 hours 33 (Netflix) ... none

tv.hours.clean <- as.numeric(gsub("[^0-9.]", "", tv.hours.strings))
tv.hours.clean

 [1]  2 15 16  0  2  5  0  4  0  0 14  7 10 15  4  4 10  7 33  8  8 NA  0
[24]  0  0 NA  6  1  0  0  4

As a last step, we should go through and figure out if any of the NA values should really be 0.
- This step is not shown here.

Rebuilding our data

survey <- mutate(survey.messy, TVhours = tv.hours.clean)
str(survey)

'data.frame':   31 obs. of  6 variables:
 $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 3 3 2 3 3 3 2 3 2 ...
 $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 2 3 3 2 2 3 3 3 2 ...
 $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 3 3 1 1 3 3 1 1 1 3 ...
 $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 1 2 1 1 2 2 1 2 2 2 ...
 $ TVhours        : num  2 15 16 0 2 5 0 4 0 0 ...
 $ Editor         : Factor w/ 4 levels "Excel","LaTeX",..: 3 3 3 2 3 3 3 3 3 3 ...

Success!

A different approach

We can also handle this problem by setting stringsAsFactors = FALSE when importing our data.

survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_messy.csv", header=TRUE, stringsAsFactors=FALSE)
str(survey.messy)

'data.frame':   31 obs. of  6 variables:
 $ Program        : chr  "PPM" "PPM" "PPM" "Other" ...
 $ PriorExp       : chr  "Some experience" "Never programmed before" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Never used" "Never used" "Basic competence" "Basic competence" ...
 $ OperatingSystem: chr  "Mac OS X" "Windows" "Mac OS X" "Mac OS X" ...
 $ TVhours        : chr  "2" "15 incl movies" "16" "0" ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "LaTeX" ...

Now everything is a character instead of a factor

One-line cleanup

Let's clean up the TVhours column and cast it to numeric all in one command

survey <- mutate(survey.messy, 
                    TVhours = as.numeric(gsub("[^0-9.]", "", TVhours)))
str(survey)

'data.frame':   31 obs. of  6 variables:
 $ Program        : chr  "PPM" "PPM" "PPM" "Other" ...
 $ PriorExp       : chr  "Some experience" "Never programmed before" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Never used" "Never used" "Basic competence" "Basic competence" ...
 $ OperatingSystem: chr  "Mac OS X" "Windows" "Mac OS X" "Mac OS X" ...
 $ TVhours        : num  2 15 16 0 2 5 0 4 0 0 ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "LaTeX" ...

What about all those other `character` variables?

table(survey[["Program"]])


 MISM Other   PPM 
    4     8    19

table(as.factor(survey[["Program"]]))


 MISM Other   PPM 
    4     8    19

Having factors coded as characters may be OK for many parts of our analysis

To be safe, let's fix things

# Figure out which columns are coded as characters
chr.indexes <- sapply(survey, FUN = is.character)
chr.indexes

        Program        PriorExp     Rexperience OperatingSystem 
           TRUE            TRUE            TRUE            TRUE 
        TVhours          Editor 
          FALSE            TRUE

# Re-code all of the character columns to factors
survey[chr.indexes] <- lapply(survey[chr.indexes], FUN = as.factor)

Here's the outcome

str(survey)

'data.frame':   31 obs. of  6 variables:
 $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 3 3 2 3 3 3 2 3 2 ...
 $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 2 3 3 2 2 3 3 3 2 ...
 $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 3 3 1 1 3 3 1 1 1 3 ...
 $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 1 2 1 1 2 2 1 2 2 2 ...
 $ TVhours        : num  2 15 16 0 2 5 0 4 0 0 ...
 $ Editor         : Factor w/ 4 levels "Excel","LaTeX",..: 3 3 3 2 3 3 3 3 3 3 ...

Success!

Another common problem

On Homework 2 you'll learn how to wrangle with another common problem
When data is entered manually, misspellings and case changes are very common
E.g., a column showing treatment information may look like,

life.support <- as.factor(c("dialysis", "Ventilation", "Dialysis", "dialysis", "none", "None", "nnone", "dyalysis", "dialysis", "ventilation", "none"))
summary(life.support)

   dialysis    Dialysis    dyalysis       nnone        none        None 
          3           1           1           1           2           1 
ventilation Ventilation 
          1           1

summary(life.support)

   dialysis    Dialysis    dyalysis       nnone        none        None 
          3           1           1           1           2           1 
ventilation Ventilation 
          1           1

This factor has 8 levels even though it should have 3 (dialysis, ventilation, none)
We can fix many of the typos by running spellcheck in Excel before importing data, or by changing the values on a case-by-case basis later
There's a faster way to fix just the capitalization issue (this is an exercise on Homework 2)

What are all these [l/s/t/]apply() functions?

These are all efficient ways of applying a function to margins of an array or elements of a list
Before we talk about the details of apply() and its relatives, we should first understand loops
loops are ways of iterating over data
The apply() functions can be thought of as good alternatives to loops

For loops: a pair of examples

for(i in 1:4) {
  print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4

phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
  phrase <- paste(phrase, word)
  print(phrase)
}

[1] "Good Night, and"
[1] "Good Night, and Good"
[1] "Good Night, and Good Luck"

For loops: syntax

A for loop executes a chunk of code for every value of an index variable in an index set

The basic syntax takes the form

for(index.variable in index.set) {
  code to be repeated at every value of index.variable
}

The index set is often a vector of integers, but can be more general

Example

index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
  print(c(i, typeof(i)))
}

[1] "Michael"   "character"
[1] "185"    "double"
[1] "TRUE"    "logical"

Example: Calculate sum of each column

fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows

          [,1]       [,2]     [,3]       [,4]        [,5]
[1,] -1.012349  1.6522842 1.627709 -1.0352775 -1.21955105
[2,] -1.426333 -0.7209012 1.163312 -0.4282115 -0.05492043

col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
  col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums

[1]  8.998670 -6.301823 -6.981004 -3.153687 12.620116

colSums(fake.data) # A better approach (see also colMeans())

[1]  8.998670 -6.301823 -6.981004 -3.153687 12.620116

while loops

while loops repeat a chunk of code while the specified condition remains true

day <- 1
num.days <- 365
while(day <= num.days) {
  day <- day + 1
}

We won't really be using while loops in this class
Just be aware that they exist, and that they may become useful to you at some point in your analytics career

The various apply() functions

Command	Description
`apply(X, MARGIN, FUN)`	Obtain a vector/array/list by applying `FUN` along the specified `MARGIN` of an array or matrix `X`
`lapply(X, FUN)`	Obtain a list by applying `FUN` to the elements of a list `X`
`sapply(X, FUN)`	Simplified version of `lapply`. Returns a vector/array instead of list.
`tapply(X, INDEX, FUN)`	Obtain a table by applying `FUN` to each combination of the factors given in `INDEX`

These functions are (good!) alternatives to loops
They are typically more efficient than loops (often run considerably faster on large data sets)
Take practice to get used to, but make analysis easier to debug and less prone to error when used effectively
You can always type example(function) to get code examples (E.g., example(apply))

Example: apply()

colMeans(fake.data)

[1]  0.08998670 -0.06301823 -0.06981004 -0.03153687  0.12620116

apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns

[1]  0.08998670 -0.06301823 -0.06981004 -0.03153687  0.12620116

# Function that calculates proportion of vector indexes that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive)

[1] 0.53 0.45 0.46 0.49 0.52

Example: lapply(), sapply()

lapply(survey, is.factor) # Returns a list

$Program
[1] TRUE

$PriorExp
[1] TRUE

$Rexperience
[1] TRUE

$OperatingSystem
[1] TRUE

$TVhours
[1] FALSE

$Editor
[1] TRUE

sapply(survey, FUN = is.factor) # Returns a vector with named elements

        Program        PriorExp     Rexperience OperatingSystem 
           TRUE            TRUE            TRUE            TRUE 
        TVhours          Editor 
          FALSE            TRUE

Example: apply(), lapply(), sapply()

apply(cars, 2, FUN=mean) # Data frames are arrays

speed  dist 
15.40 42.98

lapply(cars, FUN=mean) # Data frames are also lists

$speed
[1] 15.4

$dist
[1] 42.98

sapply(cars, FUN=mean) # sapply() is just simplified lapply()

speed  dist 
15.40 42.98

Example: tapply()

Think of tapply() as a generalized form of the table() function

library(MASS)
# Get a count table, data broken down by Origin and DriveTrain
table(Cars93$Origin, Cars93$DriveTrain)


          4WD Front Rear
  USA       5    34    9
  non-USA   5    33    7

# Calculate average MPG.City, broken down by Origin and Drivetrain
tapply(Cars93$MPG.city, INDEX = Cars93[c("Origin", "DriveTrain")], FUN=mean)

         DriveTrain
Origin     4WD    Front     Rear
  USA     17.6 22.14706 18.33333
  non-USA 23.4 24.93939 19.14286

Example: tapply()

Let's get the average horsepower by car Origin and Type

tapply(Cars93[["Horsepower"]], INDEX = Cars93[c("Origin", "Type")], FUN=mean)

         Type
Origin     Compact    Large  Midsize    Small   Sporty    Van
  USA     117.4286 179.4545 153.5000 89.42857 166.5000 158.40
  non-USA 141.5556       NA 189.4167 91.78571 151.6667 138.25

What's that NA doing there?

any(Cars93$Origin == "non-USA" & Cars93$Type == "Large")

[1] FALSE

None of the non-USA manufacturers produced Large cars!

Example: using tapply() to mimic table()

Here's how one can use tapply() to produce the same output as the table() function

library(MASS)
# Get a count table, data broken down by Origin and DriveTrain
table(Cars93$Origin, Cars93$DriveTrain)


          4WD Front Rear
  USA       5    34    9
  non-USA   5    33    7

# This one may take a moment to figure out...
tapply(rep(1, nrow(Cars93)), INDEX = Cars93[c("Origin", "DriveTrain")], FUN=sum)

         DriveTrain
Origin    4WD Front Rear
  USA       5    34    9
  non-USA   5    33    7

with()

Thus far we've repeatedly typed out the data frame name when referencing its columns
This is because the data variables don't exist in our working environment
Using with(data, expr) lets us specify that the code in expr should be evaluated in an environment that contains the elements of data as variables

with(Cars93, table(Origin, Type))

         Type
Origin    Compact Large Midsize Small Sporty Van
  USA           7    11      10     7      8   5
  non-USA       9     0      12    14      6   4

Example: with()

any(Cars93$Origin == "non-USA" & Cars93$Type == "Large")

[1] FALSE

with(Cars93, any(Origin == "non-USA" & Type == "Large")) # Same effect!

[1] FALSE

with(Cars93, tapply(Horsepower, INDEX = list(Origin, Type), FUN=mean))

         Compact    Large  Midsize    Small   Sporty    Van
USA     117.4286 179.4545 153.5000 89.42857 166.5000 158.40
non-USA 141.5556       NA 189.4167 91.78571 151.6667 138.25

Using with() makes code simpler, easier to read, and easier to debug

Assignments

Homework 2 will be posted later today
- Due: Thursday, February 1, 1:20pm
- Submit your .Rmd file on Blackboard
Lab 4 link is available on the course website
- You have until Friday evening to complete it
- Tomorrow's recitation session will give you more time to work on this week's labs