Fall 2019

Agenda

  • A common data cleaning task
  • For/while loops to iterate over data
  • Helpful variants of map, mutate and summarize

Package loading

library(tidyverse)
Cars93 <- MASS::Cars93  # For Cars93 data again

A common problem

  • One of the most common problems you’ll encounter when importing manually-entered data is inconsistent data types within columns

  • For a simple example, let’s look at TVhours column in a messy version of the survey data from Lecture 2

survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv", 
                         header=TRUE)
survey.messy$TVhours
##  [1] 20        6         10        2         none      10        15       
##  [8] 3         0         0         5         2         10        40       
## [15] zero      5         3         20        0         10        2        
## [22] 10        8         8         2         12        5         6        
## [29] 4         4         0         5         0         0         0        
## [36] 4         2         3         14        0         3         7        
## [43] 0         3         7         4         5         1.5       4        
## [50] approx 10 0         0         4         2         0         0        
## [57] 1         0         2         2         0         0.5       6ish     
## [64] 3         6         1         5        
## 21 Levels: 0 0.5 1 1.5 10 12 14 15 2 20 3 4 40 5 6 6ish 7 8 ... zero

What’s happening?

str(survey.messy)
## 'data.frame':    67 obs. of  6 variables:
##  $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 1 1 3 3 3 3 3 3 3 ...
##  $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 3 3 3 3 3 2 3 3 3 ...
##  $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 1 4 4 1 3 1 3 1 3 4 ...
##  $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 2 1 1 1 2 1 2 2 2 2 ...
##  $ TVhours        : Factor w/ 21 levels "0","0.5","1",..: 10 15 5 9 20 5 8 11 1 1 ...
##  $ Editor         : Factor w/ 5 levels "Jupyter Notebook",..: 3 3 3 4 3 3 3 3 3 3 ...
  • Several of the entries have non-numeric values in them (they contain strings)

  • As a result, TVhours is being imported as factor

A look at the TVhours column

survey.messy$TVhours
##  [1] 20        6         10        2         none      10        15       
##  [8] 3         0         0         5         2         10        40       
## [15] zero      5         3         20        0         10        2        
## [22] 10        8         8         2         12        5         6        
## [29] 4         4         0         5         0         0         0        
## [36] 4         2         3         14        0         3         7        
## [43] 0         3         7         4         5         1.5       4        
## [50] approx 10 0         0         4         2         0         0        
## [57] 1         0         2         2         0         0.5       6ish     
## [64] 3         6         1         5        
## 21 Levels: 0 0.5 1 1.5 10 12 14 15 2 20 3 4 40 5 6 6ish 7 8 ... zero

Attempt at a fix

  • What if we just try to cast it back to numeric?
tv.hours.messy <- survey.messy$TVhours
tv.hours.messy
##  [1] 20        6         10        2         none      10        15       
##  [8] 3         0         0         5         2         10        40       
## [15] zero      5         3         20        0         10        2        
## [22] 10        8         8         2         12        5         6        
## [29] 4         4         0         5         0         0         0        
## [36] 4         2         3         14        0         3         7        
## [43] 0         3         7         4         5         1.5       4        
## [50] approx 10 0         0         4         2         0         0        
## [57] 1         0         2         2         0         0.5       6ish     
## [64] 3         6         1         5        
## 21 Levels: 0 0.5 1 1.5 10 12 14 15 2 20 3 4 40 5 6 6ish 7 8 ... zero
as.numeric(tv.hours.messy)
##  [1] 10 15  5  9 20  5  8 11  1  1 14  9  5 13 21 14 11 10  1  5  9  5 18
## [24] 18  9  6 14 15 12 12  1 14  1  1  1 12  9 11  7  1 11 17  1 11 17 12
## [47] 14  4 12 19  1  1 12  9  1  1  3  1  9  9  1  2 16 11 15  3 14

That didn’t work…

tv.hours.messy
as.numeric(tv.hours.messy)
##  [1] 20   6    10   2    none 10   15   3    0    0    5    2    10   40  
## [15] zero 5    3    20   0    10   2    10   8    8    2    12   5    6   
## [29] 4    4    0    5    0    0    0    4    2    3    14   0   
## 21 Levels: 0 0.5 1 1.5 10 12 14 15 2 20 3 4 40 5 6 6ish 7 8 ... zero
##  [1] 10 15  5  9 20  5  8 11  1  1 14  9  5 13 21 14 11 10  1  5  9  5 18
## [24] 18  9  6 14 15 12 12  1 14  1  1  1 12  9 11  7  1
  • This just converted all the values into the integer-coded levels of the factor

  • Not what we wanted!

Something that does work

  • Consider the following simple example
num.vec <- c(3.1, 2.5)
as.factor(num.vec)
## [1] 3.1 2.5
## Levels: 2.5 3.1
as.numeric(as.factor(num.vec))
## [1] 2 1
as.numeric(as.character(as.factor(num.vec)))
## [1] 3.1 2.5

If we take a number that’s being coded as a factor and first turn it into a character string, then converting the string to a numeric gets back the number

Back to the corrupted TVhours column

as.character(tv.hours.messy)
##  [1] "20"        "6"         "10"        "2"         "none"     
##  [6] "10"        "15"        "3"         "0"         "0"        
## [11] "5"         "2"         "10"        "40"        "zero"     
## [16] "5"         "3"         "20"        "0"         "10"       
## [21] "2"         "10"        "8"         "8"         "2"        
## [26] "12"        "5"         "6"         "4"         "4"        
## [31] "0"         "5"         "0"         "0"         "0"        
## [36] "4"         "2"         "3"         "14"        "0"        
## [41] "3"         "7"         "0"         "3"         "7"        
## [46] "4"         "5"         "1.5"       "4"         "approx 10"
## [51] "0"         "0"         "4"         "2"         "0"        
## [56] "0"         "1"         "0"         "2"         "2"        
## [61] "0"         "0.5"       "6ish"      "3"         "6"        
## [66] "1"         "5"
as.numeric(as.character(tv.hours.messy))
## Warning: NAs introduced by coercion
##  [1] 20.0  6.0 10.0  2.0   NA 10.0 15.0  3.0  0.0  0.0  5.0  2.0 10.0 40.0
## [15]   NA  5.0  3.0 20.0  0.0 10.0  2.0 10.0  8.0  8.0  2.0 12.0  5.0  6.0
## [29]  4.0  4.0  0.0  5.0  0.0  0.0  0.0  4.0  2.0  3.0 14.0  0.0  3.0  7.0
## [43]  0.0  3.0  7.0  4.0  5.0  1.5  4.0   NA  0.0  0.0  4.0  2.0  0.0  0.0
## [57]  1.0  0.0  2.0  2.0  0.0  0.5   NA  3.0  6.0  1.0  5.0
typeof(as.numeric(as.character(tv.hours.messy)))  # Success!! (Almost...)
## Warning in typeof(as.numeric(as.character(tv.hours.messy))): NAs introduced
## by coercion
## [1] "double"

A small improvement

  • All the corrupted cells now appear as NA, which is R’s missing indicator

  • We can do a little better by cleaning up the vector once we get it to character form

tv.hours.strings <- as.character(tv.hours.messy)
tv.hours.strings
##  [1] "20"        "6"         "10"        "2"         "none"     
##  [6] "10"        "15"        "3"         "0"         "0"        
## [11] "5"         "2"         "10"        "40"        "zero"     
## [16] "5"         "3"         "20"        "0"         "10"       
## [21] "2"         "10"        "8"         "8"         "2"        
## [26] "12"        "5"         "6"         "4"         "4"        
## [31] "0"         "5"         "0"         "0"         "0"        
## [36] "4"         "2"         "3"         "14"        "0"        
## [41] "3"         "7"         "0"         "3"         "7"        
## [46] "4"         "5"         "1.5"       "4"         "approx 10"
## [51] "0"         "0"         "4"         "2"         "0"        
## [56] "0"         "1"         "0"         "2"         "2"        
## [61] "0"         "0.5"       "6ish"      "3"         "6"        
## [66] "1"         "5"

Deleting non-numeric (or .) characters

tv.hours.strings
##  [1] "20"        "6"         "10"        "2"         "none"     
##  [6] "10"        "15"        "3"         "0"         "0"        
## [11] "5"         "2"         "10"        "40"        "zero"     
## [16] "5"         "3"         "20"        "0"         "10"       
## [21] "2"         "10"        "8"         "8"         "2"        
## [26] "12"        "5"         "6"         "4"         "4"        
## [31] "0"         "5"         "0"         "0"         "0"        
## [36] "4"         "2"         "3"         "14"        "0"        
## [41] "3"         "7"         "0"         "3"         "7"        
## [46] "4"         "5"         "1.5"       "4"         "approx 10"
## [51] "0"         "0"         "4"         "2"         "0"        
## [56] "0"         "1"         "0"         "2"         "2"        
## [61] "0"         "0.5"       "6ish"      "3"         "6"        
## [66] "1"         "5"
# Use gsub() to replace everything except digits and '.' with a blank ""
gsub("[^0-9.]", "", tv.hours.strings) 
##  [1] "20"  "6"   "10"  "2"   ""    "10"  "15"  "3"   "0"   "0"   "5"  
## [12] "2"   "10"  "40"  ""    "5"   "3"   "20"  "0"   "10"  "2"   "10" 
## [23] "8"   "8"   "2"   "12"  "5"   "6"   "4"   "4"   "0"   "5"   "0"  
## [34] "0"   "0"   "4"   "2"   "3"   "14"  "0"   "3"   "7"   "0"   "3"  
## [45] "7"   "4"   "5"   "1.5" "4"   "10"  "0"   "0"   "4"   "2"   "0"  
## [56] "0"   "1"   "0"   "2"   "2"   "0"   "0.5" "6"   "3"   "6"   "1"  
## [67] "5"

The final product

tv.hours.messy[1:30]
##  [1] 20   6    10   2    none 10   15   3    0    0    5    2    10   40  
## [15] zero 5    3    20   0    10   2    10   8    8    2    12   5    6   
## [29] 4    4   
## 21 Levels: 0 0.5 1 1.5 10 12 14 15 2 20 3 4 40 5 6 6ish 7 8 ... zero
tv.hours.clean <- as.numeric(gsub("[^0-9.]", "", tv.hours.strings))
tv.hours.clean
##  [1] 20.0  6.0 10.0  2.0   NA 10.0 15.0  3.0  0.0  0.0  5.0  2.0 10.0 40.0
## [15]   NA  5.0  3.0 20.0  0.0 10.0  2.0 10.0  8.0  8.0  2.0 12.0  5.0  6.0
## [29]  4.0  4.0  0.0  5.0  0.0  0.0  0.0  4.0  2.0  3.0 14.0  0.0  3.0  7.0
## [43]  0.0  3.0  7.0  4.0  5.0  1.5  4.0 10.0  0.0  0.0  4.0  2.0  0.0  0.0
## [57]  1.0  0.0  2.0  2.0  0.0  0.5  6.0  3.0  6.0  1.0  5.0
  • As a last step, we should go through and figure out if any of the NA values should really be 0.
    • This step is not shown here.

Rebuilding our data

survey <- mutate(survey.messy, TVhours = tv.hours.clean)
str(survey)
## 'data.frame':    67 obs. of  6 variables:
##  $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 1 1 3 3 3 3 3 3 3 ...
##  $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 3 3 3 3 3 2 3 3 3 ...
##  $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 1 4 4 1 3 1 3 1 3 4 ...
##  $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 2 1 1 1 2 1 2 2 2 2 ...
##  $ TVhours        : num  20 6 10 2 NA 10 15 3 0 0 ...
##  $ Editor         : Factor w/ 5 levels "Jupyter Notebook",..: 3 3 3 4 3 3 3 3 3 3 ...
  • Success!

A different approach

  • We can also handle this problem by setting stringsAsFactors = FALSE when importing our data.
survey.meayssy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv", 
                         header=TRUE, stringsAsFactors=FALSE)
str(survey.messy)
## 'data.frame':    67 obs. of  6 variables:
##  $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 1 1 3 3 3 3 3 3 3 ...
##  $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 3 3 3 3 3 2 3 3 3 ...
##  $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 1 4 4 1 3 1 3 1 3 4 ...
##  $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 2 1 1 1 2 1 2 2 2 2 ...
##  $ TVhours        : Factor w/ 21 levels "0","0.5","1",..: 10 15 5 9 20 5 8 11 1 1 ...
##  $ Editor         : Factor w/ 5 levels "Jupyter Notebook",..: 3 3 3 4 3 3 3 3 3 3 ...
  • Now everything is a character instead of a factor

One-line cleanup

  • Let’s clean up the TVhours column and cast it to numeric all in one command
survey <- mutate(survey.messy, 
                 TVhours = as.numeric(gsub("[^0-9.]", "", TVhours)))
str(survey)
## 'data.frame':    67 obs. of  6 variables:
##  $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 1 1 3 3 3 3 3 3 3 ...
##  $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 3 3 3 3 3 2 3 3 3 ...
##  $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 1 4 4 1 3 1 3 1 3 4 ...
##  $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 2 1 1 1 2 1 2 2 2 2 ...
##  $ TVhours        : num  20 6 10 2 NA 10 15 3 0 0 ...
##  $ Editor         : Factor w/ 5 levels "Jupyter Notebook",..: 3 3 3 4 3 3 3 3 3 3 ...

What about all those other character variables?

table(survey[["Program"]])
## 
##  MISM Other   PPM 
##     8    17    42
table(as.factor(survey[["Program"]]))
## 
##  MISM Other   PPM 
##     8    17    42
  • Having factors coded as characters may be OK for many parts of our analysis

If we wanted to, here’s how we could fix things

mutate_if(.tbl, .predicate, .funs) applies a functions .funs to the elements of .tbl for which the predicate (condition) .predicate holds.

Here is how we can use mutate_if to convert all character columns to factors.

survey <- survey %>% mutate_if(is.character, as.factor)
# Outcome:
str(survey)
## 'data.frame':    67 obs. of  6 variables:
##  $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 1 1 3 3 3 3 3 3 3 ...
##  $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 3 3 3 3 3 2 3 3 3 ...
##  $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 1 4 4 1 3 1 3 1 3 4 ...
##  $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 2 1 1 1 2 1 2 2 2 2 ...
##  $ TVhours        : num  20 6 10 2 NA 10 15 3 0 0 ...
##  $ Editor         : Factor w/ 5 levels "Jupyter Notebook",..: 3 3 3 4 3 3 3 3 3 3 ...
  • Success!

Another common problem

  • On Homework 2 you’ll learn how to wrangle with another common problem

  • When data is entered manually, misspellings and case changes are very common

  • E.g., a column showing treatment information may look like,

treatment <- as.factor(c("dialysis", "Ventilation", "Dialysis", "dialysis", "none", "None", "nnone", "dyalysis", "dialysis", "ventilation", "none"))
summary(treatment)
##    dialysis    Dialysis    dyalysis       nnone        none        None 
##           3           1           1           1           2           1 
## ventilation Ventilation 
##           1           1

summary(treatment)
##    dialysis    Dialysis    dyalysis       nnone        none        None 
##           3           1           1           1           2           1 
## ventilation Ventilation 
##           1           1
  • This factor has 8 levels even though it should have 3 (dialysis, ventilation, none)

  • We can fix many of the typos by running spellcheck in Excel before importing data, or by changing the values on a case-by-case basis later

  • There’s a faster way to fix just the capitalization issue (this is an exercise on Homework 2)

What are all these map<*> functions?

  • These are all efficient ways of applying a function to margins of an array or elements of a list

  • Before we talk about them in detail, we need to understand their more cumbersome but more general alternative: loops

  • loops are ways of iterating over data

  • The map<*> functions and their <*>apply base-R ancestors can be thought of as good alternatives to loops

For loops: a pair of examples

for(i in 1:4) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
  phrase <- paste(phrase, word)
  print(phrase)
}
## [1] "Good Night, and"
## [1] "Good Night, and Good"
## [1] "Good Night, and Good Luck"

For loops: syntax

A for loop executes a chunk of code for every value of an index variable in an index set

  • The basic syntax takes the form
for(index.variable in index.set) {
  code to be repeated at every value of index.variable
}
  • The index set is often a vector of integers, but can be more general

Example

index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
  print(c(i, typeof(i)))
}
## [1] "Michael"   "character"
## [1] "185"    "double"
## [1] "TRUE"    "logical"

Example: Calculate sum of each column

fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows
##           [,1]       [,2]       [,3]       [,4]      [,5]
## [1,] 0.6764355 -0.5059319  0.5638624 -0.6844998 0.7464268
## [2,] 0.7038234 -1.2086823 -0.3606734 -0.2522732 0.9815616
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
  col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums
## [1]   4.241824 -12.692596  -3.264017  -8.326689   9.278035
colSums(fake.data) # A better approach (see also colMeans())
## [1]   4.241824 -12.692596  -3.264017  -8.326689   9.278035

while loops

  • while loops repeat a chunk of code while the specified condition remains true
day <- 1
num.days <- 365
while(day <= num.days) {
  day <- day + 1
}
  • We won’t really be using while loops in this class

  • Just be aware that they exist, and that they may become useful to you at some point in your analytics career

Loop alternatives

Command Description
apply(X, MARGIN, FUN) Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X
map(.x, .f, ...) Obtain a list by applying .f to every element of a list or atomic vector .x
map_<type>(.x, .f, ...) For <type> given by lgl (logical), int (integer), dbl (double) or chr (character), return a vector of this type obtained by applying .f to each element of .x
map_at(.x, .at, .f) Obtain a list by applying .f to the elements of .x specified by name or index given in .at
map_if(.x, .p, .f) Obtain a list .f to the elements of .x specified by .p (a predicate function, or a logical vector)
mutate_all/_at/_if Mutate all variables, specified (at) variables, or those selected by a predicate (if)
summarize_all/_at/_if Summarize all variables, specified variables, or those selected by a predicate (if)
  • These take practice to get used to, but make analysis easier to debug and less prone to error when used effectively

  • The best way to learn them is by looking at a bunch of examples. The end of each help file contains some examples.

Example: apply()

colMeans(fake.data)
## [1]  0.04241824 -0.12692596 -0.03264017 -0.08326689  0.09278035
apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns
## [1]  0.04241824 -0.12692596 -0.03264017 -0.08326689  0.09278035
# Function that calculates proportion of vector indexes that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive) 
## [1] 0.52 0.43 0.53 0.47 0.55

Example: map, map_()

map(survey, is.factor) # Returns a list
## $Program
## [1] TRUE
## 
## $PriorExp
## [1] TRUE
## 
## $Rexperience
## [1] TRUE
## 
## $OperatingSystem
## [1] TRUE
## 
## $TVhours
## [1] FALSE
## 
## $Editor
## [1] TRUE
map_lgl(survey, is.factor) # Returns a logical vector with named elements
##         Program        PriorExp     Rexperience OperatingSystem 
##            TRUE            TRUE            TRUE            TRUE 
##         TVhours          Editor 
##           FALSE            TRUE

Example: apply(), map(), map_()

apply(cars, 2, FUN=mean) # Data frames are arrays
## speed  dist 
## 15.40 42.98
map(cars, mean) # Data frames are also lists
## $speed
## [1] 15.4
## 
## $dist
## [1] 42.98
map_dbl(cars, mean) # map output as a double vector
## speed  dist 
## 15.40 42.98

Example: mutate_if

Let’s convert all factor variables in Cars93 to lowercase

head(Cars93$Type)
## [1] Small   Midsize Compact Midsize Midsize Midsize
## Levels: Compact Large Midsize Small Sporty Van
Cars93.lower <- mutate_if(Cars93, is.factor, tolower)
head(Cars93.lower$Type)
## [1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"

Note: this operation is performed “in place”, replacing prior versions of the variables

Example: mutate_if, adding instead of replacing columns

If you pass the functions in as a list with named elements, those names get appended to create modified versions of variables instead of replacing existing variables

Cars93.lower <- mutate_if(Cars93, is.factor, list(lower = tolower))
head(Cars93.lower$Type)
## [1] Small   Midsize Compact Midsize Midsize Midsize
## Levels: Compact Large Midsize Small Sporty Van
head(Cars93.lower$Type_lower)
## [1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"

Example: mutate_at

Let’s convert from MPG to KPML but this time using mutate_at

Cars93.metric <- Cars93 %>% 
  mutate_at(c("MPG.city", "MPG.highway"), list(KMPL = ~ 0.425 * .x))
tail(colnames(Cars93.metric))
## [1] "Luggage.room"     "Weight"           "Origin"          
## [4] "Make"             "MPG.city_KMPL"    "MPG.highway_KMPL"

Here, ~ 0.425 * .x is an example of specifying a “lambda” (anonymous) function. It is permitted short-hand for

function(.x){0.425 * .x}

Example: summarize_if

Let’s get the mean of every numeric column in Cars93

Cars93 %>% summarize_if(is.numeric, mean)
##   Min.Price    Price Max.Price MPG.city MPG.highway EngineSize Horsepower
## 1  17.12581 19.50968  21.89892 22.36559    29.08602   2.667742    143.828
##        RPM Rev.per.mile Fuel.tank.capacity Passengers   Length Wheelbase
## 1 5280.645     2332.204           16.66452   5.086022 183.2043  103.9462
##      Width Turn.circle Rear.seat.room Luggage.room   Weight
## 1 69.37634    38.95699             NA           NA 3072.903
Cars93 %>% summarize_if(is.numeric, list(mean = mean), na.rm=TRUE)
##   Min.Price_mean Price_mean Max.Price_mean MPG.city_mean MPG.highway_mean
## 1       17.12581   19.50968       21.89892      22.36559         29.08602
##   EngineSize_mean Horsepower_mean RPM_mean Rev.per.mile_mean
## 1        2.667742         143.828 5280.645          2332.204
##   Fuel.tank.capacity_mean Passengers_mean Length_mean Wheelbase_mean
## 1                16.66452        5.086022    183.2043       103.9462
##   Width_mean Turn.circle_mean Rear.seat.room_mean Luggage.room_mean
## 1   69.37634         38.95699            27.82967          13.89024
##   Weight_mean
## 1    3072.903

Example: summarize_at

Let’s get the average fuel economy of all vehicles, grouped by their Type

We’ll use the helper function contains() to indicate that we want all variables that contain MPG.

Cars93 %>%
  group_by(Type) %>%
  summarize_at(c("MPG.city", "MPG.highway"), mean)
## # A tibble: 6 x 3
##   Type    MPG.city MPG.highway
##   <fct>      <dbl>       <dbl>
## 1 Compact     22.7        29.9
## 2 Large       18.4        26.7
## 3 Midsize     19.5        26.7
## 4 Small       29.9        35.5
## 5 Sporty      21.8        28.8
## 6 Van         17          21.9

Another approach

We’ll learn about a bunch of helper functions like contains() and starts_with(). Here’s one way of performing the previous operation with the help of these functions, and appending _mean to the resulting output.

Cars93 %>%
  group_by(Type) %>%
  summarize_at(vars(contains("MPG")), list(mean = mean))
## # A tibble: 6 x 3
##   Type    MPG.city_mean MPG.highway_mean
##   <fct>           <dbl>            <dbl>
## 1 Compact          22.7             29.9
## 2 Large            18.4             26.7
## 3 Midsize          19.5             26.7
## 4 Small            29.9             35.5
## 5 Sporty           21.8             28.8
## 6 Van              17               21.9

More than one grouping variable

Cars93 %>%
  group_by(Origin, AirBags) %>%
  summarize_at(vars(contains("MPG")), list(mean = mean))
## # A tibble: 6 x 4
## # Groups:   Origin [2]
##   Origin  AirBags            MPG.city_mean MPG.highway_mean
##   <fct>   <fct>                      <dbl>            <dbl>
## 1 USA     Driver & Passenger          19               27.2
## 2 USA     Driver only                 20.2             27.5
## 3 USA     None                        23.1             29.6
## 4 non-USA Driver & Passenger          20.3             27  
## 5 non-USA Driver only                 23.2             29.4
## 6 non-USA None                        25.9             32

Assignments

  • Homework 2 will be posted soon
    • Due: Thursday, November 7, 2:50pm
    • Submit your .Rmd and .html files on Canvas
  • Lab 4 is available on Canvas and the course website
    • You have until Friday evening to complete it
    • Tomorrow’s lab session will give you more time to work on this week’s labs