Prof. Alexandra Chouldechova
dplyr
and plyr
librariesmutate
function from dplyr
and mapvalues
from plyr
library(plyr)
library(dplyr)
# Loading dplyr before plyr typically breaks things. Don't do it.
Note Remember to run
install.packages("dplyr")
and similarly forplyr
before trying to load the libraries for the first time.
library(MASS)
head(Cars93, 3)
Manufacturer Model Type Min.Price Price Max.Price MPG.city
1 Acura Integra Small 12.9 15.9 18.8 25
2 Acura Legend Midsize 29.2 33.9 38.7 18
3 Audi 90 Compact 25.9 29.1 32.3 20
MPG.highway AirBags DriveTrain Cylinders EngineSize
1 31 None Front 4 1.8
2 25 Driver & Passenger Front 6 3.2
3 26 Driver only Front 6 2.8
Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
1 140 6300 2890 Yes 13.2
2 200 5500 2335 Yes 18.0
3 172 5500 2280 Yes 16.9
Passengers Length Wheelbase Width Turn.circle Rear.seat.room
1 5 177 102 68 37 26.5
2 5 195 115 71 38 30.0
3 5 180 102 67 37 28.0
Luggage.room Weight Origin Make
1 11 2705 non-USA Acura Integra
2 15 3560 non-USA Acura Legend
3 14 3375 non-USA Audi 90
mutate()
returns a new data frame with columns modified or added as specified by the function callCars93.metric <- mutate(Cars93,
KMPL.city = 0.425 * MPG.city,
KMPL.highway = 0.425 * MPG.highway)
tail(names(Cars93.metric))
[1] "Luggage.room" "Weight" "Origin" "Make"
[5] "KMPL.city" "KMPL.highway"
# Add a new column called KMPL.city.2
Cars93.metric$KMPL.city.2 <- 0.425 * Cars93$MPG.city
tail(names(Cars93.metric))
[1] "Weight" "Origin" "Make" "KMPL.city"
[5] "KMPL.highway" "KMPL.city.2"
identical(Cars93.metric$KMPL.city, Cars93.metric$KMPL.city.2)
[1] TRUE
manufacturer <- Cars93$Manufacturer
head(manufacturer, 10)
[1] Acura Acura Audi Audi BMW Buick Buick
[8] Buick Buick Cadillac
32 Levels: Acura Audi BMW Buick Cadillac Chevrolet Chrylser ... Volvo
We'll use the mapvalues(x, from, to)
function from the plyr
library.
# Map Chevrolet, Pontiac and Buick to GM
manufacturer.combined <- mapvalues(manufacturer,
from = c("Chevrolet", "Pontiac", "Buick"),
to = rep("GM", 3))
head(manufacturer.combined, 10)
[1] Acura Acura Audi Audi BMW GM GM
[8] GM GM Cadillac
30 Levels: Acura Audi BMW GM Cadillac Chrylser Chrysler Dodge ... Volvo
A lot of data comes with integer encodings of levels
You may want to convert the integers to more meaningful values for the purpose of your analysis
Let's pretend that in the class survey 'Program' was coded as an integer with 1 = MISM, 2 = Other, 3 = PPM
survey <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data.csv", header=TRUE, sep=",")
survey <- mutate(survey, Program=as.numeric(Program))
head(survey)
Program PriorExp Rexperience OperatingSystem TVhours
1 3 Some experience Never used Mac OS X 2
2 3 Never programmed before Never used Windows 15
3 3 Some experience Basic competence Mac OS X 16
4 2 Some experience Basic competence Mac OS X 0
5 3 Never programmed before Never used Windows 2
6 3 Never programmed before Never used Windows 5
Editor
1 Microsoft Word
2 Microsoft Word
3 Microsoft Word
4 LaTeX
5 Microsoft Word
6 Microsoft Word
mutate()
, as.factor()
and mapvalues()
functionssurvey <- mutate(survey,
Program = as.factor(mapvalues(Program,
c(1, 2, 3),
c("MISM", "Other", "PPM")))
)
head(survey)
Program PriorExp Rexperience OperatingSystem TVhours
1 PPM Some experience Never used Mac OS X 2
2 PPM Never programmed before Never used Windows 15
3 PPM Some experience Basic competence Mac OS X 16
4 Other Some experience Basic competence Mac OS X 0
5 PPM Never programmed before Never used Windows 2
6 PPM Never programmed before Never used Windows 5
Editor
1 Microsoft Word
2 Microsoft Word
3 Microsoft Word
4 LaTeX
5 Microsoft Word
6 Microsoft Word
Let's revisit the Cars93 dataset
The table()
function builds contingency tables showing counts at each combination of factor levels
table(Cars93$AirBags)
Driver & Passenger Driver only None
16 43 34
table(Cars93$Origin)
USA non-USA
48 45
table(Cars93$AirBags, Cars93$Origin)
USA non-USA
Driver & Passenger 9 7
Driver only 23 20
None 16 18
Looks like US and non-US cars had about the same distribution of AirBag types
Later in the class we'll learn how to do a hypothesis tests on this kind of data
table()
is supplied a data frame, it produces contingency tables for all combinations of factors head(Cars93[c("AirBags", "Origin")], 3)
AirBags Origin
1 None non-USA
2 Driver & Passenger non-USA
3 Driver only non-USA
table(Cars93[c("AirBags", "Origin")])
Origin
AirBags USA non-USA
Driver & Passenger 9 7
Driver only 23 20
None 16 18
A list is a data structure that can be used to store different kinds of data
Recall: a vector is a data structure for storing similar kinds of data
To better understand the difference, consider the following example.
my.vector.1 <- c("Michael", 165, TRUE) # (name, weight, is.male)
my.vector.1
[1] "Michael" "165" "TRUE"
typeof(my.vector.1) # All the elements are now character strings!
[1] "character"
my.vector.2 <- c(FALSE, TRUE, 27) # (is.male, is.citizen, age)
typeof(my.vector.2)
[1] "double"
Vectors expect elements to be all of the same type (e.g., Boolean
, numeric
, character
)
When data of different types are put into a vector, the R converts everything to a common type
To store data of different types in the same object, we use lists
Simple way to build lists: use list()
function
my.list <- list("Michael", 165, TRUE)
my.list
[[1]]
[1] "Michael"
[[2]]
[1] 165
[[3]]
[1] TRUE
sapply(my.list, typeof)
[1] "character" "double" "logical"
patient.1 <- list(name="Michael", weight=165, is.male=TRUE)
patient.1
$name
[1] "Michael"
$weight
[1] 165
$is.male
[1] TRUE
patient.1$name # Get "name" element (returns a string)
[1] "Michael"
patient.1[["name"]] # Get "name" element (returns a string)
[1] "Michael"
patient.1["name"] # Get "name" slice (returns a sub-list)
$name
[1] "Michael"
c(typeof(patient.1$name), typeof(patient.1["name"]))
[1] "character" "list"
We have used a lot of built-in functions: mean()
, subset()
, plot()
, read.table()
…
An important part of programming and data analysis is to write custom functions
Functions help make code modular
Functions make debugging easier
Remember: this entire class is about applying functions to data
A function is a machine that turns input objects (arguments) into an output object (return value) according to a definite rule.
addOne <- function(x) {
x + 1
}
x
is the argument or input
The function output is the input x
incremented by 1
addOne(12)
[1] 13
calculatePercentage <- function(x, y, d) {
decimal <- x / y # Calculate decimal value
round(100 * decimal, d) # Convert to % and round to d digits
}
calculatePercentage(27, 80, 1)
[1] 33.8
createPatientRecord <- function(full.name, weight, height) {
name.list <- strsplit(full.name, split=" ")[[1]]
first.name <- name.list[1]
last.name <- name.list[2]
weight.in.kg <- weight / 2.2
height.in.m <- height * 0.0254
bmi <- weight.in.kg / (height.in.m ^ 2)
list(first.name=first.name, last.name=last.name, weight=weight.in.kg, height=height.in.m,
bmi=bmi)
}
createPatientRecord("Michael Smith", 185, 12 * 6 + 1)
$first.name
[1] "Michael"
$last.name
[1] "Smith"
$weight
[1] 84.09091
$height
[1] 1.8542
$bmi
[1] 24.45884
threeNumberSummary <- function(x) {
c(mean=mean(x), median=median(x), sd=sd(x))
}
x <- rnorm(100, mean=5, sd=2) # Vector of 100 normals with mean 5 and sd 2
threeNumberSummary(x)
mean median sd
5.060500 4.968122 1.787651
Oftentimes we want our code to have different effects depending on the features of the input
Example: Calculating a student's letter grade
To code this up, we use if-else statements
calculateLetterGrade <- function(x) {
if(x >= 90) {
grade <- "A"
} else if(x >= 80) {
grade <- "B"
} else if(x >= 70) {
grade <- "C"
} else {
grade <- "F"
}
grade
}
course.grades <- c(92, 78, 87, 91, 62)
sapply(course.grades, FUN=calculateLetterGrade)
[1] "A" "C" "B" "A" "F"
In the previous examples we specified the output simply by writing the output variable as the last line of the function
More explicitly, we can use the return()
function
addOne <- function(x) {
return(x + 1)
}
addOne(12)
[1] 13
return()
function, but you can use it if necessary or if it makes writing a particular function easier.Homework 1 due 1:20PM on Thursday
Lab 3
If you have questions, feel free to post on the Piazza Discussion Forum or attend office hours