Lecture 3: Manipulating data; lists; functions; if-else statements ==== author: Prof. Alexandra Chouldechova date: font-family: Gill Sans autosize: false width:1920 height:1080 Agenda ==== - More on data frames - Basic dplyr command - Lists - Writing functions in R - If-else statements Load dplyr ==== - In the coming weeks we'll learn a lot more about the `dplyr` and `plyr` libraries - Today we'll just need the `mutate` function from `dplyr` and `mapvalues` from `plyr` ```{r} library(plyr) library(dplyr) # Loading dplyr before plyr typically breaks things. Don't do it. ``` > **Note** Remember to run `install.packages("dplyr")` and similarly for `plyr` before trying to load the libraries for the first time. More on data frames ==== ```{r} library(MASS) head(Cars93, 3) ``` Adding a column: `mutate()` function from `dplyr` ==== - `mutate()` returns a new data frame with columns modified or added as specified by the function call ```{r} Cars93.metric <- mutate(Cars93, KMPL.city = 0.425 * MPG.city, KMPL.highway = 0.425 * MPG.highway) tail(names(Cars93.metric)) ``` - Our data frame has two new columns, giving the fuel consumption in km/l Another approach ==== ```{r} # Add a new column called KMPL.city.2 Cars93.metric$KMPL.city.2 <- 0.425 * Cars93$MPG.city tail(names(Cars93.metric)) ``` - Let's check that both approaches did the same thing ```{r} identical(Cars93.metric$KMPL.city, Cars93.metric$KMPL.city.2) ``` Changing levels of a factor: mapvalues ==== ```{r} manufacturer <- Cars93$Manufacturer head(manufacturer, 10) ``` We'll use the `mapvalues(x, from, to)` function from the `plyr` library. ```{r} # Map Chevrolet, Pontiac and Buick to GM manufacturer.combined <- mapvalues(manufacturer, from = c("Chevrolet", "Pontiac", "Buick"), to = rep("GM", 3)) head(manufacturer.combined, 10) ``` Another example ==== - A lot of data comes with integer encodings of levels - You may want to convert the integers to more meaningful values for the purpose of your analysis - Let's pretend that in the class survey 'Program' was coded as an integer with 1 = MISM, 2 = Other, 3 = PPM ```{r} survey <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data.csv", header=TRUE, sep=",") survey <- mutate(survey, Program=as.numeric(Program)) head(survey) ``` Example continued ==== - Here's how we would get back the program codings using the `mutate()`, `as.factor()` and `mapvalues()` functions ```{r} survey <- mutate(survey, Program = as.factor(mapvalues(Program, c(1, 2, 3), c("MISM", "Other", "PPM"))) ) head(survey) ``` Some more data frame summaries: `table()` function ==== - Let's revisit the Cars93 dataset - The `table()` function builds **contingency tables** showing counts at each combination of factor levels ```{r} table(Cars93$AirBags) ``` ==== ```{r} table(Cars93$Origin) table(Cars93$AirBags, Cars93$Origin) ``` - Looks like US and non-US cars had about the same distribution of AirBag types - Later in the class we'll learn how to do a hypothesis tests on this kind of data Alternative syntax ==== - When `table()` is supplied a data frame, it produces contingency tables for all combinations of factors ```{r} head(Cars93[c("AirBags", "Origin")], 3) table(Cars93[c("AirBags", "Origin")]) ``` Basics of lists ==== > A list is a **data structure** that can be used to store **different kinds** of data - Recall: a vector is a data structure for storing *similar kinds of data* - To better understand the difference, consider the following example. ```{r} my.vector.1 <- c("Michael", 165, TRUE) # (name, weight, is.male) my.vector.1 typeof(my.vector.1) # All the elements are now character strings! ``` Lists vs. vectors ==== ```{r} my.vector.2 <- c(FALSE, TRUE, 27) # (is.male, is.citizen, age) typeof(my.vector.2) ``` - Vectors expect elements to be all of the same type (e.g., `Boolean`, `numeric`, `character`) - When data of different types are put into a vector, the R converts everything to a common type Lists ==== - To store data of different types in the same object, we use lists - Simple way to build lists: use `list()` function ```{r} my.list <- list("Michael", 165, TRUE) my.list sapply(my.list, typeof) ``` Named elements ==== ```{r} patient.1 <- list(name="Michael", weight=165, is.male=TRUE) patient.1 ``` Referencing elements of a list (similar to data frames) ==== ```{r} patient.1$name # Get "name" element (returns a string) patient.1[["name"]] # Get "name" element (returns a string) patient.1["name"] # Get "name" slice (returns a sub-list) c(typeof(patient.1$name), typeof(patient.1["name"])) ``` Functions ==== - We have used a lot of built-in functions: `mean()`, `subset()`, `plot()`, `read.table()`... - An important part of programming and data analysis is to write custom functions - Functions help make code **modular** - Functions make debugging easier - Remember: this entire class is about applying *functions* to *data* What is a function? ==== > A function is a machine that turns **input objects** (arguments) into an **output object** (return value) according to a definite rule. - Let's look at a really simple function ```{r} addOne <- function(x) { x + 1 } ``` - `x` is the **argument** or **input** - The function **output** is the input `x` incremented by 1 ```{r} addOne(12) ``` More interesting example ==== - Here's a function that returns a % given a numerator, denominator, and desired number of decimal values ```{r} calculatePercentage <- function(x, y, d) { decimal <- x / y # Calculate decimal value round(100 * decimal, d) # Convert to % and round to d digits } calculatePercentage(27, 80, 1) ``` - If you're calculating several %'s for your report, you should use this kind of function instead of repeatedly copying and pasting code Function returning a list ==== - Here's a function that takes a person's full name (FirstName LastName), weight in lb and height in inches and converts it into a list with the person's first name, person's last name, weight in kg, height in m, and BMI. ```{r} createPatientRecord <- function(full.name, weight, height) { name.list <- strsplit(full.name, split=" ")[[1]] first.name <- name.list[1] last.name <- name.list[2] weight.in.kg <- weight / 2.2 height.in.m <- height * 0.0254 bmi <- weight.in.kg / (height.in.m ^ 2) list(first.name=first.name, last.name=last.name, weight=weight.in.kg, height=height.in.m, bmi=bmi) } ``` Trying out the function ==== ```{r} createPatientRecord("Michael Smith", 185, 12 * 6 + 1) ``` Another example: 3 number summary ==== - Calculate mean, median and standard deviation ```{r} threeNumberSummary <- function(x) { c(mean=mean(x), median=median(x), sd=sd(x)) } x <- rnorm(100, mean=5, sd=2) # Vector of 100 normals with mean 5 and sd 2 threeNumberSummary(x) ``` If-else statements ==== - Oftentimes we want our code to have different effects depending on the features of the input - Example: Calculating a student's letter grade - If grade >= 90, assign A - Otherwise, if grade >= 80, assign B - Otherwise, if grade >= 70, assign C - In all other cases, assign F - To code this up, we use if-else statements If-else Example: Letter grades ==== ```{r} calculateLetterGrade <- function(x) { if(x >= 90) { grade <- "A" } else if(x >= 80) { grade <- "B" } else if(x >= 70) { grade <- "C" } else { grade <- "F" } grade } course.grades <- c(92, 78, 87, 91, 62) sapply(course.grades, FUN=calculateLetterGrade) ``` `return()` ==== - In the previous examples we specified the output simply by writing the output variable as the last line of the function - More explicitly, we can use the `return()` function ```{r} addOne <- function(x) { return(x + 1) } addOne(12) ``` - We will generally avoid the `return()` function, but you can use it if necessary or if it makes writing a particular function easier. Reminders ==== - Homework 1 **due 1:20PM on Thursday** - Lab 3 - If you have questions, feel free to post on the Piazza Discussion Forum or attend office hours