Lecture 5: plyr and introduction to graphics
====
author: Prof. Alexandra Chouldechova
date: 94842
font-family: Gill Sans
autosize: false
width:1920
height:1080
Agenda
====
- split-apply-combine (plyr)
- Introduction to R graphics (ggplot2)
Packages
====
```{r}
library(MASS)
library(plyr)
library(dplyr)
library(tibble)
library(ggplot2)
```
Getting started: birthwt dataset
====
- We're going to start by operating on the `birthwt` dataset from the MASS library
- Let's get it loaded and see what we're working with
```{r}
str(birthwt)
```
tibbles
====
- tibbles are nicer data frames
- You may find it more convenient to work with tibbles instead of data frames
- In particular, they have nicer and more informative default print settings
```{r}
birthwt <- as_tibble(birthwt)
birthwt
```
Renaming the variables
====
- The dataset doesn't come with very descriptive variable names
- Let's get better column names (use `help(birthwt)` to understand the variables and come up with better names)
```{r}
colnames(birthwt)
# The default names are not very descriptive
colnames(birthwt) <- c("birthwt.below.2500", "mother.age",
"mother.weight", "race", "mother.smokes",
"previous.prem.labor", "hypertension",
"uterine.irr", "physician.visits", "birthwt.grams")
# Better names!
```
Renaming the factors
====
- All the factors are currently represented as integers
- Let's use the `mutate()` and `mapvalues()` functions to convert variables to factors and give the factors more meaningful levels
```{r}
birthwt <- mutate(birthwt,
race = as.factor(mapvalues(race, c(1, 2, 3),
c("white","black", "other"))),
mother.smokes = as.factor(mapvalues(mother.smokes,
c(0,1), c("no", "yes"))),
hypertension = as.factor(mapvalues(hypertension,
c(0,1), c("no", "yes"))),
uterine.irr = as.factor(mapvalues(uterine.irr,
c(0,1), c("no", "yes"))),
birthwt.below.2500 = as.factor(mapvalues(birthwt.below.2500,
c(0,1), c("no", "yes")))
)
```
Summary of the data
====
- Now that things are coded correctly, we can look at an overall summary
```{r}
summary(birthwt)
```
A simple table
====
- Let's use the `tapply()` function to see what the average birthweight looks like when broken down by race and smoking status
```{r}
with(birthwt, tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean))
```
- Questions you should be asking yourself:
- Does smoking status appear to have an effect on birth weight?
- Does the effect of smoking status appear to be consistent across racial groups?
- What is the association between race and birth weight?
What if we wanted nicer looking output?
====
- Let's use the header `{r, results='asis'}`, along with the `kable()` function from the `knitr` library
```{r, results='asis'}
library(knitr)
# Construct table (rounded to 0 decimal places)
tbl.round <- with(birthwt, round(tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean)))
# Print nicely
kable(tbl.round, format = "markdown")
```
- `kable()` outputs the table in a way that Markdown can read and nicely display
- Note: changing the CSS changes the table appearance
plyr: more general split-apply-combine operations
====
- We've seen several examples of split-apply-combine in action
- The basic principle to keep in mind is as follows:
1. **Split** a data set into piece (e.g., according to some factor)
2. **Apply** a function to each piece (e.g., mean)
3. **Combine** all the pieces into a single output (e.g., a table or data frame)
split-apply-combine: a simple example
====
![A simple example](http://nicercode.github.io/2014-02-13-UNSW/lessons/40-repeating/splitapply.png)
What does plyr do?
====
plyr introduces a family of functions of the form `XYply`, where `X` specifies the input type and `Y` specifies the output type
| X/Y option | Sepal.Width|
|:------------:|:------------|
| `a` | array (e.g., vector or matrix) |
| `d` | data.frame |
| `l` | list |
| `_` | no output (valid only for `Y`, useful when plotting) |
> **Usage:** `XYply(.data, .variables, .fun)`
> **Effect:** Take input of type `X`, and apply `.fun` to `.data` split up according to `.variables`, combining the answer into output of type `Y`
Example: Average birthweight by mother's race
====
```{r}
# ddply
ddply(birthwt, ~ race, summarize, mean.bwt = mean(birthwt.grams))
# Compare to tapply:
with(birthwt,
tapply(birthwt.grams, race, mean))
```
data frames are much nicer for plotting
====
```{r, echo = FALSE}
mean.bwt <- ddply(birthwt, ~ race, summarize,
mean.bwt = mean(birthwt.grams))
```
```{r}
mean.bwt
```
***
```{r, fig.height = 3, fig.width = 5, fig.align = 'center', dpi = 300}
ggplot(data = mean.bwt,
aes(x = race,
y = mean.bwt)) +
geom_bar(stat = "identity")
```
Multiple calculated fields
====
- Let's look at the average birthweight and proportion of babies with low birthweight broken down by smoking status
```{r}
ddply(birthwt, ~ mother.smokes, summarize,
mean.bwt = mean(birthwt.grams),
low.bwt.prop = mean(birthwt.below.2500 == "yes"))
```
Multiple splitting factors
====
- We can add additional splitting variables by including additional terms in the formula
```{r}
ddply(birthwt, ~ race + mother.smokes, summarize,
mean.age = mean(mother.age),
mean.bwt = mean(birthwt.grams),
low.bwt.prop = mean(birthwt.below.2500 == "yes"))
```
Example: Association between mother's age and birth weight?
====
- Is the mother's age correlated with birth weight?
```{r}
with(birthwt, cor(birthwt.grams, mother.age)) # Calculate correlation
```
- Does the correlation vary with smoking status?
- tapply can't help us here. But ddply still works!
```{r}
ddply(birthwt, ~ mother.smokes, summarize,
cor.bwt.age = cor(birthwt.grams, mother.age))
```
Graphics in R
====
- We now know a lot about how to tabulate data
- It's often easier to look at plots instead of tables
- We'll now talk about some of the standard plotting options
- Easier to do this in a live demo...
- Please refer to **.Rmd** version of lecture notes for the graphics material