Lecture 5: plyr and introduction to graphics ==== author: Prof. Alexandra Chouldechova date: 94842 font-family: Gill Sans autosize: false width:1920 height:1080 Agenda ==== - split-apply-combine (plyr) - Introduction to R graphics (ggplot2) Packages ==== ```{r} library(MASS) library(plyr) library(dplyr) library(tibble) library(ggplot2) ``` Getting started: birthwt dataset ==== - We're going to start by operating on the `birthwt` dataset from the MASS library - Let's get it loaded and see what we're working with ```{r} str(birthwt) ``` tibbles ==== - tibbles are nicer data frames - You may find it more convenient to work with tibbles instead of data frames - In particular, they have nicer and more informative default print settings ```{r} birthwt <- as_tibble(birthwt) birthwt ``` Renaming the variables ==== - The dataset doesn't come with very descriptive variable names - Let's get better column names (use `help(birthwt)` to understand the variables and come up with better names) ```{r} colnames(birthwt) # The default names are not very descriptive colnames(birthwt) <- c("birthwt.below.2500", "mother.age", "mother.weight", "race", "mother.smokes", "previous.prem.labor", "hypertension", "uterine.irr", "physician.visits", "birthwt.grams") # Better names! ``` Renaming the factors ==== - All the factors are currently represented as integers - Let's use the `mutate()` and `mapvalues()` functions to convert variables to factors and give the factors more meaningful levels ```{r} birthwt <- mutate(birthwt, race = as.factor(mapvalues(race, c(1, 2, 3), c("white","black", "other"))), mother.smokes = as.factor(mapvalues(mother.smokes, c(0,1), c("no", "yes"))), hypertension = as.factor(mapvalues(hypertension, c(0,1), c("no", "yes"))), uterine.irr = as.factor(mapvalues(uterine.irr, c(0,1), c("no", "yes"))), birthwt.below.2500 = as.factor(mapvalues(birthwt.below.2500, c(0,1), c("no", "yes"))) ) ``` Summary of the data ==== - Now that things are coded correctly, we can look at an overall summary ```{r} summary(birthwt) ``` A simple table ==== - Let's use the `tapply()` function to see what the average birthweight looks like when broken down by race and smoking status ```{r} with(birthwt, tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean)) ``` - Questions you should be asking yourself: - Does smoking status appear to have an effect on birth weight? - Does the effect of smoking status appear to be consistent across racial groups? - What is the association between race and birth weight? What if we wanted nicer looking output? ==== - Let's use the header `{r, results='asis'}`, along with the `kable()` function from the `knitr` library
```{r, results='asis'} library(knitr) # Construct table (rounded to 0 decimal places) tbl.round <- with(birthwt, round(tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean))) # Print nicely kable(tbl.round, format = "markdown") ```
- `kable()` outputs the table in a way that Markdown can read and nicely display - Note: changing the CSS changes the table appearance plyr: more general split-apply-combine operations ==== - We've seen several examples of split-apply-combine in action - The basic principle to keep in mind is as follows: 1. **Split** a data set into piece (e.g., according to some factor) 2. **Apply** a function to each piece (e.g., mean) 3. **Combine** all the pieces into a single output (e.g., a table or data frame) split-apply-combine: a simple example ====
![A simple example](http://nicercode.github.io/2014-02-13-UNSW/lessons/40-repeating/splitapply.png)
What does plyr do? ==== plyr introduces a family of functions of the form `XYply`, where `X` specifies the input type and `Y` specifies the output type
| X/Y option | Sepal.Width| |:------------:|:------------| | `a` | array (e.g., vector or matrix) | | `d` | data.frame | | `l` | list | | `_` | no output (valid only for `Y`, useful when plotting) | > **Usage:** `XYply(.data, .variables, .fun)` > **Effect:** Take input of type `X`, and apply `.fun` to `.data` split up according to `.variables`, combining the answer into output of type `Y` Example: Average birthweight by mother's race ==== ```{r} # ddply ddply(birthwt, ~ race, summarize, mean.bwt = mean(birthwt.grams)) # Compare to tapply: with(birthwt, tapply(birthwt.grams, race, mean)) ``` data frames are much nicer for plotting ==== ```{r, echo = FALSE} mean.bwt <- ddply(birthwt, ~ race, summarize, mean.bwt = mean(birthwt.grams)) ``` ```{r} mean.bwt ``` *** ```{r, fig.height = 3, fig.width = 5, fig.align = 'center', dpi = 300} ggplot(data = mean.bwt, aes(x = race, y = mean.bwt)) + geom_bar(stat = "identity") ``` Multiple calculated fields ==== - Let's look at the average birthweight and proportion of babies with low birthweight broken down by smoking status ```{r} ddply(birthwt, ~ mother.smokes, summarize, mean.bwt = mean(birthwt.grams), low.bwt.prop = mean(birthwt.below.2500 == "yes")) ``` Multiple splitting factors ==== - We can add additional splitting variables by including additional terms in the formula ```{r} ddply(birthwt, ~ race + mother.smokes, summarize, mean.age = mean(mother.age), mean.bwt = mean(birthwt.grams), low.bwt.prop = mean(birthwt.below.2500 == "yes")) ``` Example: Association between mother's age and birth weight? ==== - Is the mother's age correlated with birth weight? ```{r} with(birthwt, cor(birthwt.grams, mother.age)) # Calculate correlation ``` - Does the correlation vary with smoking status? - tapply can't help us here. But ddply still works! ```{r} ddply(birthwt, ~ mother.smokes, summarize, cor.bwt.age = cor(birthwt.grams, mother.age)) ``` Graphics in R ==== - We now know a lot about how to tabulate data - It's often easier to look at plots instead of tables - We'll now talk about some of the standard plotting options - Easier to do this in a live demo... - Please refer to **.Rmd** version of lecture notes for the graphics material