Problems

We’ll begin by doing all the same data processing as in lecture.

library(MASS)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

# Assign more descriptive variable names
colnames(birthwt) <- c("birthwt.below.2500", "mother.age", "mother.weight", 
    "race", "mother.smokes", "previous.prem.labor", "hypertension", "uterine.irr", 
    "physician.visits", "birthwt.grams")

# Assign more descriptive factor levels and convert variables to factors as needed
library(plyr)
birthwt <- transform(birthwt, 
            race = as.factor(mapvalues(race, c(1, 2, 3), 
                              c("white","black", "other"))),
            mother.smokes = as.factor(mapvalues(mother.smokes, 
                              c(0,1), c("no", "yes"))),
            hypertension = as.factor(mapvalues(hypertension, 
                              c(0,1), c("no", "yes"))),
            uterine.irr = as.factor(mapvalues(uterine.irr, 
                              c(0,1), c("no", "yes"))),
            birthwt.below.2500 = as.factor(mapvalues(birthwt.below.2500,
                              c(0,1), c("no", "yes")))
            )

1. ddply() vs tapply()

One of the advantages of aggregate() is that it makes it easier to view summary tables when grouping on more than two factors.

(a) Use the tapply() function to calculate mean birthwt.grams grouped by race, mother’s smoking status, and hypertension.

# Edit me

One of the cells in the tapply output is equal to NA. Explain why.

Replace this text with your solution.

(b) Repeat part (a), this time using the ddply() function.

# Edit me

Do you see an NA result? Explain.

Replace this text with your solution.


(c) Repeat part (b), this time adding the argument .drop = FALSE as part of your ddply call. What happens now?

# Edit me

2. Plotting the diamonds data

(a) Construct a violin plot of showing how the distribution of diamond prices varies by diamond cut.

# Edit me

(b) Use facet_grid with geom_historam to construct 5 histograms showing the distribution of price within every category of diamond color.

# Edit me