---
title: "Lab 8 Solutions"
author: "Alexandra Chouldechova"
date: ""
output: 
  html_document:
    fig_width: 7
    fig_height: 5
---

##### Remember to change the `author: ` field on this Rmd file to your own name.

We'll begin by loading some packages.
```{r}
library(MASS)
library(plyr)
library(ggplot2)
```

Let's form our favourite birthwt data set.

```{r}
# Rename the columns to have more descriptive names
colnames(birthwt) <- c("birthwt.below.2500", "mother.age", "mother.weight", 
    "race", "mother.smokes", "previous.prem.labor", "hypertension", "uterine.irr", 
    "physician.visits", "birthwt.grams")

# Transform variables to factors with descriptive levels
birthwt <- transform(birthwt, 
            race = as.factor(mapvalues(race, c(1, 2, 3), 
                              c("white","black", "other"))),
            mother.smokes = as.factor(mapvalues(mother.smokes, 
                              c(0,1), c("no", "yes"))),
            hypertension = as.factor(mapvalues(hypertension, 
                              c(0,1), c("no", "yes"))),
            uterine.irr = as.factor(mapvalues(uterine.irr, 
                              c(0,1), c("no", "yes")))
            )
```

### ANOVA with birthwt data

**(a)** Create a new factor that categorizes the number of physician visits into three levels: 0, 1, 2, 3 or more.  

```{r}
phys.visit.binned <- birthwt$physician.visits
phys.visit.binned[phys.visit.binned >= 3] <- "3.or.more"
birthwt <- transform(birthwt, phys.visit.binned = as.factor(phys.visit.binned))

birthwt$phys.visit.binned
```

**Hint**: One way of doing this is with mapvalues, by mapping all instances of 3, 4,... etc, to "3 or more".

**(b)** Run an ANOVA to determine whether the average birth weight varies across number of physician visits.

```{r}
summary(aov(birthwt.grams ~ phys.visit.binned, data = birthwt))
```

The p-value is greater than 0.05, so the variation in birthweight across number of physician visits is not statistically significant.  

### Linear regression with Cars93 data

Below is figure showing how Price varies with EngineSize in the Cars93, with accompanying regression lines.  There are two plots, one for USA cars, and one for non-USA cars.

```{r, fig.align='center', fig.height = 4}
qplot(data = Cars93, x = EngineSize, y = Price, colour = Origin) + facet_wrap("Origin") + stat_smooth(method = "lm") + theme(legend.position="none")
```

**(a)** Use the `lm()` function to regress Price on EngineSize and Origin

```{r}
# Fit linear regression model
price.lm <- lm(Price ~ EngineSize + Origin, data = Cars93)
# summary of linear model
summary(price.lm)
```

**(b)** Run `plot()` on your `lm` object.  Do you see any problems?

```{r}
plot(price.lm)
```

The residual plot shows a clear sign of non-constant variance.  (The plot looks like a funnel, with variance increasing with fitted value.)

**(c)** Try running a linear regression with log(Price) as your outcome.

```{r}
Cars93.log <- transform(Cars93, log.price = log(Price))
price.log.lm <- lm(log.price ~ EngineSize + Origin, data = Cars93.log)
```

**(d)** Run `plot()` on your new `lm` object.  Do you see any problems?

```{r}
plot(price.log.lm)
```

The variance now looks pretty constant across the range of fitted values, and there don't appear to be any clear trends in the plots.  All of the diagnostic plots seem pretty good.  It looks like the log transformation helped here.

**(e)** Construct the same plot as in the beginning of this problem, this time showing log(Price) on the y-axis.

```{r}
qplot(data = Cars93.log, x = EngineSize, y = log.price, colour = Origin) + facet_wrap("Origin") + stat_smooth(method = "lm") + theme(legend.position="none")
```

**(f)** The figure indicates that the slopes are different between USA and non-USA origin cars.  To model this difference in slopes, we need to include an **interaction term**.  To do this, simply write `~ EngineSize*Price` in the formula when running `lm` instead of `EngineSize + Price`.  Is the interaction term statistically significant?

```{r}
price.log.interact <- price.log.lm <- lm(log.price ~ EngineSize*Origin, data = Cars93.log)
summary(price.log.interact)
```

The interaction term (`EngineSize:Originnon-USA`) is positive and highly significant.  This indicates that Price increases more steeply in EngineSize for non-USA cars as compared to USA cars.