Remember to change the author: field on this Rmd file to your own name.

We’ll begin by loading some packages.

library(MASS)
library(plyr)
library(ggplot2)

Let’s form our favourite birthwt data set.

# Rename the columns to have more descriptive names
colnames(birthwt) <- c("birthwt.below.2500", "mother.age", "mother.weight", 
    "race", "mother.smokes", "previous.prem.labor", "hypertension", "uterine.irr", 
    "physician.visits", "birthwt.grams")

# Transform variables to factors with descriptive levels
birthwt <- transform(birthwt, 
            race = as.factor(mapvalues(race, c(1, 2, 3), 
                              c("white","black", "other"))),
            mother.smokes = as.factor(mapvalues(mother.smokes, 
                              c(0,1), c("no", "yes"))),
            hypertension = as.factor(mapvalues(hypertension, 
                              c(0,1), c("no", "yes"))),
            uterine.irr = as.factor(mapvalues(uterine.irr, 
                              c(0,1), c("no", "yes")))
            )

ANOVA with birthwt data

(a) Create a new factor that categorizes the number of physician visits into three levels: 0, 1, 2, 3 or more.

phys.visit.binned <- birthwt$physician.visits
phys.visit.binned[phys.visit.binned >= 3] <- "3.or.more"
birthwt <- transform(birthwt, phys.visit.binned = as.factor(phys.visit.binned))

birthwt$phys.visit.binned
##   [1] 0         3.or.more 1         2         0         0         1        
##   [8] 1         1         0         0         1         0         2        
##  [15] 0         0         0         3.or.more 0         1         2        
##  [22] 3.or.more 1         0         2         0         0         2        
##  [29] 0         1         1         1         1         1         0        
##  [36] 2         2         0         2         1         2         2        
##  [43] 1         0         0         0         3.or.more 0         2        
##  [50] 0         1         0         0         2         0         0        
##  [57] 0         0         0         0         0         2         0        
##  [64] 0         0         1         2         3.or.more 1         2        
##  [71] 0         2         1         0         0         0         1        
##  [78] 3.or.more 0         0         1         0         0         0        
##  [85] 0         0         0         0         0         1         0        
##  [92] 2         0         0         0         1         1         0        
##  [99] 0         1         1         0         0         1         0        
## [106] 0         1         0         2         3.or.more 2         1        
## [113] 2         1         0         1         0         0         2        
## [120] 1         1         0         1         0         2         2        
## [127] 1         0         1         1         0         2         0        
## [134] 0         0         0         1         1         0         1        
## [141] 0         0         0         1         0         2         2        
## [148] 0         0         0         1         2         0         0        
## [155] 0         0         3.or.more 1         0         0         0        
## [162] 1         0         0         0         0         3.or.more 0        
## [169] 1         0         1         0         0         0         0        
## [176] 0         1         3.or.more 0         2         1         3.or.more
## [183] 0         0         2         2         0         0         3.or.more
## Levels: 0 1 2 3.or.more

Hint: One way of doing this is with mapvalues, by mapping all instances of 3, 4,… etc, to “3 or more”.

(b) Run an ANOVA to determine whether the average birth weight varies across number of physician visits.

summary(aov(birthwt.grams ~ phys.visit.binned, data = birthwt))
##                    Df   Sum Sq Mean Sq F value Pr(>F)
## phys.visit.binned   3  2259057  753019   1.426  0.237
## Residuals         185 97710599  528165

The p-value is greater than 0.05, so the variation in birthweight across number of physician visits is not statistically significant.

Linear regression with Cars93 data

Below is figure showing how Price varies with EngineSize in the Cars93, with accompanying regression lines. There are two plots, one for USA cars, and one for non-USA cars.

qplot(data = Cars93, x = EngineSize, y = Price, colour = Origin) + facet_wrap("Origin") + stat_smooth(method = "lm") + theme(legend.position="none")

(a) Use the lm() function to regress Price on EngineSize and Origin

# Fit linear regression model
price.lm <- lm(Price ~ EngineSize + Origin, data = Cars93)
# summary of linear model
summary(price.lm)
## 
## Call:
## lm(formula = Price ~ EngineSize + Origin, data = Cars93)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.474  -4.362  -1.051   2.743  34.626 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3.0892     2.5419  -1.215    0.227    
## EngineSize      7.0637     0.7617   9.274 9.24e-15 ***
## Originnon-USA   7.7596     1.5725   4.935 3.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.948 on 90 degrees of freedom
## Multiple R-squared:  0.4939, Adjusted R-squared:  0.4826 
## F-statistic: 43.91 on 2 and 90 DF,  p-value: 4.925e-14

(b) Run plot() on your lm object. Do you see any problems?

plot(price.lm)

The residual plot shows a clear sign of non-constant variance. (The plot looks like a funnel, with variance increasing with fitted value.)

(c) Try running a linear regression with log(Price) as your outcome.

Cars93.log <- transform(Cars93, log.price = log(Price))
price.log.lm <- lm(log.price ~ EngineSize + Origin, data = Cars93.log)

(d) Run plot() on your new lm object. Do you see any problems?

plot(price.log.lm)

The variance now looks pretty constant across the range of fitted values, and there don’t appear to be any clear trends in the plots. All of the diagnostic plots seem pretty good. It looks like the log transformation helped here.

(e) Construct the same plot as in the beginning of this problem, this time showing log(Price) on the y-axis.

qplot(data = Cars93.log, x = EngineSize, y = log.price, colour = Origin) + facet_wrap("Origin") + stat_smooth(method = "lm") + theme(legend.position="none")

(f) The figure indicates that the slopes are different between USA and non-USA origin cars. To model this difference in slopes, we need to include an interaction term. To do this, simply write ~ EngineSize*Price in the formula when running lm instead of EngineSize + Price. Is the interaction term statistically significant?

price.log.interact <- price.log.lm <- lm(log.price ~ EngineSize*Origin, data = Cars93.log)
summary(price.log.interact)
## 
## Call:
## lm(formula = log.price ~ EngineSize * Origin, data = Cars93.log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44551 -0.19216 -0.01735  0.14870  1.13597 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.99650    0.11636  17.158  < 2e-16 ***
## EngineSize                0.27657    0.03563   7.763 1.33e-11 ***
## Originnon-USA            -0.39954    0.17982  -2.222   0.0288 *  
## EngineSize:Originnon-USA  0.29904    0.06832   4.377 3.27e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2773 on 89 degrees of freedom
## Multiple R-squared:  0.6401, Adjusted R-squared:  0.628 
## F-statistic: 52.77 on 3 and 89 DF,  p-value: < 2.2e-16

The interaction term (EngineSize:Originnon-USA) is positive and highly significant. This indicates that Price increases more steeply in EngineSize for non-USA cars as compared to USA cars.