(a) Create a new factor that categorizes the number of physician visits into three levels: 0, 1, 2, 3 or more.
phys.visit.binned <- birthwt$physician.visits
phys.visit.binned[phys.visit.binned >= 3] <- "3.or.more"
birthwt <- transform(birthwt, phys.visit.binned = as.factor(phys.visit.binned))
birthwt$phys.visit.binned
## [1] 0 3.or.more 1 2 0 0 1
## [8] 1 1 0 0 1 0 2
## [15] 0 0 0 3.or.more 0 1 2
## [22] 3.or.more 1 0 2 0 0 2
## [29] 0 1 1 1 1 1 0
## [36] 2 2 0 2 1 2 2
## [43] 1 0 0 0 3.or.more 0 2
## [50] 0 1 0 0 2 0 0
## [57] 0 0 0 0 0 2 0
## [64] 0 0 1 2 3.or.more 1 2
## [71] 0 2 1 0 0 0 1
## [78] 3.or.more 0 0 1 0 0 0
## [85] 0 0 0 0 0 1 0
## [92] 2 0 0 0 1 1 0
## [99] 0 1 1 0 0 1 0
## [106] 0 1 0 2 3.or.more 2 1
## [113] 2 1 0 1 0 0 2
## [120] 1 1 0 1 0 2 2
## [127] 1 0 1 1 0 2 0
## [134] 0 0 0 1 1 0 1
## [141] 0 0 0 1 0 2 2
## [148] 0 0 0 1 2 0 0
## [155] 0 0 3.or.more 1 0 0 0
## [162] 1 0 0 0 0 3.or.more 0
## [169] 1 0 1 0 0 0 0
## [176] 0 1 3.or.more 0 2 1 3.or.more
## [183] 0 0 2 2 0 0 3.or.more
## Levels: 0 1 2 3.or.more
Hint: One way of doing this is with mapvalues, by mapping all instances of 3, 4,… etc, to “3 or more”.
(b) Run an ANOVA to determine whether the average birth weight varies across number of physician visits.
summary(aov(birthwt.grams ~ phys.visit.binned, data = birthwt))
## Df Sum Sq Mean Sq F value Pr(>F)
## phys.visit.binned 3 2259057 753019 1.426 0.237
## Residuals 185 97710599 528165
The p-value is greater than 0.05, so the variation in birthweight across number of physician visits is not statistically significant.
Below is figure showing how Price varies with EngineSize in the Cars93, with accompanying regression lines. There are two plots, one for USA cars, and one for non-USA cars.
qplot(data = Cars93, x = EngineSize, y = Price, colour = Origin) + facet_wrap("Origin") + stat_smooth(method = "lm") + theme(legend.position="none")
(a) Use the lm()
function to regress Price on EngineSize and Origin
# Fit linear regression model
price.lm <- lm(Price ~ EngineSize + Origin, data = Cars93)
# summary of linear model
summary(price.lm)
##
## Call:
## lm(formula = Price ~ EngineSize + Origin, data = Cars93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.474 -4.362 -1.051 2.743 34.626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.0892 2.5419 -1.215 0.227
## EngineSize 7.0637 0.7617 9.274 9.24e-15 ***
## Originnon-USA 7.7596 1.5725 4.935 3.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.948 on 90 degrees of freedom
## Multiple R-squared: 0.4939, Adjusted R-squared: 0.4826
## F-statistic: 43.91 on 2 and 90 DF, p-value: 4.925e-14
(b) Run plot()
on your lm
object. Do you see any problems?
plot(price.lm)
The residual plot shows a clear sign of non-constant variance. (The plot looks like a funnel, with variance increasing with fitted value.)
(c) Try running a linear regression with log(Price) as your outcome.
Cars93.log <- transform(Cars93, log.price = log(Price))
price.log.lm <- lm(log.price ~ EngineSize + Origin, data = Cars93.log)
(d) Run plot()
on your new lm
object. Do you see any problems?
plot(price.log.lm)
The variance now looks pretty constant across the range of fitted values, and there don’t appear to be any clear trends in the plots. All of the diagnostic plots seem pretty good. It looks like the log transformation helped here.
(e) Construct the same plot as in the beginning of this problem, this time showing log(Price) on the y-axis.
qplot(data = Cars93.log, x = EngineSize, y = log.price, colour = Origin) + facet_wrap("Origin") + stat_smooth(method = "lm") + theme(legend.position="none")
(f) The figure indicates that the slopes are different between USA and non-USA origin cars. To model this difference in slopes, we need to include an interaction term. To do this, simply write ~ EngineSize*Price
in the formula when running lm
instead of EngineSize + Price
. Is the interaction term statistically significant?
price.log.interact <- price.log.lm <- lm(log.price ~ EngineSize*Origin, data = Cars93.log)
summary(price.log.interact)
##
## Call:
## lm(formula = log.price ~ EngineSize * Origin, data = Cars93.log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44551 -0.19216 -0.01735 0.14870 1.13597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.99650 0.11636 17.158 < 2e-16 ***
## EngineSize 0.27657 0.03563 7.763 1.33e-11 ***
## Originnon-USA -0.39954 0.17982 -2.222 0.0288 *
## EngineSize:Originnon-USA 0.29904 0.06832 4.377 3.27e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2773 on 89 degrees of freedom
## Multiple R-squared: 0.6401, Adjusted R-squared: 0.628
## F-statistic: 52.77 on 3 and 89 DF, p-value: < 2.2e-16
The interaction term (EngineSize:Originnon-USA
) is positive and highly significant. This indicates that Price increases more steeply in EngineSize for non-USA cars as compared to USA cars.