---
title: "Lab 11 Solutions"
author: "Alexandra Chouldechova"
date: 
output: html_document
---

##### Remember to change the `author: ` field on this Rmd file to your own name.

### Learning objectives

> In today's Lab you will gain practice with the following concepts from today's class:

>- Interpreting linear regression coefficients of numeric covariates
- Interpreting linear regression coefficients of categorical variables
- Fitting linear regression models with interaction terms
- Interpreting linear regression coefficients of interaction terms

We'll begin by loading some packages and importing the data.
```{r}
library(tidyverse)
```

```{r, echo = FALSE}
# Load data from MASS into a tibble and do pre-processing
birthwt <- as_tibble(MASS::birthwt) %>%
  rename(birthwt.below.2500 = low, 
         mother.age = age,
         mother.weight = lwt,
         mother.smokes = smoke,
         previous.prem.labor = ptl,
         hypertension = ht,
         uterine.irr = ui,
         physician.visits = ftv,
         birthwt.grams = bwt) %>%
  mutate(race = recode_factor(race, `1` = "white", `2` = "black", `3` = "other")) %>%
  mutate_at(c("mother.smokes", "hypertension", "uterine.irr", "birthwt.below.2500"),
            ~ recode_factor(.x, `0` = "no", `1` = "yes"))
```


### Interaction terms in regression

**(a)** Run a linear regression to better understand how birthweight varies with the mother's age and smoking status (do not include interaction terms).

```{r}
# Run regression model
birthwt.lm <- lm(birthwt.grams ~ mother.age + mother.smokes, data = birthwt)

# Output coefficients table
summary(birthwt.lm)
```

**(b)** What is the coefficient of mother.age in your regression?  How do you interpret this coefficient?

```{r}
coef(birthwt.lm)["mother.age"]

age.coef <- round(coef(birthwt.lm)["mother.age"], 1)
```

**Note: This solution uses inline code chunks.** The coefficient is `r age.coef`.  This means that among mothers with the same smoking status, each additional year of age is on average associated with a `r age.coef`g increase in birthweight. *However, this coefficient is not statistically significant, so there may be no association between mother's age and birth weight.*

**(c)** How many coefficients are estimated for the mother's smoking status variable?  How do you interpret these coefficients?

```{r}
coef(birthwt.lm)["mother.smokesyes"]

smoke.coef <- abs(round(coef(birthwt.lm)["mother.smokesyes"], 1))
```

**Note: This solution uses inline code chunks.**  There is just one coefficient estimated.  This coefficient gives us the average difference in birthweight between mothers that smoke and mother's that don't, in a model that adjusts for the effect of mother's age.  That is, after we adjust for the effect of age, smoking leads to an average `r smoke.coef` decrease in birthweight.  

**(d)** What does the intercept mean in this model?

The intercept is the (extrapolated) estimated birth weight of a baby born to a *non-smoking* mother who is *0 years of age*.  Note that this value doesn't really make sense because mothers can't be 0 years of age when they give birth.  

**(e)** Using ggplot, construct a scatterplot with birthweight on the y-axis and mother's age on the x-axis.  Color the points by mother's smoking status, and add smoking status-specific linear regression lines using the `stat_smooth` layer.

```{r}
library(ggplot2)

# Note fullrange = TRUE is used here to extend the 'mother.smokes = yes' line beyond the maximum age (35) in this group
qplot(data = birthwt, x = mother.age, y = birthwt.grams, colour = mother.smokes) + stat_smooth(method = "lm", fullrange = TRUE)
```

**(f)** Do the regression lines plotted in part (e) correspond to the model you fit in part (a)?  How can you tell?

The regression lines shown here *do not correspond to the model in part (a)*.  The lines in the plot have different slopes depending on whether the mother smokes or not.  This is an interaction effect between mother's smoking status and mother's age.  By contrast, the model in part (a) allows only for different intercepts, and assumes a common slope for both smoking categories.

**(g)** Fit a linear regression model that now models potential interactions between mother's age and smoking status in their effect on birthweight.

```{r}
# Note: this is the same as ~ mother.smokes + mother.age + mother.smokes*mother.age
birthwt.lm.interact <- lm(birthwt.grams ~ mother.smokes*mother.age, data = birthwt)

summary(birthwt.lm.interact)
```

**(h)** Interpret your model.  Is the interaction term statistically significant?  What does it mean?

```{r}
# p-value for interaction coefficient
pval.interact <- coef(summary(birthwt.lm.interact))["mother.smokesyes:mother.age", "Pr(>|t|)"]

slope.nosmoke <- coef(birthwt.lm.interact)["mother.age"]

slope.smoke <- coef(birthwt.lm.interact)["mother.age"] + coef(birthwt.lm.interact)["mother.smokesyes:mother.age"]
```

The estimated interaction term between mother's age and smoking status (estimated for mother.smokes = yes) is negative, and statistically significant at the 0.05 level (p-value = `r pval.interact`).  This means that the slope (with age) among mothers that smoke is much smaller than the slope among mother's that don't.  Indeed, among mothers that don't smoke, for every additional year of age the average birthweight increases by `r slope.nosmoke`g on average.  Among mothers that do smoke, for every additional year of age the average birthweight **decreases** by `r abs(slope.smoke)`g on average