Essentially you're saying that your data is broken down into 3 racial groups, and you want to model your data as having the same slope governing how birthweight changes with mother's age, but potentially different intercepts. Here's a picture of what's happening. ```{r} # Calculate race-specific intercepts intercepts <- c(coef(birthwt.lm)["(Intercept)"], coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceother"], coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["racewhite"]) lines.df <- data.frame(intercepts = intercepts, slopes = rep(coef(birthwt.lm)["mother.age"], 3), race = levels(birthwt$race)) qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + geom_abline(aes(intercept = intercepts, slope = slopes, color = race), data = lines.df) ``` How do we interpret the 2 race coefficients? For categorical variables, the interpretation is relative to the given baseline. The baseline is just whatever level comes first (here, "black"). E.g., the estimate of `raceother` means that the estimated intercept is `r round(coef(birthwt.lm)["raceother"], 1)` higher among "other" race mothers compared to black mothers. Similarly, the estimated intercept is `r round(coef(birthwt.lm)["racewhite"], 1)` higher for white mothers than black mothers. > Another way of putting it: Among mothers of the same age, babies of white mothers are born on average weighing `r round(coef(birthwt.lm)["racewhite"], 1)`g more than babies of black mothers. ##### Why is one of the levels missing in the regression? As you've already noticed, there is no coefficient called "raceblack" in the estimated model. This is because this coefficient gets absorbed into the overall (Intercept) term. Let's peek under the hood. Using the `model.matrix()` function on our linear model object, we can get the data matrix that underlies our regression. Here are the first 20 rows. ```{r} head(model.matrix(birthwt.lm), 20) ``` Even though we think of the regression `birthwt.grams ~ race + mother.age` as being a regression on two variables (and an intercept), it's actually a regression on 3 variables (and an intercept). This is because the `race` variable gets represented as two dummy variables: one for `race == other` and the other for `race == white`. Why isn't there a column for representing the indicator of `race == black`? This gets back to our colinearity issue. By definition, we have that

This is because for every observation, one and only one of the race dummy variables will equal 1. Thus the group of 4 variables {raceblack, raceother, racewhite, (Intercept)} is perfectly colinear, and we can't include all 4 of them in the model. The default behavior in R is to remove the dummy corresponding to the first level of the factor (here, raceblack), and to keep the rest. #### Interaction terms Let's go back to the regression line plot we generated above. ```{r} qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + geom_abline(aes(intercept = intercepts, slope = slopes, color = race), data = lines.df) ``` We have seen similar plots before by using the `geom_smooth` or `stat_smooth` commands in `ggplot`. Compare the plot above to the following. ```{r} qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + stat_smooth(method = "lm", se = FALSE, fullrange = TRUE) ``` In this case we have not only race-specific intercepts, but also **race-specific slopes**. The plot above corresponds to the model:

To specify this interaction model in R, we use the following syntax ```{r} birthwt.lm.interact <- lm(birthwt.grams ~ race * mother.age, data = birthwt) summary(birthwt.lm.interact) ``` We now have new terms appearing. Terms like `racewhite:mother.age` are deviations from the baseline slope (the coefficient of `mother.age` in the model) in the same way that terms like `racewhite` are deviations from the baseline intercept. This models says that: > On average among black mothers, every additional year of age is associated with a `r abs(round(coef(birthwt.lm.interact)["mother.age"], 1))`g decrease in the birthweight of the baby. To get the slope for white mothers, we need to add the interaction term to the baseline. slope(racewhite) = slope(raceblack) + racewhite:mother.age = mother.age + racewhite:mother.age = `r round(coef(birthwt.lm.interact)["mother.age"] + coef(birthwt.lm.interact)["racewhite:mother.age"], 1)` This slope estimate is positive, which agrees with the regression plot above.