2x2 tables, j x k tables

ANOVA

- Linear regression
- Fitting linear regression models in R
- Diagnostic plots
- Interpreting regression coefficients
- Testing significance of factor variables

Letâ€™s begin by loading the packages weâ€™ll need to get started

`## â”€â”€ Attaching packages â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ tidyverse 1.2.1 â”€â”€`

```
## âœ” ggplot2 3.2.1 âœ” purrr 0.3.3
## âœ” tibble 2.1.3 âœ” dplyr 0.8.3
## âœ” tidyr 1.0.0 âœ” stringr 1.4.0
## âœ” readr 1.3.1 âœ” forcats 0.4.0
```

```
## â”€â”€ Conflicts â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ tidyverse_conflicts() â”€â”€
## âœ– dplyr::filter() masks stats::filter()
## âœ– dplyr::lag() masks stats::lag()
```

Weâ€™ll begin by doing all the same data processing as in previous lectures

```
# Load data from MASS into a tibble
birthwt <- as_tibble(MASS::birthwt)
# Rename variables
birthwt <- birthwt %>%
rename(birthwt.below.2500 = low,
mother.age = age,
mother.weight = lwt,
mother.smokes = smoke,
previous.prem.labor = ptl,
hypertension = ht,
uterine.irr = ui,
physician.visits = ftv,
birthwt.grams = bwt)
# Change factor level names
birthwt <- birthwt %>%
mutate(race = recode_factor(race, `1` = "white", `2` = "black", `3` = "other")) %>%
mutate_at(c("mother.smokes", "hypertension", "uterine.irr", "birthwt.below.2500"),
~ recode_factor(.x, `0` = "no", `1` = "yes"))
```

Hereâ€™s an example of a 2 x 2 table that we might want to run a test on. This one looks at low birthweight broken down by motherâ€™s smoking status. You can think of it as another approach to the t-test problem, this time looking at indicators of low birth weight instead of the actual weights.

First, letâ€™s build our table using the `table()`

function (we did this back in Lecture 5)

```
## mother.smokes
## birthwt.below.2500 no yes
## no 86 44
## yes 29 30
```

When we have this type of tabular data we are interested in testing: Is there an association between the row variables and the column variables? i.e., If I tell you whether the mother smoked during pregnancy, does that inform you as to whether the baby is more or less likely to be born with low birthweight? Conversely, if I tell you the baby was both with low birthweight, does that inform you as to the motherâ€™s likely smoking status during pregnancy?

To test for significance, we just need to pass our 2 x 2 table into the appropriate function. Hereâ€™s the result of using fisherâ€™s exact test by calling `fisher.test`

```
##
## Fisher's Exact Test for Count Data
##
## data: weight.smoke.tbl
## p-value = 0.03618
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.028780 3.964904
## sample estimates:
## odds ratio
## 2.014137
```

```
## $names
## [1] "p.value" "conf.int" "estimate" "null.value" "alternative"
## [6] "method" "data.name"
##
## $class
## [1] "htest"
```

The fisher test is a formal test of independence. The null is that the row variable and column variable are statistically independent. Instead of testing differences in means, the natural thing to test in 2x2 tables is whether the **odds ratio** is equal to 1. Let \(p_{s} = P(\text{low birthweight} \mid \text{smoke})\) (the *probability* of low birthweight among smoking mothers) and let \(p_{n} = P(\text{low birthweight} \mid \text{doesn't smoke})\) (the probability of low birthweight among nonsmoking mothers). The **odds** of low weight among smoking mothers is defined as:

\[
\mathrm{odds}_s = \frac{p_s}{1 - p_s}
\] The **odd ratio** (OR) summarizes the association between smoking and low birthweight. It is defined as the odds of low birthweight under smoking divided by the odds under not smoking.

\[ \mathrm{OR} = \frac{p_s}{1 - p_s} \Big/ \frac{p_n}{1 - p_n} \]

**Interpretation**: The odds of low birth weight are 2.01 times greater when the mother smokes than when the mother does not smoke.

You can also use the chi-squared test via the `chisq.test`

function. This is the test that you may be more familiar with from your statistics class.

```
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: weight.smoke.tbl
## X-squared = 4.2359, df = 1, p-value = 0.03958
```

You get essentially the same answer by running the chi-squared test, but the output isnâ€™t as useful. In particular, youâ€™re not getting an estimate or confidence interval for the odds ratio. This is why I prefer `fisher.exact()`

for testing 2 x 2 tables.

Hereâ€™s a small data set on party affiliation broken down by gender.

```
# Manually enter the data
politics <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(politics) <- list(gender = c("F", "M"),
party = c("Democrat","Independent", "Republican"))
politics # display the data
```

```
## party
## gender Democrat Independent Republican
## F 762 327 468
## M 484 239 477
```

We may be interested in asking whether men and women have different party affiliations.

The answer will be easier to guess at if we convert the rows to show proportions instead of counts. Hereâ€™s one way of doing this.

```
## party
## gender Democrat Independent Republican
## F 0.4894027 0.2100193 0.3005780
## M 0.4033333 0.1991667 0.3975000
```

By looking at the table we see that Female are more likely to be Democrats and less likely to be Republicans.

We still want to know if this difference is significant. To assess this we can use the chi-squared test (on the counts table, not the proportions table!).

```
##
## Pearson's Chi-squared test
##
## data: politics
## X-squared = 30.07, df = 2, p-value = 2.954e-07
```

There isnâ€™t really a good one-number summary for general \(j\) x \(k\) tables the way there is for 2 x 2 tables. One thing that we may want to do at this stage is to ignore the Independent category and just look at the 2 x 2 table showing the counts for the Democrat and Republican categories.

```
## party
## gender Democrat Republican
## F 762 468
## M 484 477
```

```
##
## Fisher's Exact Test for Count Data
##
## data: politics.dem.rep
## p-value = 6.806e-08
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.347341 1.910944
## sample estimates:
## odds ratio
## 1.604345
```

We see that women have significantly higher odds of being Democrat compared to men.

It may be useful to represent the data graphically. Hereâ€™s one way of doing so with the `ggplot2`

package. Note that we plot the **proportions** not the counts.

**1.** Convert the table into something ggplot2 can process

```
## party
## gender Democrat Independent Republican
## F 0.4894027 0.2100193 0.3005780
## M 0.4033333 0.1991667 0.3975000
```

```
## # A tibble: 6 x 3
## gender party n
## <chr> <chr> <dbl>
## 1 F Democrat 0.489
## 2 M Democrat 0.403
## 3 F Independent 0.210
## 4 M Independent 0.199
## 5 F Republican 0.301
## 6 M Republican 0.398
```

**2.** Create a ggplot2 object, and plot with `geom_barplot()`

```
ggplot(politics.df, aes(x=party, y=prop, fill=gender)) +
geom_bar(position="dodge", stat="identity")
```

This figure is a nice alternative to displaying a table. One thing we might want to add is a way of gauging the statistical significance of the differences in height. Weâ€™ll do so by adding error bars.

Remember, ggplot wants everything you plot to be sitting nicely in a data frame. Hereâ€™s some code that will calculate the relevant values and do the plotting.

**1.** Get the data into a form thatâ€™s easy to work with.

```
## # A tibble: 6 x 3
## gender party n
## <chr> <chr> <dbl>
## 1 F Democrat 762
## 2 M Democrat 484
## 3 F Independent 327
## 4 M Independent 239
## 5 F Republican 468
## 6 M Republican 477
```

```
# Add a column of marginal counts
politics.df.count <- politics.df.count %>%
group_by(gender) %>%
mutate(totals = sum(n))
# print
politics.df.count
```

```
## # A tibble: 6 x 4
## # Groups: gender [2]
## gender party n totals
## <chr> <chr> <dbl> <dbl>
## 1 F Democrat 762 1557
## 2 M Democrat 484 1200
## 3 F Independent 327 1557
## 4 M Independent 239 1200
## 5 F Republican 468 1557
## 6 M Republican 477 1200
```

**2.** Calculate confidence intervals.

To calculate confidence intervals for the proportions, we can use `prop.test`

or `binom.test`

. Essentially you call these functions the numerator and denominator for the proportion.

Weâ€™ll do this with a ddply call. First, letâ€™s look at an example of how the `prop.test`

works.

```
# Suppose that we had 80 coin flips, 27 of which came up heads.
# How can we get a 95% CI for the true probability that this coin comes up H?
coin.test <- prop.test(27, 80)
coin.test
```

```
##
## 1-sample proportions test with continuity correction
##
## data: 27 out of 80, null probability 0.5
## X-squared = 7.8125, df = 1, p-value = 0.005189
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.2379394 0.4528266
## sample estimates:
## p
## 0.3375
```

```
## [1] 0.2379394 0.4528266
## attr(,"conf.level")
## [1] 0.95
```

```
politics.toplot <- politics.df.count %>%
group_by(gender, party) %>%
summarize(prop = n / totals,
lower = prop.test(n, totals)$conf.int[1],
upper = prop.test(n, totals)$conf.int[2])
politics.toplot
```

```
## # A tibble: 6 x 5
## # Groups: gender [2]
## gender party prop lower upper
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 F Democrat 0.489 0.464 0.515
## 2 F Independent 0.210 0.190 0.231
## 3 F Republican 0.301 0.278 0.324
## 4 M Democrat 0.403 0.376 0.432
## 5 M Independent 0.199 0.177 0.223
## 6 M Republican 0.398 0.370 0.426
```

**3.** Combine the confidence intervals into the data frame

**4.** Use `ggplot()`

, `geom_bar()`

and `geom_errorbar()`

to construct the plots

You can think of ANOVA (analysis of variance) as a more general version of the t-test, or a special case of linear regression in which all covariates are factors.

Letâ€™s go back to our favourite birthwt data set from the MASS library.

**Question: Is there a significant association between race and birthweight?**

Hereâ€™s a table showing the mean and standard error of birthweight by race.

```
birthwt %>%
group_by(race) %>%
summarize(mean.bwt = round(mean(birthwt.grams), 0),
se.bwt = round(sd(birthwt.grams) / sqrt(n()), 0))
```

```
## # A tibble: 3 x 3
## race mean.bwt se.bwt
## <fct> <dbl> <dbl>
## 1 white 3103 74
## 2 black 2720 125
## 3 other 2805 88
```

It looks like thereâ€™s some association, but we donâ€™t yet know if itâ€™s statistically significant. Note that if we had just two racial categories in our data, we could run a t-test. Since we have more than 2, we need to run a 1-way analysis of variance (ANOVA).

**Terminology**: a \(k\)-way ANOVA is used to assess whether the mean of an outcome variable is constant across all combinations of \(k\) factors. The most common examples are 1-way ANOVA (looking at a single factor), and 2-way ANOVA (looking at two factors).

Weâ€™ll use the `aov()`

function. For convenience, `aov()`

allows you to specify a formula.

```
## Df Sum Sq Mean Sq F value Pr(>F)
## race 2 5015725 2507863 4.913 0.00834 **
## Residuals 186 94953931 510505
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The p-value is significant at the 0.05 level, so the data suggests that there is an association between birthweight and race. In other words, average birthweight varies across the three racial groups considered in the data.

Linear regression is just a more general form of ANOVA. It allows you to model effects of continuous variables.

linear regressionis used to model linear relationship between an outcome variable, \(y\), and a set ofcovariatesorpredictor variables\(x_1, x_2, \ldots, x_p\).

For our first example weâ€™ll look at a small data set in which weâ€™re interested in predicting the crime rate per million population based on socio-economic and demographic information at the state level.

Letâ€™s first import the data set and see what weâ€™re working with.

```
# Import data set
crime <- read_delim("http://www.andrew.cmu.edu/user/achoulde/94842/data/crime_simple.txt", delim = "\t")
```

```
## Parsed with column specification:
## cols(
## R = col_double(),
## Age = col_double(),
## S = col_double(),
## Ed = col_double(),
## Ex0 = col_double(),
## Ex1 = col_double(),
## LF = col_double(),
## M = col_double(),
## N = col_double(),
## NW = col_double(),
## U1 = col_double(),
## U2 = col_double(),
## W = col_double(),
## X = col_double()
## )
```

**The variable names that this data set comes with are very confusing, and even misleading.**

R: Crime rate: # of offenses reported to police per million population

Age: The number of males of age 14-24 per 1000 population

S: Indicator variable for Southern states (0 = No, 1 = Yes)

Ed: Mean # of years of schooling x 10 for persons of age 25 or older

Ex0: 1960 per capita expenditure on police by state and local government

Ex1: 1959 per capita expenditure on police by state and local government

LF: Labor force participation rate per 1000 civilian urban males age 14-24

M: The number of males per 1000 females

N: State population size in hundred thousands

NW: The number of non-whites per 1000 population

U1: Unemployment rate of urban males per 1000 of age 14-24

U2: Unemployment rate of urban males per 1000 of age 35-39

W: Median value of transferable goods and assets or family income in tens of $

X: The number of families per 1000 earning below 1/2 the median income

**We really need to give these variables better names**

```
# Assign more meaningful variable names, also
# Convert is.south to a factor
# Divide average.ed by 10 so that the variable is actually average education
# Convert median assets to 1000's of dollars instead of 10's
crime <- crime %>%
rename(crime.per.million = R,
young.males = Age,
is.south = S,
average.ed = Ed,
exp.per.cap.1960 = Ex0,
exp.per.cap.1959 = Ex1,
labour.part = LF,
male.per.fem = M,
population = N,
nonwhite = NW,
unemp.youth = U1,
unemp.adult = U2,
median.assets = W,
num.low.salary = X) %>%
mutate(is.south = as.factor(is.south),
average.ed = average.ed / 10,
median.assets = median.assets / 100)
# print summary of the data
summary(crime)
```

```
## crime.per.million young.males is.south average.ed
## Min. : 34.20 Min. :119.0 0:31 Min. : 8.70
## 1st Qu.: 65.85 1st Qu.:130.0 1:16 1st Qu.: 9.75
## Median : 83.10 Median :136.0 Median :10.80
## Mean : 90.51 Mean :138.6 Mean :10.56
## 3rd Qu.:105.75 3rd Qu.:146.0 3rd Qu.:11.45
## Max. :199.30 Max. :177.0 Max. :12.20
## exp.per.cap.1960 exp.per.cap.1959 labour.part male.per.fem
## Min. : 45.0 Min. : 41.00 Min. :480.0 Min. : 934.0
## 1st Qu.: 62.5 1st Qu.: 58.50 1st Qu.:530.5 1st Qu.: 964.5
## Median : 78.0 Median : 73.00 Median :560.0 Median : 977.0
## Mean : 85.0 Mean : 80.23 Mean :561.2 Mean : 983.0
## 3rd Qu.:104.5 3rd Qu.: 97.00 3rd Qu.:593.0 3rd Qu.: 992.0
## Max. :166.0 Max. :157.00 Max. :641.0 Max. :1071.0
## population nonwhite unemp.youth unemp.adult
## Min. : 3.00 Min. : 2.0 Min. : 70.00 Min. :20.00
## 1st Qu.: 10.00 1st Qu.: 24.0 1st Qu.: 80.50 1st Qu.:27.50
## Median : 25.00 Median : 76.0 Median : 92.00 Median :34.00
## Mean : 36.62 Mean :101.1 Mean : 95.47 Mean :33.98
## 3rd Qu.: 41.50 3rd Qu.:132.5 3rd Qu.:104.00 3rd Qu.:38.50
## Max. :168.00 Max. :423.0 Max. :142.00 Max. :58.00
## median.assets num.low.salary
## Min. :2.880 Min. :126.0
## 1st Qu.:4.595 1st Qu.:165.5
## Median :5.370 Median :176.0
## Mean :5.254 Mean :194.0
## 3rd Qu.:5.915 3rd Qu.:227.5
## Max. :6.890 Max. :276.0
```

You can start by feeding everything into a regression, but itâ€™s often a better idea to construct some simple plots (e.g., scatterplots and boxplots) and summary statistics to get some sense of how the data behaves.

```
# Scatter plot of outcome (crime.per.million) against average.ed
qplot(average.ed, crime.per.million, data = crime)
```

`## [1] 0.3228349`

This seems to suggest that higher levels of average education are associated with higher crime rates. *Can you come up with an explanation for this phenomenon?*

```
# Scatter plot of outcome (crime.per.million) against median.assets
qplot(median.assets, crime.per.million, data = crime)
```

`## [1] 0.4413199`

There also appears to be a positive association between median assets and crime rates.