---
title: "Statistical significance testing"
author: "Alexandra Chouldechova"
date: ""
output: 
  html_document:
    toc: true
    toc_depth: 5
    fig_width: 5
    fig_height: 5
---

```{r}
library(knitr)
library(ggplot2)
```


#### Introduction

R makes it really easy to perform statistical and data analytic tasks. More often than not, the hardest part is not the code itself, but rather figuring out what model to fit and how to interpret the output.  This is certainly the case with hypothesis testing.  Running a t-test is easy (just type `t.test(...)`), but interpeting the output can be tricky.

The purpose of these notes is to fill in various common gaps that you may currently have in your understanding of hypothesis testing.  You should think of these notes not as *everything you need to know* but rather, *everything you need to know for the purpose of this class*.  

#### Working example: the t-test

Suppose that someone tells you they've come up with a miraculous IQ-boosting drug.  They even have the data to prove it!

```{r, echo = FALSE}
set.seed(1111)
n1 = 23
n2 = 19
# Function to generate data
generateIQData <- function(n1, n2, mean.shift = 0) {
  y <- round(110 + 10 * rnorm(n1 + n2) + c(rep(0, n1), rep(mean.shift, n2)))
  groups <- c(rep("control", n1), rep("treatment", n2))
  data.frame(iq = y, groups = groups)
}
drug.data <- generateIQData(n1, n2)
```

Here's the data they show you

```{r}
aggregate(iq ~ groups, drug.data, function(x) round(mean(x), 1))
```

Interesting... it looks like the average IQ in the group that took the drug is 4 points higher than in the control (placebo) group.  Let's think about things statistically.

First, how many people were in each group.  Sample size matters a lot.

```{r}
table(drug.data$groups)
```

That's not a very big sample size.  Let's run a t-test to assess whether the observed difference in average IQ is statistically significant.

```{r}
ttest.iq <- t.test(iq ~ groups, data = drug.data)
ttest.iq
```

We get a t-statistic of `r round(ttest.iq$statistic, 2)` and a p-value of `r round(ttest.iq$p.value, 4)`. 

#### How should we think about the results of the t-test?

**(1) Do we reject the null hypothesis?**

- We observe the p-value is greater than the standard 0.05 cutoff.  Thus we conclude that *the difference in average IQ observed between the treatment and control groups is* **not** *statistically significant at the 0.05 level.*

**(2) What does the p-value actually mean?**

- In Lecture 7 I showed you a simulation to illustrate the key property of the p-value: When the null hypothesis is true, the p-value follows the Uniform[0,1] distribution.  This is a very useful fact to keep in mind, but it seems rather abstract.  

- For the purpose of data analysis, it's more helpful to think about the p-value as you've likely seen it defined in your statistics classes:  The p-value is the probability of observing a value of the test statistic at least as large/extreme/surprising as the one we saw, assuming the null hypothesis is true.  In the IQ example, the p-value is just the probability of observing a t-statistic at least as large as $|t| =$ `r round(ttest.iq$statistic, 2)` if the drug actually had no effect on IQ.  

**(3) Why do we calculate a t-statistic?  Why can't I just look at the difference in means directly?**

- The t-statistic is a so-called "pivotal quantity": Under the null, its distribution doesn't depend on unknown quantities such as (population) means or standard deviations.  The distribution of the difference in sample averages $\hat\Delta = \bar{IQ}_{treat} - \bar{IQ}_{control}$ **does** depend on unknown parameters, even under the null.  The distribution of $\hat\Delta$ depends on the (population) standard deviation of treatment and control outcomes.  The distribution of the t-statistic depends only on the sample size, which is observable.

- That said, the t-statistic is just a normalized version of $\hat\Delta$, so in interpreting the p-value we can say:

> - The p-value is the probability that we would observe a difference in average IQ between the treatment and control group at least as large we did if the drug actually had no effect.

**(4) Can we say that the probability the drug had no effect is `r round(ttest.iq$p.value, 3)`.**

- **NO**.  In order to make statements about quantities like $P$(null is true | data), we would need to have some kind of prior probability distribution on the likelihood of the null being true.  
    - For instance, suppose that someone claimed to have a drug that raised IQ by 20 points, and they showed you data that when passed through a t-test resulted in a p-value of 0.00004.  Would you believe them?  No.  It's just not possible for such a drug to exist, no matter what they claim the data says.  In cases like this, I'd be more willing to believe that they faked the data than to believe in the existence of such a drug.  In other words, just because the p-value $P$(observe 20 point IQ difference | drug has no effect) is small, it doesn't necessarily mean that $P$(drug has no effect | observed 20 point IQ difference) is small.
    - In mathematical terms, this is just saying that $P(A \mid B) \neq P(B \mid A)$.
    - Indeed, Bayes' rule tells us that:
    $$ P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}$$
    - So if A = "drug has no effect" and B = "observe $\hat\Delta  \ge 20$", Bayes' rule tells us that
    $$ P(\text{drug has no effect} \mid \hat\Delta  \ge 20) = \frac{P(\hat\Delta  \ge 20 \mid \text{drug has no effect}) P(\text{drug has no effect})}{P(\hat\Delta  \ge 20)}$$
    - Prior experience tells me that the probability of such a drug existing is essentially 0.  Let's say... 1 in 10 million, which is 0.0000001.  Since we don't really believe that there are any drugs that could boost IQ by 20 points, we're essentially saying that $P(\hat\Delta  \ge 20) \approx  P(\hat\Delta  \ge 20 \mid \text{drug has no effect}) = 0.00004 )$.  Putting this all together, we would get that
    $$ P (\text{drug has no effect} \mid \hat \Delta \ge 20) \approx \frac{0.00004 \times (1 - 0.000000001)}{0.00004} = 0.9999999$$
    - Hence, even though our p-value is small, when we combine it with our very strong prior belief that such a drug couldn't possibly exist, we're led to the conclusion it's still overwhelmingly likely that the drug had no effect.