library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2

Introduction

This set of supplementary notes provides further discussion of the diagnostic plots that are output in R when you run th plot() function on a linear model (lm) object.

1. Residual vs. Fitted plot

The ideal case

Let’s begin by looking at the Residual-Fitted plot coming from a linear model that is fit to data that perfectly satisfies all the of the standard assumptions of linear regression. What are those assumptions? In the ideal case, we expect the \(i\)th data point to be generated as:

\[y_i = \beta_0 + \beta_1x_{1i} + \cdots + \beta_p x_{pi} + \epsilon_i\]

where \(\epsilon_i\) is an “error” term with mean 0 and some variance \(\sigma^2\).

To create an example of this type of model, let’s generate data according to

\[y_i = 3 + 0.1 x + \epsilon_i,\]

for \(i = 1, 2, \ldots, 1000\), where the \(\epsilon_i\) are independent Normal\((0,sd = 3)\) variables (with standard deviation 3).

Here’s code to generate this data and then regress y on x.

n <- 1000      # sample size
x <- runif(n, min = 0, max = 100)
y.good <- 3 + 0.1 * x + rnorm(n, sd = 3)

# Scatterplot of the data with regression line overlaid
qplot(x, y.good, ylab = "y", main = "Ideal regression setup") + stat_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# Run regression and display residual-fitted plot
lm.good <- lm(y.good ~ x)
plot(lm.good, which = 1)