###Packages

Let’s begin by loading the packages we’ll need to get started

library(tidyverse)
## ── Attaching packages ────────
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ─────────────────
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(knitr)

Linear regression

Linear regression is just a more general form of ANOVA, which itself is a generalized t-test. In each case, we’re assessing if and how the mean of our outcome \(y\) varies with other variables. Unlike t-tests and ANOVA, which are restricted to the case where the factors of interest are all categorical, regression allows you to also model the effects of continuous variables.

linear regression is used to model linear relationship between an outcome variable, \(y\), and a set of covariates or predictor variables \(x_1, x_2, \ldots, x_p\).

For our first example we’ll look at a small data set in which we’re interested in predicting the crime rate per million population based on socio-economic and demographic information at the state level.

Let’s first import the data set and see what we’re working with.

# Import data set
crime <- read_delim("http://www.andrew.cmu.edu/user/achoulde/94842/data/crime_simple.txt", delim = "\t")
## Parsed with column specification:
## cols(
##   R = col_double(),
##   Age = col_double(),
##   S = col_double(),
##   Ed = col_double(),
##   Ex0 = col_double(),
##   Ex1 = col_double(),
##   LF = col_double(),
##   M = col_double(),
##   N = col_double(),
##   NW = col_double(),
##   U1 = col_double(),
##   U2 = col_double(),
##   W = col_double(),
##   X = col_double()
## )

The variable names that this data set comes with are very confusing, and even misleading.

R: Crime rate: # of offenses reported to police per million population

Age: The number of males of age 14-24 per 1000 population

S: Indicator variable for Southern states (0 = No, 1 = Yes)

Ed: Mean # of years of schooling x 10 for persons of age 25 or older

Ex0: 1960 per capita expenditure on police by state and local government

Ex1: 1959 per capita expenditure on police by state and local government

LF: Labor force participation rate per 1000 civilian urban males age 14-24

M: The number of males per 1000 females

N: State population size in hundred thousands

NW: The number of non-whites per 1000 population

U1: Unemployment rate of urban males per 1000 of age 14-24

U2: Unemployment rate of urban males per 1000 of age 35-39

W: Median value of transferable goods and assets or family income in tens of $

X: The number of families per 1000 earning below 1/2 the median income

We really need to give these variables better names

# Assign more meaningful variable names, also
# Convert is.south to a factor
# Divide average.ed by 10 so that the variable is actually average education
# Convert median assets to 1000's of dollars instead of 10's
crime <- crime %>%
  rename(crime.per.million = R,
         young.males = Age,
         is.south = S,
         average.ed = Ed,
         exp.per.cap.1960 = Ex0,
         exp.per.cap.1959 = Ex1,
         labour.part = LF,
         male.per.fem = M,
         population = N,
         nonwhite = NW,
         unemp.youth = U1,
         unemp.adult = U2,
         median.assets = W,
         num.low.salary = X) %>%
  mutate(is.south = as.factor(is.south),
         average.ed = average.ed / 10,
         median.assets = median.assets / 100)
# print summary of the data
summary(crime)
##  crime.per.million  young.males    is.south   average.ed   
##  Min.   : 34.20    Min.   :119.0   0:31     Min.   : 8.70  
##  1st Qu.: 65.85    1st Qu.:130.0   1:16     1st Qu.: 9.75  
##  Median : 83.10    Median :136.0            Median :10.80  
##  Mean   : 90.51    Mean   :138.6            Mean   :10.56  
##  3rd Qu.:105.75    3rd Qu.:146.0            3rd Qu.:11.45  
##  Max.   :199.30    Max.   :177.0            Max.   :12.20  
##  exp.per.cap.1960 exp.per.cap.1959  labour.part     male.per.fem   
##  Min.   : 45.0    Min.   : 41.00   Min.   :480.0   Min.   : 934.0  
##  1st Qu.: 62.5    1st Qu.: 58.50   1st Qu.:530.5   1st Qu.: 964.5  
##  Median : 78.0    Median : 73.00   Median :560.0   Median : 977.0  
##  Mean   : 85.0    Mean   : 80.23   Mean   :561.2   Mean   : 983.0  
##  3rd Qu.:104.5    3rd Qu.: 97.00   3rd Qu.:593.0   3rd Qu.: 992.0  
##  Max.   :166.0    Max.   :157.00   Max.   :641.0   Max.   :1071.0  
##    population        nonwhite      unemp.youth      unemp.adult   
##  Min.   :  3.00   Min.   :  2.0   Min.   : 70.00   Min.   :20.00  
##  1st Qu.: 10.00   1st Qu.: 24.0   1st Qu.: 80.50   1st Qu.:27.50  
##  Median : 25.00   Median : 76.0   Median : 92.00   Median :34.00  
##  Mean   : 36.62   Mean   :101.1   Mean   : 95.47   Mean   :33.98  
##  3rd Qu.: 41.50   3rd Qu.:132.5   3rd Qu.:104.00   3rd Qu.:38.50  
##  Max.   :168.00   Max.   :423.0   Max.   :142.00   Max.   :58.00  
##  median.assets   num.low.salary 
##  Min.   :2.880   Min.   :126.0  
##  1st Qu.:4.595   1st Qu.:165.5  
##  Median :5.370   Median :176.0  
##  Mean   :5.254   Mean   :194.0  
##  3rd Qu.:5.915   3rd Qu.:227.5  
##  Max.   :6.890   Max.   :276.0

First step: some plotting and summary statistics

You can start by feeding everything into a regression, but it’s often a better idea to construct some simple plots (e.g., scatterplots and boxplots) and summary statistics to get some sense of how the data behaves.

# Scatter plot of outcome (crime.per.million) against average.ed
qplot(average.ed, crime.per.million, data = crime)

# correlation between education and crime
with(crime, cor(average.ed, crime.per.million))
## [1] 0.3228349

This seems to suggest that higher levels of average education are associated with higher crime rates. Can you come up with an explanation for this phenomenon?

# Scatter plot of outcome (crime.per.million) against median.assets
qplot(median.assets, crime.per.million, data = crime)

# correlation between education and crime
with(crime, cor(median.assets, crime.per.million))
## [1] 0.4413199

There also appears to be a positive association between median assets and crime rates.

# Boxplots showing crime rate broken down by southern vs non-southern state
qplot(is.south, crime.per.million, geom = "boxplot", data = crime)

Constructing a regression model

To construct a linear regression model in R, we use the lm() function. You can specify the regression model in various ways. The simplest is often to use the formula specification.

The first model we fit is a regression of the outcome (crimes.per.million) against all the other variables in the data set. You can either write out all the variable names. or use the shorthand y ~ . to specify that you want to include all the variables in your regression.

crime.lm <- lm(crime.per.million ~ ., data = crime)
# Summary of the linear regression model
crime.lm
## 
## Call:
## lm(formula = crime.per.million ~ ., data = crime)
## 
## Coefficients:
##      (Intercept)       young.males         is.south1        average.ed  
##       -6.918e+02         1.040e+00        -8.308e+00         1.802e+01  
## exp.per.cap.1960  exp.per.cap.1959       labour.part      male.per.fem  
##        1.608e+00        -6.673e-01        -4.103e-02         1.648e-01  
##       population          nonwhite       unemp.youth       unemp.adult  
##       -4.128e-02         7.175e-03        -6.017e-01         1.792e+00  
##    median.assets    num.low.salary  
##        1.374e+01         7.929e-01
summary(crime.lm)
## 
## Call:
## lm(formula = crime.per.million ~ ., data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.884 -11.923  -1.135  13.495  50.560 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.918e+02  1.559e+02  -4.438 9.56e-05 ***
## young.males       1.040e+00  4.227e-01   2.460  0.01931 *  
## is.south1        -8.308e+00  1.491e+01  -0.557  0.58117    
## average.ed        1.802e+01  6.497e+00   2.773  0.00906 ** 
## exp.per.cap.1960  1.608e+00  1.059e+00   1.519  0.13836    
## exp.per.cap.1959 -6.673e-01  1.149e+00  -0.581  0.56529    
## labour.part      -4.103e-02  1.535e-01  -0.267  0.79087    
## male.per.fem      1.648e-01  2.099e-01   0.785  0.43806    
## population       -4.128e-02  1.295e-01  -0.319  0.75196    
## nonwhite          7.175e-03  6.387e-02   0.112  0.91124    
## unemp.youth      -6.017e-01  4.372e-01  -1.376  0.17798    
## unemp.adult       1.792e+00  8.561e-01   2.093  0.04407 *  
## median.assets     1.374e+01  1.058e+01   1.298  0.20332    
## num.low.salary    7.929e-01  2.351e-01   3.373  0.00191 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.94 on 33 degrees of freedom
## Multiple R-squared:  0.7692, Adjusted R-squared:  0.6783 
## F-statistic: 8.462 on 13 and 33 DF,  p-value: 3.686e-07

R’s default is to output values in scientific notation. This can make it hard to interpret the numbers. Here’s some code that can be used to force full printout of numbers.

options(scipen=4)  # Set scipen = 0 to get back to default
summary(crime.lm)
## 
## Call:
## lm(formula = crime.per.million ~ ., data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.884 -11.923  -1.135  13.495  50.560 
## 
## Coefficients:
##                     Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)      -691.837588  155.887918  -4.438 0.0000956 ***
## young.males         1.039810    0.422708   2.460   0.01931 *  
## is.south1          -8.308313   14.911588  -0.557   0.58117    
## average.ed         18.016011    6.496504   2.773   0.00906 ** 
## exp.per.cap.1960    1.607818    1.058667   1.519   0.13836    
## exp.per.cap.1959   -0.667258    1.148773  -0.581   0.56529    
## labour.part        -0.041031    0.153477  -0.267   0.79087    
## male.per.fem        0.164795    0.209932   0.785   0.43806    
## population         -0.041277    0.129516  -0.319   0.75196    
## nonwhite            0.007175    0.063867   0.112   0.91124    
## unemp.youth        -0.601675    0.437154  -1.376   0.17798    
## unemp.adult         1.792263    0.856111   2.093   0.04407 *  
## median.assets      13.735847   10.583028   1.298   0.20332    
## num.low.salary      0.792933    0.235085   3.373   0.00191 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.94 on 33 degrees of freedom
## Multiple R-squared:  0.7692, Adjusted R-squared:  0.6783 
## F-statistic: 8.462 on 13 and 33 DF,  p-value: 0.0000003686

Looking at the p-values, it looks like num.low.salary (number of families per 1000 earning below 1/2 the median income), unemp.adult (Unemployment rate of urban males per 1000 of age 35-39), average.ed (Mean # of years of schooling 25 or older), and young.males (number of males of age 14-24 per 1000 population) are all statistically significant predictors of crime rate.

The coefficients for these predictors are all positive, so crime rates are positively associated with wealth inequality, adult unemployment rates, average education levels, and high rates of young males in the population.

Exploring the lm object

What kind of output do we get when we run a linear model (lm) in R?

# List all attributes of the linear model
attributes(crime.lm)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"        
## 
## $class
## [1] "lm"
# coefficients
crime.lm$coef
##      (Intercept)      young.males        is.south1       average.ed 
##   -691.837587905      1.039809653     -8.308312889     18.016010601 
## exp.per.cap.1960 exp.per.cap.1959      labour.part     male.per.fem 
##      1.607818377     -0.667258285     -0.041031047      0.164794968 
##       population         nonwhite      unemp.youth      unemp.adult 
##     -0.041276887      0.007174688     -0.601675298      1.792262901 
##    median.assets   num.low.salary 
##     13.735847285      0.792932786

None of the attributes seem to give you p-values. Here’s what you can do to get a table that allows you to extract p-values.

# Pull coefficients element from summary(lm) object
round(summary(crime.lm)$coef, 3)
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)      -691.838    155.888  -4.438    0.000
## young.males         1.040      0.423   2.460    0.019
## is.south1          -8.308     14.912  -0.557    0.581
## average.ed         18.016      6.497   2.773    0.009
## exp.per.cap.1960    1.608      1.059   1.519    0.138
## exp.per.cap.1959   -0.667      1.149  -0.581    0.565
## labour.part        -0.041      0.153  -0.267    0.791
## male.per.fem        0.165      0.210   0.785    0.438
## population         -0.041      0.130  -0.319    0.752
## nonwhite            0.007      0.064   0.112    0.911
## unemp.youth        -0.602      0.437  -1.376    0.178
## unemp.adult         1.792      0.856   2.093    0.044
## median.assets      13.736     10.583   1.298    0.203
## num.low.salary      0.793      0.235   3.373    0.002

If you want a particular p-value, you can get it by doing the following

# Pull the coefficients table from summary(lm)
crime.lm.coef <- round(summary(crime.lm)$coef, 3)
# See what this gives
class(crime.lm.coef)
## [1] "matrix"
attributes(crime.lm.coef)
## $dim
## [1] 14  4
## 
## $dimnames
## $dimnames[[1]]
##  [1] "(Intercept)"      "young.males"      "is.south1"       
##  [4] "average.ed"       "exp.per.cap.1960" "exp.per.cap.1959"
##  [7] "labour.part"      "male.per.fem"     "population"      
## [10] "nonwhite"         "unemp.youth"      "unemp.adult"     
## [13] "median.assets"    "num.low.salary"  
## 
## $dimnames[[2]]
## [1] "Estimate"   "Std. Error" "t value"    "Pr(>|t|)"
crime.lm.coef["average.ed", "Pr(>|t|)"]
## [1] 0.009

The coefficients table is a matrix with named rows and columns. You can therefore access particular cells either by numeric index, or by name (as in the example above).

Plotting the lm object
plot(crime.lm)