glm
command to fit a linear regression of shares
on all the other variables in the news
data set. Print the names of the predictors whose coefficient estimates are statistically significant at the 0.05 level. Are any of the “noise” predictors statistically significant? (Recall that “noise” predictors all have variable names of the form X#.)cv.glm
command with 10-fold cross-validation to estimate the test error of the model you fit in part (a). Repeat with number of folds K = 2, 5, 10, 20, 50, 100
(use a loop).summary(news.subset)
or coef(news.subset, id = )
object to determine this.) Are the models all nested? That is, does the best model of size k-1 always a subset of the best model of size k? Do any “noise predictors” appear in any of the models?summary()
on a regsubsets object you get a bunch of useful values. Construct a plot showing R-squared on the y-axis and model size on the x-axis. Use appropriate axis labels. Does R-squared always increase as we increase the model size? Explain.news
data? Explain.news.x
and news.y
were pre-constructed in the preamble to this assignment. Use the glmnet
command to fit a Lasso to this data. Call the result news.lasso
.news.lasso
contains model fits for an entire sequence of \(\lambda\) values. Look at the news.lasso$lambda
attribute. How many \(\lambda\) values do we have model fits for?coef(news.lasso, s = )
will print out the estimated coefficients for our model at any lambda value s
. Display the coefficient estimates for the 25th value of \(\lambda\) from part (l). How many coefficients in this model are non-zero? How many of the non-zero coefficients come from “noise predictors”?plot
command on your news.lasso
object to get a regularization plot. Review the help file for plot.glmnet
to figure out how to set “norm” as the x-axis variable option, and how to add labels to the curves. In this parameterization of the x-axis, is the model fit getting more complex or less complex as the x-axis variable increases?cv.glmnet(x, y)
will perform 10-fold cross-validation on the entire sequence of models fit by glmnet(x, y)
. Use the cv.glmnet
command to perform cross-validation. Save the results in a variable called news.lasso.cv
. Run the plot()
command to get a CV error plot.news.lasso.cv
object to figure out the value of \(\lambda\) that minimizes CV error. Which value of \(\lambda\) does the 1-SE rule tell us to use? How many non-zero variables are selected by the min-CV rule and the 1-SE rule? What is the estimated CV error for both of these models? How many “noise predictors” get included in each model?Download the homework3.Rmd
file from Canvas or the course website.
Open homework3.Rmd
in RStudio.
Replace the “Your Name Here” text in the author:
field with your own name.
Supply your solutions to the homework by editing homework3.Rmd
.
When you have completed the homework and have checked that your code both runs in the Console and knits correctly when you click Knit HTML
, rename the R Markdown file to homework3_YourNameHere.Rmd
, and submit both the .Rmd
file and the .html
output file on Canvas (YourNameHere should be changed to your own name.)
DO NOT CHANGE ANYTHING ABOUT THIS CODE!
library(ggplot2)
library(ISLR)
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
library(leaps) # needed for regsubsets
library(boot) # needed for cv.glm
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
options(scipen = 4)
# Online news share count data
set.seed(20120180)
online.news <- read.csv("http://www.andrew.cmu.edu/user/achoulde/95791/data/online_news.csv")
# subsample the data to reduce data size
num.noise <- 50
news <- data.frame(online.news,
matrix(rnorm(num.noise * nrow(online.news)),
nrow = nrow(online.news))
)
# Extract covariates matrix (for lasso)
news.x <- as.matrix(news[, -which(names(news) == "shares")])
# Extract response variable (for lasso)
news.y <- news$shares
If you want to learn more about this data set, you can have a look at the data dictionary provided here: Data dictionary for news share data.
This question walks you through a comparison of three variable selection procedures we discussed in class.
cv.glm
command with 10-fold cross-validation to estimate the test error of the model you fit in part (a). Repeat with number of folds K = 2, 5, 10, 20, 50, 100
(use a loop).# Edit me
Note: This loop may take a few minutes to run. I have supplied the argument cache = TRUE in the header to prevent the code from needing to re-execute every time you knit. This code chunk will re-execute only if the code it contains gets changed.
####
## Modify the code below as necessary to answer the question.
####
# Form a random split
rand.split <- sample(cut(1:nrow(news), breaks = 2, labels = FALSE))
# Fit model on first part
news.glm.train <- glm(shares ~ ., data = news[rand.split == 1,])
# Predict on the second part
news.glm.pred <- predict(news.glm.train, newdata = news[rand.split == 2, ])
# Calculate MSE on the second part
mean((news$shares[rand.split == 2] - news.glm.pred)^2)
## [1] 125708178
summary(news.subset)
or coef(news.subset, id = )
object to determine this.) Are the models all nested? That is, does the best model of size k-1 always a subset of the best model of size k? Do any “noise predictors” appear in any of the models?set.seed(12310)
# Get a smaller subset of the data to work with
# Use this ONLY for problem (d).
news.small <- news[sample(nrow(news), 2000), ]
# Best subset selection
news.subset <- regsubsets(shares ~ .,
data = news.small,
nbest = 1, # 1 best model for each number of predictors
nvmax = 5, # NULL for no limit on number of variables
method = "exhaustive", really.big = TRUE)
# Add code below to answer the question
NOTE: You will need to swap out news.small
for the full news
data. You should not use news.small
for anything other than part (d)
# Edit me
Note: Parts (f) - (k) all refer to the results produced by Forward stepwise selection.
# Edit me
summary()
on a regsubsets object you get a bunch of useful values. Construct a plot showing R-squared on the y-axis and model size on the x-axis. Use appropriate axis labels. Does R-squared always increase as we increase the model size? Explain.# Edit me
# Edit me
# Edit me
# Edit me
news
data? Explain.For the Lasso problems below, you may find it helpful to review the code examples in the Linear regression with glmnet vignette. Running the glmnet command
glmnet(x = X, y = y)
wherey
is your response vector andX
is your covariates matrix will fit a Lasso.
news.x
and news.y
were pre-constructed in the preamble to this assignment. Use the glmnet
command to fit a Lasso to this data. Call the result news.lasso
.# Edit me
news.lasso
contains model fits for an entire sequence of \(\lambda\) values. Look at the news.lasso$lambda
attribute. How many \(\lambda\) values do we have model fits for?# Edit me
coef(news.lasso, s = )
will print out the estimated coefficients for our model at any lambda value s
. Display the coefficient estimates for the 25th value of \(\lambda\) from part (l). How many coefficients in this model are non-zero? How many of the non-zero coefficients come from “noise predictors”?# Edit me
plot
command on your news.lasso
object to get a regularization plot. Review the help file for plot.glmnet
to figure out how to set “norm” as the x-axis variable option, and how to add labels to the curves. In this parameterization of the x-axis, is the model fit getting more complex or less complex as the x-axis variable increases?# Edit me
cv.glmnet(x, y)
will perform 10-fold cross-validation on the entire sequence of models fit by glmnet(x, y)
. Use the cv.glmnet
command to perform cross-validation. Save the results in a variable called news.lasso.cv
. Run the plot()
command to get a CV error plot.# Edit me
news.lasso.cv
object to figure out the value of \(\lambda\) that minimizes CV error. Which value of \(\lambda\) does the 1-SE rule tell us to use? How many non-zero variables are selected by the min-CV rule and the 1-SE rule? What is the estimated CV error for both of these models? How many “noise predictors” get included in each model?# Edit me