Download the homework2.Rmd
file from Canvas or the course website.
Open homework2.Rmd
in RStudio.
Replace the “Your Name Here” text in the author:
field with your own name.
Supply your solutions to the homework by editing homework2.Rmd
.
When you have completed the homework and have checked that your code both runs in the Console and knits correctly when you click Knit HTML
, rename the R Markdown file to homework2_YourNameHere.Rmd
, and submit both the .Rmd
file and the .html
output file on Canvas (YourNameHere should be changed to your own name.)
library(ggplot2)
library(plyr)
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.2
library(MASS)
library(knitr)
library(splines)
library(gam)
## Warning: package 'gam' was built under R version 3.4.4
## Loading required package: foreach
## Loaded gam 1.16
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
options(scipen = 4)
# Load bikes data
bikes <- read.csv("http://www.andrew.cmu.edu/user/achoulde/95791/data/bikes.csv", header = TRUE)
# Transform temp and atemp to degrees C instead of [0,1] scale
# Transform humidity to %
# Transform wind speed (multiply by 67, the normalizing value)
bikes <- transform(bikes,
temp = 47 * temp - 8,
atemp = 66 * atemp - 16,
hum = 100 * hum,
windspeed = 67 * windspeed)
# The mapvalues() command from the plyr library allows us to easily
# rename values in our variables. Below we use this command to change season
# from numeric codings to season names.
bikes <- transform(bikes,
season = mapvalues(season, c(1,2,3,4),
c("Winter", "Spring", "Summer", "Fall")))
This question is intended to provide you with practice on manual knot placement, and to improve your understanding of effective degrees of freedom selection for smoothing splines.
The following command loads a data frame called
spline.data
into your workspace. This question gets you to analysespline.data
.
con <- url("http://www.andrew.cmu.edu/user/achoulde/95791/data/splines.Rdata")
load(con)
close(con)
qplot
to plot the data, and use stat_smooth
to overlay a cubic spline with 9 degrees of freedom.# Edit me
lm
command to fit a cubic spline with 9 degrees of freedom. Explore this object that is constructed to figure out how many knots are placed, and where the knots are located. How many knots are placed? Where are they placed?basis.obj <- with(spline.data, bs(x, 9))
# Edit me
bs()
function, now try manually selecting knots. You should supply a knots
vector containing 6 values. Try to pick the knots as optimally as possible. Use qplot
and stat_smooth
to show a plot of the data with the cubic spline with your choice of knots overlaid. Explain your choice of knot location.# Edit me
lm
function to fit two models: One using the model from part (a), and another using the model you settled on in part (c). Compare the R-squared values. Which model better fits the data?# Edit me
smooth.spline
command with cv = TRUE
to fit a smoothing spline to the data. What degrees of freedom does the CV routine select for the smoothing spline? How does this compare to the degrees of freedom of your model from part (c)?# Edit me
smooth.spline
command with cv = TRUE
to fit a smoothing spline to the first half of the data (x <= 5.0). What degrees of freedom does the CV routine select for this smoothing spline?# Edit me
x
> 5.0). How does the optimal choice for the second half of the data compare to the optimal choice for the first half. Are they very different? Can you explain what’s happening?# Edit me
This problem asks you to code up your own cross-validation routine that will produce \(K\)-fold CV error estimates for polynomial regression, regression splines, and smoothing splines.
You should code up a function called
smoothCV
that takes the following inputs.
Inputs:
Argument | Description |
---|---|
x |
a vector giving the values of a predictor variable |
y |
a vector giving the values of the response variable |
K |
the number of folds to use in the validation routine |
df.min |
the smallest number of degrees of freedom to consider |
df.max |
the largest number of degrees of freedom to consider |
smoothCV
should return the following output
Output:
Your function should return a data.frame
object giving the \(K\)-fold error estimates for: polynomial regression, cubic splines, and smoothing splines, with the degrees of freedom ranging from df.min
to df.max
. The data frame should have three columns: df
, method
, error
.
Sample output:
df method cv.error
1 poly 25.4
1 cubic.spline NA
1 smoothing.spline NA
2 poly 21.1
2 cubic.spline NA
2 smoothing.spline 20.0
3 poly 15.2
3 cubic.spline 15.2
3 smoothing.spline 16.1
Note: In the example above, we had df.min = 1
and df.max = 3
. We saw in lecture that a cubic spline with \(K\) interior knots has \(K+3\) degrees of freedom. Thus we cannot form a cubic spline with df
of 1 or 2. Similarly, the smooth.spline()
fitting function in R requires that df
> 1. If the given method cannot be fit at the specified degrees of freedom, you should report the cv.error as NA, as shown above.
Note: When \(n\) is not divisible by \(K\), it will not be possible to partition the sample into \(K\) equally sized groups. You should make the groups as equally sized as possible. When the groups are of unequal size, the preferred way of calculating the average MSE is by using a weighted average. More precisely, if \(n_k\) is the number of observations in fold \(k\) and \(MSE_k\) is the MSE estimated from fold \(k\), the weighted average estimate of MSE is:
\[ CV_{K} = \sum_{k = 1}^K \frac{n_k}{n} MSE_k \]
It’s easy to check that if \(n\) is evenly divisible by \(K\) then each \(n_k = n/K\), and so the above expression reduces to the formula you saw in class: \(CV_{K} = \frac{1}{K}\sum_{k = 1}^K MSE_k\)
smoothCV()
according to the specification above. A function header is provided for you to get you started.smoothCV <- function(x, y, K = 10, df.min = 1, df.max = 10) {
# Your code here
}
smoothCV()
.Inputs:
Argument | Description |
---|---|
smoothcv.err |
a data frame obtained by running the smoothCV function |
K |
the number of folds used in the CV routine |
title.text |
the desired title for the plot |
y.scale.factor |
if provided, a relative upper bound on the upper y-axis limit |
Additional details
smoothcv.err
: This data frame has the exact structure of the smoothCV()
output illustrated in the preamble of this problem.y.scale.factor
: You can use the is.null(y.scale.factor)
command to test if the user provided a value of y.scale.factor
. If this value is non-null, you should set the y-axis limits of your plot to (lower
, upper
), where lower
is the smallest CV error of any method for any choice of df
, and upper
is y.scale.factor * lower
. With ggplot2
graphics, you may use syntax such as p + ylim(100, 2 * 100)
to set the y-axis limits to (100, 200)
.Output: For the example above if we had K = 5
, the plot would look something like this:
plot.smoothCV <- function(smoothcv.err, K, title.text = "", y.scale.factor = NULL) {
# Your code here
}
bikes
data on Homework 1. Use your smoothCV
function with 10-fold cross-validation to determine the best choice of model and degrees of freedom for modeling the relationship between cnt
and each of these inputs: mnth
, atemp
, hum
, and windspeed
. Rely on your plot.smoothCV
plotting routine to support your choice of model for each of the inputs.Hint: Use the y.scale.factor
argument of your plot.smoothCV
function wisely. If you see that a particular model’s error starts to blow up as df
increases, you should set y.scale.factor
appropriately to prevent the extremely large error estimates from misleading you in your assessment of which model to use.
# Edit me
gam
library and the models you selected in part (c) to fit an additive model of cnt
on mnth
, atemp
, hum
and windspeed
. Use the plot
command on your fitted glm
object with the arguments se = TRUE, col = 'steelblue', lwd = 3
to produce plots of the fitted curves. (See ?plot.gam
for details.)# Edit me
# Ensure that all 4 model fits appear in the same figure
par(mfrow = c(1,4))
# Write your plot() command below this comment
The “% deviance explained” is the Generalized Additive Model analog of R-squared. It is exactly equal to the R-squared for regression models that can be fit with both the
gam
andlm
functions. If you have a fittedgam
model called, say,gam.fake
, you can calculate the % deviance explained with the following syntax:
1 - gam.fake$deviance / gam.fake$null.deviance
# Edit me
cnt
on the same input variables. Does the Additive Model considerably outperform the linear regression model?# Edit me