This homework is due by 2:50PM on Thursday, January 25.
To complete this assignment, follow these steps:
  1. Download the homework1.Rmd file from Blackboard or the course website.

  2. Open homework1.Rmd in RStudio.

  3. Replace the “Your Name Here” text in the author: field with your own name.

  4. Supply your solutions to the homework by editing homework1.Rmd.

  5. When you have completed the homework and have checked that your code both runs in the Console and knits correctly when you click Knit HTML, rename the R Markdown file to homework1_YourNameHere.Rmd, and submit both the .Rmd file and the .html output file on Blackboard. (YourNameHere should be changed to your own name.)

Homework tips:
  1. Recall the following useful RStudio hotkeys.
Keystroke Description
<tab> Autocompletes commands and filenames, and lists arguments for functions.
<up> Cycles through previous commands in the console prompt
<ctrl-up> Lists history of previous commands matching an unfinished one
<ctrl-enter> Runs current line from source window to Console. Good for trying things out ideas from a source file.
<ESC> Aborts an unfinished command and get out of the + prompt

Note: Shown above are the Windows/Linux keys. For Mac OS X, the <ctrl> key should be substituted with the <command> (⌘) key.

  1. Instead of sending code line-by-line with <ctrl-enter>, you can send entire code chunks, and even run all of the code chunks in your .Rmd file. Look under the menu of the Source panel.

  2. Run your code in the Console and Knit HTML frequently to check for errors.

  3. You may find it easier to solve a problem by interacting only with the Console at first, or by creating a separate .R source file that contains only R code and no Markdown.

Introduction: Bikeshare data

library(ggplot2)
library(plyr)
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.2
library(MASS)
library(knitr)

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

options(scipen = 4)

For this problem we’ll be working with two years of bikeshare data from the Capital Bikeshare system in Washington DC. The dataset contains daily bikeshare counts, along with daily measurements on environmental and seasonal information that may affect the bikesharing.

Data pre-processing

Let’s start by loading the data.

bikes <- read.csv("http://www.andrew.cmu.edu/user/achoulde/95791/data/bikes.csv", header = TRUE)

# Transform temp and atemp to degrees C instead of [0,1] scale
# Transform humidity to %
# Transform wind speed (multiply by 67, the normalizing value)

bikes <- transform(bikes,
                   temp = 47 * temp - 8,
                   atemp = 66 * atemp - 16,
                   hum = 100 * hum,
                   windspeed = 67 * windspeed)

# The mapvalues() command from the plyr library allows us to easily
# rename values in our variables.  Below we use this command to change season
# from numeric codings to season names.

bikes <- transform(bikes, 
                   season = mapvalues(season, c(1,2,3,4), 
                                      c("Winter", "Spring", "Summer", "Fall")))

Let’s look at some boxplots of how bikeshare ride count varies with season.

qplot(data = bikes, x = season, y = cnt, fill = I(cbPalette[3]), geom = "boxplot")

There’s something funny going on here. Instead of showing up in seasonal order, the seasons in the plot are showing up in alphabetical order. The following command reorders the seasons appropriately.

bikes <- transform(bikes, season = factor(season, levels = c("Winter", "Spring", "Summer", "Fall")))

Now let’s try that plot again.

qplot(data = bikes, x = season, y = cnt, fill = I(cbPalette[3]), geom = "boxplot")

Here’s information on what the variables mean.

Problem 1: Qualitative predictors

The Season variable is an example of what’s called a qualitative or categorical predictor. In R, such variables are called factors. This problems gets to fit a model with a qualitative predictor and to interpret the findings.

(a) Fit a linear regression model with cnt as the response and season as the input. Use the summary() and kable() commands to produce a nice looking coefficients table.
# Edit me
(b) How many total coefficients are there in the model?
  • Your answer here.
(c) How many coefficients are estimated for the season variable?
  • Your answer here.
(d) Interpret the coefficients of season in the model.
  • Your answer here.

Hint: If you have not previously studied how to interpret qualitative variables in regressions, begin by reading through the relevant sections of the Suggested readings for the Week 1 lectures


Problem 2: Multiple linear regression

In this problem we’ll practice fitting and interpreting the results of a multiple linear regression.

(a) Fit a regression model with cnt as the response and the following variables as inputs: temp, atemp, mnth, hum, windspeed. Use the summary() and kable() commands to produce a nice looking coefficients table.
# Edit me
(b) Interpret the coefficients of mnth, windspeed and atemp in the model.
  • Your answer here.
(c) Which predictors are associated with increased ridership? Which predictors are associated with decreased ridership?
  • Your answer here.
(d) Which predictors are statistically significant at the 0.05 level?
  • Your answer here.

Problem 3: Dealing with collinearity

As you probably already know from your most recent regression class, collinear or highly correlated predictors can make interpreting regression coefficients problematic. In this problem you will try to diagnose and address collinearity issues in the data.

(a) Use the pairs() function on the set of variables used in Problem 2 to check if any of the predictor variables are highly correlated with one another. Your pairs plot should have scatterplots above the diagonal, and correlations below the diagonal.
# Edit me

Hint: A complete example of how to use the pairs() command to construct such plots may be found here: Pairs plot example

(b) Are any of the predictors highly correlated? Are you surprised that these predictors are highly correlated, or can you think of a reason for why it makes sense that they should be correlated?
  • Your answer here.
(c) Refit your regression model, but this time omit the temp variable. Display the coefficients table for this model.
# Edit me
(d) What is the coefficient of atemp in this new model? Is it very different from the atemp coefficient estimated in part (b)? Is it statistically significant? Explain your findings.
  • Your answer here.
(e) Here’s some made-up data.
Y X1 X2
16 5 -10
10 3 -6
22 7 -14
-5 -2 4
28 9 -18
31 10 -20
-14 -5 10
7 2 -4
-11 -4 8
Without doing any model fitting, determine the least squares coefficient estimates \(\hat\beta_0\), \(\hat\beta_1\) and \(\hat\beta_2\) in the model

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon \]

  • Your answer here
(f) Is your answer unique? Can you think of 2 other choices of \(\hat\beta_0\), \(\hat\beta_1\) and \(\hat\beta_2\) that have the same RSS? Explain what’s happening.
  • Your answer here

Problem 4: Exploring non-linearities

Hint: For this problem, you will find it useful to know about the jitter feature in graphics. Begin by reviewing the code at this link, and be sure to use what you feel to be an appropriate amount of jitter in your plots for (a), (b) and (c). You should not use jitter for parts (d) onward.

(a) Using ggplot2 graphics, construct a scatterplot of cnt (bikeshare count) across mnth (month of the year). Describe what you see. Does a linear relationship appear to be a good way of modeling how bikeshare count varies with month?
# Edit me
  • Your answer here.
Which do you think better describes the relationship between cnt and mnth: Polynomials, or Step Functions? Explain your answer.
  • Your answer here.
(d) Repeat parts (a) and (b) to determine appropriate degree polynomials for modeling the relationship between cnt and the other inputs: atemp, hum and windspeed. Summarize your choices. (Note: your polynomials can have different degrees for different inputs.)
# Edit me
  • Your answer here.
(e) Use your answers to parts (b) and (d) to fit a polynomial regression model that regresses cnt on polynomials in the input variables: atemp, mnth, hum, and windspeed. How does the R-squared of this model compare to the R-squared of the model you fit in Problem 3(d)?
# Edit me
  • Your answer here.
(f) What is the total number of parameters in the model you fit in part (e)? How does this compare to the number of parameters in the model fit in Problem 3(d)?
  • Your answer here.