Download the homework1.Rmd
file from Blackboard or the course website.
Open homework1.Rmd
in RStudio.
Replace the “Your Name Here” text in the author:
field with your own name.
Supply your solutions to the homework by editing homework1.Rmd
.
When you have completed the homework and have checked that your code both runs in the Console and knits correctly when you click Knit HTML
, rename the R Markdown file to homework1_YourNameHere.Rmd
, and submit both the .Rmd
file and the .html
output file on Blackboard. (YourNameHere should be changed to your own name.)
Keystroke | Description |
---|---|
<tab> |
Autocompletes commands and filenames, and lists arguments for functions. |
<up> |
Cycles through previous commands in the console prompt |
<ctrl-up> |
Lists history of previous commands matching an unfinished one |
<ctrl-enter> |
Runs current line from source window to Console. Good for trying things out ideas from a source file. |
<ESC> |
Aborts an unfinished command and get out of the + prompt |
Note: Shown above are the Windows/Linux keys. For Mac OS X, the <ctrl>
key should be substituted with the <command>
(⌘) key.
Instead of sending code line-by-line with <ctrl-enter>
, you can send entire code chunks, and even run all of the code chunks in your .Rmd file. Look under the
Run your code in the Console and Knit HTML frequently to check for errors.
You may find it easier to solve a problem by interacting only with the Console at first, or by creating a separate .R
source file that contains only R code and no Markdown.
Let’s start by loading the data.
bikes <- read.csv("http://www.andrew.cmu.edu/user/achoulde/95791/data/bikes.csv", header = TRUE)
# Transform temp and atemp to degrees C instead of [0,1] scale
# Transform humidity to %
# Transform wind speed (multiply by 67, the normalizing value)
bikes <- transform(bikes,
temp = 47 * temp - 8,
atemp = 66 * atemp - 16,
hum = 100 * hum,
windspeed = 67 * windspeed)
# The mapvalues() command from the plyr library allows us to easily
# rename values in our variables. Below we use this command to change season
# from numeric codings to season names.
bikes <- transform(bikes,
season = mapvalues(season, c(1,2,3,4),
c("Winter", "Spring", "Summer", "Fall")))
Let’s look at some boxplots of how bikeshare ride count varies with season.
qplot(data = bikes, x = season, y = cnt, fill = I(cbPalette[3]), geom = "boxplot")
There’s something funny going on here. Instead of showing up in seasonal order, the seasons in the plot are showing up in alphabetical order. The following command reorders the seasons appropriately.
bikes <- transform(bikes, season = factor(season, levels = c("Winter", "Spring", "Summer", "Fall")))
Now let’s try that plot again.
qplot(data = bikes, x = season, y = cnt, fill = I(cbPalette[3]), geom = "boxplot")
Here’s information on what the variables mean.
The Season variable is an example of what’s called a qualitative or categorical predictor. In R, such variables are called
factors
. This problems gets to fit a model with a qualitative predictor and to interpret the findings.
cnt
as the response and season
as the input. Use the summary()
and kable()
commands to produce a nice looking coefficients table.# Edit me
season
variable?season
in the model.Hint: If you have not previously studied how to interpret qualitative variables in regressions, begin by reading through the relevant sections of the Suggested readings for the Week 1 lectures
In this problem we’ll practice fitting and interpreting the results of a multiple linear regression.
cnt
as the response and the following variables as inputs: temp
, atemp
, mnth
, hum
, windspeed
. Use the summary()
and kable()
commands to produce a nice looking coefficients table.# Edit me
mnth
, windspeed
and atemp
in the model.As you probably already know from your most recent regression class, collinear or highly correlated predictors can make interpreting regression coefficients problematic. In this problem you will try to diagnose and address collinearity issues in the data.
temp
variable. Display the coefficients table for this model.# Edit me
atemp
in this new model? Is it very different from the atemp
coefficient estimated in part (b)? Is it statistically significant? Explain your findings.Y | X1 | X2 |
---|---|---|
16 | 5 | -10 |
10 | 3 | -6 |
22 | 7 | -14 |
-5 | -2 | 4 |
28 | 9 | -18 |
31 | 10 | -20 |
-14 | -5 | 10 |
7 | 2 | -4 |
-11 | -4 | 8 |
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon \]
Hint: For this problem, you will find it useful to know about the
jitter
feature in graphics. Begin by reviewing the code at this link, and be sure to use what you feel to be an appropriate amount of jitter in your plots for (a), (b) and (c). You should not use jitter for parts (d) onward.
ggplot2
’s stat_smooth()
overlays to try out different degree polynomial fits for modeling the relationship between cnt
and month
. Display the lowest degree polynomial fit that appears to nicely capture the trends in the data. Explain your choice.# Edit me
ggplot2
’s stat_smooth()
overlays to try out different step functions for modeling the relationship between cnt
and month
. Display the model with the smallest number of “breaks” or “cuts” that nicely captures the trends in the data. Explain your choice.# Edit me
cnt
and mnth
: Polynomials, or Step Functions? Explain your answer.cnt
and the other inputs: atemp
, hum
and windspeed
. Summarize your choices. (Note: your polynomials can have different degrees for different inputs.)# Edit me
cnt
on polynomials in the input variables: atemp
, mnth
, hum
, and windspeed
. How does the R-squared of this model compare to the R-squared of the model you fit in Problem 3(d)?# Edit me