Homework 5

Problem 1: Linear regression with bikeshare data
Problem 2: Interpreting and testing linear models
Problem 3: Exploring statistical significance
Problem 4: A more robust approach to error bars
Problem 5: dlply + ldply

This homework is due by 1:20pm on Tuesday, February 27.

Problem 1: Linear regression with bikeshare data

For this problem we’ll be working with two years of bikeshare data from the Capital Bikeshare system in Washington DC. The dataset contains daily bikeshare counts, along with daily measurements on environmental and seasonal information that may affect the bikesharing.

Let’s start by loading the data.

bikes <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/bikes.csv", header = TRUE)

# Transform temp and atemp to degrees C instead of [0,1] scale
# Transform humidity to %
# Transform wind speed (multiply by 67, the normalizing value)

bikes <- transform(bikes,
                   temp = 47 * temp - 8,
                   atemp = 66 * atemp - 16,
                   hum = 100 * hum,
                   windspeed = 67 * windspeed)

Here’s information on what the variables mean.

instant: record index
- dteday : date
- season : season (1:Winter, 2:Spring, 3:Summer, 4:Fall)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit :
  - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
  - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Temperature in Celsius.
- atemp: Feeling temperature in Celsius.
- hum: Humidity (%)
- windspeed: Wind speed.
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

(a) Season: Factor or numeric?

Consider the following two regression models.

bikes.lm1 <- lm(cnt ~ yr + temp + hum + season, data = bikes)

bikes.lm2 <- lm(cnt ~ yr + temp + hum + as.factor(season), data = bikes)

What is the difference between these two models?

Replace this text with your answer. (do not delete the html tags)

(b) What is the interpretation the coefficient(s) of season in each of the models in part (a)?

Replace this text with your answer. (do not delete the html tags)

(c) Use ggplot2 graphics to construct boxplots of count for each season. Based on what you see from these plots, which model do you think makes more sense? Explain.

# Edit me

Replace this text with your answer. (do not delete the html tags)

(d) Using ggplot2 graphics, construct a scatterplot of cnt (bikeshare count) across mnth (month of the year). Describe what you see. Would a linear model be a good way of modeling how bikeshare count varies with month?

# Edit me

(e) Consider the following three models. Figures of the model fits are shown to help you interpret the models.

# Fit and plot first model
bikes.lm3 <- lm(cnt ~ mnth, data = bikes)

qplot(data = bikes, x = mnth, y = cnt) + stat_smooth(method = "lm")

# Fit and plot second model
bikes.lm4 <- lm(cnt ~ as.factor(mnth), data = bikes)

# Construct data frame that has fitted values for each month
mnth.df <- data.frame(mnth = unique(bikes$mnth))
df.lm4 <- data.frame(mnth = unique(bikes$mnth),
                     fitted = predict(bikes.lm4, newdata = mnth.df))

# Red points are the fitted values from the model
qplot(data = bikes, x = mnth, y = cnt, alpha = 0.5) + 
  guides(alpha = FALSE) + 
  geom_point(data = df.lm4, aes(x = mnth, y = fitted), colour = I("red"),
             alpha = 1, size = I(5))

# Fit and plot third model
bikes.lm5 <- lm(cnt ~ mnth + I(mnth^2), data = bikes)

qplot(data = bikes, x = mnth, y = cnt) + 
  stat_smooth(method = "lm", formula = y ~ x + I(x^2))

What are the differences between the models? How many coefficients are used to model the relationship between bikeshare count and month in each of the models?

Answer here

(f) Use the plot() function to produce diagnostic plots for bikes.lm3. Do you observe any problems?

# Edit me

(g) Produce diagnostic plots for bikes.lm4 and bikes.lm5. Are any of the problems you identified in part (f) resolved? Do any remain? Explain.

# Edit me

Problem 2: Interpreting and testing linear models

This problem once again uses the bikes data.

(a) Use the transform and mapvalues functions to map the values of the season and weathersit variables to something more interpretable. Your new weathersit variable should have levels: Clear, Cloud, Light.precip, Heavy.precip

# Edit me

(b) Fit a linear regression model with cnt as the outcome, and season, workingday, weathersit, temp, atemp and hum as covariates. Use the summary function to print a summary of the model showing the model coefficients.

Note that you may wish to code the workingday variable as a factor, though this will not change the estimated coefficient or its interpretation (workingday is already a 0-1 indicator).

# Edit me

(c) How do you interpret the coefficient of workingday in the model? What is its p-value? Is it statistically significant?

# Edit me

(d) How do you interpret the coefficient corresponding to Light.precip weather situation? What is its p-value? Is it statistically significant?

# Edit me

(e) Use the pairs function to construct a pairs plot of temp, atemp and hum. The bottom panel of your plot should show correlations (see example in Lecture 9).

# Edit me

Do you observe any strong colinearities between the variables?

# Edit me

(f) Use the update function to update your linear model by removing the temp variable. Has the coefficient of atemp changed? Has it changed a lot? Explain.

# Edit me

(g) How do you interpret the coefficient of atemp in the model from part (f)?

(h) Use the anova() function on your model from part (f) to assess whether weathersit is a statistically significant factor in the model. Interpret your findings.

# Edit me

Problem 3: Exploring statistical significance

This problem will guide you through the process of conducting a simulation to investigate statistical significance in regression.

In this setup, we’ll repeatedly simulate n observations of 3 variables: an outcome y, a covariate x1 that’s associated with y, and a covariate x2 that is unassociated with y. Our model is:

\[ y_i = 0.5 x_1 + \epsilon_i\]

where the \(\epsilon_i\) are independent normal noise variables having standard deviation 5 (i.e., Normal random variables with mean 0 and standard deviation 5).

Here’s the setup.

set.seed(12345)  # Set random number generator
n <- 200  # Number of observations
x1 <- runif(n, min = 0, max = 10) # Random covariate
x2 <- rnorm(n, 0, 10)              # Another random covariate

To generate a random realization of the outcome y, use the following command.

# Random realization of y
y <- 0.5 * x1 + rnorm(n, mean = 0, sd = 5)

Here’s are plots of that random realization of the outcome y, plotted against x1 and x2.

qplot(x = x1, y = y)

qplot(x = x2, y = y)

(a) Write code that implements the following simulation (you’ll want to use a for loop):

for 2000 simulations:
  generate a random realization of y
  fit regression of y on x1 and x2
  record the coefficient estimates, standard errors and p-values for x1 and x2

At the end you should have 2000 instances of estimated slopes, standard errors, and corresponding p-values for both x1 and x2. It’s most convenient to store these in a data frame.

# Note the cache = TRUE header here.  This tells R Markdown to store the output of this code chunk and only re-run the code when code in this chunk changes.  By caching you won't wind up re-running this code every time you knit.

# Edit me

(b) This problem has multiple parts.

Construct a histogram of the coefficient estimates for x1.
Calculate the average of the coefficient estimates for x1. Is the average close to the true value?
Calculate the average of the standard errors for the coefficient of x1. Calculate the standard deviation of the coefficient estimates for x1. Are these numbers similar?

# Edit me

Take-away from this problem: the Std. Error value in the linear model summary is an estimate of the standard deviation of the coefficient estimates.

(c) Repeat part (b) for x2.

# Edit me

(d) Construct a histogram of the p-values for the coefficient of x1. What do you see? What % of the time is the p-value significant at the 0.05 level?

# Edit me

(e) Repeat part (d) with x2. What % of the time is the p-value significant at the 0.05 level?

# Edit me

(f) Given a coefficient estimate \(\hat \beta\) and a standard error estimate \(\hat{se}(\hat\beta)\), we can construct an approximate 95% confidence interval using the “2 standard error rule”. i.e., \[ [\hat \beta - 2 \hat{se}, \, \hat \beta + 2 \hat{se}] \] is an approximate 95% confidence interval for the true unknown coefficient.

As part of your simulation you stored \(\hat \beta\) and \(\hat{se}\) values for 2000 simulation instances. Use these estimates to construct approximate confidence intervals and answer the following questions.

Question: In your simulation, what % of such confidence intervals constructed for the coefficnet of x1 actually contain the the true value of the coefficient (\(\beta_1 = 0.5\)).

Replace this text with your answer. (do not delete the html tags)

Question: In your simulation, what % of such confidence intervals constructed for the coefficient of x2 actually contain the the true value of the coefficient (\(\beta_2 = 0\)).

Replace this text with your answer. (do not delete the html tags)

Problem 4: A more robust approach to error bars

On homework 4 you completed several exercises that involved using ddply to calculate error bars via the t.test command. When the x-axis variable has one or more categories that have low counts, it’s possible for this approach to fail. Here’s an example of this issue.

# Data import & transformation
# DO NOT CHANGE THIS CODE
income.data <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/homework/nlsy97_income.csv", header=TRUE)
income.data <- subset(income.data,
                      select = c("R0536300", "T7545600", "T6656900"),
                      subset = T6656900 %in% c(1:9, 11))
income.data[income.data < 0] <- NA
income.data <- na.omit(income.data)
colnames(income.data) <- c("gender", "income", "household.size")
income.data <- transform(income.data, 
                         gender = mapvalues(gender, 1:2, c("Male", "Female")))

Let’s look at how many observations we have at every level of household size:

with(income.data, table(household.size, gender))

##               gender
## household.size Female Male
##             1     252  404
##             2     686  757
##             3     586  646
##             4     489  547
##             5     277  263
##             6     121  106
##             7      52   42
##             8      22   16
##             9      17    5
##             11      3    1

We can see that counts are very low in some of the categories. For instance, we have just 5 men in households of size 9.

(a) Construct a modified t-test function that operates in the following way (CI below stands for “confidence interval”):

Input: 2 inputs
- x: numeric vector
- group: 2-level factor vector of the same length as x
- k: integer, giving minimum group size
- flip.sign: logical. Should CI and mean be flipped?
Output: a vector with the following elements:
- lower: the lower value of the error bar (CI)
- upper: the upper value of the error bar (CI)
- is.signif: for groups of size >= k, this value should be 0 if CI overlaps 0, and 1 if it doesn’t. For groups of size < k, this value should always be 0.

If group has at least k observations in each level, return the lower and upper confidence interval values from running t.test with formula x ~ group.
If group has fewer than k observations in one or both levels of this factor, lower and upper should both be set equal to the mean difference.
If flip.sign = TRUE, the difference and confidence interval should be calculate as group2 - group1 instead of group1 - group2.

# Add comments to this function
betterTTestErrorBars <- function(x, groups, k = 5, flip.sign = TRUE) {
  # Edit me
}

(b) Modify the code below to use the betterTTestErrorBars command instead of the t.test command to produce a bar chart with error bars showing how the income gap varies with household size. Specify that the bars should be filled based on is.significant. Note: you’ll need to set eval = TRUE in the code chunk header for the code to run

# Calculate income gaps (male - female) and 95% confidence intervals
# for the gap
# MODIFY THIS CODE
gap.data.conf <- ddply(nlsy, ~ household.size, summarize, 
                       income.gap = mean(income[gender == "Male"], na.rm = TRUE) - mean(income[gender == "Female"], na.rm = TRUE),
                       upper = -t.test(income ~ gender)$conf.int[1],
                       lower = -t.test(income ~ gender)$conf.int[2],
                       is.significant = as.numeric(t.test(income ~ gender)$p.value < 0.05))

# Re-order the race factor according to gap size
gap.data.conf <- transform(gap.data.conf,
                           race = reorder(race, income.gap))

# Plot, with error bars
# MODIFY THIS CODE TO WORK FOR household.size INSTEAD OF race
ggplot(data = gap.data.conf, aes(x = race, y = income.gap)) +
  geom_bar(stat = "identity") +
  xlab("Race") + 
  ylab("Income gap($)") +
  ggtitle("Income gap between men and women, by race") + 
  guides(fill = FALSE) +
  geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
  theme(text = element_text(size=12))

Problem 5: dlply + ldply

For this problem we’ll return to the gapminder data set from Lecture 10.

gapminder <- read.delim("http://www.andrew.cmu.edu/user/achoulde/94842/data/gapminder_five_year.txt")

(a) Use dlply to produce a list of linear regression models, one for each country, regressing gdpPercap on year.

# Edit me

(b) Use ldply on your list from part (a) to produce a data frame displaying for each country whether the coefficient of year was significant at the 0.05 level. Your output should be a two column data frame: the first column gives the country, and the second column displays a 0 if the coefficient of year was not significant, and a 1 if it was significant.

# Edit me

(c) The following code produces a data frame giving the continent for each country.

summary.continent <- ddply(gapminder, ~ country, summarize, continent = unique(continent))

Use the merge function to merge your data frame from part (b) with summary.continent to produce a data frame that shows both the slope significance indicator and also the continent. (See Lecture 10 notes for examples of how to do this.)

# Edit me

(d) Using your data frame from part (c), produce a contingency table showing counts for each combination of slope significance and continent.

# Edit me

(e) What do you observe in the table from part (d)? What does a non-significant slope (coefficient of year) suggest about a country’s economic growth?

Replace this text with your answer. (do not delete the html tags)

Homework 5

Your Name Here

Assigned: February 15, 2018

This homework is due by 1:20pm on Tuesday, February 27.

Problem 1: Linear regression with bikeshare data

Problem 2: Interpreting and testing linear models

Problem 3: Exploring statistical significance

Problem 4: A more robust approach to error bars

Problem 5: dlply + ldply