This homework is due by 2:50PM Thursday, February 15.

Preamble: Loading packages and data

library(ggplot2)
library(ISLR)
library(MASS)
library(klaR)  # You may need to install this one
library(knitr)
library(glmnet)
library(plyr)
library(gam)

set.seed(14504008)

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

options(scipen = 4)

# Adulthood data
n.obs <- 3000
age <- pmin(pmax(rnorm(n.obs, mean = 30, sd = 10), 5), 50)
is.adult <- as.numeric(age >= 18)
age.data <- data.frame(age = age, is.adult = as.factor(is.adult))
# Spam data
spamdf <- read.table("http://www.andrew.cmu.edu/user/achoulde/95791/data/spamdata.txt")
varinfo <- read.table("http://www.andrew.cmu.edu/user/achoulde/95791/data/spamids2.txt", sep=" ",  stringsAsFactors = FALSE)
is.test <- read.table("http://www.andrew.cmu.edu/user/achoulde/95791/data/traintest_indicator.txt", header=FALSE)
is.test <- as.integer(is.test[,1])
# Partition the spam data

# log-transform the x variable
spamdf.log <- spamdf
spamdf.log[, 1:(ncol(spamdf) - 1)] <- log(0.1 + spamdf[, 1:(ncol(spamdf) - 1)])

# Add names
colnames(spamdf.log) <- c(varinfo[,1], "is.spam")

spam.train <- subset(spamdf.log, subset = is.test == 0)
spam.test <- subset(spamdf.log, subset = is.test == 1)

Problem 1 [6 points]: Instability of Logistic regression

This question walks you through a simple example that illustrates the instability of logistic regression coefficient estimates in cases where the classes are clearly separable.

This instability can arise in practice when we have inputs \(X\) that are categorical variables with a large number of levels. In such cases, particularly when we have low cell counts, it is not uncommon for all observed outcomes in a particular category to be either all \(0\) or all \(1\). This leads the coefficient corresponding to that category to be very unstable.

(a) The age.data data frame contains information on 3000 individuals. We want to use the age variable to try to classify individuals as adults or non-adults. The outcome variable is.adult is 1 for adults and 0 for non-adults.
Following the geom_histogram(position = "fill") examples (at this link)[http://docs.ggplot2.org/0.9.3.1/geom_histogram.html], construct a conditional probability plot to show how the probability of being an adult varies with age.
# Edit me

  • Your answer here.

(b) Is this a difficult classification problem? Can you think of a simple rule that gives 100% classification accuracy for this task? Display a confusion matrix to verify that your rule works.
# Edit me

  • Your answer here.

(c) Fit a logistic regression to the data. Use the kable() command to print out a nice summary of your coefficients estimate table. Is the coefficient of age statistically significant?
# Edit me

  • Your answer here.

(d) Using a probability cutoff of 0.5, produce a confusion matrix for your logistic regression output. Calculate the mislcassification rate of the logistic regression classifier. Does the logistic regression classifier do a good job of classifying individuals as adult vs non-adult?
# Edit me

  • Your answer here.

(e) Construct a histogram of the estimated probabilities from your logistic regression. Describe what you see.
# Edit me

  • Your answer here.

(f) When we have a single predictor in logistic regression, we saw that the probability at \(X = x\) is given by

\[ p(x) = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}\]

Let’s suppose we’ve ensured that \(\beta_0 = 0\). Then \(p(x)\) becomes

\[ p(x) = \frac{1}{1 + e^{-\beta_1 x}} \]

Suppose that we’re in an easy classification problem where the rule \(x > 0\) is a perfect clasifier. What does \(\beta_1\) have to be to ensure that \(p(x) = 1\) when \(x > 0\) and \(p(x) = 0\) when \(x < 0\)?.

  • Your answer here.

Problem 2 [7 points]: Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayes

This problem introduces you to the klaR library, which provides a set of useful model fitting and visualization tools. You will also use some fitting functions from the MASS library.

You may find the tutorial at this link helpful for solving this problem.

We’re going to use Fisher’s famous iris data set for this problem. This data set comes pre-loaded with R. You can learn more about it by looking at the helpfile ?iris. It’s fair to say that everyone who has ever learned Data Mining in R has encountered the iris data at one point or another

(a) Use the lda function from the MASS library to build an LDA classifier predicting Species from the 4 measurements in the iris data. Call this fit iris.lda.
# Edit me
Explore the iris.lda object to answer the following: What are the group means and prior probabilities for each class?
# Edit me
Run the plot() command on your iris.lda object. This produces what is called a discriminant plot. When we have \(K\) possible classes, we get \(K-1\) so-called linear discriminants. You should think of these as “derived features” that provide a helpful low-dimensional representation of the data. The more spread out the classes appear in these discriminant plots, the better the LDA method performs (and vice versa). You may supply the argument col = as.numeric(iris$Species) to colour the points based on the true class label.
# Edit me
(b) Using the predict function, calculate the 3x3 confusion matrix for the lda classifier. What is the overall misclassification rate of the LDA classifier? Does LDA perform well on this problem?
# Edit me

  • Your answer here.

Again using the predict() function: What are the estimated posterior class probabilities for the 120th observation? You should run zapsmall() on the vector of posterior probabilities to get it to display nicely.
# Edit me
(c) Use the partimat() function from the klaR package with method = "lda" to get bivariate plots showing the LDA decision boundaries. Misclassifications are indicated by red text.
# Edit me
Two of the classes begin with the letter v, which makes the above plot hard to interpret. The following code produces a new data frame, where the Species column has been transformed according to: S = setosa, C = versicolor, N = verginica. Try constructing the plot again. Do all 2-variable combinations of the inputs do an equally good job of separating the three classes?
iris2 <- transform(iris, Species = mapvalues(Species, c("setosa", "versicolor", "virginica"),
                                             c("S", "C", "N")))
# Edit me

  • Your answer here.

(d) Using the iris2 data frame, run the partimat command again, this time with method = "qda". Does it look like allowing quadratic boundaries changes much?
# Edit me

  • Your answer here.

(e) Using the geom = "density" or geom_density() functionality in ggplot2, construct density plots for the 4 input variables. Your density plots should look similar to the ones shown for income and balance in Lecture 8. There are 3 classes in the iris data instead of two, so your plots should have 3 densities overlaid. The alpha argument may be used to change the transparency.
Based on these plots, does it look like Naive Bayes will be an effective classifier for the iris data? Explain.
# Edit me

  • Your answer here.

(f) Use the NaiveBayes() command with usekernel = TRUE to fit a Naive Bayes classifier to the iris data. Save your output as iris.nb. Produce a confusion matrix for the Naive Bayes classifier. What is the misclassification rate of Naive Bayes on this problem? How does the performance of Naive Bayes compare to that of LDA in this example?
# Edit me

  • Your answer here.

(g) What is the true class of the 120th observation? What are the estimated posterior probabilities for the 120th observation according to Naive Bayes? Are they similar to those estimated by LDA? Do LDA and Naive Bayes result in the same classification for this observation? Does either method classify this observation correctly?
# Edit me

  • Your answer here.

Problem 3 [7 points]: Additive Logistic Regression with spam data

In the preamble to this document I pre-processed spam data for you to use for this problem. You have two data sets: spam.train and spam.test. The outcome is is.spam, which is 1 if the given email is spam and 0 if it isn’t. You will use spam.train to build a spam classifier, which you will then test on the data in spam.test.

For more information on the meaning of these variables, you may refer to the variable information file here: https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names

The input variables have extremely highly skewed distributions, so I applied the transformation log(0.1 + x) to every input variable. Thus the inputs are now approximately log-frequencies or log-counts instead of the raw frequencies/counts.

To answer the questions in part (a), you will find it helpful to know that this data was made publicly available by George Forman from Hewlett-Packard laboratories in Palo Alto California. These are emails from his HP Labs work account.

[2 points] (a) Fit a logistic regression model to the training data. Name your logistic regression object spam.logit. Remember that the formula specification y ~ . will regress y on every other variable in the specified data set.
Use the kable() command to produce a nice summary of the logistic regression model. Make sure you set the digits = argument appropriately so that the number of decimal places is reasonable for each column. Is increased frequency of the ! and $ characters associated with higher or lower likelihood that the email is spam?
There are several terms that are associated with decreased likelihood that this email is spam. Knowing what you do about the source of the data, pick out 3 terms with a negative coefficient where you can explain why the terms should have negative coefficients.
# Edit me

  • Your answer here.

(b) Using ggplot2 graphics, construct a single plot showing histograms of the estimated spam probabilities. Specify fill = as.factor(spam.logit$y) so that your plot colours the observations with is.spam = 1 differently from those with is.spam = 0. Does logistic regression appear to do a good job of separating spam from not spam?
# Edit me

  • Your answer here.

(c) What is the prevalence of spam in the training data? What is the prevalence of spam in the test data? Using a probability cutoff of 0.5, construct a confusion matrix for both the training data and test data. You will need to use the predict() function appropriately to get probability estimates for your test data. Look into the type argument of predict.glm() to see how to do this.
Calculate the misclassification rate for both the test data and training data. Is this a good misclassification rate, relative to the prevalence of spam in the data?
# Edit me

  • Your answer here.

(d) The code below constructs an additive formula for fitting an additive model with degree of freedom 5 smoothing splines for every input. Supply this formula to the gam command to fit a logistic additive model to the Training data. (Be sure that you are fitting a LOGISTIC additive model.) Call this fit spam.gam.
Use the plot() command to display the fitted splines for each term. You should colour the lines ‘ForestGreen’. You should also use the par(mfrow = ...) command to set up a grid with 15 rows and 4 columns for the purpose of plotting. Does it look like many of the fits are highly non-linear?
spam.formula <- formula(paste("is.spam ~ ", paste("s(", varinfo[,1], ", 4)", sep = "", collapse= " + ")))
# Edit me

  • Your answer here.

(e) Using a probability cutoff of 0.5, construct a confusion matrix to show how the logistic additive model performs on the test data. Calculate the misclassification rate. Compare this to the Test misclassification rate of the standard logistic regression model.
# Edit me

  • Your answer here.

(f) We all hate seeing spam in our inbox, but what we hate even more is when real emails wind up in the spam folder. Suppose that each time we see a spam email in our inbox, we get 1 happiness unit more sad Each time a real email winds up in the spam folder, we get 5 happiness units more sad. Our happiness is unaffected by correct classifications.
How much happiness do we lose using the Additive Model? How much happiness do we lose using the Logistic model? Report your answers as happiness loss-per-email-received. (Use the test data for your calculations)
# Edit me

  • Your answer here.