Homework 4: Classification

Preamble: Loading packages and data
Problem 1 [6 points]: Instability of Logistic regression
Problem 2 [7 points]: Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayes
Problem 3 [7 points]: Additive Logistic Regression with spam data

This homework is due by 2:50PM Thursday, February 15.

Preamble: Loading packages and data

library(ggplot2)
library(ISLR)
library(MASS)
library(klaR)  # You may need to install this one
library(knitr)
library(glmnet)
library(plyr)
library(gam)

set.seed(14504008)

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

options(scipen = 4)

# Adulthood data
n.obs <- 3000
age <- pmin(pmax(rnorm(n.obs, mean = 30, sd = 10), 5), 50)
is.adult <- as.numeric(age >= 18)
age.data <- data.frame(age = age, is.adult = as.factor(is.adult))

# Spam data
spamdf <- read.table("http://www.andrew.cmu.edu/user/achoulde/95791/data/spamdata.txt")
varinfo <- read.table("http://www.andrew.cmu.edu/user/achoulde/95791/data/spamids2.txt", sep=" ",  stringsAsFactors = FALSE)
is.test <- read.table("http://www.andrew.cmu.edu/user/achoulde/95791/data/traintest_indicator.txt", header=FALSE)
is.test <- as.integer(is.test[,1])

# Partition the spam data

# log-transform the x variable
spamdf.log <- spamdf
spamdf.log[, 1:(ncol(spamdf) - 1)] <- log(0.1 + spamdf[, 1:(ncol(spamdf) - 1)])

# Add names
colnames(spamdf.log) <- c(varinfo[,1], "is.spam")

spam.train <- subset(spamdf.log, subset = is.test == 0)
spam.test <- subset(spamdf.log, subset = is.test == 1)

Problem 1 [6 points]: Instability of Logistic regression

This question walks you through a simple example that illustrates the instability of logistic regression coefficient estimates in cases where the classes are clearly separable.

This instability can arise in practice when we have inputs $X$ that are categorical variables with a large number of levels. In such cases, particularly when we have low cell counts, it is not uncommon for all observed outcomes in a particular category to be either all $0$ or all $1$. This leads the coefficient corresponding to that category to be very unstable.

(a) The `age.data` data frame contains information on 3000 individuals. We want to use the `age` variable to try to classify individuals as adults or non-adults. The outcome variable `is.adult` is 1 for adults and 0 for non-adults.

Following the `geom_histogram(position = "fill")` examples (at this link)[http://docs.ggplot2.org/0.9.3.1/geom_histogram.html], construct a conditional probability plot to show how the probability of being an adult varies with age.

# Edit me

Your answer here.

(b) Is this a difficult classification problem? Can you think of a simple rule that gives 100% classification accuracy for this task? Display a confusion matrix to verify that your rule works.

# Edit me

Your answer here.

(c) Fit a logistic regression to the data. Use the `kable()` command to print out a nice summary of your coefficients estimate table. Is the coefficient of `age` statistically significant?

# Edit me

Your answer here.

(d) Using a probability cutoff of 0.5, produce a confusion matrix for your logistic regression output. Calculate the mislcassification rate of the logistic regression classifier. Does the logistic regression classifier do a good job of classifying individuals as adult vs non-adult?

# Edit me

Your answer here.

(e) Construct a histogram of the estimated probabilities from your logistic regression. Describe what you see.

# Edit me

Your answer here.

(f) When we have a single predictor in logistic regression, we saw that the probability at $X = x$ is given by

\[ p(x) = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}\]

Let’s suppose we’ve ensured that $\beta_0 = 0$. Then $p(x)$ becomes

\[ p(x) = \frac{1}{1 + e^{-\beta_1 x}} \]

Suppose that we’re in an easy classification problem where the rule $x > 0$ is a perfect clasifier. What does $\beta_1$ have to be to ensure that $p(x) = 1$ when $x > 0$ and $p(x) = 0$ when $x < 0$?.

Your answer here.

Problem 2 [7 points]: Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayes

This problem introduces you to the klaR library, which provides a set of useful model fitting and visualization tools. You will also use some fitting functions from the MASS library.

You may find the tutorial at this link helpful for solving this problem.

We’re going to use Fisher’s famous iris data set for this problem. This data set comes pre-loaded with R. You can learn more about it by looking at the helpfile ?iris. It’s fair to say that everyone who has ever learned Data Mining in R has encountered the iris data at one point or another

(a) Use the `lda` function from the `MASS` library to build an LDA classifier predicting `Species` from the 4 measurements in the `iris` data. Call this fit `iris.lda`.

# Edit me

Explore the `iris.lda` object to answer the following: What are the group means and prior probabilities for each class?

# Edit me

Run the `plot()` command on your `iris.lda` object. This produces what is called a discriminant plot. When we have $K$ possible classes, we get $K-1$ so-called linear discriminants. You should think of these as “derived features” that provide a helpful low-dimensional representation of the data. The more spread out the classes appear in these discriminant plots, the better the LDA method performs (and vice versa). You may supply the argument `col = as.numeric(iris$Species)` to colour the points based on the true class label.

# Edit me

(b) Using the `predict` function, calculate the 3x3 confusion matrix for the lda classifier. What is the overall misclassification rate of the LDA classifier? Does LDA perform well on this problem?

# Edit me

Your answer here.

Again using the `predict()` function: What are the estimated posterior class probabilities for the 120th observation? You should run `zapsmall()` on the vector of posterior probabilities to get it to display nicely.

# Edit me

(c) Use the `partimat()` function from the `klaR` package with `method = "lda"` to get bivariate plots showing the LDA decision boundaries. Misclassifications are indicated by red text.

# Edit me

Two of the classes begin with the letter v, which makes the above plot hard to interpret. The following code produces a new data frame, where the Species column has been transformed according to: `S = setosa`, `C = versicolor`, `N = verginica`. Try constructing the plot again. Do all 2-variable combinations of the inputs do an equally good job of separating the three classes?

iris2 <- transform(iris, Species = mapvalues(Species, c("setosa", "versicolor", "virginica"),
                                             c("S", "C", "N")))
# Edit me

Your answer here.

(d) Using the `iris2` data frame, run the `partimat` command again, this time with `method = "qda"`. Does it look like allowing quadratic boundaries changes much?

# Edit me

Your answer here.

(e) Using the `geom = "density"` or `geom_density()` functionality in `ggplot2`, construct density plots for the 4 input variables. Your density plots should look similar to the ones shown for `income` and `balance` in Lecture 8. There are 3 classes in the iris data instead of two, so your plots should have 3 densities overlaid. The `alpha` argument may be used to change the transparency.

Based on these plots, does it look like Naive Bayes will be an effective classifier for the iris data? Explain.

# Edit me

Your answer here.

(f) Use the `NaiveBayes()` command with `usekernel = TRUE` to fit a Naive Bayes classifier to the `iris` data. Save your output as `iris.nb`. Produce a confusion matrix for the Naive Bayes classifier. What is the misclassification rate of Naive Bayes on this problem? How does the performance of Naive Bayes compare to that of LDA in this example?

# Edit me

Your answer here.

(g) What is the true class of the 120th observation? What are the estimated posterior probabilities for the 120th observation according to Naive Bayes? Are they similar to those estimated by LDA? Do LDA and Naive Bayes result in the same classification for this observation? Does either method classify this observation correctly?

# Edit me

Your answer here.

Problem 3 [7 points]: Additive Logistic Regression with spam data

In the preamble to this document I pre-processed spam data for you to use for this problem. You have two data sets: spam.train and spam.test. The outcome is is.spam, which is 1 if the given email is spam and 0 if it isn’t. You will use spam.train to build a spam classifier, which you will then test on the data in spam.test.

For more information on the meaning of these variables, you may refer to the variable information file here: https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names

The input variables have extremely highly skewed distributions, so I applied the transformation log(0.1 + x) to every input variable. Thus the inputs are now approximately log-frequencies or log-counts instead of the raw frequencies/counts.

To answer the questions in part (a), you will find it helpful to know that this data was made publicly available by George Forman from Hewlett-Packard laboratories in Palo Alto California. These are emails from his HP Labs work account.

[2 points] (a) Fit a logistic regression model to the training data. Name your logistic regression object `spam.logit`. Remember that the formula specification `y ~ .` will regress `y` on every other variable in the specified data set.

Use the `kable()` command to produce a nice summary of the logistic regression model. Make sure you set the `digits =` argument appropriately so that the number of decimal places is reasonable for each column. Is increased frequency of the `!` and `$` characters associated with higher or lower likelihood that the email is spam?

There are several terms that are associated with decreased likelihood that this email is spam. Knowing what you do about the source of the data, pick out 3 terms with a negative coefficient where you can explain why the terms should have negative coefficients.

# Edit me

Your answer here.

(b) Using `ggplot2` graphics, construct a single plot showing histograms of the estimated spam probabilities. Specify `fill = as.factor(spam.logit$y)` so that your plot colours the observations with `is.spam = 1` differently from those with `is.spam = 0`. Does logistic regression appear to do a good job of separating spam from not spam?

# Edit me

Your answer here.

(c) What is the prevalence of spam in the training data? What is the prevalence of spam in the test data? Using a probability cutoff of 0.5, construct a confusion matrix for both the training data and test data. You will need to use the `predict()` function appropriately to get probability estimates for your test data. Look into the `type` argument of `predict.glm()` to see how to do this.

Calculate the misclassification rate for both the test data and training data. Is this a good misclassification rate, relative to the prevalence of spam in the data?

# Edit me

Your answer here.

(d) The code below constructs an additive formula for fitting an additive model with degree of freedom 5 smoothing splines for every input. Supply this formula to the `gam` command to fit a logistic additive model to the Training data. (Be sure that you are fitting a LOGISTIC additive model.) Call this fit `spam.gam`.

Use the `plot()` command to display the fitted splines for each term. You should colour the lines ‘ForestGreen’. You should also use the `par(mfrow = ...)` command to set up a grid with 15 rows and 4 columns for the purpose of plotting. Does it look like many of the fits are highly non-linear?

spam.formula <- formula(paste("is.spam ~ ", paste("s(", varinfo[,1], ", 4)", sep = "", collapse= " + ")))
# Edit me

Your answer here.

(e) Using a probability cutoff of 0.5, construct a confusion matrix to show how the logistic additive model performs on the test data. Calculate the misclassification rate. Compare this to the Test misclassification rate of the standard logistic regression model.

# Edit me

Your answer here.

(f) We all hate seeing spam in our inbox, but what we hate even more is when real emails wind up in the spam folder. Suppose that each time we see a spam email in our inbox, we get 1 happiness unit more sad Each time a real email winds up in the spam folder, we get 5 happiness units more sad. Our happiness is unaffected by correct classifications.

How much happiness do we lose using the Additive Model? How much happiness do we lose using the Logistic model? Report your answers as happiness loss-per-email-received. (Use the test data for your calculations)

# Edit me

Your answer here.

Homework 4: Classification

Your Name Here

Assigned: February 10, 2018

This homework is due by 2:50PM Thursday, February 15.

Preamble: Loading packages and data

Problem 1 [6 points]: Instability of Logistic regression

(a) The `age.data` data frame contains information on 3000 individuals. We want to use the `age` variable to try to classify individuals as adults or non-adults. The outcome variable `is.adult` is 1 for adults and 0 for non-adults.

Following the `geom_histogram(position = "fill")` examples (at this link)[http://docs.ggplot2.org/0.9.3.1/geom_histogram.html], construct a conditional probability plot to show how the probability of being an adult varies with age.

(b) Is this a difficult classification problem? Can you think of a simple rule that gives 100% classification accuracy for this task? Display a confusion matrix to verify that your rule works.

(c) Fit a logistic regression to the data. Use the `kable()` command to print out a nice summary of your coefficients estimate table. Is the coefficient of `age` statistically significant?

(d) Using a probability cutoff of 0.5, produce a confusion matrix for your logistic regression output. Calculate the mislcassification rate of the logistic regression classifier. Does the logistic regression classifier do a good job of classifying individuals as adult vs non-adult?

(e) Construct a histogram of the estimated probabilities from your logistic regression. Describe what you see.

(f) When we have a single predictor in logistic regression, we saw that the probability at \(X = x\) is given by

Suppose that we’re in an easy classification problem where the rule \(x > 0\) is a perfect clasifier. What does \(\beta_1\) have to be to ensure that \(p(x) = 1\) when \(x > 0\) and \(p(x) = 0\) when \(x < 0\)?.

Problem 2 [7 points]: Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayes

(a) Use the `lda` function from the `MASS` library to build an LDA classifier predicting `Species` from the 4 measurements in the `iris` data. Call this fit `iris.lda`.

Explore the `iris.lda` object to answer the following: What are the group means and prior probabilities for each class?

(b) Using the `predict` function, calculate the 3x3 confusion matrix for the lda classifier. What is the overall misclassification rate of the LDA classifier? Does LDA perform well on this problem?

Again using the `predict()` function: What are the estimated posterior class probabilities for the 120th observation? You should run `zapsmall()` on the vector of posterior probabilities to get it to display nicely.

(c) Use the `partimat()` function from the `klaR` package with `method = "lda"` to get bivariate plots showing the LDA decision boundaries. Misclassifications are indicated by red text.

(d) Using the `iris2` data frame, run the `partimat` command again, this time with `method = "qda"`. Does it look like allowing quadratic boundaries changes much?

Based on these plots, does it look like Naive Bayes will be an effective classifier for the iris data? Explain.

Problem 3 [7 points]: Additive Logistic Regression with spam data

[2 points] (a) Fit a logistic regression model to the training data. Name your logistic regression object `spam.logit`. Remember that the formula specification `y ~ .` will regress `y` on every other variable in the specified data set.

There are several terms that are associated with decreased likelihood that this email is spam. Knowing what you do about the source of the data, pick out 3 terms with a negative coefficient where you can explain why the terms should have negative coefficients.

Calculate the misclassification rate for both the test data and training data. Is this a good misclassification rate, relative to the prevalence of spam in the data?

Use the `plot()` command to display the fitted splines for each term. You should colour the lines ‘ForestGreen’. You should also use the `par(mfrow = ...)` command to set up a grid with 15 rows and 4 columns for the purpose of plotting. Does it look like many of the fits are highly non-linear?

(e) Using a probability cutoff of 0.5, construct a confusion matrix to show how the logistic additive model performs on the test data. Calculate the misclassification rate. Compare this to the Test misclassification rate of the standard logistic regression model.

How much happiness do we lose using the Additive Model? How much happiness do we lose using the Logistic model? Report your answers as happiness loss-per-email-received. (Use the test data for your calculations)

Homework 4: Classification

Your Name Here

Assigned: February 10, 2018

This homework is due by 2:50PM Thursday, February 15.

Preamble: Loading packages and data

Problem 1 [6 points]: Instability of Logistic regression

(a) The age.data data frame contains information on 3000 individuals. We want to use the age variable to try to classify individuals as adults or non-adults. The outcome variable is.adult is 1 for adults and 0 for non-adults.

Following the geom_histogram(position = "fill") examples (at this link)[http://docs.ggplot2.org/0.9.3.1/geom_histogram.html], construct a conditional probability plot to show how the probability of being an adult varies with age.

(b) Is this a difficult classification problem? Can you think of a simple rule that gives 100% classification accuracy for this task? Display a confusion matrix to verify that your rule works.

(c) Fit a logistic regression to the data. Use the kable() command to print out a nice summary of your coefficients estimate table. Is the coefficient of age statistically significant?

(d) Using a probability cutoff of 0.5, produce a confusion matrix for your logistic regression output. Calculate the mislcassification rate of the logistic regression classifier. Does the logistic regression classifier do a good job of classifying individuals as adult vs non-adult?

(e) Construct a histogram of the estimated probabilities from your logistic regression. Describe what you see.

(f) When we have a single predictor in logistic regression, we saw that the probability at \(X = x\) is given by

Suppose that we’re in an easy classification problem where the rule \(x > 0\) is a perfect clasifier. What does \(\beta_1\) have to be to ensure that \(p(x) = 1\) when \(x > 0\) and \(p(x) = 0\) when \(x < 0\)?.

Problem 2 [7 points]: Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayes

(a) Use the lda function from the MASS library to build an LDA classifier predicting Species from the 4 measurements in the iris data. Call this fit iris.lda.

Explore the iris.lda object to answer the following: What are the group means and prior probabilities for each class?

(b) Using the predict function, calculate the 3x3 confusion matrix for the lda classifier. What is the overall misclassification rate of the LDA classifier? Does LDA perform well on this problem?

Again using the predict() function: What are the estimated posterior class probabilities for the 120th observation? You should run zapsmall() on the vector of posterior probabilities to get it to display nicely.

(c) Use the partimat() function from the klaR package with method = "lda" to get bivariate plots showing the LDA decision boundaries. Misclassifications are indicated by red text.

(d) Using the iris2 data frame, run the partimat command again, this time with method = "qda". Does it look like allowing quadratic boundaries changes much?

Based on these plots, does it look like Naive Bayes will be an effective classifier for the iris data? Explain.

Problem 3 [7 points]: Additive Logistic Regression with spam data

[2 points] (a) Fit a logistic regression model to the training data. Name your logistic regression object spam.logit. Remember that the formula specification y ~ . will regress y on every other variable in the specified data set.

There are several terms that are associated with decreased likelihood that this email is spam. Knowing what you do about the source of the data, pick out 3 terms with a negative coefficient where you can explain why the terms should have negative coefficients.

Calculate the misclassification rate for both the test data and training data. Is this a good misclassification rate, relative to the prevalence of spam in the data?

Use the plot() command to display the fitted splines for each term. You should colour the lines ‘ForestGreen’. You should also use the par(mfrow = ...) command to set up a grid with 15 rows and 4 columns for the purpose of plotting. Does it look like many of the fits are highly non-linear?

(e) Using a probability cutoff of 0.5, construct a confusion matrix to show how the logistic additive model performs on the test data. Calculate the misclassification rate. Compare this to the Test misclassification rate of the standard logistic regression model.

How much happiness do we lose using the Additive Model? How much happiness do we lose using the Logistic model? Report your answers as happiness loss-per-email-received. (Use the test data for your calculations)

(a) The `age.data` data frame contains information on 3000 individuals. We want to use the `age` variable to try to classify individuals as adults or non-adults. The outcome variable `is.adult` is 1 for adults and 0 for non-adults.

Following the `geom_histogram(position = "fill")` examples (at this link)[http://docs.ggplot2.org/0.9.3.1/geom_histogram.html], construct a conditional probability plot to show how the probability of being an adult varies with age.

(c) Fit a logistic regression to the data. Use the `kable()` command to print out a nice summary of your coefficients estimate table. Is the coefficient of `age` statistically significant?

(a) Use the `lda` function from the `MASS` library to build an LDA classifier predicting `Species` from the 4 measurements in the `iris` data. Call this fit `iris.lda`.

Explore the `iris.lda` object to answer the following: What are the group means and prior probabilities for each class?

(b) Using the `predict` function, calculate the 3x3 confusion matrix for the lda classifier. What is the overall misclassification rate of the LDA classifier? Does LDA perform well on this problem?

Again using the `predict()` function: What are the estimated posterior class probabilities for the 120th observation? You should run `zapsmall()` on the vector of posterior probabilities to get it to display nicely.

(c) Use the `partimat()` function from the `klaR` package with `method = "lda"` to get bivariate plots showing the LDA decision boundaries. Misclassifications are indicated by red text.

(d) Using the `iris2` data frame, run the `partimat` command again, this time with `method = "qda"`. Does it look like allowing quadratic boundaries changes much?

[2 points] (a) Fit a logistic regression model to the training data. Name your logistic regression object `spam.logit`. Remember that the formula specification `y ~ .` will regress `y` on every other variable in the specified data set.

Use the `plot()` command to display the fitted splines for each term. You should colour the lines ‘ForestGreen’. You should also use the `par(mfrow = ...)` command to set up a grid with 15 rows and 4 columns for the purpose of plotting. Does it look like many of the fits are highly non-linear?