library(ggplot2)
library(ISLR)
library(MASS)
library(klaR) # You may need to install this one
library(knitr)
library(glmnet)
library(plyr)
library(gam)
set.seed(14504008)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
options(scipen = 4)
# Adulthood data
n.obs <- 3000
age <- pmin(pmax(rnorm(n.obs, mean = 30, sd = 10), 5), 50)
is.adult <- as.numeric(age >= 18)
age.data <- data.frame(age = age, is.adult = as.factor(is.adult))
# Spam data
spamdf <- read.table("http://www.andrew.cmu.edu/user/achoulde/95791/data/spamdata.txt")
varinfo <- read.table("http://www.andrew.cmu.edu/user/achoulde/95791/data/spamids2.txt", sep=" ", stringsAsFactors = FALSE)
is.test <- read.table("http://www.andrew.cmu.edu/user/achoulde/95791/data/traintest_indicator.txt", header=FALSE)
is.test <- as.integer(is.test[,1])
# Partition the spam data
# log-transform the x variable
spamdf.log <- spamdf
spamdf.log[, 1:(ncol(spamdf) - 1)] <- log(0.1 + spamdf[, 1:(ncol(spamdf) - 1)])
# Add names
colnames(spamdf.log) <- c(varinfo[,1], "is.spam")
spam.train <- subset(spamdf.log, subset = is.test == 0)
spam.test <- subset(spamdf.log, subset = is.test == 1)
This question walks you through a simple example that illustrates the instability of logistic regression coefficient estimates in cases where the classes are clearly separable.
This instability can arise in practice when we have inputs \(X\) that are categorical variables with a large number of levels. In such cases, particularly when we have low cell counts, it is not uncommon for all observed outcomes in a particular category to be either all \(0\) or all \(1\). This leads the coefficient corresponding to that category to be very unstable.
age.data
data frame contains information on 3000 individuals. We want to use the age
variable to try to classify individuals as adults or non-adults. The outcome variable is.adult
is 1 for adults and 0 for non-adults.geom_histogram(position = "fill")
examples (at this link)[http://docs.ggplot2.org/0.9.3.1/geom_histogram.html], construct a conditional probability plot to show how the probability of being an adult varies with age.# Edit me
# Edit me
kable()
command to print out a nice summary of your coefficients estimate table. Is the coefficient of age
statistically significant?# Edit me
# Edit me
# Edit me
\[ p(x) = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}\]
Let’s suppose we’ve ensured that \(\beta_0 = 0\). Then \(p(x)\) becomes
\[ p(x) = \frac{1}{1 + e^{-\beta_1 x}} \]
This problem introduces you to the
klaR
library, which provides a set of useful model fitting and visualization tools. You will also use some fitting functions from theMASS
library.
You may find the tutorial at this link helpful for solving this problem.
We’re going to use Fisher’s famous
iris
data set for this problem. This data set comes pre-loaded with R. You can learn more about it by looking at the helpfile?iris
. It’s fair to say that everyone who has ever learned Data Mining in R has encountered the iris data at one point or another
lda
function from the MASS
library to build an LDA classifier predicting Species
from the 4 measurements in the iris
data. Call this fit iris.lda
.# Edit me
iris.lda
object to answer the following: What are the group means and prior probabilities for each class?# Edit me
plot()
command on your iris.lda
object. This produces what is called a discriminant plot. When we have \(K\) possible classes, we get \(K-1\) so-called linear discriminants. You should think of these as “derived features” that provide a helpful low-dimensional representation of the data. The more spread out the classes appear in these discriminant plots, the better the LDA method performs (and vice versa). You may supply the argument col = as.numeric(iris$Species)
to colour the points based on the true class label.# Edit me
predict
function, calculate the 3x3 confusion matrix for the lda classifier. What is the overall misclassification rate of the LDA classifier? Does LDA perform well on this problem?# Edit me
predict()
function: What are the estimated posterior class probabilities for the 120th observation? You should run zapsmall()
on the vector of posterior probabilities to get it to display nicely.# Edit me
partimat()
function from the klaR
package with method = "lda"
to get bivariate plots showing the LDA decision boundaries. Misclassifications are indicated by red text.# Edit me
S = setosa
, C = versicolor
, N = verginica
. Try constructing the plot again. Do all 2-variable combinations of the inputs do an equally good job of separating the three classes?iris2 <- transform(iris, Species = mapvalues(Species, c("setosa", "versicolor", "virginica"),
c("S", "C", "N")))
# Edit me
iris2
data frame, run the partimat
command again, this time with method = "qda"
. Does it look like allowing quadratic boundaries changes much?# Edit me
geom = "density"
or geom_density()
functionality in ggplot2
, construct density plots for the 4 input variables. Your density plots should look similar to the ones shown for income
and balance
in Lecture 8. There are 3 classes in the iris data instead of two, so your plots should have 3 densities overlaid. The alpha
argument may be used to change the transparency.# Edit me
NaiveBayes()
command with usekernel = TRUE
to fit a Naive Bayes classifier to the iris
data. Save your output as iris.nb
. Produce a confusion matrix for the Naive Bayes classifier. What is the misclassification rate of Naive Bayes on this problem? How does the performance of Naive Bayes compare to that of LDA in this example?# Edit me
# Edit me
In the preamble to this document I pre-processed spam data for you to use for this problem. You have two data sets:
spam.train
andspam.test
. The outcome isis.spam
, which is 1 if the given email is spam and 0 if it isn’t. You will usespam.train
to build a spam classifier, which you will then test on the data inspam.test
.
For more information on the meaning of these variables, you may refer to the variable information file here: https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names
The input variables have extremely highly skewed distributions, so I applied the transformation
log(0.1 + x)
to every input variable. Thus the inputs are now approximately log-frequencies or log-counts instead of the raw frequencies/counts.
To answer the questions in part (a), you will find it helpful to know that this data was made publicly available by George Forman from Hewlett-Packard laboratories in Palo Alto California. These are emails from his HP Labs work account.
spam.logit
. Remember that the formula specification y ~ .
will regress y
on every other variable in the specified data set.kable()
command to produce a nice summary of the logistic regression model. Make sure you set the digits =
argument appropriately so that the number of decimal places is reasonable for each column. Is increased frequency of the !
and $
characters associated with higher or lower likelihood that the email is spam?# Edit me
ggplot2
graphics, construct a single plot showing histograms of the estimated spam probabilities. Specify fill = as.factor(spam.logit$y)
so that your plot colours the observations with is.spam = 1
differently from those with is.spam = 0
. Does logistic regression appear to do a good job of separating spam from not spam?# Edit me
predict()
function appropriately to get probability estimates for your test data. Look into the type
argument of predict.glm()
to see how to do this.# Edit me
gam
command to fit a logistic additive model to the Training data. (Be sure that you are fitting a LOGISTIC additive model.) Call this fit spam.gam
.plot()
command to display the fitted splines for each term. You should colour the lines ‘ForestGreen’. You should also use the par(mfrow = ...)
command to set up a grid with 15 rows and 4 columns for the purpose of plotting. Does it look like many of the fits are highly non-linear?spam.formula <- formula(paste("is.spam ~ ", paste("s(", varinfo[,1], ", 4)", sep = "", collapse= " + ")))
# Edit me
# Edit me
# Edit me