--- title: Lecture 14 - Prediction vs inference author: 94-842 date: December 4, 2019 output: html_document: toc: true toc_depth: 5 --- ```{r} library(tidyverse) library(knitr) library(pROC) library(rpart) options(scipen = 4) ``` ### Prediction vs. Inference Unless you've already taken a class on data mining or machine learning, a lot of the analytics tasks you've undertaken have probably taken the form of "inference" problems. Common inference problems include: - Estimating and interpreting the parameters of a model - Conducting hypothesis tests - Constructing confidence intervals These are all what [Mullainathan and Spiess]([https://www.aeaweb.org/articles?id=10.1257/jep.31.2.87) "$\hat \beta$" problems ("beta-hat problems"). Essentially, these are all problems that begin with you putting down a model $$ y = X\beta + \epsilon, $$ estimating $\beta$, and making some conclusions about the world based on those estimates and corresponding statistical significance analyses. Prediction problems are different. When we're doing prediction, we aren't interested in $\beta$. Instead, we're interested in being able to accurately predict $y$ from information $x$. These are what M&S call "$\hat y$" problems ("y-hat problems"). Prediction is a very useful paradigm to know about for a number of reasons. First, prediction problems are ubiquitous, even in policy settings. Second, prediction is much easier than inference. Whereas inference often relies on various assumptions holding, prediction is largely assumption-free. These notes introduce you to prediction in its most common form---binary classification---and teach you the basics of training and evaluating different classifiers. ### Classification Many of the problems you'll come across in the future will likely be classification problems (and if they're not, there's you can typically turn them into classification problems). These are problems where your outcome variable $y$ is binary or categorical. E.g., $y$ might be the indicator that an email is spam, a transation is fraud, or that a student graduates from college. There are certainly notable cases where a good prediction of a quantitative outcome is desired. E.g., stock prices, home values, crop yields. But "most" problems do wind up being ones of classification. Let's get started. We'll use a marketing data set where we have observations on whether bank customers who were contacted by the bank's sales team opened a term deposit ("subscribed"). Let's start by loading the data. ```{r} marketing <- read_delim("http://www.andrew.cmu.edu/user/achoulde/94842/data/bank-full.csv", delim = ";") ``` What does the data contain? ```{r} str(marketing) ``` ```{r} marketing <- marketing %>% mutate(y = as.numeric(y == "yes")) ``` Our outcome variable here is `y`, whether or not a person opens an account. You'll see above that we transformed the original yes/no `y` to an indicator that a person subscribes. #### Classifiation as a $\hat \beta$ problem: inference with logistic regression You're already familiar with linear regression, and you could certainly regress `y` on the other variables in the data using linear regiression to construct a model. That approach turns out to be ill-suited to the analysis of binary outcome data. Among other things, when $y$ is either $0$ or $1$, it is odd to fit a line to predict this type of outcome. For one, the linear model generally won't be constrained to predict values between 0 and 1; it can give negative values or values >1. Linear increases on the probability scale are also unlikely to be a good description of the association between a given input $x$ and the outcome $y$. The standard approach to modeling binary outcome data in regression is to use **logistic regression**. This is an example of a **generalized linear model** (glm). GLM's generalize what you know about linear regression to outcome variable types that aren't well modeled by the "gaussian" linear model $y = X\beta + \epsilon$. Obviously binary outcomes $y$ aren't generated by this sort of process. The setup for logistic regression is essentially as follows. The observed outcome $y$ for an observation with features $x$ is thought to come from a Bernoulli$(p(x))$ random variable, where the success probability $p(x)$ is parameterized by $$ \log\left( \frac{p}{1 - p} \right) = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p $$ Essentially this is saying that instead of modeling $y$ (or, technically, $E(y | x)$) as a linear function of $x$, we'll model $y$ as a Bernoulli realization where the success probility $p$ is a function of $x$. The "linear" part of the generalized linear model is what's being illustrated in the above expression: A transformation of $p$ is being modeled as a linear function of $x$. You can compare this with the standard linear regression model, which says that $y \sim N(\mu(x), \sigma^2)$, where the mean $\mu = E(Y \mid x)$ is a linear function of $x$: $$ \mu = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p $$ So let's fit a logistic regression model and look at what we find. ```{r} marketing.glm <- glm(y ~ ., data = marketing, family = binomial()) summary(marketing.glm) ``` ### Classification as a $\hat y$ problem: Try everything and see what sticks #### First step: Set aside a test set ```{r} # Initialized seed for random number generation set.seed(12345) # Upsample the data to artifically overcome sample imbalance marketing.more.idx <- sample(which(marketing$y == 1), 15000, replace = TRUE) marketing.upsample <- rbind(marketing, marketing[marketing.more.idx, ]) # Trim job strings to 5 characters # marketing.upsample <- transform(marketing.upsample, job = strtrim(job, 5)) # Randomly select 20% of the data to be held out for model validation test.indexes <- sample(1:nrow(marketing.upsample), round(0.2 * nrow(marketing.upsample))) train.indexes <- setdiff(1:nrow(marketing.upsample), test.indexes) # Just pull the covariates available to marketers (cols 1:8) and the outcome (col 17) marketing.train <- marketing.upsample[train.indexes, c(1:8, 17)] marketing.test <- marketing.upsample[test.indexes, c(1:8, 17)] ``` When buliding models it is important that we hold out a subset of our data, typically called a "test set" or a "validation set", or a "holdout set". In this example we're holding out a random 20% of our data. The purpose of this test set is to ensure that we get reasonable estimated of the prediction accuracy of our model even if we make mistakes during our "training" process that result in "overfitting". > When you have a large number of covariates, it's easy to overfit the data. When you overfit the training data, you get a model that describes the training data really well, but which doesn't give good predictions on unseen data. ![Overfitting](http://pingax.com/wp-content/uploads/2014/05/underfitting-overfitting.png) source: [http://pingax.com/regularization-implementation-r/](http://pingax.com/regularization-implementation-r/)
#### Second step: "Train" models ```{r} library(glmnet) # Regularized regression library(ranger) # random forests ``` We'll start by fitting a logistic regression model to the training data. ```{r} marketing.glm <- glm(y ~ ., data = marketing.train, family = binomial()) pred.test.glm <- as.numeric(predict(marketing.glm, newdata = marketing.test, type = "response") > 0.5) ``` The code above fits a logistic regression model to the training data, and then gets predicted probabilities for the test data. The `round` operation here is equivalent to thresholding those probabilities at 0.5 to form predictions of whether the person is a high earner. #### Evaluate predictive performance ```{r} # Confusion matrix for logistic regression conf.glm <- table(marketing.test$y, pred.test.glm) conf.glm ``` ```{r} # How accurate is our model? sum(diag(conf.glm)) / sum(conf.glm) ``` That's way better than 50%! But... is 50% accuracy really the baseline we want? You often hear that something is "better than a coin flip" or "no better than a coin flip". Is a fair coin flip really the right baseline? Generally, no. Let's look at what fraction of our test data are actually high earners ```{r} mean(marketing.train$y) ``` Hmm... So if we guessed that *no one* subscribes, our accuracy would already be `r 1 - mean(marketing.train$y)`. That makes our accuracy of `r sum(diag(conf.glm)) / sum(conf.glm)` a lot less impressive by comparison. ### Now let's fit some other models #### Regularized logistic regression (Lasso) ```{r} ### Regularized logistic regression, with parameters tuned through cross-validation # Extract y column y.marketing <- marketing.train$y # Get a numeric design matrix x x.marketing <- model.matrix(~ . - y - 1, data = marketing.train) x.marketing.test <- model.matrix(~ . - y - 1, data = marketing.test) # Run cross-validated regularized regression marketing.cv.glmnet <- cv.glmnet(x.marketing, y.marketing, family = "binomial") # Have a look at the cv error plot plot(marketing.cv.glmnet) ``` Let's get our predictions for the test data ```{r} # Extract predictions from model selected by the 1se rule (simplest model within 1 standard error from the minimum) pred.test.glmnet <- predict(marketing.cv.glmnet, x.marketing.test, s = "lambda.1se", type = "class") # Confusion matrix for regularized logistic regression conf.glmnet <- table(marketing.test$y, pred.test.glmnet) conf.glmnet ``` How did we do? ```{r} sum(diag(conf.glmnet)) / sum(conf.glmnet) ``` Well... that wasn't any better... #### Tree model (with rpart) ```{r, fig.width = 10} library(partykit) marketing.tree <- rpart(as.factor(y) ~ ., data = marketing.train, control = rpart.control(minsplit=50, cp=0.002)) marketing.party <- as.party(marketing.tree) plot(marketing.party, gp = gpar(fontsize = 10)) ``` ```{r} pred.test.tree <- as.numeric(predict(marketing.tree, newdata = marketing.test)[,"1"] > 0.5) ``` ```{r} # Confusion matrix for tree model conf.tree <- table(marketing.test$y, pred.test.tree) conf.tree ``` How did we do? ```{r} sum(diag(conf.tree)) / sum(conf.tree) ``` That's a little better. #### Random forest (with ranger) ```{r} marketing.rf <- ranger(y ~ ., data = marketing.train, importance = 'impurity') pred.test.rf <- as.numeric(predict(marketing.rf, data = marketing.test)$predictions > 0.5) ``` ```{r} # Confusion matrix for random forest model conf.rf <- table(marketing.test$y, pred.test.rf) conf.rf ``` How did we do? ```{r} sum(diag(conf.rf)) / sum(conf.rf) ``` Way better! But is overall accuracy really what we care about? How will we use this model in the future? Presumably we'll be using the model to help guide a new marketing campaign. In that case our task will be to select a subset of new customers who we should contact, instead of contacting everyone. How do we think about our model's performance in that type of setting? Here's a function that calculates a bunch of classification metrics based on a model's confusion table. We'll assess it on all of our models. ```{r} classSummary <- function(tbl) { n <- sum(tbl) prev <- sum(tbl[2,]) / sum(tbl) acc <- sum(diag(tbl)) / n prop.pos <- sum(tbl[,2]) / n ppv <- tbl[2,2] / sum(tbl[,2]) fpr <- tbl[1,2] / sum(tbl[1,]) fnr <- tbl[2,1] / sum(tbl[2,]) spec <- 1 - fpr sens <- 1 - fnr lr.pos <- sens / fpr lr.neg <- fnr / spec out <- data.frame(value = round(c(n, prev, acc, prop.pos, ppv, fpr, fnr, spec, sens, lr.pos, lr.neg), 3)) rownames(out) <- c("count", "prevalence", "accuracy", "prop.positive", "PPV", "FPR", "FNR", "Specificity (TNR)", "Sensitivity (TPR)", "LR+", "LR-") out } ``` ```{r} classSummary(conf.glm) ``` ```{r} classSummary(conf.glmnet) ``` ```{r} classSummary(conf.tree) ``` ```{r} classSummary(conf.rf) ``` Let's bind those together to make them easier to compare ```{r} tibble(metric = rownames(classSummary(conf.glm)), logistic = classSummary(conf.glm)$value, lasso = classSummary(conf.glmnet)$value, tree = classSummary(conf.tree)$value, rf = classSummary(conf.rf)$value) ``` ```{r} marketing.preds <- tibble(glm = predict(marketing.glm, newdata = marketing.test, type = "response"), lasso = predict(marketing.cv.glmnet, x.marketing.test, s = "lambda.min", type = "response")[,1], tree = predict(marketing.tree, newdata = marketing.test)[,"1"], rf = predict(marketing.rf, data = marketing.test)$predictions, y = marketing.test$y) ``` Let's look at ROC curves and the AUC. ROC curves trace out the TPR on the y axis and the FPR on the x axis as we vary the threshold used for classification. ```{r} roc.list <- with(marketing.preds, roc(y ~ glm + lasso + tree + rf)) plot(roc.list[[1]]) plot(roc.list[[2]], col = "red", add = TRUE) plot(roc.list[[3]], col = "purple", add = TRUE) plot(roc.list[[4]], col = "steelblue", add = TRUE) ``` Let's calculate the AUCs for these (the areas under the curve). The AUC ```{r} with(marketing.preds, auc(y ~ glm)) with(marketing.preds, auc(y ~ lasso)) with(marketing.preds, auc(y ~ rf)) ``` OK... but what variables are important? We don't have p-values or coefficient estimates, but we do have "importance" measures that tell us how important variables are for *predictions*. ```{r} sort(marketing.rf$variable.importance) ``` ```{r} library(edarf) pd <- partial_dependence(marketing.rf, data = marketing.test, vars = c("balance")) plot_pd(pd) ``` One of the reasons that the logistic models might not be performing well is that variable like balance appear to have non-linear relationships with the outcome. There looks to be a sharp discontinuity in the relationship between outcome and balance, as modeled by the random forest.