--- title: "Homework 5: Classification, Tree-based methods" author: "Your Name Here" date: 'Assigned: February 17, 2018' output: html_document: toc: true toc_depth: 3 theme: cerulean highlight: tango --- ##### This homework is due by **2:50PM on Tuesday, February 27**. ### Preamble: Loading packages and data ```{r, message=FALSE} library(ggplot2) library(ISLR) library(MASS) library(partykit) library(caret) library(rpart) library(randomForest) library(pROC) ``` ```{r, cache = TRUE} # Read in the marketing data marketing <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/bank-full.csv", header = TRUE, sep = ";") ``` ```{r} set.seed(531) # Upsample the data to artifically overcome sample imbalance marketing.more.idx <- sample(which(marketing$y == "yes"), 15000, replace = TRUE) marketing.upsample <- rbind(marketing, marketing[marketing.more.idx, ]) # Randomly select 20% of the data to be held out for model validation test.indexes <- sample(1:nrow(marketing.upsample), round(0.2 * nrow(marketing.upsample))) train.indexes <- setdiff(1:nrow(marketing.upsample), test.indexes) # Just pull the covariates available to marketers (cols 1:8) and the outcome (col 17) marketing.train <- marketing.upsample[train.indexes, c(1:8, 17)] marketing.test <- marketing.upsample[test.indexes, c(1:8, 17)] ``` ### Problem 1 [7 points]: Classifier performance metrics > In this problem we'll assume that we have a binary classification problem where our outcome variable $Y \in \{0, 1\}$. Your main task is to construct a function that calculates various kinds of classifier performance metrics. ##### Here's some code for generating fake input to test out your function. I provide sample output for both parts of this problem. ```{r} set.seed(826) score.fake <- runif(200) y.fake <- as.numeric(runif(200) <= score.fake) ``` ##### (a) [3 points] Code up the function specified below. ##### Input | Argument | Description | |----------|--------------------------------------------------------------------| | `score` | length-n vector giving a score for every observation | | `y` | true observed class label for each observation | | `cutoff`| score cutoff: classify $\hat y = 1$ if `score` >= `cutoff` | | `type` | which performance metric(s) to return. `type = all` calculates all| ##### Output Your output will be a list containing the following elements | Argument | Description | |-----------|------------------------------------------------------------------| |`conf.mat` | the confusion matrix for the classifier | | `perf` | a data frame containing all of the desired metrics | > Example output: ``` # Cutoff 0.6 classMetrics(score.fake, y.fake, cutoff = 0.6, type = "all") $conf.mat observed predicted 0 1 0 82 31 1 15 72 $perf value accuracy 0.7700000 sensitivity 0.6990291 specificity 0.8453608 ppv 0.8275862 npv 0.7256637 precision 0.8275862 recall 0.6990291 # Cutoff 0.2 classMetrics(score.fake, y.fake, cutoff = 0.2, type = "all") $conf.mat observed predicted 0 1 0 36 3 1 61 100 $perf value accuracy 0.6800000 sensitivity 0.9708738 specificity 0.3711340 ppv 0.6211180 npv 0.9230769 precision 0.6211180 recall 0.9708738 # Precision and recall only classMetrics(score.fake, y.fake, cutoff = 0.2, type = c("precision", "recall")) $conf.mat observed predicted 0 1 0 36 3 1 61 100 $perf value precision 0.6211180 recall 0.9708738 ``` ```{r} classMetrics <- function(score, y, cutoff, type = c("all", "accuracy", "sensitivity", "specificity", "ppv", "npv", "precision", "recall")) { # This command throws an error if the user specifies a "type" that # isn't supported by this function type <- match.arg(type, several.ok = TRUE) # Edit me } ``` ##### (b) [3 points] A plotting routine. This function allows you to specify an x axis variable and a y-axis variable. If `y = NULL`, the x-axis variable should be taken to be `score`, and should range from the smallest to the largest value of `score`. If `flip.x = TRUE`, you should plot `1 - xvar_metric` on the x-axis. E.g., if `xvar = Specificity` and `flip.x = TRUE`, your plot should have `1 - Specificity` as the x-axis variable. > Example output: ```{r} plotClassMetrics <- function(score, y, xvar = NULL, yvar = c("accuracy", "sensitivity", "specificity", "ppv", "npv", "precision", "recall"), flip.x = FALSE) { yvar <- match.arg(yvar) # Edit me } ``` ``` # ROC curve test <- plotClassMetrics(score.fake, y.fake, xvar = "specificity", yvar = "sensitivity", flip.x = TRUE) ``` ![](http://www.andrew.cmu.edu/user/achoulde/95791/homework/roc.png) ``` plotClassMetrics(score.fake, y.fake, yvar = "precision") ``` ![](http://www.andrew.cmu.edu/user/achoulde/95791/homework/precision.png) ### Problem 2 [9 points]: Decision trees, with nicer plots > We'll need to construct `rpart` objects instead of `tree` objects in order to use the more advanced plotting routine from the `partykit` library. The syntax for `rpart` is similar to that of `tree`, and was demonstrated on the Lab for week 4. For additional details, you may refer to [the following link](http://www.statmethods.net/advstats/cart.html). > We will be using the `marketing` data, which has been split into `marketing.train` and `marketing.test` in the preamble of this document. All model fitting should be done on `marketing.train`. The outcome variable in the data set is `y`, denoting whether the customer opened up a CD or not. > This data comes from a Portuguese banking institution that ran a marketing campaign to try to get clients to subscribe to a "term deposit"" (a CD). A CD is an account that you can put money into that guarantees fixed interest rate over a certain period of time (e.g., 2 years). The catch is that if you try to withdraw your money before the term ends, you will typically incur heavy penalties or "early withdrawal fees". > Suppose that you’re hired as a decision support analyst at this bank and your first job is to use the data to figure out who the marketing team should contact for their next CD marketing campaign. i.e., they pull up new spreadsheet that contains the contact information, age, job, marital status, education level, default history, mortgage status, and personal loan status for tens of thousands of clients, and they want you to tell them who they should contact. ##### (a) Fit a decision tree to the data using the `rpart()` function. Call this tree `marketing.tree`. The syntax is exactly the same as for the `tree` function you saw on Lab 4. Use the `plot` and `text` functions to visualize the tree. Show a text print-out of the tree. Which variables get used in fitting the tree? ```{r, fig.height = 7} # Edit me ``` - **Your answer here** ##### (b) The `as.party` command converts the `rpart` tree you fit in part (a) to a `party` object that has a much better plot function. Run `plot` on the object created below. Also run the `print` function. ##### In the plot, you'll see a node labeled Node 8. How many observations fall into this leaf node? What does the shaded bar shown below this Node mean? Do observations falling into this node get classified as `"yes"` or `"no"`? ```{r, fig.height = 7, fig.width = 9} # marketing.party <- as.party(marketing.tree) # Edit me ``` - **Your answer here** ##### (c) We got a pretty shallow tree in part (a). Here we'll practice growing larger (deeper) trees, and pruning them back. The code below grows a tree to a complexity parameter value of `cp = 0.002`, while ensuring that no single node contains fewer than `minsplit = 100` observations. ##### Run the `plotcp` command on this tree to get a plot of the Cross-validated error. Also look at the `cptable` attribute of `marketing.full`. Observe that all of the errors are reported relative to that of the 1-node "tree". ```{r} # marketing.full <- rpart(y ~ ., data = marketing.train, # control = rpart.control(minsplit=100, cp=0.002)) # Edit me ``` ##### (d) The horizontal dotted line is 1 standard error above the minimum CV value for the range of `cp` values shown. Apply the 1-SE rule to determine which value of `cp` to use for pruning. Print this value of `cp`. ```{r} # Edit me ``` ##### (e) Use the `prune` command (`prune(rpart.fit, cp = )`) to prune `marketing.full` to the level of complexity you settled on in part (e). Call your pruned tree `marketing.pruned`. Display a text print-out of your tree. ```{r} # Edit me ``` > The questions below all refer to `marketing.pruned`. ##### (f) The code below converts your `marketing.pruned` tree into a `party` object. Plot these results, supplying the argument `gp = gpar(fontsize = 10)` to make the text size more easily legible. Notice the use of `gpar` to set the `fontsize` for the plot. ##### Which Node has the highest proportion of individuals who were observed to open a CD? How many individuals are in this node? Describe the characteristics of these individuals ```{r, fig.width = 14, fig.height = 12} # Uncomment the code below # marketing.pruned.party <- as.party(marketing.pruned) # Edit me ``` - **Your answer here** ##### (g) Use the `predict` function on your pruned tree to get estimated probabilities of opening a cd for everyone in `marketing.test`. Use your `classMetrics` function to get classification metrics at probability `cutoff` values of `0.25`, `0.4` and `0.5`. Use your `plotClassMetrics` command to construct an ROC curve. ```{r} # Edit me ``` ##### (h) Which of the cutoffs considered in part (h) gives the highest sensitivity? Which gives the highest specificity? In this marketing problem, do you think we want to have high sensitivity or high specificity? Explain. - **Your answer here** ##### (i) Suppose that it on average costs $10 to call a customer. When a customer opens a CD, you gain $50 on average. This means that when you fail to call a customer who would've opened a CD, you effectively lose $50. What value of the probability cutoff should we choose if we want to maximize profit? Explain. (Use `marketing.test` for this calculation.) ```{r} # Edit me ``` - **Your answer here** ### Problem 3 [4 points]: Random forests ##### (a) Use the `randomForest` command to fit a random forest to `marketing.train`. Call your fit `marketing.rf`. Show a print-out of your random Forest fit. This print-out contains a confusion matrix. Are the predicted classes given as the rows or columns of this table? ```{r, cache = TRUE} # Edit me ``` - **Your answer here** ##### (b) Construct a variable importance plot of your random forest fit. Which variables turn out to be the most important? ```{r} # Edit me ``` - **Your answer here** ##### (c) Use the `predict` command to obtain probability estimates on the test data. Use your `classMetrics` function to calculate performance metrics at `cutoff = 0.3`. Compare the metrics to those of the pruned tree `marketing.pruned` at the same `cutoff`. ```{r} # Edit me ``` - **Your answer here** ##### (d) Use the `roc` function from the `pROC` package to get an ROC curve for the random forest. Overlay the ROC curve for the pruned tree (use `steelblue` as the colour). Calculate the AUC for both methods. Do we do better with random forests than with a single tree? Are most of the gains at high or low values of Specificity? i.e., is the random forest performing better in the regime we actually care about? ```{r, fig.height = 5, fig.width = 5} # Edit me ``` - **Your answer here**