--- title: "Lecture 6 - more ggplot2" author: "Prof. Alexandra Chouldechova" date: "94842" output: html_document: fig_height: 5 fig_width: 5 toc: yes toc_depth: 5 pdf_document: toc: yes --- ###Agenda This lecture continues our introduction to the `ggplot2` package developed by Hadley Wickham. Let's begin by loading the packages we'll use this class ```{r} library(tidyverse) library(MASS) # Useful for data sets ``` ```{r} cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") ``` ### Review of ggplot2 basics ggplot2 has a slightly steeper learning curve than the base graphics functions, but it also generally produces far better and more easily customizable graphics. There are two basic calls in ggplot: - `qplot(x, y, ..., data)`: a "quick-plot" routine, which essentially replaces the base `plot()` - `ggplot(data, aes(x, y, ...), ...)`: defines a graphics object from which plots can be generated, along with *aesthetic mappings* that specify how variables are mapped to visual properties. #### ggplot function The `ggplot2` library comes with a dataset called `diamonds`. Let's look at it ```{r} dim(diamonds) diamonds ``` It is a data frame of 53,940 diamonds, recording their attributes such as carat, cut, color, clarity, and price. We will make a scatterplot showing the price as a function of the carat (size). (The data set is large so the plot may take a few moments to generate.) ```{r fig.width=10, fig.height=4, dpi=70, cache=TRUE} diamond.plot <- ggplot(data=diamonds, aes(x=carat, y=price)) diamond.plot + geom_point() ``` Let's take a step back and try to understand the ggplot syntax. 1) The first thing we did was to define a graphics object, `diamond.plot`. This definition told R that we're using the `diamonds` data, and that we want to display `carat` on the x-axis, and `price` on the y-axis. 2) We then called `diamond.plot + geom_point()` to get a scatterplot. The arguments passed to `aes()` are called **mappings**. Mappings specify what variables are used for what purpose. When you use `geom_point()` in the second line, it pulls `x`, `y`, `colour`, `size`, etc., from the **mappings** specified in the `ggplot()` command. You can also specify some arguments to `geom_point` directly if you want to specify them for each plot separately instead of pre-specifying a default. Here we shrink the points to a smaller size, and use the `alpha` argument to make the points transparent. ```{r fig.width=10, fig.height=4, dpi=70, cache=TRUE} diamond.plot + geom_point(size = 0.7, alpha = 0.3) ``` If we wanted to let point color depend on the color indicator of the diamond, we could do so in the following way. ```{r fig.width=10, fig.height=6, dpi=70, cache=TRUE} diamond.plot <- ggplot(data=diamonds, aes(x=carat, y=price, colour = color)) diamond.plot + geom_point() ``` To make the scatterplot look more typical, we can switch to logarithmic coordinate axis spacing. ```{r, eval = TRUE} diamond.plot + geom_point() + coord_trans(x = "log10", y = "log10") ``` We can create plots showing the relationship between variables across different values of a factor. For instance, here's a scatterplot showing how diamond price varies with carat size, conditioned on color. It's created using the `facet_wrap(~ factor1 + factor2 + ... + factorn)` command. ```{r, fig.width=12, fig.height=6, dpi=70, cache=TRUE} diamond.plot <- ggplot(data=diamonds, aes(x=carat, y=price, colour = color)) diamond.plot + geom_point() + facet_wrap(~ cut) + coord_trans(x = "log10", y = "log10") ``` You can also use `facet_grid()` to produce this type of output. ```{r, fig.width=11, fig.height=4.5, dpi=70, cache=TRUE} diamond.plot + geom_point() + facet_grid(. ~ cut) + coord_trans(x = "log10", y = "log10") ``` ```{r, fig.width = 8, fig.height = 10, dpi = 70, cache = TRUE} diamond.plot + geom_point() + facet_grid(cut ~ .) + coord_trans(x = "log10", y = "log10") ``` `ggplot` can create a lot of different kinds of plots, just like lattice. Here are some examples. Function | Description ---------|------------ `geom_point(...)` | Points, i.e., scatterplot `geom_bar(...)` | Bar chart `geom_line(...)` | Line chart `geom_boxplot(...)` | Boxplot `geom_violin(...)` | Violin plot `geom_density(...)` | Density plot with one variable `geom_density2d(...)` | Density plot with two variables `geom_histogram(...)` | Histogram #### Visualizing means We'll use the `birthwt` data for this example, so let's start by loading an cleaning it. All of this code is borrowed from Lecture 5. ```{r} birthwt <- as_tibble(MASS::birthwt) birthwt <- birthwt %>% rename(birthwt.below.2500 = low, mother.age = age, mother.weight = lwt, mother.smokes = smoke, previous.prem.labor = ptl, hypertension = ht, uterine.irr = ui, physician.visits = ftv, birthwt.grams = bwt) # Assign better labels to categorical variables birthwt <- birthwt %>% mutate(race = recode_factor(race, `1` = "white", `2` = "black", `3` = "other")) %>% mutate_at(c("mother.smokes", "hypertension", "uterine.irr", "birthwt.below.2500"), ~ recode_factor(.x, `0` = "no", `1` = "yes")) ``` Previously we calculated the following table: ```{r} bwt.summary <- birthwt %>% group_by(race, mother.smokes) %>% summarize(mean.birthwt = round(mean(birthwt.grams), 0)) bwt.summary ``` We can plot this table in a nice bar chart as follows: ```{r, fig.width = 6} # Define basic aesthetic parameters p.bwt <- ggplot(data = bwt.summary, aes(y = mean.birthwt, x = race, fill = mother.smokes)) # Pick colors for the bars bwt.colors <- c("#009E73", "#999999") # Display barchart p.bwt + geom_bar(stat = "identity", position = "dodge") + ylab("Average birthweight") + xlab("Mother's race") + guides(fill = guide_legend(title = "Mother's smoking status")) + scale_fill_manual(values=bwt.colors) ```

#### Histograms and density plots ```{r} base.plot <- ggplot(birthwt, aes(x = mother.age)) + xlab("Mother's age") base.plot + geom_histogram() base.plot + geom_histogram(aes(fill = race)) base.plot + geom_density() base.plot + geom_density(aes(fill = race), alpha = 0.5) ``` #### Box plots and violin plots ```{r} base.plot <- ggplot(birthwt, aes(x = as.factor(physician.visits), y = birthwt.grams)) + xlab("Number of first trimester physician visits") + ylab("Baby's birthweight (grams)") # Box plot base.plot + geom_boxplot() # Violin plot base.plot + geom_violin() ``` #### Statistical overlays in ggplot One of the main advantages of `ggplot` is that it makes it really easy to overlay model fits, confidence bars, etc. We'll get more practice with this next week, when we talk more about how to do statistical inference in R. For the time being, let's just display a scatterplot smoother. ```{r fig.width=12, fig.height=6, dpi=70, cache=TRUE} diamond.plot + stat_smooth() ``` This shows curves modelling the relationship between diamond size in carats and price for the different diamond colours. We could use the `geom_point()` command to also display the underlying points, but that would make the curves difficult to see. Let's try it anyway. ```{r fig.width=12, fig.height=6, dpi=70, cache=TRUE} diamond.plot + geom_point() + stat_smooth() ``` We can make this look better by decreasing the point opacity so that the trend curves aren't so obscured. ```{r fig.width=12, fig.height=6, dpi=70, cache=TRUE} diamond.plot + geom_point(size = 1.5, alpha = 0.25) + stat_smooth() ```

#### Birthweight regression overlay In addition to the more flexible smoother in the example above, we can also overlay more structured summaries such as regression curves. Here's the plot that we saw at the end of Lecture 5, which makes good use of regression line overlays. ```{r, fig.height=5, fig.width=6, fig.align='center'} ggplot(birthwt, aes(x=mother.age, y=birthwt.grams, shape=mother.smokes, color=mother.smokes)) + geom_point() + # Adds points (scatterplot) geom_smooth(method = "lm") + # Adds regression lines ylab("Birth Weight (grams)") + # Changes y-axis label xlab("Mother's Age (years)") + # Changes x-axis label ggtitle("Birth Weight by Mother's Age") # Changes plot title ``` Recall that when we calculated correlations between birth weight and mother's age, we got the following results. ```{r} birthwt %>% group_by(mother.smokes) %>% summarize(cor_bwt_age = cor(birthwt.grams, mother.age)) ``` This tells us that the association between mother's age and the baby's birthweight seems to depend on the mother's smoking status. Among mothers that smoke, this association is negative (older mothers tend to give birth to lower weight babies), while among mothers that don't smoke, the association is positive. We can read off the same conclusions more simply from the plot. The regression lines capture the association between birthweight and mother's age. The fact that the observed points are very spread out around the regression lines visually conveys the fact that the correlation between birthweight and mother's age is not very large. ##### Is that an outlier? The plot shows us something that the correlation calculation cannot: there's a non-smoking mother who was 45 years old when she gave birth, and the combination of her age and the baby's weight (around 5000g) are well outside the range of all other data points. Let's see what happens when we remove this observation and re-plot. ```{r, fig.height=5, fig.width=6, fig.align='center'} # Get new data set that no longer contains the outlier birthwt.sub <- subset(birthwt, subset = mother.age < 40) ggplot(birthwt.sub, aes(x=mother.age, y=birthwt.grams, shape=mother.smokes, color=mother.smokes)) + geom_point() + # Adds points (scatterplot) geom_smooth(method = "lm") + # Adds regression lines ylab("Birth Weight (grams)") + # Changes y-axis label xlab("Mother's Age (years)") + # Changes x-axis label ggtitle("Birth Weight by Mother's Age") # Changes plot title ``` The general trends appear to be qualitatively the same, but the association in the non-smoking group appears to be far less strong than we were led to believe when we still had the outlying point in our data.

#### A dot plot/bar chart example. For this example we're going to use some data based on the 2011 Consumer Expenditure (CE) Survey. This data is stored in a tab-delimited file called `mtbi.txt`. ```{r} expenses <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/mtbi.txt", sep="\t", header=TRUE) dim(expenses) head(expenses, 20) ``` The data is ordered from highest to lowest count (the count is then number of times that an item in the given ucc category was purchased). From `head()` we can see that the 20 most common expenses are things like food/drink, gas, cell phone bills, mortgage payments, etc. This is what we'd expect. Now, let's create a dot plot showing the counts for each of the top 20 expenses. ```{r fig.width=8, fig.height=8, dpi=70} expense.plot <- ggplot(data = expenses[1:20, ], mapping = aes(y = descr, x = count)) expense.plot + geom_point() ``` This doesn't look very good... What happened? Well, `R` doesn't know that the `count` variable is important for the purpose of ordering. The default behaviour is to show the plot with factor levels ordered alphabetically. This makes sense in many cases, but not in ours. To re-order the levels of a factor, we use the `reorder()` function. Here's how we re-order the levels of `descr` based on the value of `count`. ```{r} expenses <- mutate(expenses, descr = reorder(descr, count)) ``` Now let's try that plotting function again... ```{r fig.width=8, fig.height=8, dpi=70} expense.plot <- ggplot(data = expenses[1:20, ], mapping = aes(y = descr, x = count)) expense.plot + geom_point() ``` This kind of data is typically represented with bar charts instead of dot plots. Let's create a bar chart to capture the same information. ```{r fig.width=8, fig.height=5, dpi=70} barchart.fig = ggplot(data = expenses[1:20, ], mapping = aes(x = descr, y = count)) barchart.fig + geom_bar(stat = "identity", fill = cbPalette[3]) ``` There's an issue with this plot: the labels along the x-axis have all blended together and are incomprehensible. To adjust the text, we can use the `theme` command. ```{r fig.width=8, fig.height=8, dpi=70} barchart.fig + geom_bar(stat = "identity", fill = cbPalette[3]) + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) ``` We should probably also change the axis labels and title to something more meaningful ```{r fig.width=8, fig.height=8, dpi=70} barchart.fig + geom_bar(stat = "identity", fill = cbPalette[3]) + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) + labs(x = "", y = "Times reported", title = "Most Commonly Reported Purchases") ``` ### ggplot2 also does maps We'll need the `maps` library for this example. ```{r} library(maps) ``` One of the data sets that comes with R is the `USArrests` data ```{r} head(USArrests) ``` Here's how we can get a headmap of Murder rates (per 100,000 population) on a map of the US. ```{r, fig.width = 6.5, fig.height = 3.5, fig.align='center'} # Create data frame for map data (US states) states <- map_data("state") # Here's what the states data frame looks like str(states) # Make a copy of the data frame to manipulate arrests <- USArrests # Convert everything to lower case names(arrests) <- tolower(names(arrests)) arrests$region <- tolower(rownames(USArrests)) # Merge the map data with the arrests data based on region choro <- merge(states, arrests, sort = FALSE, by = "region") choro <- choro[order(choro$order), ] # Plot a map, filling in the states based on murder rate qplot(long, lat, data = choro, group = group, fill = murder, geom = "polygon") ``` Essentially what's happening here is that the map data (here called `states`) includes the latitute and longitude coordinates for the boundaries of each state. Specifying `geom = "polygon"` in the `qplot` function results in the states being traced out based on the lat/long coordinates. The `arrests` data provides crime rates for all the states, which allows us to color each polygon (state) based on the murder rate. It may be counter-intuitive that light blue indicates areas with high murder rates while dark blue indicates areas with low murder rates. We can fix this by specifying a different gradient using the `scale_colour_gradient()` command. Here's an example. ```{r, fig.width = 6.5, fig.height = 3.5, fig.align='center'} qplot(long, lat, data = choro, group = group, fill = murder, geom = "polygon") + scale_fill_gradient(low = "#56B1F7", high = "#132B43") ``` All we've done here is swapped the defaults for `low` and `high`.