In @sec-exploratory-data-analysis, you learned how to use plots as tools for *exploration*.
When you make exploratory plots, you know---even before looking---which variables the plot will display.
You made each plot for a purpose, could quickly look at it, and then move on to the next plot.
In the course of most analyses, you'll produce tens or hundreds of plots, most of which are immediately thrown away.
Now that you understand your data, you need to *communicate* your understanding to others.
Your audience will likely not share your background knowledge and will not be deeply invested in the data.
To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible.
In this chapter, you'll learn some of the tools that ggplot2 provides to do so.
This chapter focuses on the tools you need to create good graphics.
We assume that you know what you want, and just need to know how to do it.
For that reason, we highly recommend pairing this chapter with a good general visualization book.
We particularly like [The Truthful Art](https://www.amazon.com/gp/product/0321934075/), by Albert Cairo.
It doesn't teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.
### Prerequisites
In this chapter, we'll focus once again on ggplot2.
We'll also use a little dplyr for data manipulation, **scales** to override the default breaks, labels, transformations and palettes, and a few ggplot2 extension packages, including **ggrepel** ([https://ggrepel.slowkow.com](https://ggrepel.slowkow.com/)) by Kamil Slowikowski and **patchwork** ([https://patchwork.data-imaginist.com](https://patchwork.data-imaginist.com/)) by Thomas Lin Pedersen.
Don't forget that you'll need to install those packages with `install.packages()` if you don't already have them.
If you need to add more text, there are two other useful labels: `subtitle` adds additional detail in a smaller font beneath the title and `caption` adds text at the bottom right of the plot, often used to describe the source of the data.
#| Scatterplot of highway versus city fuel efficiency. Shapes and
#| colors of points are determined by type of drive train.
ggplot(mpg, aes(x = cty, y = hwy, color = drv, shape = drv)) +
geom_point() +
labs(
x = "City MPG",
y = "Highway MPG",
shape = "Type of\ndrive train",
color = "Type of\ndrive train"
)
```
3. Take an exploratory graphic that you've created in the last month, and add informative titles to make it easier for others to understand.
## Annotations
In addition to labelling major components of your plot, it's often useful to label individual observations or groups of observations.
The first tool you have at your disposal is `geom_text()`.
`geom_text()` is similar to `geom_point()`, but it has an additional aesthetic: `label`.
This makes it possible to add textual labels to your plots.
There are two possible sources of labels.
First, you might have a tibble that provides labels.
In the following plot we pull out the cars with the highest engine size in each drive type and save their information as a new data frame called `label_info`.
```{r}
label_info <- mpg |>
group_by(drv) |>
arrange(desc(displ)) |>
slice_head(n = 1) |>
mutate(
drive_type = case_when(
drv == "f" ~ "front-wheel drive",
drv == "r" ~ "rear-wheel drive",
drv == "4" ~ "4-wheel drive"
)
) |>
select(displ, hwy, drv, drive_type)
label_info
```
Then, we use this new data frame to directly label the three groups to replace the legend with labels placed directly on the plot.
Using the `fontface` and `size` arguments we can customize the look of the text labels.
They're larger than the rest of the text on the plot and bolded.
- Use `geom_rect()` to draw a rectangle around points of interest.
The boundaries of the rectangle are defined by aesthetics `xmin`, `xmax`, `ymin`, `ymax`.
Alternatively, look into the [ggforce package](https://ggforce.data-imaginist.com/index.html), specifically [`geom_mark_hull()`](https://ggforce.data-imaginist.com/reference/geom_mark_hull.html), which allows you to annotate subsets of points with hulls.
Another handy function for adding annotations to plots is `annotate()`.
As a rule of thumb, geoms are generally useful for highlighting a subset of the data while `annotate()` is useful for adding one or few annotation elements to a plot.
The `x` and `y` aesthetics in both define where the annotation should start, and the `xend` and `yend` aesthetics in the segment annotation define the end location of the segment.
`scale_x_continuous()` puts the numeric values from `displ` on a continuous number line on the x-axis, `scale_color_discrete()` chooses colors for each of the `class` of car, etc.
You can use `labels` in the same way (a character vector the same length as `breaks`), but you can also set it to `NULL` to suppress the labels altogether.
The `labels` argument coupled with labelling functions from the scales package is also useful for formatting numbers as currency, percent, etc.
The plot on the left shows default labelling with `label_dollar()`, which adds a dollar sign as well as a thousand separator comma.
The plot on the right adds further customization by dividing dollar values by 1,000 and adding a suffix "K" (for "thousands") as well as adding custom breaks.
Note that `breaks` is in the original scale of the data.
Note that for the `breaks` argument we pulled out the `start` variable as a vector with `presidential$start` because we can't do an aesthetic mapping for this argument.
Also note that the specification of breaks and labels for date and datetime scales is a little different:
You can also use `legend.position = "none"` to suppress the display of the legend altogether.
To control the display of individual legends, use `guides()` along with `guide_legend()` or `guide_colorbar()`.
The following example shows two important settings: controlling the number of rows the legend uses with `nrow`, and overriding one of the aesthetics to make the points bigger.
This is particularly useful if you have used a low `alpha` to display many points on a plot.
Instead of just tweaking the details a little, you can instead replace the scale altogether.
There are two types of scales you're mostly likely to want to switch out: continuous position scales and color scales.
Fortunately, the same principles apply to all the other aesthetics, so once you've mastered position and color, you'll be able to quickly pick up other scale replacements.
It's very useful to plot transformations of your variable.
For example, it's easier to see the precise relationship between `carat` and `price` if we log transform them:
The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green color blindness.[^communication-1]
[^communication-1]: You can use a tool like [SimDaltonism](https://michelf.ca/projects/sim-daltonism/) to simulate color blindness to test these images.
#| Two scatterplots of highway mileage versus engine size where both color
#| and shape of points are based on drive type. The color palette is not
#| the default ggplot2 palette.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv, shape = drv)) +
scale_color_brewer(palette = "Set1")
```
The ColorBrewer scales are documented online at <https://colorbrewer2.org/> and made available in R via the **RColorBrewer** package, by Erich Neuwirth.
@fig-brewer shows the complete list of all palettes.
The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a "middle".
This often arises if you've used `cut()` to make a continuous variable into a categorical variable.
For continuous color, you can use the built-in `scale_color_gradient()` or `scale_fill_gradient()`.
If you have a diverging scale, you can use `scale_color_gradient2()`.
That allows you to give, for example, positive and negative values different colors.
That's sometimes also useful if you want to distinguish points above or below the mean.
Another option is to use the viridis color scales.
The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous color schemes that are perceptible to people with various forms of color blindness as well as perceptually uniform in both color and black and white.
These scales are available as continuous (`c`), discrete (`d`), and binned (`b`) palettes in ggplot2.
Note that all color scales come in two varieties: `scale_color_*()` and `scale_fill_*()` for the `color` and `fill` aesthetics respectively (the color scales are available in both UK and US spellings).
Let's compare these to the two plots below where the plot on the left sets the `limits` on individual scales and the plot on the right sets them in `coord_cartesian()`.
We can see that reducing the limits is equivalent to subsetting the data.
Therefore, to zoom in on a region of the plot, it's generally best to use `coord_cartesian()`.
On the other hand, setting the `limits` on individual scales is generally more useful if you want to *expand* the limits, e.g., to match scales across different plots.
For example, if we extract two classes of cars and plot them separately, it's difficult to compare the plots because all three scales (the x-axis, the y-axis, and the color aesthetic) have different ranges.
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
x_scale +
y_scale +
col_scale
```
In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.
### Exercises
1. Why doesn't the following code override the default scale?
```{r}
#| fig-show: "hide"
df <- tibble(
x = rnorm(10000),
y = rnorm(10000)
)
ggplot(df, aes(x, y)) +
geom_hex() +
scale_color_gradient(low = "white", high = "red") +
coord_fixed()
```
2. What is the first argument to every scale?
How does it compare to `labs()`?
3. Change the display of the presidential terms by:
[^communication-2]: Many people wonder why the default theme has a gray background.
This was a deliberate choice because it puts the data forward while still making the grid lines visible.
The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out.
The gray background gives the plot a similar typographic color to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background.
Finally, the gray background creates a continuous field of color which ensures that the plot is perceived as a single visual entity.
Note that customization of the legend box and plot title elements of the theme are done with `element_*()` functions.
These functions specify the styling of non-data components, e.g., the title text is bolded in the `face` argument of `element_text()` and the legend border color is defined in the `color` argument of `element_rect()`.
The theme elements that control the position of the title and the caption are `plot.title.position` and `plot.caption.position`, respectively.
In the following plot these are set to `"plot"` to indicate these elements are aligned to the entire plot area, instead of the plot panel (the default).
#| scatterplot of highway mileage versus engine size, third plot is a
#| scatterplot of highway mileage versus city mileage, and the third plot is
#| side-by-side boxplots of highway mileage versus drive train) placed next
#| to each other.
p3 <- ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
labs(title = "Plot 3")
(p1 | p3) / p2
```
Additionally, patchwork allows you to collect legends from multiple plots into one common legend, customize the placement of the legend as well as dimensions of the plots, and add a common title, subtitle, caption, etc. to your plots.
We have turned off the legends on the box plots and the scatterplot and collected the legends for the density plots at the top of the plot with `& theme(legend.position = "top")`.
Note the use of the `&` operator here instead of the usual `+`.
This is because we're modifying the theme for the patchwork plot as opposed to the individual ggplots.
The legend is placed on top, inside the `guide_area()`.
Finally, we have also customized the heights of the various components of our patchwork -- the guide has a height of 1, the box plots 3, density plots 2, and the faceted scatterplot 4.
If you'd like to learn more about combining and layout out multiple plots with patchwork, we recommend looking through the guides on the package website: <https://patchwork.data-imaginist.com>.
### Exercises
1. What happens if you omit the parentheses in the following plot layout.
#| Three plots: Plot 1 is a scatterplot of highway mileage versus engine size.
#| Plot 2 is side-by-side box plots of highway mileage versus drive train.
#| Plot 3 is side-by-side box plots of city mileage versus drive train.
#| Plots 1 is on the first row. Plots 2 and 3 are on the next row, each span
#| half the width of Plot 1. Plot 1 is labelled "Fig. A", Plot 2 is labelled
#| "Fig. B", and Plot 3 is labelled "Fig. C".
p1 / (p2 + p3) +
plot_annotation(
tag_levels = c("A"),
tag_prefix = "Fig. ",
tag_suffix = ":"
)
```
## Summary
In this chapter you've learned about adding plot labels such as title, subtitle, caption as well as modifying default axis labels, using annotation to add informational text to your plot or to highlight specific data points, customizing the axis scales, and changing the theme of your plot.
You've also learned about combining multiple plots in a single graph using both simple and complex plot layouts.
While you've so far learned about how to make many different types of plots and how to customize them using a variety of techniques, we've barely scratched the surface of what you can create with ggplot2.
If you want to get a comprehensive understanding of ggplot2, we recommend reading the book, [*ggplot2: Elegant Graphics for Data Analysis*](https://ggplot2-book.org).
Other useful resources are the [*R Graphics Cookbook*](https://r-graphics.org) by Winston Chang and [*Fundamentals of Data Visualization*](https://clauswilke.com/dataviz/) by Claus Wilke.