r4ds/visualize.Rmd

205 lines
8.4 KiB
Plaintext

---
layout: default
title: Data Visualization
output: bookdown::html_chapter
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache = TRUE)
```
# Visualize Data
Visualization makes data decipherable. Have you ever tried to study data in a table? From a cognitive standpoint, tabular data is difficult to comprehend. You can examine table values one at a time, but you cannot attend to many values at once. This makes it hard to spot patterns in the data. See this for yourself; can you spot the striking relationship between $X$ and $Y$ in the table below?
```{r echo=FALSE}
X <- rep(seq(0.1, 1.9, length = 6), 2) + runif(12, -0.1, 0.1)
Y <- sqrt(1 - (X - 1)^2)
Y[1:6] <- -1 * Y[1:6]
Y <- Y - 1
order <- sample(1:10)
knitr::kable(round(data.frame(X = X[order], Y = Y[order]), 2))
```
In contrast, visualized data is cognitively transparent. Once you visualize data in a graph, you can instantly see the relationships between data points. You can spot the structure of the data, and you can read off individual values as necessary.
```{r echo=FALSE}
ggplot2::qplot(X, Y) + ggplot2::coord_fixed(ylim = c(-2.5, 2.5), xlim = c(-2.5, 2.5))
```
This chapter will teach you how to visualize your data with R and the `ggplot2` package. R contains a number of methods for making graphs, but the `ggplot2` method is one of the most beautiful and most versatile. `ggplot2` implements the *grammar of graphics*, a coherent system for describing and building graphs. The advantage is tremendous. With `ggplot2`, you can do more, faster by learning one system and applying it in many places.
## Outline
In *Section 1*, you will learn how to make scatterplots, the most popular type of data visualization. Along the way, you will learn to add information to your plots with color, size, shape, and facets; and how to change the "type" of your plot with _geoms_ .
*Section 2* shows how to build bar charts. Here you will learn how to plot summaries of your data with _stats_ and how to control the placement of objects with with _positions_.
*Section 3* explains how to make histograms and how to fine tune your plots with _parameters_. You will also learn the best ways to display comparisons in your plots.
*Section 4* draws on examples in the first three sections to teach the _gramar of graphics_, a versatile system for describing---and building---any plot.
*Section 5* concludes the chapter by showing how to customize your plots with labels, legends, and color schemes.
## Prerequisites
```{r}
library(ggplot2)
```
## Scatterplots
What is the relationship between fuel efficiency and engine size?
```{r}
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy))
```
The `ggplot2` syntax. This syntax may seem strange. For this section, just go with it. Make the graphs and experiment by changing the code whenever an idea strikes you. Concentrate on being able to visulaize the data. Enjoy your new powers. In the next section, we will explain the reasons behind the syntax.
Visualizations help you discover what you do not expect to see. How can we explain the outlying groups? By adding a third variable.
### Aesthetics
An aesthetic is a visual property of the point. We can map aesthetics to variables in our data set to convey information.
```{r}
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy, color = class))
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy, shape = class))
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy, size = class))
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy, alpha = class))
```
The two groups are hybrids and sports cars.
What is a variable. Differences between discrete and continuous variables.
### Facets
Facets divide your plot into subplots. Each subplot shows a subset of related data points.
```{r}
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_wrap(~ class)
```
### Geoms
How are these two plots similar?
```{r echo = FALSE, message = FALSE, fig.show='hold', fig.width=4, fig.height=4}
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy))
ggplot(mpg) +
geom_smooth(aes(x = displ, y = hwy))
```
They both contain the same x variable, the same y variable, and if we look closely, we can see that they are plotting the same data. But the plots are not identical.
Each plot uses a different visual object to represent the data. The first plot represents each observation in the data set with a point. The second plot represents the entire group of observations with a smoothed line. You could say that these two graphs are different "types" of plots, or that the "draw" different things. In `ggplot2` syntax, we say that they use different _geoms_.
_geom_ is short for geometrical object. The geom of the plot determines what type of visual object the plot uses to represent the data. So far, all of our plots have used the point geom, which is how you create scatterplots.
The new plot uses the smooth geom, a smooth line fitted to the data. You can use different geoms to plot the same data. To change the geom in your plot, change the `geom_` function that you add to `ggplot()`. For example, to go from the first plot above to the second, replace `geom_point()` in
```{r eval=FALSE}
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy))
```
with `geom_smooth()` like this
```{r eval=FALSE, message = FALSE}
ggplot(mpg) +
geom_smooth(aes(x = displ, y = hwy))
```
`ggplot2` comes with 37 `geom_` functions that you can use. You can also find additional `geom_` functions in other R packages. All `geom_` functions behave similarly. They each take a `mapping` argument. However, the aesthetics that you pass the argument will change from geom to geom. For example, you can set the shape of points, but it would not make sense to set the shape of a line.
#### Multiple geoms
You can add multiple geoms to the same plot by adding multiple `geom_` functions to the plot call. For instance, it is common to combine a geom that displays the raw data with a geom that displays a summary of the data:
```{r, message = FALSE}
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
geom_smooth(aes(x = displ, y = hwy))
```
`ggplot2` will place each new geom on top of the preceeding geom. This system lets you build a sophisticated graph layer by layer, that is geom by geom.
Notice that our call now contains some redundant code. We call `aes(x = displ, y = hwy)` twice. It is unwise to repeat code because each repetition creates a chance to make a typo or error. Repetitions also make your code harder to read and write.
You can avoid repetition by passing a set of aesthetics to `ggplot()`. `ggplot2` will treat these aesthetics as global mappings that apply to each geom in the graph. You can then remove the mapping arguments in the individual layers.
```{r, message = FALSE}
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
```
If you supply a mapping argument in a `geom_`function, `ggplot2` will add the local aesthetics to the global aesthetics _for that geom only_. This provides an easy way to differentiate geoms.
```{r, message = FALSE}
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth()
```
Notice that the smooth line above is a single line with a single color. This does not occur if you add the color aesthetic to the global mappings. Smooth will draw a different colored line for each class of cars.
```{r, message = FALSE, warning = FALSE}
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
geom_smooth()
```
Local aesthetics will override global aesthetics in any geom where the two conflict.
## Bar Charts
After scatterplots, the most common type of plot is probably the bar chart. A bar chart is only a graph that uses the bar geom.
### Stats
### Position
Why does our graph appear to only display 126 points? There are 234 observations in the data set. Why do the points appear to be arranged on a grid?
The `hwy` and `displ` measurements were rounded to the nearest integer and tenths. As a result, many points overlap each other. In other words, their positions collide. You can prevent this by setting adding a little bit of random noise to each point.
```{r}
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy), position = "jitter")
```
## Histograms
## The Grammar of Graphics
## Customizing plots