340 lines
19 KiB
Plaintext
340 lines
19 KiB
Plaintext
# Exploratory Data Analysis (EDA)
|
|
|
|
```{r include = FALSE}
|
|
library(ggplot2)
|
|
library(dplyr)
|
|
knitr::opts_chunk$set(fig.height = 2)
|
|
```
|
|
|
|
|
|
Visualization and transformation are the most useful tools for exploring your data, a task that statisticians call Exploratory Data Analysis, or EDA for short. EDA involves iteratively
|
|
|
|
1. forming questions about your data
|
|
2. searching for answers by visualizing and transforming your data
|
|
3. using what you discover to refine your questions about the data, or to choose new questions to investigate
|
|
|
|
There is no formal way to do Exploratory Data Analysis because you must be free to investigate every insight that occurs to you. However, some tactics will lead more reliably to insights than others. This chapter will teach you a basic toolkit of the most useful EDA techniques.
|
|
|
|
## Questions
|
|
|
|
> "There are no routine statistical questions, only questionable statistical routines."---Sir David Cox
|
|
|
|
> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.."---John Tukey
|
|
|
|
EDA begins with questions. The questions that you ask about your data will guide your attention as you search for insights. Good questions will lead you to discoveries that let you ask better questions.
|
|
|
|
There is no rule about which questions you should ask to guide your research. You will often begin with one set of questions and then replace them as your understanding of your data deepens. If you ever find yourself at a loss for questions, two types of questions will always be useful for making discoveries with your data. You can loosely word them as
|
|
|
|
1. What type of **variation** occurs **within** my variables? and
|
|
|
|
2. What type of **covariation** occurs **between** my variables?
|
|
|
|
The rest of this chapter will look at these two questions. I'll show you the best ways to use visualization and summaries to explore variation and covariation. Our discussion will lead to a model of data science itself, the model that we've built this book around. To make the discussion easier, let's define some terms:
|
|
|
|
* A _variable_ is a quantity, quality, or property that you can measure.
|
|
|
|
* A _value_ is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
|
|
|
|
* An _observation_ is a set of measurements that you make under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I'll sometimes refer to an observation as a data point.
|
|
|
|
## Variation
|
|
|
|
> "What type of variation occurs within my variables?"
|
|
|
|
**Variation** is the tendency of the values of a variable to change from measurement to measurement.
|
|
|
|
You can see variation easily in real life; if you measure any continuous variable twice---and precisely enough, you will get two different results. This is true even if you measure quantities that should be constant, like the speed of light (below). Each of your measurements will include a small amount of error that varies from measurement to measurement.
|
|
|
|
```{r, variation, echo = FALSE}
|
|
|
|
mat <- as.data.frame(matrix(morley$Speed + 299000, ncol = 10))
|
|
|
|
knitr::kable(mat, caption = "*The speed of light is a universal constant, but variation due to measurement error obscures its value. In 1879, Albert Michelson measured the speed of light 100 times and observed 30 different values (in km/sec).*", col.names = rep("", ncol(mat)))
|
|
```
|
|
|
|
Discrete and categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron).
|
|
|
|
Each variable contains its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of the values that you have observed for the variable.
|
|
|
|
### Visualizing distributions
|
|
|
|
How you visualize the distribution will depend on whether your variable is categorical or continuous.
|
|
|
|
A variable is **categorical** if it can only have a finite (or countably infinite) set of unique values. In R, categorical variables are usually saved as factors, integers, or character strings. To examine the distribution of a categorical variable, use a bar chart.
|
|
|
|
```{r}
|
|
ggplot(data = diamonds) +
|
|
geom_bar(mapping = aes(x = cut))
|
|
```
|
|
|
|
The height of the bars displays how many observations occurred with each x value. If you would like, these exact values, wyou can compute them with R's `table()` function.
|
|
|
|
```{r}
|
|
table(diamonds$cut)
|
|
```
|
|
|
|
A variable is **continuous** if you can arrange its values in order _and_ an infinite number of unique values can exist between any two values of the variable. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram.
|
|
|
|
```{r message = FALSE}
|
|
ggplot(data = diamonds) +
|
|
geom_histogram(aes(x = carat), binwidth = 0.5)
|
|
```
|
|
|
|
A histogram divides the x axis into equally spaced intervals and then uses a bar to display how many observations fall into each interval. In the graph above, the tallest bar shows that almost 30,000 observations have a $carat$ value between 0.25 and 0.75, which are the left and right edges of the bar.
|
|
|
|
In a histogram, the intervals are known as **bins** and the process of creating intervals is known as **binning**. You can set the binwidth of the intervals with the `binwidth` argument of `geom_histogram()`, which is measured in the units of the $x$ axis. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks with a binwidth of 0.01.
|
|
|
|
```{r message = FALSE}
|
|
ggplot(data = diamonds) +
|
|
geom_histogram(aes(x = carat), binwidth = 0.01)
|
|
```
|
|
|
|
If you wish to overlay multiple histograms in the same plot, I recommend using `geom_freqpoly()` or `geom_density2d()` instead of `geom_histogram()`. `geom_freqpoly()` makes a frequency polygon, a line that connects the tops of the bars that would appear in a histogram. Like `geom_histogram()`, `geom_freqpoly()` accepts a binwidth argument.
|
|
|
|
`geom_density()` plots a one dimensional kernel density estimate of a variable's distribution. The result is a smooth version of the information contained in a histogram or a frequency polygon. You can control the smoothness of the density with `adjust`. `geom_density()` displays $density$---not $count$---on the y axis; the area under each curve will be normalized to one, no matter how many total observations occur in the subgroup.
|
|
|
|
```{r message = FALSE, fig.show='hold', fig.width=3}
|
|
zoom <- coord_cartesian(xlim = c(55, 70))
|
|
|
|
ggplot(data = diamonds) +
|
|
geom_freqpoly(aes(x = depth, color = cut), binwidth = 0.2) +
|
|
zoom
|
|
|
|
ggplot(data = diamonds) +
|
|
geom_density(aes(x = depth, color = cut), adjust = 3) +
|
|
zoom
|
|
```
|
|
|
|
## Follow up questions
|
|
|
|
Now that you can visualize variation, what should you look for in your plots? Here are the most useful types of information in any graph:
|
|
|
|
* *Typical Values*
|
|
|
|
In both bar charts and histograms, tall bars reveal common values of a variable. Shorter bars reveal less common or rare values. Places that do not have bars reveal seemingly impossible values. To turn this information into a useful question, look for anything unexpected:
|
|
|
|
+ Which values are the most common? Why?
|
|
|
|
+ Which values are the most rare? Why?
|
|
|
|
+ Is there an unusual pattern to the frequencies? Why?
|
|
|
|
+ Do the typical values change if you look at individual subgroups of the data?
|
|
|
|
For example, the histogram below suggests several interesting questions: Why are there more diamonds at whole carats and common fractions of carats? Why are there slightly more diamonds above each of these peaks than there are slightly below each of these peaks?
|
|
|
|
```{r echo = FALSE, message = FALSE, fig.height = 2}
|
|
ggplot(data = diamonds) +
|
|
geom_histogram(aes(x = carat), binwidth = 0.01) + xlim(0, 3)
|
|
```
|
|
|
|
* *Range of values*
|
|
|
|
The range, or spread, of values in the distribution reveals how certain you can be when you make predictions about a variable. If the variable only takes a narrow set of values, like below, you are unlikely to be far off if you make a prediction about a future observation. Even if the observation takes a value at the distant extreme of the distribution, the value will not be far from your guess.
|
|
|
|
```{r echo = FALSE, message = FALSE, fig.height = 2}
|
|
mpg$hwy2 <- mpg$hwy / 10 + 22
|
|
ggplot(mpg) + geom_histogram(aes(x = hwy2), binwidth = 1) + xlim(10, 45)
|
|
```
|
|
|
|
If the variable takes on a wide set of values, like below, the possibility that your guess will be far off the mark is much greater. The extreme possibilities are farther away.
|
|
|
|
```{r echo = FALSE, message = FALSE, fig.height = 2}
|
|
ggplot(mpg) + geom_histogram(aes(x = hwy), binwidth = 1) + xlim(10, 45)
|
|
```
|
|
|
|
As a quick rule, narrow distributions imply less uncertainty when making predictions about a variable; wide distributions imply more uncertainty. Ask yourself
|
|
|
|
+ Do your data show a surprising amount of certainty or uncertainty? Why?
|
|
|
|
+ Does the spread of the data change if you look at individual subgroups of the data?
|
|
|
|
* *Outliers*
|
|
|
|
Outliers are data points that do not seem to fit the overall pattern of variation, like the diamond on the far right of the histogram below. This diamond has a y dimension of `r diamonds$y[which(diamonds$y > 50)]` mm, which is much larger than the other diamonds.
|
|
|
|
```{r echo = FALSE, message = FALSE, fig.height = 2}
|
|
ggplot(diamonds[24000:24500, ]) + geom_histogram(aes(x = y), binwidth = 0.25)
|
|
```
|
|
|
|
An outlier is a signal that something unique happened to the observation. Whenever you spot an outlier, ask yourself
|
|
|
|
+ What can explain the unusual value?
|
|
|
|
If you can figure out what happened, a discovery might follow. In the case above, the unique event was a measurement error.
|
|
|
|
* *Clusters*
|
|
|
|
Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
|
|
|
|
+ How are the observations within each cluster similar to each other?
|
|
|
|
+ How are the observations in separate clusters different from each other?
|
|
|
|
+ How can you explain or describe the clusters?
|
|
|
|
The histogram below displays the length in minutes of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. You can spot two distinct clusters; Old Faithful appears to oscillate between short and long eruptions.
|
|
|
|
```{r echo = FALSE, message = FALSE, fig.height = 2}
|
|
ggplot(faithful) + geom_histogram(aes(x = eruptions))
|
|
```
|
|
|
|
To answer many of the follow up questions above, you will need to make a new graph that includes two or more variables and then look for:
|
|
|
|
* *Patterns*
|
|
|
|
Patterns in your data provide clues about covariation. If a relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
|
|
|
|
+ Could this pattern be due to coincidence (i.e. random chance)?
|
|
|
|
+ How can you describe the relationship described by the pattern?
|
|
|
|
+ How strong is the relationship implied by the pattern?
|
|
|
|
+ What other variable might be involved in the relationship?
|
|
|
|
+ Does the relationship change if you look at individual subgroups of the data?
|
|
|
|
Each of these questions is an example of the second general question that I proposed for EDA. Let's look at that question now.
|
|
|
|
## Covariation
|
|
|
|
> "What type of covariation occurs between my variables?"
|
|
|
|
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a systematic way.
|
|
|
|
### Visualizaing covariation
|
|
|
|
The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variable.
|
|
|
|
#### Two categorical variables
|
|
|
|
Visualize covariation between categorical variables with `geom_count()`.
|
|
|
|
```{r}
|
|
ggplot(data = diamonds) +
|
|
geom_count(mapping = aes(x = cut, y = color))
|
|
```
|
|
|
|
The size of each circle in the plot will display how many observations occurred at each combination of values. As with bar charts, you can calculate the specific values with `table()`. Covariation will appear as a strong correlation between specifc x values and specific y values.
|
|
|
|
```{r}
|
|
table(diamonds$color, diamonds$cut)
|
|
```
|
|
|
|
#### One categorical variable and one continuous variable
|
|
|
|
Visualize covariation between continuous and categorical variables with boxplots. A **boxplot** is a type of visual shorthand for a distribution that is popular among statisticians. You make a boxplot with `geom_boxplot()`. The chart below shows several boxplots, one for each level of the cut variable. Each boxplot represents the distribution of depth values for points with the given level of cut.
|
|
|
|
```{r}
|
|
ggplot(data = diamonds) +
|
|
geom_boxplot(aes(x = cut, y = depth))
|
|
```
|
|
|
|
How should you interpret a boxplot? Each boxplot consists of:
|
|
|
|
* A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the Inter-Quartile Range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not it is symmetric about the median or skewed to one side.
|
|
|
|
* Points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points have a strong chance of being outliers, so they are included in the boxplot for inspection. Since diamonds is a large data set, quite a few points fall in this range.
|
|
|
|
* A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.
|
|
|
|
Boxplots make it especially easy to see if the locations or spreads of the distributions change across values. Simply compare the median lines of the distributions, or the width of the boxes. To make the trend easier to see, wrap the $x$ variable with `reorder()`. The code below reorders the x axis based on the median depth value of each group.
|
|
|
|
```{r}
|
|
ggplot(data = diamonds) +
|
|
geom_boxplot(aes(x = reorder(cut, depth, FUN = median), y = depth))
|
|
```
|
|
|
|
`geom_boxplot()` works best when the x variable is categorical, but if you wish to invert the axes, you can easily do so with `coord_flip()`.
|
|
|
|
```{r}
|
|
ggplot(data = diamonds) +
|
|
geom_boxplot(aes(x = cut, y = depth)) +
|
|
coord_flip()
|
|
```
|
|
|
|
`geom_violin()` provides an alternate version of a boxplot. In a violin plot, the width of the "box" displays a kernel density estimate of the shape of the distribution.
|
|
|
|
```{r}
|
|
ggplot(data = diamonds) +
|
|
geom_violin(aes(x = cut, y = depth)) +
|
|
coord_flip()
|
|
```
|
|
|
|
|
|
#### Two continuous variables
|
|
|
|
Visualize covariation between two continuous variables with a scatterplot. Covariation will appear as a structure or pattern in the data points. For example, we saw in Chapter 1 that a positive relationship exists between the carat size and price of a diamond.
|
|
|
|
```{r}
|
|
ggplot(data = diamonds) +
|
|
geom_point(aes(x = carat, y = price))
|
|
```
|
|
|
|
The easiest relationship to spot between two variables is a straight line. Often you can get a sense of the exact non-linear relationship between two variables by "bending" the data into a straight line with `coord_trans()`. For example, the gamut of charts below suggests that price and carat size have a relationship of the form $price = carat^{\beta}$ (This type of relationship is straightened by a log-log transform).
|
|
|
|
```{r fig.show='hold', fig.width=3, fig.height=3}
|
|
p <- ggplot(data = diamonds) +
|
|
geom_point(aes(x = carat, y = price))
|
|
|
|
p + coord_trans(y = "sqrt")
|
|
p + coord_trans(y = "log10")
|
|
p + coord_trans(x = "log10", y = "log10")
|
|
```
|
|
|
|
Scatterplots become less useful as the size of your data set grows, because points begin to pile up into areas of uniform black. You can make patterns clear again with `geom_bin2d()`, `geom_hex()`, or `geom_density2d()`.
|
|
|
|
`geom_bin2d()` and `geom_hex()` both divide the coordinate plane into two dimensional bins and then use fill to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.
|
|
|
|
```{r fig.show='hold', fig.width=3, fig.height=3}
|
|
ggplot(data = diamonds) +
|
|
geom_bin2d(aes(x = carat, y = price))
|
|
|
|
ggplot(data = diamonds) +
|
|
geom_hex(aes(x = carat, y = price))
|
|
```
|
|
|
|
`geom_density2d()` fits a 2D kernel density estimation to the data and then uses contour lines to highlight areas of high density. It is very useful for overlaying on raw data.
|
|
|
|
|
|
```{r}
|
|
ggplot(data = faithful, aes(x = eruptions, y = waiting)) +
|
|
geom_point() +
|
|
geom_density2d()
|
|
```
|
|
|
|
As you search for covariation, also keep an eye out for outliers and clusters. Two dimensional plots can reveal outliers and clusters that are not visible in one dimensional plots.
|
|
|
|
For example, some points in the plot on the left have an unusual combination of $x$ and $y$ values, which makes them an outlier even though their $x$ and $y$ values appear common when examined separately.
|
|
|
|
The two dimensional pattern in the plot on the right reveals two clusters, a separation that is not visible in the distribution of either variable by itself, as verified with a rug geom.
|
|
|
|
```{r fig.show='hold', fig.width=3, fig.height=3}
|
|
ggplot(data = diamonds) +
|
|
geom_point(aes(x = x, y = y)) +
|
|
coord_cartesian(xlim = c(3, 12), ylim = c(3, 12))
|
|
|
|
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
|
|
geom_jitter() +
|
|
geom_rug(position = "jitter")
|
|
```
|
|
|
|
|
|
#### Three or more variables
|
|
|
|
You can extend scatterplots into three dimensions with the plotly, rgl, rglwidget, and threejs packages (among others). Each creates a "three dimensional," graph that you can rotate with your mouse. Below is an example from plotly.
|
|
|
|
```{r eval = FALSE}
|
|
library(plotly)
|
|
plot_ly(data = iris, x = Sepal.Length, y = Sepal.Width, z = Petal.Width, color = Species, type = "scatter3d", mode = "markers")
|
|
```
|
|
|
|
![](images/EDA-plotly.png)
|
|
|
|
You can extend this approach into n-dimensional hyperspace with the ggobi package, but you will soon notice a weakness of multidimensional graphs. You can only visualize multidimensional space by projecting it onto your two dimensional retina. In the case of 3D graphics, you can combine 2D projections with rotation to create an intuitive illusion of space, but the illusion ceases to be intuitive as soon as you add a fourth dimension.
|
|
|
|
You can use aesthetics such as color and size to add a third or fourth variable to a two dimensional graph. You can also explore complex relationships two variables at a time. If you need to investigate the relationship between mutliple variables at the saem time, I recommend that you use modelling techniques, which we will discuss soon.
|
|
|
|
|