r4ds/variation.Rmd

# Exploratory Data Analysis (EDA)

```{r include = FALSE}
library(ggplot2)
library(dplyr)
library(broom)
knitr::opts_chunk$set(fig.height = 2)
```


This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call Exploratory Data Analysis, or EDA for short. EDA involves iteratively

1. forming questions about your data 
2. searching for answers by visualizing and transforming your data
3. using what you discover to refine your questions about the data, or to choose new questions to investigate

There is no formal way to do Exploratory Data Analysis because you must be free to investigate every idea that occurs to you. However, some tactics will reliably lead to insights. This chapter will teach you a basic toolkit of these useful EDA techniques. Our discussion will lead to a model of data science itself, the model that I've built this book around. 

## Questions

> "There are no routine statistical questions, only questionable statistical routines."---Sir David Cox

> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise."---John Tukey

Your goal during EDA is to develop a complete understanding of your data set and the information that it contains. The easiest way to do this is to ask questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your data set and helps you decide which graphs or models to make.

During EDA, the _quantity_ of questions that you ask matters more than the quality of the questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your data set. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow up each question with a new question based on what you find.

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as

1. What type of **variation** occurs **within** my variables? and

2. What type of **covariation** occurs **between** my variables?

The rest of this chapter will look at these two questions. I'll explain what variation and covariation are, and I'll show you several ways to answer each question. To make the discussion easier, let's define some terms: 

* A _variable_ is a quantity, quality, or property that you can measure. 

* A _value_ is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
  
* An _observation_ is a set of measurements that you make under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I'll sometimes refer to an observation as a data point.

* _tabular data_ is a set of values, each associated with a variable and an observation. Tabular data is _tidy_ if each value is placed in its own "cell", each variable in its own column, and each observation in its own row. Throughout the rest of this chapter I will use the word data to mean tidy tabular data.

## Variation

> "What type of variation occurs within my variables?"

**Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice---and precisely enough, you will get two different results. This is true even if you measure quantities that should be constant, like the speed of light (below). Each of your measurements will include a small amount of error that varies from measurement to measurement.

```{r, variation, echo = FALSE}

mat <- as.data.frame(matrix(morley$Speed + 299000, ncol = 10))

knitr::kable(mat, caption = "*The speed of light is a universal constant, but variation due to measurement error obscures its value. In 1879, Albert Michelson measured the speed of light 100 times and observed 30 different values (in km/sec).*", col.names = rep("", ncol(mat)))
```

Discrete and categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).

Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of the values that you have observed for the variable.

### Visualizing distributions

How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only have a finite (or countably infinite) set of unique values. In R, categorical variables are usually saved as factors, integers, or character strings. To examine the distribution of a categorical variable, use a bar chart.

```{r}
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))
```

The height of the bars displays how many observations occurred at each x value. You can compute these values manually with `table()`. 

```{r}
table(diamonds$cut)
```

A variable is **continuous** if you can arrange its values in order _and_ an infinite number of unique values can exist between any two values of the variable. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram.

```{r message = FALSE}
ggplot(data = diamonds) +
  geom_histogram(aes(x = carat), binwidth = 0.5)
```

A histogram divides the x axis into equally spaced intervals and then uses a bar to display how many observations fall into each interval. In the graph above, the tallest bar shows that almost 30,000 observations have a $carat$ value between 0.25 and 0.75, which are the left and right edges of the bar. 

You can set the width of the intervals in a histogram with the `binwidth` argument, which  is measured in the units of the $x$ axis. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks with a binwidth of 0.01.

```{r message = FALSE}
ggplot(data = diamonds) +
  geom_histogram(aes(x = carat), binwidth = 0.01)
```

If you wish to overlay multiple histograms in the same plot, I recommend using `geom_freqpoly()` or `geom_density2d()` instead of `geom_histogram()`. `geom_freqpoly()` makes a frequency polygon, a line that connects the tops of the bars that would appear in a histogram. Like `geom_histogram()`, `geom_freqpoly()` accepts a binwidth argument.

`geom_density()` plots a one dimensional kernel density estimate of a variable's distribution. The result is a smooth version of the information contained in a histogram or a frequency polygon. You can control the smoothness of the density with `adjust`. `geom_density()` displays _density_---not count---on the y axis; the area under each curve will be normalized to one, no matter how many total observations occur in the subgroup, which makes it easier to compare subgroups. 

```{r message = FALSE, fig.show='hold', fig.width=3}
zoom <- coord_cartesian(xlim = c(55, 70))

ggplot(data = diamonds) +
  geom_freqpoly(aes(x = depth, color = cut), binwidth = 0.2) +
  zoom

ggplot(data = diamonds) +
  geom_density(aes(x = depth, color = cut), adjust = 3) +
  zoom
```

### Asking questions about variation

Now that you can visualize variation, what should you look for in your plots? and what type of follow up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).

* *Typical values*
   
    In both bar charts and histograms, tall bars reveal common values of a variable. Shorter bars reveal less common or rare values. Places that do not have bars reveal seemingly impossible values. To turn this information into a useful question, look for anything unexpected:
    
    + Which values are the most common? Why might that be?
    
    + Which values are the most rare? Why might that be?
    
    + Is there an unusual pattern in the distribution? Why might that be?
    
    + Do the typical values change if you look at subgroups of the data?
    
    As an example, the histogram below suggests several interesting questions: Why are there more diamonds at whole carats and common fractions of carats? Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
    
    ```{r echo = FALSE, message = FALSE, warning = FALSE, fig.height = 2}
    ggplot(data = diamonds) +
      geom_histogram(aes(x = carat), binwidth = 0.01) + xlim(0, 3)
    ```
    
* *Range of values*

    The range of values, or spread, of the distribution reveals how certain you can be when you make predictions about a variable. If the variable only takes a narrow set of values, like below, you are unlikely to be far off if you make a prediction about a future observation. Even if the observation takes a value at the distant extreme of the distribution, the value will not be far from your guess. 
    
    ```{r echo = FALSE, message = FALSE, fig.height = 2}
    mpg$hwy2 <- mpg$hwy / 10 + 22
    ggplot(mpg) + geom_histogram(aes(x = hwy2), binwidth = 1) + xlim(10, 45)
    ```
    
    If the variable takes on a wide set of values, like below, the possibility that your guess will be far off the mark is much greater. The extreme possibilities are farther away. 
    
    ```{r echo = FALSE, message = FALSE, fig.height = 2}
    ggplot(mpg) + geom_histogram(aes(x = hwy), binwidth = 1) + xlim(10, 45)
    ```    
    
    As a quick rule, wide distributions imply less certainty when making predictions about a variable; narrow distributions imply more certainty. A distribution with only a single repeated value implies complete certainty: your variable is a constant. Ask yourself
    
    + Do your data show a surprising amount of certainty or uncertainty? Why?
    
    + Does the range of the distribution change if you look at individual subgroups of the data?
    
* *Outliers*

    Outliers are data points that do not seem to fit the overall pattern of variation, like the diamond on the far right of the histogram below. This diamond has a y dimension of `r diamonds$y[which(diamonds$y > 50)]` mm, which is much larger than the other diamonds.
    
    ```{r echo = FALSE, message = FALSE, fig.height = 2}
    ggplot(diamonds[24000:24500, ]) + geom_histogram(aes(x = y), binwidth = 0.25)
    ```   
    
    An outlier is a signal that something unique happened to the observation. Whenever you spot an outlier, ask yourself
    
    + What can explain the unusual value?
    
    If you can figure out what happened, a discovery might follow. In the case above, the unique event was a measurement error.

* *Clusters*

    Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
    
    + How are the observations within each cluster similar to each other?
    
    + How are the observations in separate clusters different from each other?
    
    + How can you explain or describe the clusters?
    
    + Why might the appearance of clusters be misleading?
   
    The histogram below displays two distinct clusters. It shows the length in minutes of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park; Old Faithful appears to oscillate between short and long eruptions.

    ```{r echo = FALSE, message = FALSE, fig.height = 2}
    ggplot(faithful) + geom_histogram(aes(x = eruptions))
    ```  

Many of the questions above will prompt you to explore a relationship *between* variables, for example, to see if the values of one variable can explain the behavior of another variable. Questions about relationships are examples of the second general question that I proposed for EDA. Let's look at that question now.

## Covariation

> "What type of covariation occurs between my variables?"

If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a correlated way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variables involved.

### Visualizing two categorical variables

Visualize covariation between categorical variables with `geom_count()`. 

```{r}
ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))
```

The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specifc x values and specific y values. As with bar charts, you can calculate the specific values with `table()`.

```{r}
table(diamonds$color, diamonds$cut)
```

### Visualizing one categorical variable and one continuous variable

Visualize covariation between continuous and categorical variables with boxplots. A **boxplot** is a type of visual shorthand for a distribution that is popular among statisticians. Each boxplot consists of:

* A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the Inter-Quartile Range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not it is symmetric about the median or skewed to one side. 

* Points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points have a strong chance of being outliers, so they are included in the boxplot for inspection. 

* A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.

```{r, echo = FALSE}
knitr::include_graphics("images/EDA-boxplot.pdf")
```

The chart below shows several boxplots, one for each level of the class variable in the mpg data set. Each boxplot represents the distribution of hwy values for points with the given level of class. To make boxplots, use `geom_boxplot()`.

```{r}
ggplot(data = mpg) +
  geom_boxplot(aes(x = class, y = hwy))
```

Covariation will appear as a systematic change in the medians or IQR's of the boxplots. To make the trend easier to see, wrap the $x$ variable with `reorder()`. The code below reorders the x axis based on the median hwy value of each group.

```{r}
ggplot(data = mpg) +
  geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy))
```

`geom_boxplot()` works best when the categorical variable is on the x axis. You can invert the axes with `coord_flip()`.

```{r}
ggplot(data = mpg) +
  geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()
```

If you wish to add more information to your boxplots, use `geom_violin()`. In a violin plot, the width of the "box" displays a kernel density estimate of the shape of the distribution.

```{r}
ggplot(data = mpg) +
  geom_violin(aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()
```


### Vizualizing two continuous variables

Visualize covariation between two continuous variables with a scatterplot, i.e. `geom_point()`. Covariation will appear as a structure or pattern in the data points. For example, an exponential relationship seems to exist between the carat size and price of a diamond.

```{r}
ggplot(data = diamonds) +
  geom_point(aes(x = carat, y = price))
```

Scatterplots become less useful as the size of your data set grows, because points begin to pile up into areas of uniform black (as above). You can make patterns clear again with `geom_bin2d()`, `geom_hex()`, or  `geom_density2d()`.

`geom_bin2d()` and `geom_hex()` divide the coordinate plane into two dimensional bins and then use a fill color to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.

```{r fig.show='hold', fig.width=3}
ggplot(data = diamonds) +
  geom_bin2d(aes(x = carat, y = price))

# install.packages("hexbin")
ggplot(data = diamonds) +
  geom_hex(aes(x = carat, y = price))
```

`geom_density2d()` fits a 2D kernel density estimation to the data and then uses contour lines to highlight areas of high density. It is very useful for overlaying on raw data when your data set is not big.


```{r}
ggplot(data = faithful, aes(x = eruptions, y = waiting)) +
  geom_point() +
  geom_density2d()
```

### Asking questions about covariation

When you explore plots of covariation, look for the following sources of insight:

* *Outliers* 

    Two dimensional plots can reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of $x$ and $y$ values, which makes the points outliers even though their $x$ and $y$ values appear normal when examined separately.
    
    ```{r echo = FALSE}
    ggplot(data = diamonds) +
      geom_point(aes(x = x, y = y)) +
      coord_cartesian(xlim = c(3, 12), ylim = c(3, 12))
    ```


* *Clusters*

    Plots of covariation can also reveal clusters that may not be visible in plots of variation. For example, the two dimensional pattern in the plot below reveals two clusters, a separation that is not visible in the distribution of either variable by itself, as verified with a rug geom.

    ```{r echo = FALSE}
    ggplot(data = iris, aes(y = Sepal.Length, x = Sepal.Width)) +
      geom_jitter() +
      geom_density2d(h= c(1,1)) +
      geom_rug(position = "jitter")
    ```

* *Patterns*

    Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
    
    + Could this pattern be due to coincidence (i.e. random chance)?
    
    + How can you describe the relationship implied by the pattern?
    
    + How strong is the relationship implied by the pattern?
    
    + What other variables might affect the relationship?
    
    + Does the relationship change if you look at individual subgroups of the data?
    
    A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also reveals the two clusters that we noticed above.
    
    ```{r echo = FALSE, message = FALSE, fig.height = 2}
    ggplot(faithful) + geom_point(aes(x = eruptions, y = waiting))
    ``` 

    Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.

### Visualizing three or more variables

In general, outliers, clusters, and patterns become easier to spot as you look at the interaction of more and more variables. However, as you include more variables in your plot, data becomes harder to visualize.

You can extend scatterplots into three dimensions with the plotly, rgl, rglwidget, and threejs packages (among others). Each creates a "three dimensional," graph that you can rotate with your mouse. Below is an example from plotly, displayed as a static image.

```{r eval = FALSE}
library(plotly)
plot_ly(data = iris, x = Sepal.Length, y = Sepal.Width, z = Petal.Width, color = Species, type = "scatter3d", mode = "markers")
```

```{r, echo = FALSE}
knitr::include_graphics("images/EDA-plotly.png")
```

You can extend this approach into n-dimensional hyperspace with the ggobi package, but you will soon notice a weakness of multidimensional graphs. You can only visualize multidimensional space by projecting it onto your two dimensional retinas. In the case of 3D graphics, you can combine 2D projections with rotation to create an intuitive illusion of 3D space, but the illusion ceases to be intuitive as soon as you add a fourth dimension.

This doesn't mean that you should ignore complex interactions in your data. You can explore multivariate relationships in several ways. You can

* visualize each combination of variables in a multivariate relationship, two at a time

* use aesthetics and facetting to add additional variables to a 2D plot

* use a clustering algorithm to spot clusters in multivariate space

* use a modeling algorithm to spot patterns and outliers in multivariate space

## Clusters

A clustering algorithm computes the distances between data points in n-dimensional space. It then uses an algorithm to group points into clusters based on how near or far they are from each other.

## Models

A model is a type of summary that describes the relationships in your data. You can use a model to reveal patterns and outliers that only appear in n-dimensional space. To see how this works, consider the simple linear model below. I've fit it to a two dimensional pattern so we can visualize the results.

```{r echo = FALSE}
diamonds2 <- filter(diamonds, x > 3, y > 3, y < 12)
diamond_mod <- lm(y ~ x, data = diamonds2)
resids <- augment(diamond_mod)
diamonds3 <- bind_rows(filter(resids, abs(.resid) > 0.5),  
                       sample_n(filter(resids, abs(.resid) <= 0.5), 1000)) %>% 
  select(x, y)

ggplot(diamonds3, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE)
```

The model describes the relationship between x and y as 

$$\hat{y} = 0.13 + 0.98 x$$

which is the equation of the blue model line in the graph above. Even if we did not have the graph, we could use the model coefficients in the equation above to determine that a positive relationship exists between $y$ and $x$ such that a one unit increase in $x$ is associated with an approximately one unit increase in $y$. We could use a model statistic, such as adjusted $r^{2}$ to determine that the relationship is very strong (here adjusted $r^{2} = 0.99$).

Finally, we could spot outliers in our data by examining the residuals of the model, which are the distances between the actual $y$ values of our data points and the $y$ values that the model would predict for the data points. Observations that are outliers in n-dimensional space will have a residual that is an outlier in one dimensional space. You can find these outliers by plotting a histogram of the residuals or by visualizing the residuals against any variable in a two dimenisonal plot. 

```{r echo = FALSE, fig.width = 3, fig.show='hold'}
diamond_mod <- lm(y ~ x, data = diamonds3)
resids <- augment(diamond_mod)

ggplot(resids) +
  geom_histogram(aes(x = .resid), binwidth = 0.1)

ggplot(resids) +
  geom_point(aes(x = x, y = .resid))
```

You can examine coefficients, model statistics, and residuals of a model fit to an n-dimensional relationship in the same way, without visualizing the raw data in n-dimensional space.

I'll postpone teaching you how to fit and interpret models with R until Part 4. Altough models are something simple, a description of your data set, they are tied into the logic of statistical inference: if a model describes your data accurately _and_ your data is similar to the world at large, then your model should describe the world at large accurately. This chain of reasoning provides a basis for using models to make inferences and predictions. You'll be able to do more with models if you learn a few more skills before you begin to model data.

## A last word on variables and observations

Variables, observations, and visualization

Every data set contains more information than it displays. You can use the values in your data to calculate new variables, as well as new, group-level, observations. This section will show you how to calculate new variables and observations, which you can use in visualizations, clustering algorithms, and modeling algorithms.

### Making new variables

Use dplyr's `mutate()` function to calculate mew variables from your existing variables.

```{r}
diamonds %>% 
  mutate(volume = x * y * z) %>% 
  head()
```

The window functions from Chapter 3 are particularly useful for calculating new variables. To calculate a variable from two or more variables, use basic operators or the `map2()`, `map3()`, and `map_n()` functions from purr. You will learn more about purrr in Chapter ?.

### Making new observations

If your data set contains subgroups, you can derive a new data set from it of observations that describe the subgroups. To do this, first use dplyr's `group_by()` function to group the data into subgroups. Then use dplyr's `summarise()` function to calculate group level values. The measures of location, rank and spread listed in Chapter 3 are particularly useful for describing subgroups.

```{r}
mpg %>% 
  group_by(class) %>% 
  summarise(n_obs = n(), avg_hwy = mean(hwy), sd_hwy = sd(hwy))
```

Group level observations and group geoms

## Summary

Data is difficult to comprehend, which means that you need to visualize, model, and transform it. 

Once you comprehend the information in your data, you can make inferences from your data.

But all of this will involve a computer. To make head way, you will need to know how to program in a computer language (R), import data to use with that language, and tidy the data into the format that works best for that language.

When you are finished you will want to report and reproduce your results.