This chapter will show you how to use visualisation and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short.
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data.
Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not.
To do data cleaning, you'll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
In this chapter we'll combine what you've learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.
> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise." --- John Tukey
Your goal during EDA is to develop an understanding of your data.
The easiest way to do this is to use questions as tools to guide your investigation.
When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
And like most creative processes, the key to asking *quality* questions is to generate a large *quantity* of questions.
It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset.
On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery.
You can quickly drill down into the most interesting parts of your data---and develop a set of thought-provoking questions---if you follow up each question with a new question based on what you find.
- A **value** is the state of a variable when you measure it.
The value of a variable may change from measurement to measurement.
- An **observation** is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object).
An observation will contain several values, each associated with a different variable.
I'll sometimes refer to an observation as a data point.
**Variation** is the tendency of the values of a variable to change from measurement to measurement.
You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results.
This is true even if you measure quantities that are constant, like the speed of light.
Each of your measurements will include a small amount of error that varies from measurement to measurement.
Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal interesting information.
The best way to understand that pattern is to visualise the distribution of the variable's values.
A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin.
In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable.
You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns.
For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
Now that you can visualise variation, what should you look for in your plots?
And what type of follow-up questions should you ask?
I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information.
The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).
The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park.
Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.
Many of the questions above will prompt you to explore a relationship *between* variables, for example, to see if the values of one variable can explain the behavior of another variable.
There are so many observations in the common bins that the rare bins are so short that you can't see them (although maybe if you stare intently at 0 you'll spot something).
To make it easy to see the unusual values, we need to zoom to small values of the y-axis with `coord_cartesian()`:
The `y` variable measures one of the three dimensions of these diamonds, in mm.
We know that diamonds can't have a width of 0mm, so these values must be incorrect.
We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don't cost hundreds of thousands of dollars!
It's good practice to repeat your analysis with and without the outliers.
If they have minimal effect on the results, and you can't figure out why they're there, it's reasonable to replace them with missing values, and move on.
However, if they have a substantial effect on your results, you shouldn't drop them without justification.
You'll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.
I don't recommend this option because just because one measurement is invalid, doesn't mean all the measurements are.
Additionally, if you have low quality data, by time that you've applied this approach to every variable you might find that you don't have any data left!
2. Instead, I recommend replacing the unusual values with missing values.
The easiest way to do this is to use `mutate()` to replace the variable with a modified copy.
You can use the `ifelse()` function to replace unusual values with `NA`:
- A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR).
In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution.
These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot).
It supports the counterintuitive finding that better quality diamonds are cheaper on average!
In the exercises, you'll be challenged to figure out why.
4. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of "outlying values".
One approach to remedy this problem is the letter value plot.
Install the lvplot package, and try using `geom_lv()` to display the distribution of price vs cut.
If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns.
For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.
Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above).
You've already seen one way to fix the problem: using the `alpha` aesthetic to add transparency.
`cut_width(x, width)`, as used above, divides `x` into bins of width `width`.
By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summarises a different number of points.
One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`.
3. How does the price distribution of very large diamonds compare to small diamonds?
Is it as you expect, or does it surprise you?
4. Combine two of the techniques you've learned to visualise the combined distribution of cut, carat, and price.
5. Two dimensional plots reveal outliers that are not visible in one dimensional plots.
For example, some points in the plot below have an unusual combination of `x` and `y` values, which makes the points outliers even though their `x` and `y` values appear normal when examined separately.
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions.
The scatterplot also displays the two clusters that we noticed above.
Models are a tool for extracting patterns out of data.
For example, consider the diamonds data.
It's hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related.
It's possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain.
The following code fits a model that predicts `price` from `carat` and then computes the residuals (the difference between the predicted value and the actual value).
The residuals give us a view of the price of the diamond, once the effect of carat has been removed.
Once you've removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
You'll learn how models, and the modelr package, work in the final part of the book, [model](#model-intro).
We're saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
If you want to learn more about the mechanics of ggplot2, I'd highly recommend grabbing a copy of the ggplot2 book: <https://amzn.com/331924275X>.
It's been recently updated, so it includes dplyr and tidyr code, and has much more space to explore all the facets of visualisation.
Unfortunately the book isn't generally available for free, but if you have a connection to a university you can probably get an electronic version for free through SpringerLink.