Merge branch 'master' of github.com:hadley/r4ds

This commit is contained in:
hadley 2016-07-25 15:07:06 -05:00
commit df7c438612
2 changed files with 21 additions and 21 deletions

38
EDA.Rmd
View File

@ -2,17 +2,17 @@
## Introduction ## Introduction
This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an interative cycle. You: This chapter will show you how to use visualisation and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:
1. Generate questions about your data. 1. Generate questions about your data.
1. Search for answers by visualizing, transforming, and modeling your data. 1. Search for answers by visualising, transforming, and modelling your data.
1. Use what you learn to refine your questions and or generate new questions. 1. Use what you learn to refine your questions and or generate new questions.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel be free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues you will hone in on a few particularly productive areas that you'll eventually write up and communicate to others. EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel be free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues you will hone in on a few particularly productive areas that you'll eventually write up and communicate to others.
EDA is an important part of any data analysis, even if the questions are handed to your on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you're ask questions about whether your data meets your expectations or not. To do data cleaning, you'll need to deploy all the tools of EDA: visualisation, transformation, and modelling. EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you're ask questions about whether your data meets your expectations or not. To do data cleaning, you'll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
### Prerequisites ### Prerequisites
@ -32,7 +32,7 @@ library(dplyr)
> vague, than an exact answer to the wrong question, which can always be made > vague, than an exact answer to the wrong question, which can always be made
> precise." --- John Tukey > precise." --- John Tukey
Your goal during EDA is to develop your understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make. Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
EDA is fundamentally a creative process. And like most creative processes, the key to asking _quality_ questions is to generate a large _quantity_ of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow up each question with a new question based on what you find. EDA is fundamentally a creative process. And like most creative processes, the key to asking _quality_ questions is to generate a large _quantity_ of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow up each question with a new question based on what you find.
@ -79,11 +79,11 @@ options(old)
Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments). Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of variable's values. Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of variable's values.
### Visualizing distributions ### Visualising distributions
How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only have a finite (or countably infinite) set of unique values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, use a bar chart: How you visualise the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only have a finite (or countably infinite) set of unique values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, use a bar chart:
```{r} ```{r}
ggplot(data = diamonds) + ggplot(data = diamonds) +
@ -109,7 +109,7 @@ You can compute this by hand by combining `dplyr::count()` and `ggplot2::cut_wid
diamonds %>% count(cut_width(carat, 0.5)) diamonds %>% count(cut_width(carat, 0.5))
``` ```
A histogram divides the x axis into equally spaced bins and then uses the height of bar to display the number observations fall in each bun. In the graph above, the tallest bar shows that almost 30,000 observations have a $carat$ value between 0.25 and 0.75, which are the left and right edges of the bar. A histogram divides the x axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a $carat$ value between 0.25 and 0.75, which are the left and right edges of the bar.
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the $x$ variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a binwidth of less than three and choose a smaller binwidth. You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the $x$ variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a binwidth of less than three and choose a smaller binwidth.
@ -127,7 +127,7 @@ ggplot(data = smaller, mapping = aes(x = carat)) +
geom_freqpoly(binwidth = 0.1) geom_freqpoly(binwidth = 0.1)
``` ```
Now that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?). Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
### Typical values ### Typical values
@ -135,7 +135,7 @@ In both bar charts and histograms, tall bars show the common values of a variabl
* Which values are the most common? Why? * Which values are the most common? Why?
* Which values are the rare? Why? Does that match your expectations? * Which values are rare? Why? Does that match your expectations?
* Can you see any unusual patterns? What might explain them? * Can you see any unusual patterns? What might explain them?
@ -211,8 +211,8 @@ When you discover an outlier it's a good idea to trace it back as far as possibl
might decide which dimension is the length, width, and depth. might decide which dimension is the length, width, and depth.
1. Explore the distribution of `price`. Do you discover anything unusual 1. Explore the distribution of `price`. Do you discover anything unusual
or surprising? (Hint: carefully think about reasonsable values of or surprising? (Hint: carefully think about the `binwidth` and make sure
`binwidth` and experiment.) you)
1. How many diamonds have 0.99 carats? Why? 1. How many diamonds have 0.99 carats? Why?
@ -286,7 +286,7 @@ However this plot isn't great because there are many more non-cancelled flights
## Covariation ## Covariation
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variables involved. If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.
### A categorical and continuous variable ### A categorical and continuous variable
@ -311,7 +311,7 @@ ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(aes(colour = cut), binwidth = 500) geom_freqpoly(aes(colour = cut), binwidth = 500)
``` ```
There's something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average cut! But maybe that's because frequency polygons are a little hard to interpret - there's a lot going on in this plot. There's something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that's because frequency polygons are a little hard to interpret - there's a lot going on in this plot.
Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A **boxplot** is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of: Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A **boxplot** is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:
@ -342,7 +342,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuive finding that better quality diamonds are cheaper on average! In the exercises, you'll be challenged to figure out why. We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuive finding that better quality diamonds are cheaper on average! In the exercises, you'll be challenged to figure out why.
`cut` is an ordered factor: fair is worse than good, which is wrose than very good and so on. Most factors are unordered, so it's fair game to reorder to display the results better. For example, take the `class` variable in the `mpg` dataset. You might be interested to know how hwy mileage varies across classes: `cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Most factors are unordered, so it's fair game to reorder to display the results better. For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes:
```{r} ```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
@ -378,7 +378,7 @@ ggplot(data = mpg) +
How does this compare to using `coord_flip()`? How does this compare to using `coord_flip()`?
1. One problem with boxplots is that they were developed in an era of 1. One problem with boxplots is that they were developed in an era of
much smaller datasets and tend to display an prohibitively large much smaller datasets and tend to display a prohibitively large
number of "outlying values". One approach to remedy this problem is number of "outlying values". One approach to remedy this problem is
the letter value plot. Install the lvplot package, and try using the letter value plot. Install the lvplot package, and try using
`geom_lvplot()` to display the distribution of price vs cut. What `geom_lvplot()` to display the distribution of price vs cut. What
@ -430,7 +430,7 @@ If the categorical variables are unordered, you might want to use the seriation
delays vary by destination and month of year. What makes the delays vary by destination and month of year. What makes the
plot difficult to read? How could you improve it? plot difficult to read? How could you improve it?
1. Why is slightly better to use `aes(x = color, y = cut)` rather 1. Why is it slightly better to use `aes(x = color, y = cut)` rather
than `aes(x = cut, y = color)` in the example above? than `aes(x = cut, y = color)` in the example above?
### Two continuous variables ### Two continuous variables
@ -462,14 +462,14 @@ ggplot(data = smaller) +
geom_hex(aes(x = carat, y = price)) geom_hex(aes(x = carat, y = price))
``` ```
Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a discrete and a continuous variable that you learned about. For example, you could bin `carat` and then for each group displaying a boxplot: Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a discrete and a continuous variable that you learned about. For example, you could bin `carat` and then for each group, display a boxplot:
```{r} ```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1))) geom_boxplot(aes(group = cut_width(carat, 0.1)))
``` ```
`cut_width(x, width)`, as used above, divides `x` into bins of width `width`. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell the each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot to be proportional to the number of points with `varwidth = TRUE`. `cut_width(x, width)`, as used above, divides `x` into bins of width `width`. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`.
Another approach is to display approximately the same number of points in each bin. That's the job of `cut_number()`: Another approach is to display approximately the same number of points in each bin. That's the job of `cut_number()`:

View File

@ -389,7 +389,7 @@ There are many functions for creating new variables that you can use with `mutat
feature we'll come back to in modelling. feature we'll come back to in modelling.
All else being equal, I recommend using `log2()` because it's easy to All else being equal, I recommend using `log2()` because it's easy to
interpret: an difference of 1 on the log scale corresponds to doubling on interpret: a difference of 1 on the log scale corresponds to doubling on
the original scale and a difference of -1 corresponds to halving. the original scale and a difference of -1 corresponds to halving.
* Offsets: `lead()` and `lag()` allow you to refer to leading or lagging * Offsets: `lead()` and `lag()` allow you to refer to leading or lagging