Merge branch 'master' of github.com:hadley/r4ds
This commit is contained in:
commit
8da00ed69e
2
EDA.Rmd
2
EDA.Rmd
|
@ -10,7 +10,7 @@ This chapter will show you how to use visualisation and transformation to explor
|
||||||
|
|
||||||
1. Use what you learn to refine your questions and or generate new questions.
|
1. Use what you learn to refine your questions and or generate new questions.
|
||||||
|
|
||||||
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel be free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will hone in on a few particularly productive areas that you'll eventually write up and communicate to others.
|
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will hone in on a few particularly productive areas that you'll eventually write up and communicate to others.
|
||||||
|
|
||||||
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you'll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
|
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you'll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
|
||||||
|
|
||||||
|
|
|
@ -2,7 +2,7 @@
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
|
|
||||||
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you'll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You'll learn how to do all that (and more!) in this chapter which will teach you how to transform your data using the dplyr package and new dataset on flights departing New York City in 2013.
|
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you'll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You'll learn how to do all that (and more!) in this chapter which will teach you how to transform your data using the dplyr package and a new dataset on flights departing New York City in 2013.
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
|
@ -14,7 +14,7 @@ library(nycflights13)
|
||||||
library(ggplot2)
|
library(ggplot2)
|
||||||
```
|
```
|
||||||
|
|
||||||
Take careful note of the message that's printed when you load dplyr - it tells you that dplyr overwrite some functions in base R. If you want to use the base version of these functions after loading dplyr, you'll need to use their full names: `stats::filter()`, `base::intersect()`, etc.
|
Take careful note of the message that's printed when you load dplyr - it tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you'll need to use their full names: `stats::filter()`, `base::intersect()`, etc.
|
||||||
|
|
||||||
### nycflights13
|
### nycflights13
|
||||||
|
|
||||||
|
@ -125,7 +125,7 @@ filter(flights, !(arr_delay > 120 | dep_delay > 120))
|
||||||
filter(flights, arr_delay <= 120, dep_delay <= 120)
|
filter(flights, arr_delay <= 120, dep_delay <= 120)
|
||||||
```
|
```
|
||||||
|
|
||||||
As well as `&` and `|`, R also has `&&` and `||`. Don't use them here! You'll when you should use them in [conditional execution].
|
As well as `&` and `|`, R also has `&&` and `||`. Don't use them here! You'll learn when you should use them in [conditional execution].
|
||||||
|
|
||||||
Sometimes you want to find all rows after the first `TRUE`, or all rows until the first `FALSE`. The window functions `cumany()` and `cumall()` allow you to find these values:
|
Sometimes you want to find all rows after the first `TRUE`, or all rows until the first `FALSE`. The window functions `cumany()` and `cumall()` allow you to find these values:
|
||||||
|
|
||||||
|
@ -309,7 +309,7 @@ select(flights, time_hour, air_time, everything())
|
||||||
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
|
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
|
||||||
```
|
```
|
||||||
|
|
||||||
1. Does the result of running the following code suprise you? How do the
|
1. Does the result of running the following code surprise you? How do the
|
||||||
select helpers deal with case by default? How can you change that default?
|
select helpers deal with case by default? How can you change that default?
|
||||||
|
|
||||||
```{r, eval = FALSE}
|
```{r, eval = FALSE}
|
||||||
|
@ -784,7 +784,7 @@ daily <- group_by(flights, year, month, day)
|
||||||
(per_year <- summarise(per_month, flights = sum(flights)))
|
(per_year <- summarise(per_month, flights = sum(flights)))
|
||||||
```
|
```
|
||||||
|
|
||||||
Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median. In otherwords, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.
|
Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median. In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.
|
||||||
|
|
||||||
### Ungrouping
|
### Ungrouping
|
||||||
|
|
||||||
|
@ -814,7 +814,7 @@ daily %>%
|
||||||
Which is more important: arrival delay or departure delay?
|
Which is more important: arrival delay or departure delay?
|
||||||
|
|
||||||
1. Our definition of cancelled flights (`!is.na(dep_delay) & !is.na(arr_delay)`
|
1. Our definition of cancelled flights (`!is.na(dep_delay) & !is.na(arr_delay)`
|
||||||
) is slightly sup-optimal. Why? Which is the most important column?
|
) is slightly suboptimal. Why? Which is the most important column?
|
||||||
|
|
||||||
1. Look at the number of cancelled flights per day. Is there a pattern?
|
1. Look at the number of cancelled flights per day. Is there a pattern?
|
||||||
Is the proportion of cancelled flights related to the average delay?
|
Is the proportion of cancelled flights related to the average delay?
|
||||||
|
@ -874,7 +874,7 @@ Functions that work most naturally in grouped mutates and filters are known as
|
||||||
1. Delays are typically temporally correlated: even once the problem that
|
1. Delays are typically temporally correlated: even once the problem that
|
||||||
caused the initial delay has been resolved, later flights are delayed
|
caused the initial delay has been resolved, later flights are delayed
|
||||||
to allow earlier flights to leave. Using `lag()` explore how the delay
|
to allow earlier flights to leave. Using `lag()` explore how the delay
|
||||||
of a flight is related to the delay of the immediately preceeding flight.
|
of a flight is related to the delay of the immediately preceding flight.
|
||||||
|
|
||||||
1. Look at each destination. Can you find flights that are suspiciously
|
1. Look at each destination. Can you find flights that are suspiciously
|
||||||
fast? (i.e. flights that represent a potential data entry error). Compute
|
fast? (i.e. flights that represent a potential data entry error). Compute
|
||||||
|
|
|
@ -211,7 +211,7 @@ ggplot(shapes, aes(x, y)) +
|
||||||
1. What happens if you set an aesthetic to something other than a variable
|
1. What happens if you set an aesthetic to something other than a variable
|
||||||
name, like `displ < 5`?
|
name, like `displ < 5`?
|
||||||
|
|
||||||
1. Vignettes are long-form guides the documentation things about
|
1. Vignettes are long-form guides that document things about
|
||||||
a package that affect many functions. ggplot2 has two vignettes.
|
a package that affect many functions. ggplot2 has two vignettes.
|
||||||
How can you find them and what do they describe? (Hint: Google is
|
How can you find them and what do they describe? (Hint: Google is
|
||||||
your friend.)
|
your friend.)
|
||||||
|
@ -220,7 +220,7 @@ ggplot(shapes, aes(x, y)) +
|
||||||
|
|
||||||
As you start to run R code, you're likely to run into problems. Don't worry --- it happens to everyone. I have been writing R code for years, and every day I still write code that doesn't work!
|
As you start to run R code, you're likely to run into problems. Don't worry --- it happens to everyone. I have been writing R code for years, and every day I still write code that doesn't work!
|
||||||
|
|
||||||
Start by carefully comparing the code that you're running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every `(` is matched with a `)` and every `"` is paired with another `"`. Sometimes you'll run the code and nothing happens. Check the left-hand of your console: if it's a `+`, it means that R doesn't think you've typed a complete expression and it's waiting for you to finish it. In this case, it's usually easiest to start from scratch again by pressing `Escape` to abort processing the current command.
|
Start by carefully comparing the code that you're running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every `(` is matched with a `)` and every `"` is paired with another `"`. Sometimes you'll run the code and nothing happens. Check the left-hand of your console: if it's a `+`, it means that R doesn't think you've typed a complete expression and it's waiting for you to finish it. In this case, it's usually easy to start from scratch again by pressing `Escape` to abort processing the current command.
|
||||||
|
|
||||||
One common problem when creating ggplot2 graphics is to put the `+` in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven't accidentally written code like this:
|
One common problem when creating ggplot2 graphics is to put the `+` in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven't accidentally written code like this:
|
||||||
|
|
||||||
|
@ -248,8 +248,8 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
||||||
To facet your plot on the combination of two variables, add `facet_grid()` to your plot call. The first argument of `facet_grid()` is also a formula. This time the formula should contain two variable names separated by a `~`.
|
To facet your plot on the combination of two variables, add `facet_grid()` to your plot call. The first argument of `facet_grid()` is also a formula. This time the formula should contain two variable names separated by a `~`.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
ggplot(data = mpg) +
|
||||||
geom_point() +
|
geom_point(mapping = aes(x = displ, y = hwy)) +
|
||||||
facet_grid(drv ~ cyl)
|
facet_grid(drv ~ cyl)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -410,7 +410,7 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
||||||
A histogram? An area chart?
|
A histogram? An area chart?
|
||||||
|
|
||||||
1. Run this code in your head and predict what the output will look like.
|
1. Run this code in your head and predict what the output will look like.
|
||||||
Run the code in R and check your predictions.
|
Then, run the code in R and check your predictions.
|
||||||
|
|
||||||
```{r, eval = FALSE}
|
```{r, eval = FALSE}
|
||||||
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
|
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
|
||||||
|
@ -496,7 +496,7 @@ Stats are the most subtle part of plotting because you can't see them directly.
|
||||||
|
|
||||||
1. You might want to override the default stat. In the code below, I change
|
1. You might want to override the default stat. In the code below, I change
|
||||||
the stat of `geom_bar()` from count (the default) to identity. This lets
|
the stat of `geom_bar()` from count (the default) to identity. This lets
|
||||||
me map to the height of the bars to the raw values of a $y$ variable.
|
me map the height of the bars to the raw values of a $y$ variable.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
demo <- tibble::tibble(
|
demo <- tibble::tibble(
|
||||||
|
|
Loading…
Reference in New Issue