Integrate feedback from @jennybc

This commit is contained in:
hadley
2016-10-03 16:08:44 -05:00
parent 6a4c1c9270
commit c8b586514b
7 changed files with 78 additions and 59 deletions

View File

@@ -25,12 +25,24 @@ flights
You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). It prints differently because it's a __tibble__. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle](#wrangle-intro).
You might also have noticed the row of three letter abbreviations under the column names. These describe the type of each variable:
You might also have noticed the row of three (or four) letter abbreviations under the column names. These describe the type of each variable:
* `int` stands for integers.
* `dbl` stands for doubles, or real numbers.
* `chr` stands for character vectors, or strings.
* `dttm` stands for date-times (a date + a time).
There are three other common types of variables that aren't used in this dataset but you'll encounter later in the book:
* `lgl` stands for logical, vectors that contain only `TRUE` or `FALSE`.
* `int` stands for integers.
* `dbl` stands for doubles, or real numbers.
* `chr` stands for character vectors, or strings.
* `fctr` stands for factors, which R uses to represent categorical variables
with fixed possible values.
* `date` stands for dates.
### Dplyr basics
@@ -48,9 +60,9 @@ All verbs work similarly:
1. The first argument is a data frame.
1. The subsequent arguments describe what to do with the data frame.
You can refer to columns in the data frame directly without using `$`.
1. The subsequent arguments describe what to do with the data frame,
using the variable names (without quotes).
1. The result is a new data frame.
Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let's dive in and see how these verbs work.
@@ -92,15 +104,13 @@ sqrt(2) ^ 2 == 2
1/49 * 49 == 1
```
Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on `==`, use `dplyr::near()`:
Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on `==`, use `near()`:
```{r}
near(sqrt(2) ^ 2, 2)
near(1 / 49 * 49, 1)
```
(Remember that we use `::` to be explicit about where a function lives. If dplyr is installed, `dplyr::near()` will always work. If you want to use the shorter `near()`, you need to make sure you have loaded dplyr with `library(dplyr)`.)
### Logical operators
Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output. For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not". Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
@@ -117,6 +127,12 @@ filter(flights, month == 11 | month == 12)
The order of operations doesn't work like English. You can't write `filter(flights, month == 11 | 12)`, which you might literally translate into "finds all flights that departed in November or December". Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`. In a numeric context (like here), `TRUE` becomes one, so this finds all flights in January, not November or December. This is quite confusing!
A useful short-hand for this problem is `x %in% y`. This will select every row where `x` is one of the values in `y`. We could use it to rewrite the code above:
```{r, eval = FALSE}
nov_dec <- filter(flights, month %in% c(11, 12))
```
Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`. For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
```{r, eval = FALSE}
@@ -126,20 +142,6 @@ filter(flights, arr_delay <= 120, dep_delay <= 120)
As well as `&` and `|`, R also has `&&` and `||`. Don't use them here! You'll learn when you should use them in [conditional execution].
Sometimes you want to find all rows after the first `TRUE`, or all rows until the first `FALSE`. The window functions `cumany()` and `cumall()` allow you to find these values:
```{r}
df <- tibble(
x = c(FALSE, TRUE, FALSE),
y = c(TRUE, FALSE, TRUE)
)
filter(df, cumany(x)) # all rows after first TRUE
filter(df, cumall(y)) # all rows until first FALSE
```
(`tibble()` creates a dataset "by hand". You'll learn more about it in [tibbles].)
Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead. That makes it much easier to check your work. You'll learn how to create new variables shortly.
### Missing values
@@ -702,18 +704,10 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
* Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`. These work
similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default
value if that position does not exist (i.e. you're trying to get the 3rd
element from a group that only has two elements).
These functions are complementary to filtering on ranks. Filtering gives
you all variables, with each observation in a separate row. Summarising
gives you one row per group, with multiple variables:
element from a group that only has two elements). For example, we can
find the first and last departure for each day:
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
@@ -721,6 +715,16 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
last_dep = last(dep_time)
)
```
These functions are complementary to filtering on ranks. Filtering gives
you all variables, with each observation in a separate row:
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
```
* Counts: You've seen `n()`, which takes no arguments, and returns the
size of the current group. To count the number of non-missing values, use
@@ -847,6 +851,7 @@ Grouping is most useful in conjunction with `summarise()`, but you can also do c
popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 365)
popular_dests
```
* Standardise to compute per group metrics:
@@ -872,6 +877,9 @@ Functions that work most naturally in grouped mutates and filters are known as
1. What time of day should you fly if you want to avoid delays as much
as possible?
1. For each destination, compute the total minutes of delay. For each,
flight, compute the proportion of the total delay for its destination.
1. Delays are typically temporally correlated: even once the problem that
caused the initial delay has been resolved, later flights are delayed
to allow earlier flights to leave. Using `lag()` explore how the delay