commit
c8e2bb01b9
|
@ -64,7 +64,7 @@ It prints differently because it has a different "class" to usual data frames:
|
|||
class(flights)
|
||||
```
|
||||
|
||||
This is called a `tbl_df` (prounced tibble diff) or a `data_frame` (pronunced "data underscore frame"; cf. `data dot frame`). Generally, however, we want worry about this relatively minor difference and will refer to everything as data frames.
|
||||
This is called a `tbl_df` (pronounced tibble diff) or a `data_frame` (pronounced "data underscore frame"; cf. `data dot frame`). Generally, however, we want worry about this relatively minor difference and will refer to everything as data frames.
|
||||
|
||||
You'll learn more about how that works in data structures. If you want to convert your own data frames to this special case, use `as.data_frame()`. I recommend it for large data frames as it makes interactive exploration much less painful.
|
||||
|
||||
|
@ -299,7 +299,7 @@ filter(df, is.na(x) | x > 1)
|
|||
* There were operated by United, American, or Delta.
|
||||
* That were delayed by more two hours.
|
||||
* That arrived more than two hours late, but didn't leave late.
|
||||
* We delayed by at least an hour, but made up over 30 minutes in flight.
|
||||
* Were delayed by at least an hour, but made up over 30 minutes in flight.
|
||||
* Departed between midnight and 6am.
|
||||
|
||||
1. How many flights have a missing `dep_time`? What other variables are
|
||||
|
@ -399,7 +399,7 @@ This function works similarly to the `select` argument in `base::subset()`. Beca
|
|||
1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`,
|
||||
`arr_time`, and `arr_delay` from `flights`.
|
||||
|
||||
## Add new variable with `mutate()`
|
||||
## Add new variables with `mutate()`
|
||||
|
||||
Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of `mutate()`.
|
||||
|
||||
|
@ -459,7 +459,7 @@ There are many functions for creating new variables. The key property is that th
|
|||
the proportion of a total and `y - mean(y)` computes the difference from
|
||||
the mean, and so on.
|
||||
|
||||
* Modular arithmetic: `%/%` (integer divison) and `%%` (remainder), where
|
||||
* Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where
|
||||
`x == y * (x %/% y) + (x %% y)`. Modular arithmetic is a handy tool because
|
||||
it allows you to break integers up into pieces. For example, in the
|
||||
flights dataset, you can compute `hour` and `minute` from `dep_time` with:
|
||||
|
@ -563,7 +563,7 @@ by_day <- group_by(flights, year, month, day)
|
|||
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
|
||||
```
|
||||
|
||||
Together `group_by()` and `summarise()` provide one of tools that you'll use most commonly when working with dplyr: groued summaries. But before we go any further with this idea, we need to introduce a powerful new idea: the pipe.
|
||||
Together `group_by()` and `summarise()` provide one of tools that you'll use most commonly when working with dplyr: grouped summaries. But before we go any further with this idea, we need to introduce a powerful new idea: the pipe.
|
||||
|
||||
### Combining multiple operations with the pipe
|
||||
|
||||
|
@ -594,7 +594,7 @@ There are three steps:
|
|||
* Filter to remove noisy points and Honolulu airport which is almost
|
||||
twice as far away as the next closest airport.
|
||||
|
||||
This code is a little frustraing to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things well is hard, so this slows us down.
|
||||
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things well is hard, so this slows us down.
|
||||
|
||||
There's another way to tackle the same problem with the pipe, `%>%`:
|
||||
|
||||
|
@ -745,11 +745,9 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
|
|||
)
|
||||
```
|
||||
|
||||
mean(arr_delay[arr_delay > 0])
|
||||
|
||||
* Measure of spread: `sd(x)`, `IQR(x)`, `mad(x)`. The mean squared deviation,
|
||||
or standard deviation or sd for short, is the standard measure of spread.
|
||||
The interquartile range (`IQR()`) and median absolute deviation `mad(x)`
|
||||
The interquartile range `IQR()` and median absolute deviation `mad(x)`
|
||||
are robust equivalents that maybe more useful if you have outliers.
|
||||
|
||||
```{r}
|
||||
|
@ -778,7 +776,7 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
|
|||
group that only has two elements).
|
||||
|
||||
These functions are complementary to filtering on ranks. Filtering gives
|
||||
you all variables, which each observation in a separate row. Summarising
|
||||
you all variables, with each observation in a separate row. Summarising
|
||||
gives you one row per group, with multiple variables:
|
||||
|
||||
```{r}
|
||||
|
@ -849,7 +847,7 @@ daily <- group_by(flights, year, month, day)
|
|||
(per_year <- summarise(per_month, flights = sum(flights)))
|
||||
```
|
||||
|
||||
Becareful when progressively rolling up summaries: it's ok for sums and counts, but you need to think about weighting for means and variances, and it's not possible to do it exactly for rank-based statistics like the median (i.e. the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median).
|
||||
Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting for means and variances, and it's not possible to do it exactly for rank-based statistics like the median (i.e. the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median).
|
||||
|
||||
### Ungrouping
|
||||
|
||||
|
@ -857,7 +855,7 @@ If you need to remove grouping, and return to operations on ungrouped data, use
|
|||
|
||||
### Exercises
|
||||
|
||||
1. Brainstorm at least 5 different ways to assess the typically delay
|
||||
1. Brainstorm at least 5 different ways to assess the typical delay
|
||||
characteristics of a group of flights. Consider the following scenarios:
|
||||
|
||||
* A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of
|
||||
|
@ -921,7 +919,7 @@ Functions that work most naturally in grouped mutates and filters are known as
|
|||
1. What time of day should you fly if you want to avoid delays as much
|
||||
as possible?
|
||||
|
||||
1. Delays are typically temporarily correlated: even once the problem that
|
||||
1. Delays are typically temporally correlated: even once the problem that
|
||||
caused the initial delay has been resolved, later flights are delayed
|
||||
to allow earlier flights to leave. Using `lag()` explore how the delay
|
||||
of a flight is related to the delay of the flight that left just
|
||||
|
|
Loading…
Reference in New Issue