commit
c8e2bb01b9
|
@ -64,7 +64,7 @@ It prints differently because it has a different "class" to usual data frames:
|
||||||
class(flights)
|
class(flights)
|
||||||
```
|
```
|
||||||
|
|
||||||
This is called a `tbl_df` (prounced tibble diff) or a `data_frame` (pronunced "data underscore frame"; cf. `data dot frame`). Generally, however, we want worry about this relatively minor difference and will refer to everything as data frames.
|
This is called a `tbl_df` (pronounced tibble diff) or a `data_frame` (pronounced "data underscore frame"; cf. `data dot frame`). Generally, however, we want worry about this relatively minor difference and will refer to everything as data frames.
|
||||||
|
|
||||||
You'll learn more about how that works in data structures. If you want to convert your own data frames to this special case, use `as.data_frame()`. I recommend it for large data frames as it makes interactive exploration much less painful.
|
You'll learn more about how that works in data structures. If you want to convert your own data frames to this special case, use `as.data_frame()`. I recommend it for large data frames as it makes interactive exploration much less painful.
|
||||||
|
|
||||||
|
@ -299,7 +299,7 @@ filter(df, is.na(x) | x > 1)
|
||||||
* There were operated by United, American, or Delta.
|
* There were operated by United, American, or Delta.
|
||||||
* That were delayed by more two hours.
|
* That were delayed by more two hours.
|
||||||
* That arrived more than two hours late, but didn't leave late.
|
* That arrived more than two hours late, but didn't leave late.
|
||||||
* We delayed by at least an hour, but made up over 30 minutes in flight.
|
* Were delayed by at least an hour, but made up over 30 minutes in flight.
|
||||||
* Departed between midnight and 6am.
|
* Departed between midnight and 6am.
|
||||||
|
|
||||||
1. How many flights have a missing `dep_time`? What other variables are
|
1. How many flights have a missing `dep_time`? What other variables are
|
||||||
|
@ -399,7 +399,7 @@ This function works similarly to the `select` argument in `base::subset()`. Beca
|
||||||
1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`,
|
1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`,
|
||||||
`arr_time`, and `arr_delay` from `flights`.
|
`arr_time`, and `arr_delay` from `flights`.
|
||||||
|
|
||||||
## Add new variable with `mutate()`
|
## Add new variables with `mutate()`
|
||||||
|
|
||||||
Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of `mutate()`.
|
Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of `mutate()`.
|
||||||
|
|
||||||
|
@ -459,7 +459,7 @@ There are many functions for creating new variables. The key property is that th
|
||||||
the proportion of a total and `y - mean(y)` computes the difference from
|
the proportion of a total and `y - mean(y)` computes the difference from
|
||||||
the mean, and so on.
|
the mean, and so on.
|
||||||
|
|
||||||
* Modular arithmetic: `%/%` (integer divison) and `%%` (remainder), where
|
* Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where
|
||||||
`x == y * (x %/% y) + (x %% y)`. Modular arithmetic is a handy tool because
|
`x == y * (x %/% y) + (x %% y)`. Modular arithmetic is a handy tool because
|
||||||
it allows you to break integers up into pieces. For example, in the
|
it allows you to break integers up into pieces. For example, in the
|
||||||
flights dataset, you can compute `hour` and `minute` from `dep_time` with:
|
flights dataset, you can compute `hour` and `minute` from `dep_time` with:
|
||||||
|
@ -563,7 +563,7 @@ by_day <- group_by(flights, year, month, day)
|
||||||
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
|
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
|
||||||
```
|
```
|
||||||
|
|
||||||
Together `group_by()` and `summarise()` provide one of tools that you'll use most commonly when working with dplyr: groued summaries. But before we go any further with this idea, we need to introduce a powerful new idea: the pipe.
|
Together `group_by()` and `summarise()` provide one of tools that you'll use most commonly when working with dplyr: grouped summaries. But before we go any further with this idea, we need to introduce a powerful new idea: the pipe.
|
||||||
|
|
||||||
### Combining multiple operations with the pipe
|
### Combining multiple operations with the pipe
|
||||||
|
|
||||||
|
@ -594,7 +594,7 @@ There are three steps:
|
||||||
* Filter to remove noisy points and Honolulu airport which is almost
|
* Filter to remove noisy points and Honolulu airport which is almost
|
||||||
twice as far away as the next closest airport.
|
twice as far away as the next closest airport.
|
||||||
|
|
||||||
This code is a little frustraing to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things well is hard, so this slows us down.
|
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things well is hard, so this slows us down.
|
||||||
|
|
||||||
There's another way to tackle the same problem with the pipe, `%>%`:
|
There's another way to tackle the same problem with the pipe, `%>%`:
|
||||||
|
|
||||||
|
@ -744,12 +744,10 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
|
||||||
avg_delay2 = mean(arr_delay[arr_delay > 0])
|
avg_delay2 = mean(arr_delay[arr_delay > 0])
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
mean(arr_delay[arr_delay > 0])
|
|
||||||
|
|
||||||
* Measure of spread: `sd(x)`, `IQR(x)`, `mad(x)`. The mean squared deviation,
|
* Measure of spread: `sd(x)`, `IQR(x)`, `mad(x)`. The mean squared deviation,
|
||||||
or standard deviation or sd for short, is the standard measure of spread.
|
or standard deviation or sd for short, is the standard measure of spread.
|
||||||
The interquartile range (`IQR()`) and median absolute deviation `mad(x)`
|
The interquartile range `IQR()` and median absolute deviation `mad(x)`
|
||||||
are robust equivalents that maybe more useful if you have outliers.
|
are robust equivalents that maybe more useful if you have outliers.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -778,7 +776,7 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
|
||||||
group that only has two elements).
|
group that only has two elements).
|
||||||
|
|
||||||
These functions are complementary to filtering on ranks. Filtering gives
|
These functions are complementary to filtering on ranks. Filtering gives
|
||||||
you all variables, which each observation in a separate row. Summarising
|
you all variables, with each observation in a separate row. Summarising
|
||||||
gives you one row per group, with multiple variables:
|
gives you one row per group, with multiple variables:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -849,7 +847,7 @@ daily <- group_by(flights, year, month, day)
|
||||||
(per_year <- summarise(per_month, flights = sum(flights)))
|
(per_year <- summarise(per_month, flights = sum(flights)))
|
||||||
```
|
```
|
||||||
|
|
||||||
Becareful when progressively rolling up summaries: it's ok for sums and counts, but you need to think about weighting for means and variances, and it's not possible to do it exactly for rank-based statistics like the median (i.e. the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median).
|
Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting for means and variances, and it's not possible to do it exactly for rank-based statistics like the median (i.e. the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median).
|
||||||
|
|
||||||
### Ungrouping
|
### Ungrouping
|
||||||
|
|
||||||
|
@ -857,7 +855,7 @@ If you need to remove grouping, and return to operations on ungrouped data, use
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
1. Brainstorm at least 5 different ways to assess the typically delay
|
1. Brainstorm at least 5 different ways to assess the typical delay
|
||||||
characteristics of a group of flights. Consider the following scenarios:
|
characteristics of a group of flights. Consider the following scenarios:
|
||||||
|
|
||||||
* A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of
|
* A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of
|
||||||
|
@ -921,7 +919,7 @@ Functions that work most naturally in grouped mutates and filters are known as
|
||||||
1. What time of day should you fly if you want to avoid delays as much
|
1. What time of day should you fly if you want to avoid delays as much
|
||||||
as possible?
|
as possible?
|
||||||
|
|
||||||
1. Delays are typically temporarily correlated: even once the problem that
|
1. Delays are typically temporally correlated: even once the problem that
|
||||||
caused the initial delay has been resolved, later flights are delayed
|
caused the initial delay has been resolved, later flights are delayed
|
||||||
to allow earlier flights to leave. Using `lag()` explore how the delay
|
to allow earlier flights to leave. Using `lag()` explore how the delay
|
||||||
of a flight is related to the delay of the flight that left just
|
of a flight is related to the delay of the flight that left just
|
||||||
|
|
Loading…
Reference in New Issue