Merge pull request #32 from radugrosu/patch-1

Update transform.Rmd
This commit is contained in:
Hadley Wickham 2016-01-16 14:40:33 -06:00
commit c8e2bb01b9
1 changed files with 11 additions and 13 deletions

View File

@ -64,7 +64,7 @@ It prints differently because it has a different "class" to usual data frames:
class(flights) class(flights)
``` ```
This is called a `tbl_df` (prounced tibble diff) or a `data_frame` (pronunced "data underscore frame"; cf. `data dot frame`). Generally, however, we want worry about this relatively minor difference and will refer to everything as data frames. This is called a `tbl_df` (pronounced tibble diff) or a `data_frame` (pronounced "data underscore frame"; cf. `data dot frame`). Generally, however, we want worry about this relatively minor difference and will refer to everything as data frames.
You'll learn more about how that works in data structures. If you want to convert your own data frames to this special case, use `as.data_frame()`. I recommend it for large data frames as it makes interactive exploration much less painful. You'll learn more about how that works in data structures. If you want to convert your own data frames to this special case, use `as.data_frame()`. I recommend it for large data frames as it makes interactive exploration much less painful.
@ -299,7 +299,7 @@ filter(df, is.na(x) | x > 1)
* There were operated by United, American, or Delta. * There were operated by United, American, or Delta.
* That were delayed by more two hours. * That were delayed by more two hours.
* That arrived more than two hours late, but didn't leave late. * That arrived more than two hours late, but didn't leave late.
* We delayed by at least an hour, but made up over 30 minutes in flight. * Were delayed by at least an hour, but made up over 30 minutes in flight.
* Departed between midnight and 6am. * Departed between midnight and 6am.
1. How many flights have a missing `dep_time`? What other variables are 1. How many flights have a missing `dep_time`? What other variables are
@ -399,7 +399,7 @@ This function works similarly to the `select` argument in `base::subset()`. Beca
1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, 1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`,
`arr_time`, and `arr_delay` from `flights`. `arr_time`, and `arr_delay` from `flights`.
## Add new variable with `mutate()` ## Add new variables with `mutate()`
Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of `mutate()`. Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of `mutate()`.
@ -459,7 +459,7 @@ There are many functions for creating new variables. The key property is that th
the proportion of a total and `y - mean(y)` computes the difference from the proportion of a total and `y - mean(y)` computes the difference from
the mean, and so on. the mean, and so on.
* Modular arithmetic: `%/%` (integer divison) and `%%` (remainder), where * Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where
`x == y * (x %/% y) + (x %% y)`. Modular arithmetic is a handy tool because `x == y * (x %/% y) + (x %% y)`. Modular arithmetic is a handy tool because
it allows you to break integers up into pieces. For example, in the it allows you to break integers up into pieces. For example, in the
flights dataset, you can compute `hour` and `minute` from `dep_time` with: flights dataset, you can compute `hour` and `minute` from `dep_time` with:
@ -563,7 +563,7 @@ by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)) summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
``` ```
Together `group_by()` and `summarise()` provide one of tools that you'll use most commonly when working with dplyr: groued summaries. But before we go any further with this idea, we need to introduce a powerful new idea: the pipe. Together `group_by()` and `summarise()` provide one of tools that you'll use most commonly when working with dplyr: grouped summaries. But before we go any further with this idea, we need to introduce a powerful new idea: the pipe.
### Combining multiple operations with the pipe ### Combining multiple operations with the pipe
@ -594,7 +594,7 @@ There are three steps:
* Filter to remove noisy points and Honolulu airport which is almost * Filter to remove noisy points and Honolulu airport which is almost
twice as far away as the next closest airport. twice as far away as the next closest airport.
This code is a little frustraing to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things well is hard, so this slows us down. This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things well is hard, so this slows us down.
There's another way to tackle the same problem with the pipe, `%>%`: There's another way to tackle the same problem with the pipe, `%>%`:
@ -745,11 +745,9 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
) )
``` ```
mean(arr_delay[arr_delay > 0])
* Measure of spread: `sd(x)`, `IQR(x)`, `mad(x)`. The mean squared deviation, * Measure of spread: `sd(x)`, `IQR(x)`, `mad(x)`. The mean squared deviation,
or standard deviation or sd for short, is the standard measure of spread. or standard deviation or sd for short, is the standard measure of spread.
The interquartile range (`IQR()`) and median absolute deviation `mad(x)` The interquartile range `IQR()` and median absolute deviation `mad(x)`
are robust equivalents that maybe more useful if you have outliers. are robust equivalents that maybe more useful if you have outliers.
```{r} ```{r}
@ -778,7 +776,7 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
group that only has two elements). group that only has two elements).
These functions are complementary to filtering on ranks. Filtering gives These functions are complementary to filtering on ranks. Filtering gives
you all variables, which each observation in a separate row. Summarising you all variables, with each observation in a separate row. Summarising
gives you one row per group, with multiple variables: gives you one row per group, with multiple variables:
```{r} ```{r}
@ -849,7 +847,7 @@ daily <- group_by(flights, year, month, day)
(per_year <- summarise(per_month, flights = sum(flights))) (per_year <- summarise(per_month, flights = sum(flights)))
``` ```
Becareful when progressively rolling up summaries: it's ok for sums and counts, but you need to think about weighting for means and variances, and it's not possible to do it exactly for rank-based statistics like the median (i.e. the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median). Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting for means and variances, and it's not possible to do it exactly for rank-based statistics like the median (i.e. the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median).
### Ungrouping ### Ungrouping
@ -857,7 +855,7 @@ If you need to remove grouping, and return to operations on ungrouped data, use
### Exercises ### Exercises
1. Brainstorm at least 5 different ways to assess the typically delay 1. Brainstorm at least 5 different ways to assess the typical delay
characteristics of a group of flights. Consider the following scenarios: characteristics of a group of flights. Consider the following scenarios:
* A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of * A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of
@ -921,7 +919,7 @@ Functions that work most naturally in grouped mutates and filters are known as
1. What time of day should you fly if you want to avoid delays as much 1. What time of day should you fly if you want to avoid delays as much
as possible? as possible?
1. Delays are typically temporarily correlated: even once the problem that 1. Delays are typically temporally correlated: even once the problem that
caused the initial delay has been resolved, later flights are delayed caused the initial delay has been resolved, later flights are delayed
to allow earlier flights to leave. Using `lag()` explore how the delay to allow earlier flights to leave. Using `lag()` explore how the delay
of a flight is related to the delay of the flight that left just of a flight is related to the delay of the flight that left just