Merge branch 'master' of github.com:hadley/r4ds
This commit is contained in:
		@@ -64,7 +64,7 @@ It prints differently because it has a different "class" to usual data frames:
 | 
			
		||||
class(flights)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
This is called a `tbl_df` (prounced tibble diff) or a `data_frame` (pronunced "data underscore frame"; cf. `data dot frame`). Generally, however, we want worry about this relatively minor difference and will refer to everything as data frames.
 | 
			
		||||
This is called a `tbl_df` (pronounced tibble diff) or a `data_frame` (pronounced "data underscore frame"; cf. `data dot frame`). Generally, however, we want worry about this relatively minor difference and will refer to everything as data frames.
 | 
			
		||||
 | 
			
		||||
You'll learn more about how that works in data structures. If you want to convert your own data frames to this special case, use `as.data_frame()`. I recommend it for large data frames as it makes interactive exploration much less painful.
 | 
			
		||||
 | 
			
		||||
@@ -299,7 +299,7 @@ filter(df, is.na(x) | x > 1)
 | 
			
		||||
    * There were operated by United, American, or Delta.
 | 
			
		||||
    * That were delayed by more two hours.
 | 
			
		||||
    * That arrived more than two hours late, but didn't leave late.
 | 
			
		||||
    * We delayed by at least an hour, but made up over 30 minutes in flight.
 | 
			
		||||
    * Were delayed by at least an hour, but made up over 30 minutes in flight.
 | 
			
		||||
    * Departed between midnight and 6am.
 | 
			
		||||
 | 
			
		||||
1.  How many flights have a missing `dep_time`? What other variables are 
 | 
			
		||||
@@ -399,7 +399,7 @@ This function works similarly to the `select` argument in `base::subset()`. Beca
 | 
			
		||||
1.  Brainstorm as many ways as possible to select `dep_time`, `dep_delay`,
 | 
			
		||||
    `arr_time`, and `arr_delay` from `flights`.
 | 
			
		||||
 | 
			
		||||
## Add new variable with `mutate()`
 | 
			
		||||
## Add new variables with `mutate()`
 | 
			
		||||
 | 
			
		||||
Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of `mutate()`. 
 | 
			
		||||
 | 
			
		||||
@@ -459,7 +459,7 @@ There are many functions for creating new variables. The key property is that th
 | 
			
		||||
    the proportion of a total and `y - mean(y)` computes the difference from 
 | 
			
		||||
    the mean, and so on.
 | 
			
		||||
    
 | 
			
		||||
*   Modular arithmetic: `%/%` (integer divison) and `%%` (remainder), where
 | 
			
		||||
*   Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where
 | 
			
		||||
    `x == y * (x %/% y) + (x %% y)`. Modular arithmetic is a handy tool because 
 | 
			
		||||
    it allows you to break integers up into pieces. For example, in the 
 | 
			
		||||
    flights dataset, you can compute `hour` and `minute` from `dep_time` with:
 | 
			
		||||
@@ -563,7 +563,7 @@ by_day <- group_by(flights, year, month, day)
 | 
			
		||||
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Together `group_by()` and `summarise()` provide one of tools that you'll use most commonly when working with dplyr: groued summaries. But before we go any further with this idea, we need to introduce a powerful new idea: the pipe.
 | 
			
		||||
Together `group_by()` and `summarise()` provide one of tools that you'll use most commonly when working with dplyr: grouped summaries. But before we go any further with this idea, we need to introduce a powerful new idea: the pipe.
 | 
			
		||||
 | 
			
		||||
### Combining multiple operations with the pipe
 | 
			
		||||
 | 
			
		||||
@@ -594,7 +594,7 @@ There are three steps:
 | 
			
		||||
* Filter to remove noisy points and Honolulu airport which is almost
 | 
			
		||||
  twice as far away as the next closest airport.
 | 
			
		||||
 | 
			
		||||
This code is a little frustraing to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things well is hard, so this slows us down. 
 | 
			
		||||
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things well is hard, so this slows us down. 
 | 
			
		||||
 | 
			
		||||
There's another way to tackle the same problem with the pipe, `%>%`:
 | 
			
		||||
 | 
			
		||||
@@ -744,12 +744,10 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
 | 
			
		||||
        avg_delay2 = mean(arr_delay[arr_delay > 0])
 | 
			
		||||
      )
 | 
			
		||||
    ```
 | 
			
		||||
    
 | 
			
		||||
     mean(arr_delay[arr_delay > 0])
 | 
			
		||||
 | 
			
		||||
*   Measure of spread: `sd(x)`, `IQR(x)`, `mad(x)`. The mean squared deviation,
 | 
			
		||||
    or standard deviation or sd for short, is the standard measure of spread.
 | 
			
		||||
    The interquartile range (`IQR()`) and median absolute deviation `mad(x)`
 | 
			
		||||
    The interquartile range `IQR()` and median absolute deviation `mad(x)`
 | 
			
		||||
    are robust equivalents that maybe more useful if you have outliers.
 | 
			
		||||
    
 | 
			
		||||
    ```{r}
 | 
			
		||||
@@ -778,7 +776,7 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
 | 
			
		||||
    group that only has two elements). 
 | 
			
		||||
    
 | 
			
		||||
    These functions are complementary to filtering on ranks. Filtering gives
 | 
			
		||||
    you all variables, which each observation in a separate row. Summarising
 | 
			
		||||
    you all variables, with each observation in a separate row. Summarising
 | 
			
		||||
    gives you one row per group, with multiple variables:
 | 
			
		||||
    
 | 
			
		||||
    ```{r}
 | 
			
		||||
@@ -849,7 +847,7 @@ daily <- group_by(flights, year, month, day)
 | 
			
		||||
(per_year  <- summarise(per_month, flights = sum(flights)))
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Becareful when progressively rolling up summaries: it's ok for sums and counts, but you need to think about weighting for means and variances, and it's not possible to do it exactly for rank-based statistics like the median (i.e. the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median).
 | 
			
		||||
Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting for means and variances, and it's not possible to do it exactly for rank-based statistics like the median (i.e. the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median).
 | 
			
		||||
 | 
			
		||||
### Ungrouping
 | 
			
		||||
 | 
			
		||||
@@ -857,7 +855,7 @@ If you need to remove grouping, and return to operations on ungrouped data, use
 | 
			
		||||
 | 
			
		||||
### Exercises
 | 
			
		||||
 | 
			
		||||
1.  Brainstorm at least 5 different ways to assess the typically delay 
 | 
			
		||||
1.  Brainstorm at least 5 different ways to assess the typical delay 
 | 
			
		||||
    characteristics of a group of flights. Consider the following scenarios:
 | 
			
		||||
    
 | 
			
		||||
    * A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of 
 | 
			
		||||
@@ -921,7 +919,7 @@ Functions that work most naturally in grouped mutates and filters are known as
 | 
			
		||||
1.  What time of day should you fly if you want to avoid delays as much
 | 
			
		||||
    as possible?
 | 
			
		||||
    
 | 
			
		||||
1.  Delays are typically temporarily correlated: even once the problem that
 | 
			
		||||
1.  Delays are typically temporally correlated: even once the problem that
 | 
			
		||||
    caused the initial delay has been resolved, later flights are delayed 
 | 
			
		||||
    to allow earlier flights to leave. Using `lag()` explore how the delay
 | 
			
		||||
    of a flight is related to the delay of the flight that left just
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user