Changes from @mine-cetinkaya-rundel

This commit is contained in:
hadley
2016-07-31 11:32:16 -05:00
parent fb8f3e5884
commit 9cf3badbf0
11 changed files with 152 additions and 86 deletions

View File

@@ -192,7 +192,7 @@ filter(df, is.na(x) | x > 1)
1. Find all flights that
1. Were delayed by more two hours
1. Had an arrival delay of two or more hours.
1. Flew to Houston (`IAH` or `HOU`)
1. Were operated by United, American, or Delta
1. Departed in summer (July, August, and September)
@@ -276,13 +276,7 @@ There are a number of helper functions you can use within `select()`:
See `?select` for more details.
It's possible to use `select()` to rename variables:
```{r}
select(flights, tail_num = tailnum)
```
But because `select()` drops all the variables not explicitly mentioned, it's not that useful. Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:
`select()` can be used to rename variables, but it's rarely useful because it drops all the variables not explicitly mentioned. Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:
```{r}
rename(flights, tail_num = tailnum)
@@ -619,15 +613,16 @@ RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P. This resends th
--------------------------------------------------------------------------------
There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the __Lahman__ package to compute the batting average (number of hits / number of attempts) of every major league baseball player. When I plot the skill of the batter against the number of times batted, you see two patterns:
There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the __Lahman__ package to compute the batting average (number of hits / number of attempts) of every major league baseball player.
When I plot the skill of the batter (measured by the batting average, `ba`) against the number of opportunities to hit the ball (measured by at bat, `ab`), you see two patterns:
1. As above, the variation in our aggregate decreases as we get more
data points.
2. There's a positive correlation between skill (batting average, `ba`) and
number of opportunities to hit the ball (at bat, `ab`). This is because
teams control who gets to play, and obviously they'll pick their best
players.
2. There's a positive correlation between skill (`ba`) and opportunities to
hit the ball (`ab`). This is because teams control who gets to play,
and obviously they'll pick their best players.
```{r}
# Convert to a tibble so it prints nicely
@@ -650,7 +645,8 @@ batters %>%
This also has important implications for ranking. If you naively sort on `desc(ba)`, the people with the best batting averages are clearly lucky, not skilled:
```{r}
batters %>% arrange(desc(ba))
batters %>%
arrange(desc(ba))
```
You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
@@ -744,7 +740,8 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
a count:
```{r}
not_cancelled %>% count(dest)
not_cancelled %>%
count(dest)
```
You can optionally provide a weight variable. For example, you could use
@@ -813,6 +810,11 @@ daily %>%
Which is more important: arrival delay or departure delay?
1. Come up with another appraoch that will give you the same output as
`not_cancelled %>% count(dest)` and
`not_cancelled %>% count(tailnum, wt = distance)` (without using
`count()`).
1. Our definition of cancelled flights (`!is.na(dep_delay) & !is.na(arr_delay)`
) is slightly suboptimal. Why? Which is the most important column?