Changes from @mine-cetinkaya-rundel
This commit is contained in:
@@ -192,7 +192,7 @@ filter(df, is.na(x) | x > 1)
|
||||
|
||||
1. Find all flights that
|
||||
|
||||
1. Were delayed by more two hours
|
||||
1. Had an arrival delay of two or more hours.
|
||||
1. Flew to Houston (`IAH` or `HOU`)
|
||||
1. Were operated by United, American, or Delta
|
||||
1. Departed in summer (July, August, and September)
|
||||
@@ -276,13 +276,7 @@ There are a number of helper functions you can use within `select()`:
|
||||
|
||||
See `?select` for more details.
|
||||
|
||||
It's possible to use `select()` to rename variables:
|
||||
|
||||
```{r}
|
||||
select(flights, tail_num = tailnum)
|
||||
```
|
||||
|
||||
But because `select()` drops all the variables not explicitly mentioned, it's not that useful. Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:
|
||||
`select()` can be used to rename variables, but it's rarely useful because it drops all the variables not explicitly mentioned. Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:
|
||||
|
||||
```{r}
|
||||
rename(flights, tail_num = tailnum)
|
||||
@@ -619,15 +613,16 @@ RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P. This resends th
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the __Lahman__ package to compute the batting average (number of hits / number of attempts) of every major league baseball player. When I plot the skill of the batter against the number of times batted, you see two patterns:
|
||||
There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the __Lahman__ package to compute the batting average (number of hits / number of attempts) of every major league baseball player.
|
||||
|
||||
When I plot the skill of the batter (measured by the batting average, `ba`) against the number of opportunities to hit the ball (measured by at bat, `ab`), you see two patterns:
|
||||
|
||||
1. As above, the variation in our aggregate decreases as we get more
|
||||
data points.
|
||||
|
||||
2. There's a positive correlation between skill (batting average, `ba`) and
|
||||
number of opportunities to hit the ball (at bat, `ab`). This is because
|
||||
teams control who gets to play, and obviously they'll pick their best
|
||||
players.
|
||||
2. There's a positive correlation between skill (`ba`) and opportunities to
|
||||
hit the ball (`ab`). This is because teams control who gets to play,
|
||||
and obviously they'll pick their best players.
|
||||
|
||||
```{r}
|
||||
# Convert to a tibble so it prints nicely
|
||||
@@ -650,7 +645,8 @@ batters %>%
|
||||
This also has important implications for ranking. If you naively sort on `desc(ba)`, the people with the best batting averages are clearly lucky, not skilled:
|
||||
|
||||
```{r}
|
||||
batters %>% arrange(desc(ba))
|
||||
batters %>%
|
||||
arrange(desc(ba))
|
||||
```
|
||||
|
||||
You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
|
||||
@@ -744,7 +740,8 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
|
||||
a count:
|
||||
|
||||
```{r}
|
||||
not_cancelled %>% count(dest)
|
||||
not_cancelled %>%
|
||||
count(dest)
|
||||
```
|
||||
|
||||
You can optionally provide a weight variable. For example, you could use
|
||||
@@ -813,6 +810,11 @@ daily %>%
|
||||
|
||||
Which is more important: arrival delay or departure delay?
|
||||
|
||||
1. Come up with another appraoch that will give you the same output as
|
||||
`not_cancelled %>% count(dest)` and
|
||||
`not_cancelled %>% count(tailnum, wt = distance)` (without using
|
||||
`count()`).
|
||||
|
||||
1. Our definition of cancelled flights (`!is.na(dep_delay) & !is.na(arr_delay)`
|
||||
) is slightly suboptimal. Why? Which is the most important column?
|
||||
|
||||
|
||||
Reference in New Issue
Block a user