Changes from @mine-cetinkaya-rundel

2016-07-31 11:32:16 -05:00
parent fb8f3e5884
commit 9cf3badbf0
11 changed files with 152 additions and 86 deletions
--- a/transform.Rmd
+++ b/transform.Rmd
@@ -192,7 +192,7 @@ filter(df, is.na(x) | x > 1)

 1.  Find all flights that

-    1. Were delayed by more two hours
+    1. Had an arrival delay of two or more hours.
    1. Flew to Houston (`IAH` or `HOU`)
    1. Were operated by United, American, or Delta
    1. Departed in summer (July, August, and September)
@@ -276,13 +276,7 @@ There are a number of helper functions you can use within `select()`:
   
 See `?select` for more details.

-It's possible to use `select()` to rename variables:
-
-```{r}
-select(flights, tail_num = tailnum)
-```
-
-But because `select()` drops all the variables not explicitly mentioned, it's not that useful. Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:
+`select()` can be used to rename variables, but it's rarely useful because it drops all the variables not explicitly mentioned. Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:

 ```{r}
 rename(flights, tail_num = tailnum)
@@ -619,15 +613,16 @@ RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P. This resends th

 --------------------------------------------------------------------------------

-There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the __Lahman__ package to compute the batting average (number of hits / number of attempts) of every major league baseball player.  When I plot the skill of the batter against the number of times batted, you see two patterns:
+There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the __Lahman__ package to compute the batting average (number of hits / number of attempts) of every major league baseball player.  
+
+When I plot the skill of the batter (measured by the batting average, `ba`) against the number of opportunities to hit the ball (measured by at bat, `ab`), you see two patterns:

 1.  As above, the variation in our aggregate decreases as we get more 
    data points.
    
-2.  There's a positive correlation between skill (batting average, `ba`) and 
-    number of opportunities to hit the ball (at bat, `ab`). This is because
-    teams control who gets to play, and obviously they'll pick their best
-    players.
+2.  There's a positive correlation between skill (`ba`) and opportunities to 
+    hit the ball (`ab`). This is because teams control who gets to play, 
+    and obviously they'll pick their best players.

 ```{r}
 # Convert to a tibble so it prints nicely
@@ -650,7 +645,8 @@ batters %>%
 This also has important implications for ranking. If you naively sort on `desc(ba)`, the people with the best batting averages are clearly lucky, not skilled:

 ```{r}
-batters %>% arrange(desc(ba))
+batters %>% 
+  arrange(desc(ba))
 ```

 You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
@@ -744,7 +740,8 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
    a count:
    
    ```{r}
-    not_cancelled %>% count(dest)
+    not_cancelled %>% 
+      count(dest)
    ```
    
    You can optionally provide a weight variable. For example, you could use 
@@ -813,6 +810,11 @@ daily %>%
    
    Which is more important: arrival delay or departure delay?

+1.  Come up with another appraoch that will give you the same output as 
+    `not_cancelled %>% count(dest)` and 
+    `not_cancelled %>% count(tailnum, wt = distance)` (without using 
+    `count()`).
+
 1.  Our definition of cancelled flights (`!is.na(dep_delay) & !is.na(arr_delay)`
    ) is slightly suboptimal. Why? Which is the most important column?