Fix code language
This commit is contained in:
@@ -11,7 +11,7 @@ Introduction</h1>
|
||||
Prerequisites</h2>
|
||||
<p>In this chapter we’ll focus on the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(nycflights13)
|
||||
<pre data-type="programlisting" data-code-language="r">library(nycflights13)
|
||||
library(tidyverse)
|
||||
#> ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ──
|
||||
#> ✔ ggplot2 3.4.0.9000 ✔ purrr 0.9000.0.9000
|
||||
@@ -30,7 +30,7 @@ library(tidyverse)
|
||||
nycflights13</h2>
|
||||
<p>To explore the basic dplyr verbs, we’re going to use <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code>. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US <a href="http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0">Bureau of Transportation Statistics</a>, and is documented in <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">?flights</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights
|
||||
<pre data-type="programlisting" data-code-language="r">flights
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
@@ -58,7 +58,7 @@ dplyr basics</h2>
|
||||
<li><p>The result is always a new data frame.</p></li>
|
||||
</ol><p>Because the first argument is a data frame and the output is a data frame, dplyr verbs work well with the pipe, <code>|></code>. The pipe takes the thing on its left and passes it along to the function on its right so that <code>x |> f(y)</code> is equivalent to <code>f(x, y)</code>, and <code>x |> f(y) |> g(z)</code> is equivalent to into <code>g(f(x, y), z)</code>. The easiest way to pronounce the pipe is “then”. That makes it possible to get a sense of the following code even though you haven’t yet learned the details:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dest == "IAH") |>
|
||||
group_by(year, month, day) |>
|
||||
summarize(
|
||||
@@ -81,7 +81,7 @@ Rows</h1>
|
||||
</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> allows you to keep rows based on the values of the columns<span data-type="footnote">Later, you’ll learn about the <code>slice_*()</code> family which allows you to choose rows based on their positions.</span>. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(arr_delay > 120)
|
||||
#> # A tibble: 10,034 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
@@ -99,7 +99,7 @@ Rows</h1>
|
||||
</div>
|
||||
<p>As well as <code>></code> (greater than), you can use <code>>=</code> (greater than or equal to), <code><</code> (less than), <code><=</code> (less than or equal to), <code>==</code> (equal to), and <code>!=</code> (not equal to). You can also use <code>&</code> (and) or <code>|</code> (or) to combine multiple conditions:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Flights that departed on January 1
|
||||
<pre data-type="programlisting" data-code-language="r"># Flights that departed on January 1
|
||||
flights |>
|
||||
filter(month == 1 & day == 1)
|
||||
#> # A tibble: 842 × 19
|
||||
@@ -135,7 +135,7 @@ flights |>
|
||||
</div>
|
||||
<p>There’s a useful shortcut when you’re combining <code>|</code> and <code>==</code>: <code>%in%</code>. It keeps rows where the variable equals one of the values on the right:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># A shorter way to select flights that departed in January or February
|
||||
<pre data-type="programlisting" data-code-language="r"># A shorter way to select flights that departed in January or February
|
||||
flights |>
|
||||
filter(month %in% c(1, 2))
|
||||
#> # A tibble: 51,955 × 19
|
||||
@@ -155,7 +155,7 @@ flights |>
|
||||
<p>We’ll come back to these comparisons and logical operators in more detail in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p>
|
||||
<p>When you run <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing <code>flights</code> dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <code><-</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">jan1 <- flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">jan1 <- flights |>
|
||||
filter(month == 1 & day == 1)</pre>
|
||||
</div>
|
||||
</section>
|
||||
@@ -165,7 +165,7 @@ flights |>
|
||||
Common mistakes</h2>
|
||||
<p>When you’re starting out with R, the easiest mistake to make is to use <code>=</code> instead of <code>==</code> when testing for equality. <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> will let you know when this happens:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month = 1)
|
||||
#> Error in `filter()`:
|
||||
#> ! We detected a named input.
|
||||
@@ -174,7 +174,7 @@ Common mistakes</h2>
|
||||
</div>
|
||||
<p>Another mistakes is you write “or” statements like you would in English:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month == 1 | 2)</pre>
|
||||
</div>
|
||||
<p>This works, in the sense that it doesn’t throw an error, but it doesn’t do what you want. We’ll come back to what it does and why in <a href="#sec-boolean-operations" data-type="xref">#sec-boolean-operations</a>.</p>
|
||||
@@ -186,7 +186,7 @@ Common mistakes</h2>
|
||||
</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
arrange(year, month, day, dep_time)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
@@ -204,7 +204,7 @@ Common mistakes</h2>
|
||||
</div>
|
||||
<p>You can use <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> to re-order by a column in descending order. For example, this code shows the most delayed flights:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
arrange(desc(dep_delay))
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
@@ -222,7 +222,7 @@ Common mistakes</h2>
|
||||
</div>
|
||||
<p>You can combine <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dep_delay <= 10 & dep_delay >= -10) |>
|
||||
arrange(desc(arr_delay))
|
||||
#> # A tibble: 239,109 × 19
|
||||
@@ -271,7 +271,7 @@ Columns</h1>
|
||||
</h2>
|
||||
<p>The job of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> is to add new columns that are calculated from the existing columns. In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables. For now, we’ll stick with basic algebra, which allows us to compute the <code>gain</code>, how much time a delayed flight made up in the air, and the <code>speed</code> in miles per hour:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60
|
||||
@@ -293,7 +293,7 @@ Columns</h1>
|
||||
</div>
|
||||
<p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the <code>.before</code> argument to instead add the variables to the left hand side<span data-type="footnote">Remember that in RStudio, the easiest way to see a dataset with many columns is <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60,
|
||||
@@ -316,7 +316,7 @@ Columns</h1>
|
||||
</div>
|
||||
<p>The <code>.</code> is a sign that <code>.before</code> is an argument to the function, not the name of a new variable. You can also use <code>.after</code> to add after a variable, and in both <code>.before</code> and <code>.after</code> you can the name of a variable name instead of a position. For example, we could add the new variables after <code>day:</code></p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60,
|
||||
@@ -339,7 +339,7 @@ Columns</h1>
|
||||
</div>
|
||||
<p>Alternatively, you can control which variables are kept with the <code>.keep</code> argument. A particularly useful argument is <code>"used"</code> which allows you to see the inputs and outputs from your calculations:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(,
|
||||
gain = dep_delay - arr_delay,
|
||||
hours = air_time / 60,
|
||||
@@ -365,7 +365,7 @@ Columns</h1>
|
||||
</h2>
|
||||
<p>It’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Select columns by name
|
||||
<pre data-type="programlisting" data-code-language="r"># Select columns by name
|
||||
flights |>
|
||||
select(year, month, day)
|
||||
#> # A tibble: 336,776 × 3
|
||||
@@ -436,7 +436,7 @@ flights |>
|
||||
</ul><p>See <code><a href="https://dplyr.tidyverse.org/reference/select.html">?select</a></code> for more details. Once you know regular expressions (the topic of <a href="#chp-regexps" data-type="xref">#chp-regexps</a>) you’ll also be use <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">matches()</a></code> to select variables that match a pattern.</p>
|
||||
<p>You can rename variables as you <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> them by using <code>=</code>. The new name appears on the left hand side of the <code>=</code>, and the old variable appears on the right hand side:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
select(tail_num = tailnum)
|
||||
#> # A tibble: 336,776 × 1
|
||||
#> tail_num
|
||||
@@ -457,7 +457,7 @@ flights |>
|
||||
</h2>
|
||||
<p>If you just want to keep all the existing variables and just want to rename a few, you can use <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> instead of <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
rename(tail_num = tailnum)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
@@ -483,7 +483,7 @@ flights |>
|
||||
</h2>
|
||||
<p>Use <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> to move variables around. You might want to collect related variables together or move important variables to the front. By default <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> moves variables to the front:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
relocate(time_hour, air_time)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> time_hour air_time year month day dep_time sched_dep…¹ dep_d…²
|
||||
@@ -501,7 +501,7 @@ flights |>
|
||||
</div>
|
||||
<p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to choose where to put them:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
relocate(year:dep_time, .after = time_hour)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest
|
||||
@@ -547,13 +547,13 @@ Exercises</h2>
|
||||
<li>
|
||||
<p>What does the <code><a href="https://tidyselect.r-lib.org/reference/all_of.html">any_of()</a></code> function do? Why might it be helpful in conjunction with this vector?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">variables <- c("year", "month", "day", "dep_delay", "arr_delay")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">variables <- c("year", "month", "day", "dep_delay", "arr_delay")</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">select(flights, contains("TIME"))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">select(flights, contains("TIME"))</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol></section>
|
||||
@@ -570,7 +570,7 @@ Groups</h1>
|
||||
</h2>
|
||||
<p>Use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> to divide your dataset into groups meaningful for your analysis:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(month)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> # Groups: month [12]
|
||||
@@ -596,7 +596,7 @@ Groups</h1>
|
||||
</h2>
|
||||
<p>The most important grouped operation is a summary. It collapses each group to a single row<span data-type="footnote">This is a slightly simplification; later on you’ll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> to produce multiple summary rows for each group.</span>. Here we compute the average departure delay by month:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(month) |>
|
||||
summarize(
|
||||
delay = mean(dep_delay)
|
||||
@@ -614,7 +614,7 @@ Groups</h1>
|
||||
</div>
|
||||
<p>Uhoh! Something has gone wrong and all of our results are <code>NA</code> (pronounced “N-A”), R’s symbol for missing value. We’ll come back to discuss missing values in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>, but for now we’ll remove them by using <code>na.rm = TRUE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(month) |>
|
||||
summarize(
|
||||
delay = mean(dep_delay, na.rm = TRUE)
|
||||
@@ -632,7 +632,7 @@ Groups</h1>
|
||||
</div>
|
||||
<p>You can create any number of summaries in a single call to <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. You’ll learn various useful summaries in the upcoming chapters, but one very useful summary is <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>, which returns the number of rows in each group:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(month) |>
|
||||
summarize(
|
||||
delay = mean(dep_delay, na.rm = TRUE),
|
||||
@@ -668,7 +668,7 @@ The<code>slice_</code> functions</h2>
|
||||
<code>df |> slice_sample(x, n = 1)</code> takes one random row.</li>
|
||||
</ul><p>You can vary <code>n</code> to select more than one row, or instead of <code>n =</code>, you can use <code>prop = 0.1</code> to select (e.g.) 10% of the rows in each group. For example, the following code finds the most delayed flight to each destination:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
slice_max(arr_delay, n = 1)
|
||||
#> # A tibble: 108 × 19
|
||||
@@ -688,7 +688,7 @@ The<code>slice_</code> functions</h2>
|
||||
</div>
|
||||
<p>This is similar to computing the max delay with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, but you get the whole row instead of the single summary:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarize(max_delay = max(arr_delay, na.rm = TRUE))
|
||||
#> Warning: There was 1 warning in `summarize()`.
|
||||
@@ -714,7 +714,7 @@ The<code>slice_</code> functions</h2>
|
||||
Grouping by multiple variables</h2>
|
||||
<p>You can create groups using more than one variable. For example, we could make a group for each day:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">daily <- flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">daily <- flights |>
|
||||
group_by(year, month, day)
|
||||
daily
|
||||
#> # A tibble: 336,776 × 19
|
||||
@@ -734,7 +734,7 @@ daily
|
||||
</div>
|
||||
<p>When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasn’t great way to make this function work, but it’s difficult to change without breaking existing code. To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">daily_flights <- daily |>
|
||||
<pre data-type="programlisting" data-code-language="r">daily_flights <- daily |>
|
||||
summarize(
|
||||
n = n()
|
||||
)
|
||||
@@ -743,7 +743,7 @@ daily
|
||||
</div>
|
||||
<p>If you’re happy with this behavior, you can explicitly request it in order to suppress the message:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">daily_flights <- daily |>
|
||||
<pre data-type="programlisting" data-code-language="r">daily_flights <- daily |>
|
||||
summarize(
|
||||
n = n(),
|
||||
.groups = "drop_last"
|
||||
@@ -757,7 +757,7 @@ daily
|
||||
Ungrouping</h2>
|
||||
<p>You might also want to remove grouping outside of <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. You can do this with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">ungroup()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">daily |>
|
||||
<pre data-type="programlisting" data-code-language="r">daily |>
|
||||
ungroup() |>
|
||||
summarize(
|
||||
delay = mean(dep_delay, na.rm = TRUE),
|
||||
@@ -787,7 +787,7 @@ Exercises</h2>
|
||||
Case study: aggregates and sample size</h1>
|
||||
<p>Whenever you do any aggregation, it’s always a good idea to include a count (<code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>). That way, you can ensure that you’re not drawing conclusions based on very small amounts of data. For example, let’s look at the planes (identified by their tail number) that have the highest average delays:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">delays <- flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">delays <- flights |>
|
||||
filter(!is.na(arr_delay), !is.na(tailnum)) |>
|
||||
group_by(tailnum) |>
|
||||
summarize(
|
||||
@@ -803,7 +803,7 @@ ggplot(delays, aes(delay)) +
|
||||
</div>
|
||||
<p>Wow, there are some planes that have an <em>average</em> delay of 5 hours (300 minutes)! That seems pretty surprising, so lets draw a scatterplot of number of flights vs. average delay:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(delays, aes(n, delay)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(delays, aes(n, delay)) +
|
||||
geom_point(alpha = 1/10)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-transform_files/figure-html/unnamed-chunk-37-1.png" class="img-fluid" alt="A scatterplot showing number of flights versus after delay. Delays for planes with very small number of flights have very high variability (from -50 to ~300), but the variability rapidly decreases as the number of flights increases." width="576"/></p>
|
||||
@@ -812,7 +812,7 @@ ggplot(delays, aes(delay)) +
|
||||
<p>Not surprisingly, there is much greater variation in the average delay when there are few flights for a given plane. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you’ll see that the variation decreases as the sample size increases<span data-type="footnote">*cough* the central limit theorem *cough*.</span>.</p>
|
||||
<p>When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">delays |>
|
||||
<pre data-type="programlisting" data-code-language="r">delays |>
|
||||
filter(n > 25) |>
|
||||
ggplot(aes(n, delay)) +
|
||||
geom_point(alpha = 1/10) +
|
||||
@@ -824,7 +824,7 @@ ggplot(delays, aes(delay)) +
|
||||
<p>Note the handy pattern for combining ggplot2 and dplyr. It’s a bit annoying that you have to switch from <code>|></code> to <code>+</code>, but it’s not too much of a hassle once you get the hang of it.</p>
|
||||
<p>There’s another common variation on this pattern that we can see in some data about baseball players. The following code uses data from the <strong>Lahman</strong> package to compare what proportion of times a player hits the ball vs. the number of attempts they take:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">batters <- Lahman::Batting |>
|
||||
<pre data-type="programlisting" data-code-language="r">batters <- Lahman::Batting |>
|
||||
group_by(playerID) |>
|
||||
summarize(
|
||||
perf = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
|
||||
@@ -846,7 +846,7 @@ batters
|
||||
<ol type="1"><li><p>As above, the variation in our aggregate decreases as we get more data points.</p></li>
|
||||
<li><p>There’s a positive correlation between skill (<code>perf</code>) and opportunities to hit the ball (<code>n</code>) because obviously teams want to give their best batters the most opportunities to hit the ball.</p></li>
|
||||
</ol><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">batters |>
|
||||
<pre data-type="programlisting" data-code-language="r">batters |>
|
||||
filter(n > 100) |>
|
||||
ggplot(aes(n, perf)) +
|
||||
geom_point(alpha = 1 / 10) +
|
||||
@@ -857,7 +857,7 @@ batters
|
||||
</div>
|
||||
<p>This also has important implications for ranking. If you naively sort on <code>desc(ba)</code>, the people with the best batting averages are clearly lucky, not skilled:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">batters |>
|
||||
<pre data-type="programlisting" data-code-language="r">batters |>
|
||||
arrange(desc(perf))
|
||||
#> # A tibble: 20,166 × 3
|
||||
#> playerID perf n
|
||||
|
||||
Reference in New Issue
Block a user