Don't transform non-crossref links

This commit is contained in:
Hadley Wickham
2022-11-18 10:30:32 -06:00
parent 4caea5281b
commit 78a1c12fe7
32 changed files with 693 additions and 693 deletions

View File

@@ -30,13 +30,13 @@ library(tidyverse)
#> ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()</pre>
</div>
<p>Take careful note of the conflicts message thats printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, youll need to use their full names: <code><a href="#chp-https://rdrr.io/r/stats/filter" data-type="xref">#chp-https://rdrr.io/r/stats/filter</a></code> and <code><a href="#chp-https://rdrr.io/r/stats/lag" data-type="xref">#chp-https://rdrr.io/r/stats/lag</a></code>. So far weve mostly ignored which package a function comes from because most of the time it doesnt matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, well use the same syntax as R: <code>packagename::functionname()</code>.</p>
<p>Take careful note of the conflicts message thats printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, youll need to use their full names: <code><a href="https://rdrr.io/r/stats/filter.html">stats::filter()</a></code> and <code><a href="https://rdrr.io/r/stats/lag.html">stats::lag()</a></code>. So far weve mostly ignored which package a function comes from because most of the time it doesnt matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, well use the same syntax as R: <code>packagename::functionname()</code>.</p>
</section>
<section id="nycflights13" data-type="sect2">
<h2>
nycflights13</h2>
<p>To explore the basic dplyr verbs, were going to use <code><a href="#chp-https://rdrr.io/pkg/nycflights13/man/flights" data-type="xref">#chp-https://rdrr.io/pkg/nycflights13/man/flights</a></code>. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US <a href="#chp-http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&amp;Link=0" data-type="xref">#chp-http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&amp;Link=0</a>, and is documented in <code><a href="#chp-https://rdrr.io/pkg/nycflights13/man/flights" data-type="xref">#chp-https://rdrr.io/pkg/nycflights13/man/flights</a></code>.</p>
<p>To explore the basic dplyr verbs, were going to use <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code>. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US <a href="http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&amp;Link=0">Bureau of Transportation Statistics</a>, and is documented in <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">?flights</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights
#&gt; # A tibble: 336,776 × 19
@@ -81,13 +81,13 @@ dplyr basics</h2>
<section id="rows" data-type="sect1">
<h1>
Rows</h1>
<p>The most important verbs that operate on rows are <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>, which changes which rows are present without changing their order, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>, which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged.</p>
<p>The most important verbs that operate on rows are <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, which changes which rows are present without changing their order, and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged.</p>
<section id="filter" data-type="sect2">
<h2>
<code>filter()</code>
</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> allows you to keep rows based on the values of the columns<span data-type="footnote">Later, youll learn about the <code>slice_*()</code> family which allows you to choose rows based on their positions.</span>. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:</p>
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> allows you to keep rows based on the values of the columns<span data-type="footnote">Later, youll learn about the <code>slice_*()</code> family which allows you to choose rows based on their positions.</span>. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(arr_delay &gt; 120)
@@ -161,7 +161,7 @@ flights |&gt;
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>Well come back to these comparisons and logical operators in more detail in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p>
<p>When you run <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesnt modify the existing <code>flights</code> dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <code>&lt;-</code>:</p>
<p>When you run <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesnt modify the existing <code>flights</code> dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <code>&lt;-</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">jan1 &lt;- flights |&gt;
filter(month == 1 &amp; day == 1)</pre>
@@ -171,7 +171,7 @@ flights |&gt;
<section id="common-mistakes" data-type="sect2">
<h2>
Common mistakes</h2>
<p>When youre starting out with R, the easiest mistake to make is to use <code>=</code> instead of <code>==</code> when testing for equality. <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> will let you know when this happens:</p>
<p>When youre starting out with R, the easiest mistake to make is to use <code>=</code> instead of <code>==</code> when testing for equality. <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> will let you know when this happens:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month = 1)
@@ -192,7 +192,7 @@ Common mistakes</h2>
<h2>
<code>arrange()</code>
</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.</p>
<p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
arrange(year, month, day, dep_time)
@@ -210,7 +210,7 @@ Common mistakes</h2>
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>You can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/desc" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/desc</a></code> to re-order by a column in descending order. For example, this code shows the most delayed flights:</p>
<p>You can use <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> to re-order by a column in descending order. For example, this code shows the most delayed flights:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
arrange(desc(dep_delay))
@@ -228,7 +228,7 @@ Common mistakes</h2>
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>You can combine <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:</p>
<p>You can combine <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dep_delay &lt;= 10 &amp; dep_delay &gt;= -10) |&gt;
@@ -264,20 +264,20 @@ Exercises</h2>
<li><p>Sort <code>flights</code> to find the flights with longest departure delays. Find the flights that left earliest in the morning.</p></li>
<li><p>Sort <code>flights</code> to find the fastest flights (Hint: try sorting by a calculation).</p></li>
<li><p>Which flights traveled the farthest? Which traveled the shortest?</p></li>
<li><p>Does it matter what order you used <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> in if youre using both? Why/why not? Think about the results and how much work the functions would have to do.</p></li>
<li><p>Does it matter what order you used <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> in if youre using both? Why/why not? Think about the results and how much work the functions would have to do.</p></li>
</ol></section>
</section>
<section id="columns" data-type="sect1">
<h1>
Columns</h1>
<p>There are four important verbs that affect the columns without changing the rows: <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code>. <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> creates new columns that are functions of the existing columns; <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> change which columns are present, their names, or their positions.</p>
<p>There are four important verbs that affect the columns without changing the rows: <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> creates new columns that are functions of the existing columns; <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> change which columns are present, their names, or their positions.</p>
<section id="sec-mutate" data-type="sect2">
<h2>
<code>mutate()</code>
</h2>
<p>The job of <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> is to add new columns that are calculated from the existing columns. In the transform chapters, youll learn a large set of functions that you can use to manipulate different types of variables. For now, well stick with basic algebra, which allows us to compute the <code>gain</code>, how much time a delayed flight made up in the air, and the <code>speed</code> in miles per hour:</p>
<p>The job of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> is to add new columns that are calculated from the existing columns. In the transform chapters, youll learn a large set of functions that you can use to manipulate different types of variables. For now, well stick with basic algebra, which allows us to compute the <code>gain</code>, how much time a delayed flight made up in the air, and the <code>speed</code> in miles per hour:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
@@ -299,7 +299,7 @@ Columns</h1>
#&gt; # variable names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
#&gt; # ⁵arr_delay</pre>
</div>
<p>By default, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> adds new columns on the right hand side of your dataset, which makes it difficult to see whats happening here. We can use the <code>.before</code> argument to instead add the variables to the left hand side<span data-type="footnote">Remember that in RStudio, the easiest way to see a dataset with many columns is <code><a href="#chp-https://rdrr.io/r/utils/View" data-type="xref">#chp-https://rdrr.io/r/utils/View</a></code>.</span>:</p>
<p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> adds new columns on the right hand side of your dataset, which makes it difficult to see whats happening here. We can use the <code>.before</code> argument to instead add the variables to the left hand side<span data-type="footnote">Remember that in RStudio, the easiest way to see a dataset with many columns is <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
@@ -369,7 +369,7 @@ Columns</h1>
<h2>
<code>select()</code>
</h2>
<p>Its not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables youre interested in. <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:</p>
<p>Its not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables youre interested in. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Select columns by name
flights |&gt;
@@ -430,7 +430,7 @@ flights |&gt;
#&gt; 6 UA N39463 EWR ORD
#&gt; # … with 336,770 more rows</pre>
</div>
<p>There are a number of helper functions you can use within <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>:</p>
<p>There are a number of helper functions you can use within <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>:</p>
<ul><li>
<code>starts_with("abc")</code>: matches names that begin with “abc”.</li>
<li>
@@ -439,8 +439,8 @@ flights |&gt;
<code>contains("ijk")</code>: matches names that contain “ijk”.</li>
<li>
<code>num_range("x", 1:3)</code>: matches <code>x1</code>, <code>x2</code> and <code>x3</code>.</li>
</ul><p>See <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> for more details. Once you know regular expressions (the topic of <a href="#chp-regexps" data-type="xref">#chp-regexps</a>) youll also be use <code><a href="#chp-https://tidyselect.r-lib.org/reference/starts_with" data-type="xref">#chp-https://tidyselect.r-lib.org/reference/starts_with</a></code> to select variables that match a pattern.</p>
<p>You can rename variables as you <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> them by using <code>=</code>. The new name appears on the left hand side of the <code>=</code>, and the old variable appears on the right hand side:</p>
</ul><p>See <code><a href="https://dplyr.tidyverse.org/reference/select.html">?select</a></code> for more details. Once you know regular expressions (the topic of <a href="#chp-regexps" data-type="xref">#chp-regexps</a>) youll also be use <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">matches()</a></code> to select variables that match a pattern.</p>
<p>You can rename variables as you <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> them by using <code>=</code>. The new name appears on the left hand side of the <code>=</code>, and the old variable appears on the right hand side:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
select(tail_num = tailnum)
@@ -461,7 +461,7 @@ flights |&gt;
<h2>
<code>rename()</code>
</h2>
<p>If you just want to keep all the existing variables and just want to rename a few, you can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code> instead of <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>:</p>
<p>If you just want to keep all the existing variables and just want to rename a few, you can use <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> instead of <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
rename(tail_num = tailnum)
@@ -479,15 +479,15 @@ flights |&gt;
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>It works exactly the same way as <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, but keeps all the variables that arent explicitly selected.</p>
<p>If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out <code><a href="#chp-https://rdrr.io/pkg/janitor/man/clean_names" data-type="xref">#chp-https://rdrr.io/pkg/janitor/man/clean_names</a></code> which provides some useful automated cleaning.</p>
<p>It works exactly the same way as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, but keeps all the variables that arent explicitly selected.</p>
<p>If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out <code><a href="https://rdrr.io/pkg/janitor/man/clean_names.html">janitor::clean_names()</a></code> which provides some useful automated cleaning.</p>
</section>
<section id="relocate" data-type="sect2">
<h2>
<code>relocate()</code>
</h2>
<p>Use <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> to move variables around. You might want to collect related variables together or move important variables to the front. By default <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> moves variables to the front:</p>
<p>Use <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> to move variables around. You might want to collect related variables together or move important variables to the front. By default <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> moves variables to the front:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
relocate(time_hour, air_time)
@@ -505,7 +505,7 @@ flights |&gt;
#&gt; # dest &lt;chr&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;, and abbreviated
#&gt; # variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time</pre>
</div>
<p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> to choose where to put them:</p>
<p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to choose where to put them:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
relocate(year:dep_time, .after = time_hour)
@@ -549,9 +549,9 @@ Exercises</h2>
<ol type="1"><li><p>Compare <code>air_time</code> with <code>arr_time - dep_time</code>. What do you expect to see? What do you see? What do you need to do to fix it?</p></li>
<li><p>Compare <code>dep_time</code>, <code>sched_dep_time</code>, and <code>dep_delay</code>. How would you expect those three numbers to be related?</p></li>
<li><p>Brainstorm as many ways as possible to select <code>dep_time</code>, <code>dep_delay</code>, <code>arr_time</code>, and <code>arr_delay</code> from <code>flights</code>.</p></li>
<li><p>What happens if you include the name of a variable multiple times in a <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> call?</p></li>
<li><p>What happens if you include the name of a variable multiple times in a <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> call?</p></li>
<li>
<p>What does the <code><a href="#chp-https://tidyselect.r-lib.org/reference/all_of" data-type="xref">#chp-https://tidyselect.r-lib.org/reference/all_of</a></code> function do? Why might it be helpful in conjunction with this vector?</p>
<p>What does the <code><a href="https://tidyselect.r-lib.org/reference/all_of.html">any_of()</a></code> function do? Why might it be helpful in conjunction with this vector?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">variables &lt;- c("year", "month", "day", "dep_delay", "arr_delay")</pre>
</div>
@@ -568,13 +568,13 @@ Exercises</h2>
<section id="groups" data-type="sect1">
<h1>
Groups</h1>
<p>So far youve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, well focus on the most important functions: <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, and the slice family of functions.</p>
<p>So far youve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, well focus on the most important functions: <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, and the slice family of functions.</p>
<section id="group_by" data-type="sect2">
<h2>
<code>group_by()</code>
</h2>
<p>Use <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> to divide your dataset into groups meaningful for your analysis:</p>
<p>Use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> to divide your dataset into groups meaningful for your analysis:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(month)
@@ -593,14 +593,14 @@ Groups</h1>
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> doesnt change the data but, if you look closely at the output, youll notice that its now “grouped by” month. This means subsequent operations will now work “by month”.</p>
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesnt change the data but, if you look closely at the output, youll notice that its now “grouped by” month. This means subsequent operations will now work “by month”.</p>
</section>
<section id="sec-summarize" data-type="sect2">
<h2>
<code>summarize()</code>
</h2>
<p>The most important grouped operation is a summary. It collapses each group to a single row<span data-type="footnote">This is a slightly simplification; later on youll learn how to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> to produce multiple summary rows for each group.</span>. Here we compute the average departure delay by month:</p>
<p>The most important grouped operation is a summary. It collapses each group to a single row<span data-type="footnote">This is a slightly simplification; later on youll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> to produce multiple summary rows for each group.</span>. Here we compute the average departure delay by month:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(month) |&gt;
@@ -636,7 +636,7 @@ Groups</h1>
#&gt; 6 6 20.8
#&gt; # … with 6 more rows</pre>
</div>
<p>You can create any number of summaries in a single call to <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>. Youll learn various useful summaries in the upcoming chapters, but one very useful summary is <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code>, which returns the number of rows in each group:</p>
<p>You can create any number of summaries in a single call to <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. Youll learn various useful summaries in the upcoming chapters, but one very useful summary is <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>, which returns the number of rows in each group:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(month) |&gt;
@@ -692,7 +692,7 @@ The<code>slice_</code> functions</h2>
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>This is similar to computing the max delay with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, but you get the whole row instead of the single summary:</p>
<p>This is similar to computing the max delay with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, but you get the whole row instead of the single summary:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt;
@@ -761,7 +761,7 @@ daily
<section id="ungrouping" data-type="sect2">
<h2>
Ungrouping</h2>
<p>You might also want to remove grouping outside of <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>. You can do this with <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>.</p>
<p>You might also want to remove grouping outside of <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. You can do this with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">ungroup()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">daily |&gt;
ungroup() |&gt;
@@ -783,15 +783,15 @@ Exercises</h2>
<ol type="1"><li><p>Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about <code>flights |&gt; group_by(carrier, dest) |&gt; summarize(n())</code>)</p></li>
<li><p>Find the most delayed flight to each destination.</p></li>
<li><p>How do delays vary over the course of the day. Illustrate your answer with a plot.</p></li>
<li><p>What happens if you supply a negative <code>n</code> to <code><a href="#chp-https://dplyr.tidyverse.org/reference/slice" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/slice</a></code> and friends?</p></li>
<li><p>Explain what <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> does in terms of the dplyr verbs you just learn. What does the <code>sort</code> argument to <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> do?</p></li>
<li><p>What happens if you supply a negative <code>n</code> to <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_min()</a></code> and friends?</p></li>
<li><p>Explain what <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> does in terms of the dplyr verbs you just learn. What does the <code>sort</code> argument to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> do?</p></li>
</ol></section>
</section>
<section id="sec-sample-size" data-type="sect1">
<h1>
Case study: aggregates and sample size</h1>
<p>Whenever you do any aggregation, its always a good idea to include a count (<code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code>). That way, you can ensure that youre not drawing conclusions based on very small amounts of data. For example, lets look at the planes (identified by their tail number) that have the highest average delays:</p>
<p>Whenever you do any aggregation, its always a good idea to include a count (<code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>). That way, you can ensure that youre not drawing conclusions based on very small amounts of data. For example, lets look at the planes (identified by their tail number) that have the highest average delays:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">delays &lt;- flights |&gt;
filter(!is.na(arr_delay), !is.na(tailnum)) |&gt;
@@ -882,7 +882,7 @@ batters
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>, those that manipulate the columns (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>), and those that manipulate groups (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>). In this chapter, weve focused on these “whole data frame” tools, but you havent yet learned much about what you can do with the individual variable. Well come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.</p>
<p>In this chapter, youve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, those that manipulate the columns (like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>), and those that manipulate groups (like <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>). In this chapter, weve focused on these “whole data frame” tools, but you havent yet learned much about what you can do with the individual variable. Well come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.</p>
<p>For now, well pivot back to workflow, and in the next chapter youll learn more about the pipe, <code>|&gt;</code>, why we recommend it, and a little of the history that lead from magrittrs <code>%&gt;%</code> to base Rs <code>|&gt;</code>.</p>