More minor page count tweaks & fixes

And re-convert with latest htmlbook
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions
--- a/oreilly/data-transform.html
+++ b/oreilly/data-transform.html
@@ -1,12 +1,12 @@
 <section data-type="chapter" id="chp-data-transform">
 <h1><span id="sec-data-transform" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data transformation</span></span></h1>
-<section id="introduction" data-type="sect1">
+<section id="data-transform-introduction" data-type="sect1">
 <h1>
 Introduction</h1>
 <p>Visualisation is an important tool for generating insight, but it’s rare that you get the data in exactly the right form you need for it. Often you’ll need to create some new variables or summaries to see the most important patterns, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the <strong>dplyr</strong> package and a new dataset on flights that departed New York City in 2013.</p>
 <p>The goal of this chapter is to give you an overview of all the key tools for transforming a data frame. We’ll start with functions that operate on rows and then columns of a data frame. We will then introduce the ability to work with groups. We will end the chapter with a case study that showcases these functions in action and we’ll come back to the functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).</p>

-<section id="prerequisites" data-type="sect2">
+<section id="data-transform-prerequisites" data-type="sect2">
 <h2>
 Prerequisites</h2>
 <p>In this chapter we’ll focus on the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.</p>
@@ -15,14 +15,14 @@ Prerequisites</h2>
 library(tidyverse)
 #&gt; ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
 #&gt; ✔ dplyr     1.0.99.9000     ✔ readr     2.1.3      
-#&gt; ✔ forcats   0.5.2.9000      ✔ stringr   1.5.0.9000 
+#&gt; ✔ forcats   0.5.2           ✔ stringr   1.5.0      
 #&gt; ✔ ggplot2   3.4.0.9000      ✔ tibble    3.1.8      
-#&gt; ✔ lubridate 1.9.0           ✔ tidyr     1.2.1.9001 
+#&gt; ✔ lubridate 1.9.0           ✔ tidyr     1.3.0      
 #&gt; ✔ purrr     1.0.1           
 #&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
 #&gt; ✖ dplyr::filter() masks stats::filter()
 #&gt; ✖ dplyr::lag()    masks stats::lag()
-#&gt; ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors</pre>
+#&gt; ℹ Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors</pre>
 </div>
 <p>Take careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: <code><a href="https://rdrr.io/r/stats/filter.html">stats::filter()</a></code> and <code><a href="https://rdrr.io/r/stats/lag.html">stats::lag()</a></code>. So far we’ve mostly ignored which package a function comes from because most of the time it doesn’t matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, we’ll use the same syntax as R: <code>packagename::functionname()</code>.</p>
 </section>
@@ -43,9 +43,7 @@ nycflights13</h2>
 #&gt; 5  2013     1     1      554            600        -6      812            837
 #&gt; 6  2013     1     1      554            558        -4      740            728
 #&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>If you’ve used R before, you might notice that this data frame prints a little differently to other data frames you’ve seen. That’s because it’s a <strong>tibble</strong>, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. There are a few options to see everything. If you’re using RStudio, the most convenient is probably <code>View(flights)</code>, which will open an interactive scrollable and filterable view. Otherwise you can use <code>print(flights, width = Inf)</code> to show all columns, or use call <code><a href="https://pillar.r-lib.org/reference/glimpse.html">glimpse()</a></code>:</p>
 <div class="cell">
@@ -103,7 +101,7 @@ Rows</h1>

 <section id="filter" data-type="sect2">
 <h2>
-<code>filter()</code>
+filter()
 </h2>
 <p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> allows you to keep rows based on the values of the columns<span data-type="footnote">Later, you’ll learn about the <code>slice_*()</code> family which allows you to choose rows based on their positions.</span>. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:</p>
 <div class="cell">
@@ -119,9 +117,7 @@ Rows</h1>
 #&gt; 5  2013     1     1     1505           1310       115     1638           1431
 #&gt; 6  2013     1     1     1525           1340       105     1831           1626
 #&gt; # … with 10,028 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>As well as <code>&gt;</code> (greater than), you can use <code>&gt;=</code> (greater than or equal to), <code>&lt;</code> (less than), <code>&lt;=</code> (less than or equal to), <code>==</code> (equal to), and <code>!=</code> (not equal to). You can also use <code>&amp;</code> (and) or <code>|</code> (or) to combine multiple conditions:</p>
 <div class="cell">
@@ -138,9 +134,7 @@ flights |&gt;
 #&gt; 5  2013     1     1      554            600        -6      812            837
 #&gt; 6  2013     1     1      554            558        -4      740            728
 #&gt; # … with 836 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …

 # Flights that departed in January or February
 flights |&gt; 
@@ -155,9 +149,7 @@ flights |&gt;
 #&gt; 5  2013     1     1      554            600        -6      812            837
 #&gt; 6  2013     1     1      554            558        -4      740            728
 #&gt; # … with 51,949 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>There’s a useful shortcut when you’re combining <code>|</code> and <code>==</code>: <code>%in%</code>. It keeps rows where the variable equals one of the values on the right:</p>
 <div class="cell">
@@ -174,9 +166,7 @@ flights |&gt;
 #&gt; 5  2013     1     1      554            600        -6      812            837
 #&gt; 6  2013     1     1      554            558        -4      740            728
 #&gt; # … with 51,949 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>We’ll come back to these comparisons and logical operators in more detail in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p>
 <p>When you run <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing <code>flights</code> dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <code>&lt;-</code>:</p>
@@ -208,7 +198,7 @@ Common mistakes</h2>

 <section id="arrange" data-type="sect2">
 <h2>
-<code>arrange()</code>
+arrange()
 </h2>
 <p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.</p>
 <div class="cell">
@@ -224,9 +214,7 @@ Common mistakes</h2>
 #&gt; 5  2013     1     1      554            600        -6      812            837
 #&gt; 6  2013     1     1      554            558        -4      740            728
 #&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>You can use <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> to re-order by a column in descending order. For example, this code shows the most delayed flights:</p>
 <div class="cell">
@@ -242,9 +230,7 @@ Common mistakes</h2>
 #&gt; 5  2013     7    22      845           1600      1005     1044           1815
 #&gt; 6  2013     4    10     1100           1900       960     1342           2211
 #&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>You can combine <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:</p>
 <div class="cell">
@@ -261,17 +247,15 @@ Common mistakes</h2>
 #&gt; 5  2013     9    19      648            641         7     1035            810
 #&gt; 6  2013     4    18      655            700        -5     1213            950
 #&gt; # … with 239,103 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 </section>

 <section id="distinct" data-type="sect2">
 <h2>
-<code>distinct()</code>
+distinct()
 </h2>
-<p><code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. Most of the time, however, you’ll want to the distinct combination of some variables, so you can also optionally supply column names:</p>
+<p><code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. Most of the time, however, you’ll want the distinct combination of some variables, so you can also optionally supply column names:</p>
 <div class="cell">
 <pre data-type="programlisting" data-code-language="r"># This would remove any duplicate rows if there were any
 flights |&gt; 
@@ -286,9 +270,7 @@ flights |&gt;
 #&gt; 5  2013     1     1      554            600        -6      812            837
 #&gt; 6  2013     1     1      554            558        -4      740            728
 #&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …

 # This finds all unique origin and destination pairs.
 flights |&gt; 
@@ -307,7 +289,7 @@ flights |&gt;
 <p>Note that if you want to find the number of duplicates, or rows that weren’t duplicated, you’re better off swapping <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> for <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> and then filtering as needed.</p>
 </section>

-<section id="exercises" data-type="sect2">
+<section id="data-transform-exercises" data-type="sect2">
 <h2>
 Exercises</h2>
 <ol type="1"><li>
@@ -334,7 +316,7 @@ Columns</h1>

 <section id="sec-mutate" data-type="sect2">
 <h2>
-<code>mutate()</code>
+mutate()
 </h2>
 <p>The job of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> is to add new columns that are calculated from the existing columns. In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables. For now, we’ll stick with basic algebra, which allows us to compute the <code>gain</code>, how much time a delayed flight made up in the air, and the <code>speed</code> in miles per hour:</p>
 <div class="cell">
@@ -353,9 +335,7 @@ Columns</h1>
 #&gt; 5  2013     1     1      554            600        -6      812            837
 #&gt; 6  2013     1     1      554            558        -4      740            728
 #&gt; # … with 336,770 more rows, and 13 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;, gain &lt;dbl&gt;, speed &lt;dbl&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the <code>.before</code> argument to instead add the variables to the left hand side<span data-type="footnote">Remember that in RStudio, the easiest way to see a dataset with many columns is <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>.</span>:</p>
 <div class="cell">
@@ -375,9 +355,7 @@ Columns</h1>
 #&gt; 5    19  394.  2013     1     1      554            600        -6      812
 #&gt; 6   -16  288.  2013     1     1      554            558        -4      740
 #&gt; # … with 336,770 more rows, and 12 more variables: sched_arr_time &lt;int&gt;,
-#&gt; #   arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;,
-#&gt; #   origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
-#&gt; #   minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
+#&gt; #   arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, …</pre>
 </div>
 <p>The <code>.</code> is a sign that <code>.before</code> is an argument to the function, not the name of a new variable. You can also use <code>.after</code> to add after a variable, and in both <code>.before</code> and <code>.after</code> you can use the variable name instead of a position. For example, we could add the new variables after <code>day:</code></p>
 <div class="cell">
@@ -397,14 +375,12 @@ Columns</h1>
 #&gt; 5  2013     1     1    19  394.      554            600        -6      812
 #&gt; 6  2013     1     1   -16  288.      554            558        -4      740
 #&gt; # … with 336,770 more rows, and 12 more variables: sched_arr_time &lt;int&gt;,
-#&gt; #   arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;,
-#&gt; #   origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
-#&gt; #   minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
+#&gt; #   arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, …</pre>
 </div>
 <p>Alternatively, you can control which variables are kept with the <code>.keep</code> argument. A particularly useful argument is <code>"used"</code> which allows you to see the inputs and outputs from your calculations:</p>
 <div class="cell">
 <pre data-type="programlisting" data-code-language="r">flights |&gt; 
-  mutate(,
+  mutate(
    gain = dep_delay - arr_delay,
    hours = air_time / 60,
    gain_per_hour = gain / hours,
@@ -425,7 +401,7 @@ Columns</h1>

 <section id="sec-select" data-type="sect2">
 <h2>
-<code>select()</code>
+select()
 </h2>
 <p>It’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:</p>
 <div class="cell">
@@ -470,8 +446,7 @@ flights |&gt;
 #&gt; 5      554            600        -6      812            837       -25 DL     
 #&gt; 6      554            558        -4      740            728        12 UA     
 #&gt; # … with 336,770 more rows, and 9 more variables: flight &lt;int&gt;,
-#&gt; #   tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
-#&gt; #   hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;
+#&gt; #   tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, …

 # Select all columns that are characters
 flights |&gt; 
@@ -516,7 +491,7 @@ flights |&gt;

 <section id="rename" data-type="sect2">
 <h2>
-<code>rename()</code>
+rename()
 </h2>
 <p>If you just want to keep all the existing variables and just want to rename a few, you can use <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> instead of <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>:</p>
 <div class="cell">
@@ -532,9 +507,7 @@ flights |&gt;
 #&gt; 5  2013     1     1      554            600        -6      812            837
 #&gt; 6  2013     1     1      554            558        -4      740            728
 #&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tail_num &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tail_num &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>It works exactly the same way as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, but keeps all the variables that aren’t explicitly selected.</p>
 <p>If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out <code><a href="https://rdrr.io/pkg/janitor/man/clean_names.html">janitor::clean_names()</a></code> which provides some useful automated cleaning.</p>
@@ -542,7 +515,7 @@ flights |&gt;

 <section id="relocate" data-type="sect2">
 <h2>
-<code>relocate()</code>
+relocate()
 </h2>
 <p>Use <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> to move variables around. You might want to collect related variables together or move important variables to the front. By default <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> moves variables to the front:</p>
 <div class="cell">
@@ -558,9 +531,7 @@ flights |&gt;
 #&gt; 5 2013-01-01 06:00:00      116  2013     1     1      554            600
 #&gt; 6 2013-01-01 05:00:00      150  2013     1     1      554            558
 #&gt; # … with 336,770 more rows, and 12 more variables: dep_delay &lt;dbl&gt;,
-#&gt; #   arr_time &lt;int&gt;, sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;,
-#&gt; #   flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, distance &lt;dbl&gt;,
-#&gt; #   hour &lt;dbl&gt;, minute &lt;dbl&gt;</pre>
+#&gt; #   arr_time &lt;int&gt;, sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, …</pre>
 </div>
 <p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to choose where to put them:</p>
 <div class="cell">
@@ -576,9 +547,7 @@ flights |&gt;
 #&gt; 5            600        -6      812            837       -25 DL         461
 #&gt; 6            558        -4      740            728        12 UA        1696
 #&gt; # … with 336,770 more rows, and 12 more variables: tailnum &lt;chr&gt;,
-#&gt; #   origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
-#&gt; #   minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;,
-#&gt; #   dep_time &lt;int&gt;
+#&gt; #   origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, …
 flights |&gt; 
  relocate(starts_with("arr"), .before = dep_time)
 #&gt; # A tibble: 336,776 × 19
@@ -591,13 +560,11 @@ flights |&gt;
 #&gt; 5  2013     1     1      812       -25      554            600        -6
 #&gt; 6  2013     1     1      740        12      554            558        -4
 #&gt; # … with 336,770 more rows, and 11 more variables: sched_arr_time &lt;int&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 </section>

-<section id="exercises-1" data-type="sect2">
+<section id="data-transform-exercises-1" data-type="sect2">
 <h2>
 Exercises</h2>
 <div class="cell">
@@ -629,7 +596,7 @@ Groups</h1>

 <section id="group_by" data-type="sect2">
 <h2>
-<code>group_by()</code>
+group_by()
 </h2>
 <p>Use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> to divide your dataset into groups meaningful for your analysis:</p>
 <div class="cell">
@@ -646,16 +613,14 @@ Groups</h1>
 #&gt; 5  2013     1     1      554            600        -6      812            837
 #&gt; 6  2013     1     1      554            558        -4      740            728
 #&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesn’t change the data but, if you look closely at the output, you’ll notice that it’s now “grouped by” month. This means subsequent operations will now work “by month”. <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesn’t do anything by itself; instead it changes the behavior of the subsequent verbs.</p>
 </section>

 <section id="sec-summarize" data-type="sect2">
 <h2>
-<code>summarize()</code>
+summarize()
 </h2>
 <p>The most important grouped operation is a summary, which collapses each group to a single row. In dplyr, this is operation is performed by <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code><span data-type="footnote">Or <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>, if you prefer British English.</span>, as shown by the following example, which computes the average departure delay by month:</p>
 <div class="cell">
@@ -717,7 +682,7 @@ Groups</h1>

 <section id="the-slice_-functions" data-type="sect2">
 <h2>
-The<code>slice_</code> functions</h2>
+The slice_ functions</h2>
 <p>There are five handy functions that allow you pick off specific rows within each group:</p>
 <ul><li>
 <code>df |&gt; slice_head(n = 1)</code> takes the first row from each group.</li>
@@ -745,9 +710,7 @@ The<code>slice_</code> functions</h2>
 #&gt; 5  2013     7    22     2257            759       898      121           1026
 #&gt; 6  2013     7    10     2056           1505       351     2347           1758
 #&gt; # … with 102 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>This is similar to computing the max delay with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, but you get the whole row instead of the single summary:</p>
 <div class="cell">
@@ -791,9 +754,7 @@ daily
 #&gt; 5  2013     1     1      554            600        -6      812            837
 #&gt; 6  2013     1     1      554            558        -4      740            728
 #&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasn’t great way to make this function work, but it’s difficult to change without breaking existing code. To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:</p>
 <div class="cell">
@@ -834,7 +795,7 @@ Ungrouping</h2>
 <p>As you can see, when you summarize an ungrouped data frame, you get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.</p>
 </section>

-<section id="exercises-2" data-type="sect2">
+<section id="data-transform-exercises-2" data-type="sect2">
 <h2>
 Exercises</h2>
 <ol type="1"><li><p>Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about <code>flights |&gt; group_by(carrier, dest) |&gt; summarize(n())</code>)</p></li>
@@ -996,7 +957,7 @@ batters
 <p>You can find a good explanation of this problem and how to overcome it at <a href="http://varianceexplained.org/r/empirical_bayes_baseball/" class="uri">http://varianceexplained.org/r/empirical_bayes_baseball/</a> and <a href="https://www.evanmiller.org/how-not-to-sort-by-average-rating.html" class="uri">https://www.evanmiller.org/how-not-to-sort-by-average-rating.html</a>.</p>
 </section>

-<section id="summary" data-type="sect1">
+<section id="data-transform-summary" data-type="sect1">
 <h1>
 Summary</h1>
 <p>In this chapter, you’ve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, those that manipulate the columns (like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>), and those that manipulate groups (like <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>). In this chapter, we’ve focused on these “whole data frame” tools, but you haven’t yet learned much about what you can do with the individual variable. We’ll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.</p>