Re-render book for O'Reilly

This commit is contained in:
Hadley Wickham
2023-01-12 17:22:57 -06:00
parent 28671ed8bd
commit 360d65ae47
113 changed files with 4957 additions and 2997 deletions

View File

@@ -65,15 +65,15 @@ Primary and foreign keys</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">planes
#&gt; # A tibble: 3,322 × 9
#&gt; tailnum year type manuf…¹ model engines seats speed engine
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 N10156 2004 Fixed wing multi en… EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 2 N102UW 1998 Fixed wing multi en… AIRBUS… A320… 2 182 NA Turbo…
#&gt; 3 N103US 1999 Fixed wing multi en… AIRBUS… A320… 2 182 NA Turbo…
#&gt; 4 N104UW 1999 Fixed wing multi en… AIRBUS… A320… 2 182 NA Turbo…
#&gt; 5 N10575 2002 Fixed wing multi en… EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 6 N105UW 1999 Fixed wing multi en… AIRBUS… A320… 2 182 NA Turbo…
#&gt; # … with 3,316 more rows, and abbreviated variable name ¹manufacturer</pre>
#&gt; tailnum year type manufacturer model engines seats speed engine
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 N10156 2004 Fixed wing mul… EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 2 N102UW 1998 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
#&gt; 3 N103US 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
#&gt; 4 N104UW 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
#&gt; 5 N10575 2002 Fixed wing mul… EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 6 N105UW 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
#&gt; # … with 3,316 more rows</pre>
</div>
</li>
<li>
@@ -81,17 +81,16 @@ Primary and foreign keys</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">weather
#&gt; # A tibble: 26,115 × 15
#&gt; origin year month day hour temp dewp humid wind_dir wind_sp…¹ wind_…²
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
#&gt; 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
#&gt; 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
#&gt; 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
#&gt; 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
#&gt; 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
#&gt; # … with 26,109 more rows, 4 more variables: precip &lt;dbl&gt;, pressure &lt;dbl&gt;,
#&gt; # visib &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹wind_speed, ²wind_gust</pre>
#&gt; origin year month day hour temp dewp humid wind_dir wind_speed
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
#&gt; 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
#&gt; 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
#&gt; 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
#&gt; 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
#&gt; 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
#&gt; # … with 26,109 more rows, and 5 more variables: wind_gust &lt;dbl&gt;,
#&gt; # precip &lt;dbl&gt;, pressure &lt;dbl&gt;, visib &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
</div>
</li>
</ul><p>A <strong>foreign key</strong> is a variable (or set of variables) that corresponds to a primary key in another table. For example:</p>
@@ -102,7 +101,7 @@ Primary and foreign keys</h2>
<li>
<code>flights$origin</code> is a foreign key that corresponds to the primary key <code>airports$faa</code>.</li>
<li>
<code>flights$dest</code> is a foreign key that corresponds to the primary key <code>airports$faa</code> .</li>
<code>flights$dest</code> is a foreign key that corresponds to the primary key <code>airports$faa</code>.</li>
<li>
<code>flights$origin</code>-<code>flights$time_hour</code> is a compound foreign key that corresponds to the compound primary key <code>weather$origin</code>-<code>weather$time_hour</code>.</li>
</ul><p>These relationships are summarized visually in <a href="#fig-flights-relationships" data-type="xref">#fig-flights-relationships</a>.</p>
@@ -110,7 +109,7 @@ Primary and foreign keys</h2>
<div class="cell-output-display">
<figure id="fig-flights-relationships"><p><img src="diagrams/relational.png" alt="The relationships between airports, planes, flights, weather, and airlines datasets from the nycflights13 package. airports$faa connected to the flights$origin and flights$dest. planes$tailnum is connected to the flights$tailnum. weather$time_hour and weather$origin are jointly connected to flights$time_hour and flights$origin. airlines$carrier is connected to flights$carrier. There are no direct connections between airports, planes, airlines, and weather data frames." width="502"/></p>
<figcaption>Connections between all five data frames in the nycflights13 package. Variables making up a primary key are coloured grey, and are connected to their corresponding foreign keys with arrows.</figcaption>
<figcaption>Connections between all five data frames in the nycflights13 package. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows.</figcaption>
</figure>
</div>
</div>
@@ -182,19 +181,18 @@ Surrogate keys</h2>
mutate(id = row_number(), .before = 1)
flights2
#&gt; # A tibble: 336,776 × 20
#&gt; id year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 1 2013 1 1 517 515 2 830 819 11
#&gt; 2 2 2013 1 1 533 529 4 850 830 20
#&gt; 3 3 2013 1 1 542 540 2 923 850 33
#&gt; 4 4 2013 1 1 544 545 -1 1004 1022 -18
#&gt; 5 5 2013 1 1 554 600 -6 812 837 -25
#&gt; 6 6 2013 1 1 554 558 -4 740 728 12
#&gt; # … with 336,770 more rows, 10 more variables: carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable
#&gt; # names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
#&gt; # ⁵arr_delay</pre>
#&gt; id year month day dep_time sched_dep_time dep_delay arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 1 2013 1 1 517 515 2 830
#&gt; 2 2 2013 1 1 533 529 4 850
#&gt; 3 3 2013 1 1 542 540 2 923
#&gt; 4 4 2013 1 1 544 545 -1 1004
#&gt; 5 5 2013 1 1 554 600 -6 812
#&gt; 6 6 2013 1 1 554 558 -4 740
#&gt; # … with 336,770 more rows, and 12 more variables: sched_arr_time &lt;int&gt;,
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
</div>
<p>Surrogate keys can be particular useful when communicating to other humans: its much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.</p>
</section>
@@ -312,16 +310,16 @@ Specifying join keys</h2>
left_join(planes)
#&gt; Joining with `by = join_by(year, tailnum)`
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier type manufa…¹ model
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; # … with 336,770 more rows, 4 more variables: engines &lt;int&gt;, seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;, and abbreviated variable name ¹manufacturer</pre>
#&gt; year time_hour origin dest tailnum carrier type manufacturer
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA &lt;NA&gt; &lt;NA&gt;
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA &lt;NA&gt; &lt;NA&gt;
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA &lt;NA&gt; &lt;NA&gt;
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; &lt;NA&gt;
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL &lt;NA&gt; &lt;NA&gt;
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA &lt;NA&gt; &lt;NA&gt;
#&gt; # … with 336,770 more rows, and 5 more variables: model &lt;chr&gt;,
#&gt; # engines &lt;int&gt;, seats &lt;int&gt;, speed &lt;int&gt;, engine &lt;chr&gt;</pre>
</div>
<p>We get a lot of missing matches because our join is trying to use <code>tailnum</code> and <code>year</code> as a compound key. Both <code>flights</code> and <code>planes</code> have a <code>year</code> column but they mean different things: <code>flights$year</code> is year the flight occurred and <code>planes$year</code> is the year the plane was built. We only want to join on <code>tailnum</code> so we need to provide an explicit specification with <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code>:</p>
<div class="cell">
@@ -341,7 +339,7 @@ Specifying join keys</h2>
</div>
<p>Note that the <code>year</code> variables are disambiguated in the output with a suffix (<code>year.x</code> and <code>year.y</code>), which tells you whether the variable came from the <code>x</code> or <code>y</code> argument. You can override the default suffixes with the <code>suffix</code> argument.</p>
<p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. Its important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. Thats why this type of join is often called an <strong>equi-join</strong>. Youll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
<p>Secondly, its how you specify different join keys in each table. For example, there are two ways to join the <code>flight2</code> and <code>airports</code> table: either by <code>dest</code> or <code>origin:</code></p>
<p>Secondly, its how you specify different join keys in each table. For example, there are two ways to join the <code>flight2</code> and <code>airports</code> table: either by <code>dest</code> or <code>origin</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
left_join(airports, join_by(dest == faa))
@@ -461,12 +459,12 @@ Exercises</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">airports |&gt;
semi_join(flights, join_by(faa == dest)) |&gt;
ggplot(aes(lon, lat)) +
ggplot(aes(x = lon, y = lat)) +
borders("state") +
geom_point() +
coord_quickmap()</pre>
</div>
<p>You might want to use the <code>size</code> or <code>colour</code> of the points to display the average delay for each airport.</p>
<p>You might want to use the <code>size</code> or <code>color</code> of the points to display the average delay for each airport.</p>
</li>
<li><p>What happened on June 13 2013? Draw a map of the delays, and then use Google to cross-reference with the weather.</p></li>
</ol></section>
@@ -493,8 +491,8 @@ y &lt;- tribble(
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-setup"><p><img src="diagrams/join/setup.png" alt="x and y are two data frames with 2 columns and 3 rows, with contents as described in the text. The values of the keys are coloured: 1 is green, 2 is purple, 3 is orange, and 4 is yellow." width="160"/></p>
<figcaption>Graphical representation of two simple tables. The coloured <code>key</code> columns map background colour to key value. The grey columns represent the “value” columns that are carried along for the ride.</figcaption>
<figure id="fig-join-setup"><p><img src="diagrams/join/setup.png" alt="x and y are two data frames with 2 columns and 3 rows, with contents as described in the text. The values of the keys are colored: 1 is green, 2 is purple, 3 is orange, and 4 is yellow." width="160"/></p>
<figcaption>Graphical representation of two simple tables. The colored <code>key</code> columns map background color to key value. The grey columns represent the “value” columns that are carried along for the ride.</figcaption>
</figure>
</div>
</div>
@@ -710,11 +708,11 @@ Non-equi joins</h1>
<div class="cell-output-display">
<figure id="fig-inner-both"><p><img src="diagrams/join/inner-both.png" alt="A join diagram showing an inner join betwen x and y. The result now includes four columns: key.x, val_x, key.y, and val_y. The values of key.x and key.y are identical, which is why we usually only show one. " width="415"/></p>
<figcaption>An left join showing both <code>x</code> and <code>y</code> keys in the output.</figcaption>
<figcaption>A left join showing both <code>x</code> and <code>y</code> keys in the output.</figcaption>
</figure>
</div>
</div>
<p>When we move away from equi-joins well always show the keys, because the key values will often different. For example, instead of matching only when the <code>x$key</code> and <code>y$key</code> are equal, we could match whenever the <code>x$key</code> is greater than or equal to the <code>y$key</code>, leading to <a href="#fig-join-gte" data-type="xref">#fig-join-gte</a>. dplyrs join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.</p>
<p>When we move away from equi-joins well always show the keys, because the key values will often be different. For example, instead of matching only when the <code>x$key</code> and <code>y$key</code> are equal, we could match whenever the <code>x$key</code> is greater than or equal to the <code>y$key</code>, leading to <a href="#fig-join-gte" data-type="xref">#fig-join-gte</a>. dplyrs join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.</p>
<div class="cell">
<div class="cell-output-display">
@@ -746,10 +744,10 @@ Cross joins</h2>
</figure>
</div>
</div>
<p>Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since were joining <code>df</code> to itself, this is sometimes called a <strong>self-join</strong>.</p>
<p>Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since were joining <code>df</code> to itself, this is sometimes called a <strong>self-join</strong>. Cross joins use a different join function because theres no distinction between inner/left/right/full when youre matching every row.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(name = c("John", "Simon", "Tracy", "Max"))
df |&gt; left_join(df, join_by())
df |&gt; cross_join(df)
#&gt; # A tibble: 16 × 2
#&gt; name.x name.y
#&gt; &lt;chr&gt; &lt;chr&gt;