More minor page count tweaks & fixes

And re-convert with latest htmlbook
This commit is contained in:
Hadley Wickham
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions

View File

@@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-joins">
<h1><span id="sec-joins" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Joins</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="joins-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Its rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must <strong>join</strong> them together to answer the questions that youre interested in. This chapter will introduce you to two important types of joins:</p>
@@ -8,7 +8,7 @@ Introduction</h1>
<li>Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.</li>
</ul><p>Well begin by discussing keys, the variables used to connect a pair of data frames in a join. We cement the theory with an examination of the keys in the nycflights13 datasets, then use that knowledge to start joining data frames together. Next well discuss how joins work, focusing on their action on the rows. Well finish up with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.</p>
<section id="prerequisites" data-type="sect2">
<section id="joins-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well explore the five related datasets from nycflights13 using the join functions from dplyr.</p>
@@ -22,7 +22,7 @@ library(nycflights13)</pre>
<section id="keys" data-type="sect1">
<h1>
Keys</h1>
<p>To understand joins, you need to first understand how two tables can be connected through a pair of keys, with on each table. In this section, youll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. Youll also learn how to check that your keys are valid, and what to do if your table lacks a key.</p>
<p>To understand joins, you need to first understand how two tables can be connected through a pair of keys, within each table. In this section, youll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. Youll also learn how to check that your keys are valid, and what to do if your table lacks a key.</p>
<section id="primary-and-foreign-keys" data-type="sect2">
<h2>
@@ -46,51 +46,52 @@ Primary and foreign keys</h2>
</li>
<li>
<p><code>airports</code> records data about each airport. You can identify each airport by its three letter airport code, making <code>faa</code> the primary key.</p>
<div class="cell">
<div class="cell" data-r.options="{&quot;width&quot;:67}">
<pre data-type="programlisting" data-code-language="r">airports
#&gt; # A tibble: 1,458 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America…
#&gt; 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America…
#&gt; 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America…
#&gt; 4 06N Randall Airport 41.4 -74.4 523 -5 A America…
#&gt; 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America…
#&gt; 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America…
#&gt; # … with 1,452 more rows</pre>
#&gt; faa name lat lon alt tz dst
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A
#&gt; 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A
#&gt; 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A
#&gt; 4 06N Randall Airport 41.4 -74.4 523 -5 A
#&gt; 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A
#&gt; 6 0A9 Elizabethton Municipal Airpo 36.4 -82.2 1593 -5 A
#&gt; # … with 1,452 more rows, and 1 more variable: tzone &lt;chr&gt;</pre>
</div>
</li>
<li>
<p><code>planes</code> records data about each plane. You can identify a plane by its tail number, making <code>tailnum</code> the primary key.</p>
<div class="cell">
<div class="cell" data-r.options="{&quot;width&quot;:67}">
<pre data-type="programlisting" data-code-language="r">planes
#&gt; # A tibble: 3,322 × 9
#&gt; tailnum year type manufacturer model engines seats speed engine
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 N10156 2004 Fixed wing mul… EMBRAER EMB- 2 55 NA Turbo…
#&gt; 2 N102UW 1998 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
#&gt; 3 N103US 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
#&gt; 4 N104UW 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
#&gt; 5 N10575 2002 Fixed wing mul… EMBRAER EMB- 2 55 NA Turbo…
#&gt; 6 N105UW 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
#&gt; # … with 3,316 more rows</pre>
#&gt; tailnum year type manufacturer model engines
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 N10156 2004 Fixed wing multi… EMBRAER EMB-145XR 2
#&gt; 2 N102UW 1998 Fixed wing multi… AIRBUS INDUSTR… A320-214 2
#&gt; 3 N103US 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2
#&gt; 4 N104UW 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2
#&gt; 5 N10575 2002 Fixed wing multi… EMBRAER EMB-145LR 2
#&gt; 6 N105UW 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2
#&gt; # … with 3,316 more rows, and 3 more variables: seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;</pre>
</div>
</li>
<li>
<p><code>weather</code> records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making <code>origin</code> and <code>time_hour</code> the compound primary key.</p>
<div class="cell">
<div class="cell" data-r.options="{&quot;width&quot;:67}">
<pre data-type="programlisting" data-code-language="r">weather
#&gt; # A tibble: 26,115 × 15
#&gt; origin year month day hour temp dewp humid wind_dir wind_speed
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
#&gt; 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
#&gt; 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
#&gt; 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
#&gt; 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
#&gt; 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
#&gt; # … with 26,109 more rows, and 5 more variables: wind_gust &lt;dbl&gt;,
#&gt; # precip &lt;dbl&gt;, pressure &lt;dbl&gt;, visib &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
#&gt; origin year month day hour temp dewp humid wind_dir
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 EWR 2013 1 1 1 39.0 26.1 59.4 270
#&gt; 2 EWR 2013 1 1 2 39.0 27.0 61.6 250
#&gt; 3 EWR 2013 1 1 3 39.0 28.0 64.4 240
#&gt; 4 EWR 2013 1 1 4 39.9 28.0 62.2 250
#&gt; 5 EWR 2013 1 1 5 39.0 28.0 64.4 260
#&gt; 6 EWR 2013 1 1 6 37.9 28.0 67.2 240
#&gt; # … with 26,109 more rows, and 6 more variables: wind_speed &lt;dbl&gt;,
#&gt; # wind_gust &lt;dbl&gt;, precip &lt;dbl&gt;, pressure &lt;dbl&gt;, visib &lt;dbl&gt;, </pre>
</div>
</li>
</ul><p>A <strong>foreign key</strong> is a variable (or set of variables) that corresponds to a primary key in another table. For example:</p>
@@ -139,23 +140,20 @@ weather |&gt;
filter(is.na(tailnum))
#&gt; # A tibble: 0 × 9
#&gt; # … with 9 variables: tailnum &lt;chr&gt;, year &lt;int&gt;, type &lt;chr&gt;,
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;,
weather |&gt;
filter(is.na(time_hour) | is.na(origin))
#&gt; # A tibble: 0 × 15
#&gt; # … with 15 variables: origin &lt;chr&gt;, year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;,
#&gt; # hour &lt;int&gt;, temp &lt;dbl&gt;, dewp &lt;dbl&gt;, humid &lt;dbl&gt;, wind_dir &lt;dbl&gt;,
#&gt; # wind_speed &lt;dbl&gt;, wind_gust &lt;dbl&gt;, precip &lt;dbl&gt;, pressure &lt;dbl&gt;,
#&gt; # visib &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
#&gt; # hour &lt;int&gt;, temp &lt;dbl&gt;, dewp &lt;dbl&gt;, humid &lt;dbl&gt;, wind_dir &lt;dbl&gt;,</pre>
</div>
</section>
<section id="surrogate-keys" data-type="sect2">
<h2>
Surrogate keys</h2>
<p>So far we havent talked about the primary key for <code>flights</code>. Its not super important here, because there are no data frames that use it as a foreign key, but its still useful to consider because its easier to work with observations if have some way to describe them to others.</p>
<p>So far we havent talked about the primary key for <code>flights</code>. Its not super important here, because there are no data frames that use it as a foreign key, but its still useful to consider because its easier to work with observations if we have some way to describe them to others.</p>
<p>After a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
@@ -190,14 +188,12 @@ flights2
#&gt; 5 5 2013 1 1 554 600 -6 812
#&gt; 6 6 2013 1 1 554 558 -4 740
#&gt; # … with 336,770 more rows, and 12 more variables: sched_arr_time &lt;int&gt;,
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;,</pre>
</div>
<p>Surrogate keys can be particular useful when communicating to other humans: its much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="joins-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>We forgot to draw the relationship between <code>weather</code> and <code>airports</code> in <a href="#fig-flights-relationships" data-type="xref">#fig-flights-relationships</a>. What is the relationship and how should it appear in the diagram?</p></li>
@@ -211,7 +207,7 @@ Exercises</h2>
<section id="sec-mutating-joins" data-type="sect1">
<h1>
Basic joins</h1>
<p>Now that you understand how data frames are connected via keys, we can start using joins to better understand the <code>flights</code> dataset. dplyr provides six join functions: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>. They all have the same interface: they take a pair of data frames (<code>x</code> and <code>y</code>) and return a data frame. The order of the rows and columns in the output is primarily determined by <code>x</code>.</p>
<p>Now that you understand how data frames are connected via keys, we can start using joins to better understand the <code>flights</code> dataset. dplyr provides six join functions: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code>, <code>anti_join(), and full_join()</code>. They all have the same interface: they take a pair of data frames (<code>x</code> and <code>y</code>) and return a data frame. The order of the rows and columns in the output is primarily determined by <code>x</code>.</p>
<p>In this section, youll learn how to use one mutating join, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, and two filtering joins, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>. In the next section, youll learn exactly how these functions work, and about the remaining <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">full_join()</a></code>.</p>
<section id="mutating-joins" data-type="sect2">
@@ -271,15 +267,15 @@ flights2
left_join(planes |&gt; select(tailnum, type, engines, seats))
#&gt; Joining with `by = join_by(tailnum)`
#&gt; # A tibble: 336,776 × 9
#&gt; year time_hour origin dest tailnum carrier type engines seats
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed… 2 149
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed… 2 149
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed… 2 178
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed… 2 200
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed… 2 178
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed… 2 191
#&gt; # … with 336,770 more rows</pre>
#&gt; year time_hour origin dest tailnum carrier type
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed wing multi en…
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed wing multi en…
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed wing multi en…
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed wing multi en…
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed wing multi en…
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed wing multi en…
#&gt; # … with 336,770 more rows, and 2 more variables: engines &lt;int&gt;, seats &lt;int&gt;</pre>
</div>
<p>When <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, theres no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p>
<div class="cell">
@@ -326,16 +322,16 @@ Specifying join keys</h2>
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
left_join(planes, join_by(tailnum))
#&gt; # A tibble: 336,776 × 14
#&gt; year.x time_hour origin dest tailnum carrier year.y type
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999 Fixed wing …
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998 Fixed wing …
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990 Fixed wing …
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012 Fixed wing …
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991 Fixed wing …
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012 Fixed wing …
#&gt; # … with 336,770 more rows, and 6 more variables: manufacturer &lt;chr&gt;,
#&gt; # model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, speed &lt;int&gt;, engine &lt;chr&gt;</pre>
#&gt; year.x time_hour origin dest tailnum carrier year.y
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012
#&gt; # … with 336,770 more rows, and 7 more variables: type &lt;chr&gt;,
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, </pre>
</div>
<p>Note that the <code>year</code> variables are disambiguated in the output with a suffix (<code>year.x</code> and <code>year.y</code>), which tells you whether the variable came from the <code>x</code> or <code>y</code> argument. You can override the default suffixes with the <code>suffix</code> argument.</p>
<p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. Its important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. Thats why this type of join is often called an <strong>equi-join</strong>. Youll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
@@ -344,30 +340,30 @@ Specifying join keys</h2>
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
left_join(airports, join_by(dest == faa))
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier name lat lon
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA George … 30.0 -95.3
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA George … 30.0 -95.3
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Miami I… 25.8 -80.3
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; NA NA
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Hartsfi… 33.6 -84.4
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Chicago… 42.0 -87.9
#&gt; # … with 336,770 more rows, and 4 more variables: alt &lt;dbl&gt;, tz &lt;dbl&gt;,
#&gt; # dst &lt;chr&gt;, tzone &lt;chr&gt;
#&gt; year time_hour origin dest tailnum carrier name
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA George Bush Interco…
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA George Bush Interco…
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Miami Intl
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt;
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Hartsfield Jackson …
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Chicago Ohare Intl
#&gt; # … with 336,770 more rows, and 6 more variables: lat &lt;dbl&gt;, lon &lt;dbl&gt;,
#&gt; # alt &lt;dbl&gt;, tz &lt;dbl&gt;, dst &lt;chr&gt;, tzone &lt;chr&gt;
flights2 |&gt;
left_join(airports, join_by(origin == faa))
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier name lat lon
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newark … 40.7 -74.2
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La Guar… 40.8 -73.9
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John F … 40.6 -73.8
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John F … 40.6 -73.8
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La Guar… 40.8 -73.9
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newark … 40.7 -74.2
#&gt; # … with 336,770 more rows, and 4 more variables: alt &lt;dbl&gt;, tz &lt;dbl&gt;,
#&gt; # dst &lt;chr&gt;, tzone &lt;chr&gt;</pre>
#&gt; year time_hour origin dest tailnum carrier name
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newark Liberty Intl
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La Guardia
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John F Kennedy Intl
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John F Kennedy Intl
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La Guardia
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newark Liberty Intl
#&gt; # … with 336,770 more rows, and 6 more variables: lat &lt;dbl&gt;, lon &lt;dbl&gt;,
#&gt; # alt &lt;dbl&gt;, tz &lt;dbl&gt;, dst &lt;chr&gt;, tzone &lt;chr&gt;</pre>
</div>
<p>In older code you might see a different way of specifying the join keys, using a character vector:</p>
<ul><li>
@@ -396,17 +392,17 @@ Filtering joins</h2>
<pre data-type="programlisting" data-code-language="r">airports |&gt;
semi_join(flights2, join_by(faa == dest))
#&gt; # A tibble: 101 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 ABQ Albuquerque International Sunpo… 35.0 -107. 5355 -7 A Amer
#&gt; 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A Amer…
#&gt; 3 ALB Albany Intl 42.7 -73.8 285 -5 A Amer…
#&gt; 4 ANC Ted Stevens Anchorage Intl 61.2 -150. 152 -9 A Amer…
#&gt; 5 ATL Hartsfield Jackson Atlanta Intl 33.6 -84.4 1026 -5 A Amer…
#&gt; 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A Amer
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 ABQ Albuquerque Internati… 35.0 -107. 5355 -7 A America/Denver
#&gt; 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A America/New_Yo
#&gt; 3 ALB Albany Intl 42.7 -73.8 285 -5 A America/New_Yo
#&gt; 4 ANC Ted Stevens Anchorage 61.2 -150. 152 -9 A America/Anchor
#&gt; 5 ATL Hartsfield Jackson At 33.6 -84.4 1026 -5 A America/New_Yo
#&gt; 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A America/Chicago
#&gt; # … with 95 more rows</pre>
</div>
<p><strong>Anti-joins</strong> are the opposite: they return all rows in <code>x</code> that dont have a match in <code>y</code>. Theyre useful for finding missing values that are <strong>implicit</strong> in the data, the topic of <a href="#sec-missing-implicit" data-type="xref">#sec-missing-implicit</a>. Implicitly missing values dont show up as <code>NA</code>s but instead only exist as an absence. For example, we can find rows that as missing from <code>airports</code> by looking for flights that dont have a matching destination airport:</p>
<p><strong>Anti-joins</strong> are the opposite: they return all rows in <code>x</code> that dont have a match in <code>y</code>. Theyre useful for finding missing values that are <strong>implicit</strong> in the data, the topic of <a href="#sec-missing-implicit" data-type="xref">#sec-missing-implicit</a>. Implicitly missing values dont show up as <code>NA</code>s but instead only exist as an absence. For example, we can find rows that are missing from <code>airports</code> by looking for flights that dont have a matching destination airport:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
anti_join(airports, join_by(dest == faa)) |&gt;
@@ -437,7 +433,7 @@ Filtering joins</h2>
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="joins-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the <code>weather</code> data. Can you see any patterns?</p></li>
@@ -655,15 +651,15 @@ Allow multiple rows</h2>
plane_flights
#&gt; # A tibble: 284,170 × 9
#&gt; tailnum type engines seats year time_hour origin dest carrier
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 N10156 Fixed… 2 55 2013 2013-01-10 06:00:00 EWR PIT EV
#&gt; 2 N10156 Fixed… 2 55 2013 2013-01-10 10:00:00 EWR CHS EV
#&gt; 3 N10156 Fixed… 2 55 2013 2013-01-10 15:00:00 EWR MSP EV
#&gt; 4 N10156 Fixed… 2 55 2013 2013-01-11 06:00:00 EWR CMH EV
#&gt; 5 N10156 Fixed… 2 55 2013 2013-01-11 11:00:00 EWR MCI EV
#&gt; 6 N10156 Fixed… 2 55 2013 2013-01-11 18:00:00 EWR PWM EV
#&gt; # … with 284,164 more rows</pre>
#&gt; tailnum type engines seats year time_hour origin
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt;
#&gt; 1 N10156 Fixed wing multi en… 2 55 2013 2013-01-10 06:00:00 EWR
#&gt; 2 N10156 Fixed wing multi en… 2 55 2013 2013-01-10 10:00:00 EWR
#&gt; 3 N10156 Fixed wing multi en… 2 55 2013 2013-01-10 15:00:00 EWR
#&gt; 4 N10156 Fixed wing multi en… 2 55 2013 2013-01-11 06:00:00 EWR
#&gt; 5 N10156 Fixed wing multi en… 2 55 2013 2013-01-11 11:00:00 EWR
#&gt; 6 N10156 Fixed wing multi en… 2 55 2013 2013-01-11 18:00:00 EWR
#&gt; # … with 284,164 more rows, and 2 more variables: dest &lt;chr&gt;, carrier &lt;chr&gt;</pre>
</div>
</section>
@@ -814,19 +810,19 @@ Rolling joins</h2>
<p>Now imagine that you have a table of employee birthdays:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">employees &lt;- tibble(
name = wakefield::name(100),
name = sample(babynames::babynames$name, 100),
birthday = lubridate::ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
)
employees
#&gt; # A tibble: 100 × 2
#&gt; name birthday
#&gt; &lt;variable&gt; &lt;date&gt;
#&gt; 1 Lindzy 2022-08-11
#&gt; 2 Santania 2022-03-01
#&gt; 3 Gardell 2022-03-04
#&gt; 4 Cyrille 2022-11-15
#&gt; 5 Kynli 2022-07-09
#&gt; 6 Sever 2022-02-03
#&gt; name birthday
#&gt; &lt;chr&gt; &lt;date&gt;
#&gt; 1 Case 2022-09-13
#&gt; 2 Shonnie 2022-03-30
#&gt; 3 Burnard 2022-01-10
#&gt; 4 Omer 2022-11-25
#&gt; 5 Hillel 2022-07-30
#&gt; 6 Curlie 2022-12-11
#&gt; # … with 94 more rows</pre>
</div>
<p>And for each employee we want to find the first party date that comes after (or on) their birthday. We can express that with a rolling join:</p>
@@ -834,27 +830,22 @@ employees
<pre data-type="programlisting" data-code-language="r">employees |&gt;
left_join(parties, join_by(closest(birthday &gt;= party)))
#&gt; # A tibble: 100 × 4
#&gt; name birthday q party
#&gt; &lt;variable&gt; &lt;date&gt; &lt;int&gt; &lt;date&gt;
#&gt; 1 Lindzy 2022-08-11 3 2022-07-11
#&gt; 2 Santania 2022-03-01 1 2022-01-10
#&gt; 3 Gardell 2022-03-04 1 2022-01-10
#&gt; 4 Cyrille 2022-11-15 4 2022-10-03
#&gt; 5 Kynli 2022-07-09 2 2022-04-04
#&gt; 6 Sever 2022-02-03 1 2022-01-10
#&gt; name birthday q party
#&gt; &lt;chr&gt; &lt;date&gt; &lt;int&gt; &lt;date&gt;
#&gt; 1 Case 2022-09-13 3 2022-07-11
#&gt; 2 Shonnie 2022-03-30 1 2022-01-10
#&gt; 3 Burnard 2022-01-10 1 2022-01-10
#&gt; 4 Omer 2022-11-25 4 2022-10-03
#&gt; 5 Hillel 2022-07-30 3 2022-07-11
#&gt; 6 Curlie 2022-12-11 4 2022-10-03
#&gt; # … with 94 more rows</pre>
</div>
<p>There is, however, one problem with this approach: the folks with birthdays before January 10 dont get a party:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">employees |&gt;
anti_join(parties, join_by(closest(birthday &gt;= party)))
#&gt; # A tibble: 4 × 2
#&gt; name birthday
#&gt; &lt;variable&gt; &lt;date&gt;
#&gt; 1 Janeida 2022-01-04
#&gt; 2 Aires 2022-01-07
#&gt; 3 Mikalya 2022-01-06
#&gt; 4 Carlynn 2022-01-08</pre>
#&gt; # A tibble: 0 × 2
#&gt; # … with 2 variables: name &lt;chr&gt;, birthday &lt;date&gt;</pre>
</div>
<p>To resolve that issue well need to tackle the problem a different way, with overlap joins.</p>
</section>
@@ -910,19 +901,19 @@ parties
<pre data-type="programlisting" data-code-language="r">employees |&gt;
inner_join(parties, join_by(between(birthday, start, end)), unmatched = "error")
#&gt; # A tibble: 100 × 6
#&gt; name birthday q party start end
#&gt; &lt;variable&gt; &lt;date&gt; &lt;int&gt; &lt;date&gt; &lt;date&gt; &lt;date&gt;
#&gt; 1 Lindzy 2022-08-11 3 2022-07-11 2022-07-11 2022-10-02
#&gt; 2 Santania 2022-03-01 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 3 Gardell 2022-03-04 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 4 Cyrille 2022-11-15 4 2022-10-03 2022-10-03 2022-12-31
#&gt; 5 Kynli 2022-07-09 2 2022-04-04 2022-04-04 2022-07-10
#&gt; 6 Sever 2022-02-03 1 2022-01-10 2022-01-01 2022-04-03
#&gt; name birthday q party start end
#&gt; &lt;chr&gt; &lt;date&gt; &lt;int&gt; &lt;date&gt; &lt;date&gt; &lt;date&gt;
#&gt; 1 Case 2022-09-13 3 2022-07-11 2022-07-11 2022-10-02
#&gt; 2 Shonnie 2022-03-30 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 3 Burnard 2022-01-10 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 4 Omer 2022-11-25 4 2022-10-03 2022-10-03 2022-12-31
#&gt; 5 Hillel 2022-07-30 3 2022-07-11 2022-07-11 2022-10-02
#&gt; 6 Curlie 2022-12-11 4 2022-10-03 2022-10-03 2022-12-31
#&gt; # … with 94 more rows</pre>
</div>
</section>
<section id="exercises-2" data-type="sect2">
<section id="joins-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@@ -951,7 +942,7 @@ x |&gt; full_join(y, by = "key", keep = TRUE)
</ol></section>
</section>
<section id="summary" data-type="sect1">
<section id="joins-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned how to use mutating and filtering joins to combine data from a pair of data frames. Along the way you learned how to identify keys, and the difference between primary and foreign keys. You also understand how joins work and how to figure out how many rows the output will have. Finally, youve gained a glimpse into the power of non-equi-joins and seen a few interesting use cases.</p>