More work on O'Reilly book

* Make width narrower
* Convert deps to table
* Strip chapter status
This commit is contained in:
Hadley Wickham
2022-11-18 11:05:00 -06:00
parent 5895db09cd
commit 69b4597f3b
33 changed files with 784 additions and 1048 deletions

View File

@@ -1,13 +1,5 @@
<section data-type="chapter" id="chp-joins">
<h1><span id="sec-joins" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Joins</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<h1><span id="sec-joins" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Joins</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
@@ -57,14 +49,14 @@ Primary and foreign keys</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">airports
#&gt; # A tibble: 1,458 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/Ne
#&gt; 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Ch
#&gt; 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Ch
#&gt; 4 06N Randall Airport 41.4 -74.4 523 -5 A America/Ne
#&gt; 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/Ne
#&gt; 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/Ne
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America…
#&gt; 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America…
#&gt; 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America…
#&gt; 4 06N Randall Airport 41.4 -74.4 523 -5 A America…
#&gt; 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America…
#&gt; 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America…
#&gt; # … with 1,452 more rows</pre>
</div>
</li>
@@ -73,14 +65,14 @@ Primary and foreign keys</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">planes
#&gt; # A tibble: 3,322 × 9
#&gt; tailnum year type manuf…¹ model engines seats speed engine
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 2 N102UW 1998 Fixed wing multi engine AIRBUS… A320… 2 182 NA Turbo…
#&gt; 3 N103US 1999 Fixed wing multi engine AIRBUS… A320… 2 182 NA Turbo…
#&gt; 4 N104UW 1999 Fixed wing multi engine AIRBUS… A320… 2 182 NA Turbo…
#&gt; 5 N10575 2002 Fixed wing multi engine EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 6 N105UW 1999 Fixed wing multi engine AIRBUS… A320… 2 182 NA Turbo…
#&gt; tailnum year type manuf…¹ model engines seats speed engine
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 N10156 2004 Fixed wing multi en EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 2 N102UW 1998 Fixed wing multi en AIRBUS… A320… 2 182 NA Turbo…
#&gt; 3 N103US 1999 Fixed wing multi en AIRBUS… A320… 2 182 NA Turbo…
#&gt; 4 N104UW 1999 Fixed wing multi en AIRBUS… A320… 2 182 NA Turbo…
#&gt; 5 N10575 2002 Fixed wing multi en EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 6 N105UW 1999 Fixed wing multi en AIRBUS… A320… 2 182 NA Turbo…
#&gt; # … with 3,316 more rows, and abbreviated variable name ¹manufacturer</pre>
</div>
</li>
@@ -89,16 +81,17 @@ Primary and foreign keys</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">weather
#&gt; # A tibble: 26,115 × 15
#&gt; origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
#&gt; 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
#&gt; 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
#&gt; 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
#&gt; 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
#&gt; 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
#&gt; # … with 26,109 more rows, and 4 more variables: precip &lt;dbl&gt;, pressure &lt;dbl&gt;,
#&gt; # visib &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
#&gt; origin year month day hour temp dewp humid wind_dir wind_sp…¹ wind_…²
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
#&gt; 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
#&gt; 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
#&gt; 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
#&gt; 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
#&gt; 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
#&gt; # … with 26,109 more rows, 4 more variables: precip &lt;dbl&gt;, pressure &lt;dbl&gt;,
#&gt; # visib &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹wind_speed, ²wind_gust</pre>
</div>
</li>
</ul><p>A <strong>foreign key</strong> is a variable (or set of variables) that corresponds to a primary key in another table. For example:</p>
@@ -147,8 +140,8 @@ weather |&gt;
filter(is.na(tailnum))
#&gt; # A tibble: 0 × 9
#&gt; # … with 9 variables: tailnum &lt;chr&gt;, year &lt;int&gt;, type &lt;chr&gt;,
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, speed &lt;int&gt;,
#&gt; # engine &lt;chr&gt;
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;
weather |&gt;
filter(is.na(time_hour) | is.na(origin))
@@ -189,18 +182,19 @@ Surrogate keys</h2>
mutate(id = row_number(), .before = 1)
flights2
#&gt; # A tibble: 336,776 × 20
#&gt; id year month day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 1 2013 1 1 517 515 2 830 819 11
#&gt; 2 2 2013 1 1 533 529 4 850 830 20
#&gt; 3 3 2013 1 1 542 540 2 923 850 33
#&gt; 4 4 2013 1 1 544 545 -1 1004 1022 -18
#&gt; 5 5 2013 1 1 554 600 -6 812 837 -25
#&gt; 6 6 2013 1 1 554 558 -4 740 728 12
#&gt; id year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 1 2013 1 1 517 515 2 830 819 11
#&gt; 2 2 2013 1 1 533 529 4 850 830 20
#&gt; 3 3 2013 1 1 542 540 2 923 850 33
#&gt; 4 4 2013 1 1 544 545 -1 1004 1022 -18
#&gt; 5 5 2013 1 1 554 600 -6 812 837 -25
#&gt; 6 6 2013 1 1 554 558 -4 740 728 12
#&gt; # … with 336,770 more rows, 10 more variables: carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,arr_delay</pre>
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable
#&gt; # names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
#&gt; # ⁵arr_delay</pre>
</div>
<p>Surrogate keys can be particular useful when communicating to other humans: its much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.</p>
</section>
@@ -247,14 +241,14 @@ flights2
left_join(airlines)
#&gt; Joining with `by = join_by(carrier)`
#&gt; # A tibble: 336,776 × 7
#&gt; year time_hour origin dest tailnum carrier name
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA United Air Lines Inc.
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA United Air Lines Inc.
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA American Airlines Inc.
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue Airways
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Air Lines Inc.
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA United Air Lines Inc.
#&gt; year time_hour origin dest tailnum carrier name
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA United Air Lines In
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA United Air Lines In
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA American Airlines I
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue Airways
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Air Lines Inc.
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA United Air Lines In
#&gt; # … with 336,770 more rows</pre>
</div>
<p>Or we could find out the temperature and wind speed when each plane departed:</p>
@@ -279,14 +273,14 @@ flights2
left_join(planes |&gt; select(tailnum, type, engines, seats))
#&gt; Joining with `by = join_by(tailnum)`
#&gt; # A tibble: 336,776 × 9
#&gt; year time_hour origin dest tailnum carrier type engines seats
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed wi… 2 149
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed wi… 2 149
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed wi… 2 178
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed wi… 2 200
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed wi… 2 178
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed wi… 2 191
#&gt; year time_hour origin dest tailnum carrier type engines seats
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed… 2 149
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed… 2 149
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed… 2 178
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed… 2 200
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed… 2 178
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed… 2 191
#&gt; # … with 336,770 more rows</pre>
</div>
<p>When <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, theres no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p>
@@ -318,14 +312,14 @@ Specifying join keys</h2>
left_join(planes)
#&gt; Joining with `by = join_by(year, tailnum)`
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier type manufactu…¹ model
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; year time_hour origin dest tailnum carrier type manufa…¹ model
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; # … with 336,770 more rows, 4 more variables: engines &lt;int&gt;, seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;, and abbreviated variable name ¹manufacturer</pre>
</div>
@@ -334,17 +328,16 @@ Specifying join keys</h2>
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(planes, join_by(tailnum))
#&gt; # A tibble: 336,776 × 14
#&gt; year.x time_hour origin dest tailnum carrier year.y type manuf…¹
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999 Fixed … BOEING
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998 Fixed … BOEING
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990 Fixed … BOEING
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012 Fixed … AIRBUS
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991 Fixed … BOEING
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012 Fixed … BOEING
#&gt; # … with 336,770 more rows, 5 more variables: model &lt;chr&gt;, engines &lt;int&gt;,
#&gt; # seats &lt;int&gt;, speed &lt;int&gt;, engine &lt;chr&gt;, and abbreviated variable name
#&gt; # ¹manufacturer</pre>
#&gt; year.x time_hour origin dest tailnum carrier year.y type
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999 Fixed wing
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998 Fixed wing
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990 Fixed wing
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012 Fixed wing
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991 Fixed wing
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012 Fixed wing
#&gt; # … with 336,770 more rows, and 6 more variables: manufacturer &lt;chr&gt;,
#&gt; # model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, speed &lt;int&gt;, engine &lt;chr&gt;</pre>
</div>
<p>Note that the <code>year</code> variables are disambiguated in the output with a suffix (<code>year.x</code> and <code>year.y</code>), which tells you whether the variable came from the <code>x</code> or <code>y</code> argument. You can override the default suffixes with the <code>suffix</code> argument.</p>
<p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. Its important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. Thats why this type of join is often called an <strong>equi-join</strong>. Youll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
@@ -353,30 +346,30 @@ Specifying join keys</h2>
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(airports, join_by(dest == faa))
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier name lat lon alt
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Geor… 30.0 -95.3 97
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Geor… 30.0 -95.3 97
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Miam… 25.8 -80.3 8
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; NA NA NA
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Hart… 33.6 -84.4 1026
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Chic… 42.0 -87.9 668
#&gt; # … with 336,770 more rows, and 3 more variables: tz &lt;dbl&gt;, dst &lt;chr&gt;,
#&gt; # tzone &lt;chr&gt;
#&gt; year time_hour origin dest tailnum carrier name lat lon
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA George … 30.0 -95.3
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA George … 30.0 -95.3
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Miami I… 25.8 -80.3
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; NA NA
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Hartsfi… 33.6 -84.4
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Chicago… 42.0 -87.9
#&gt; # … with 336,770 more rows, and 4 more variables: alt &lt;dbl&gt;, tz &lt;dbl&gt;,
#&gt; # dst &lt;chr&gt;, tzone &lt;chr&gt;
flights2 |&gt;
left_join(airports, join_by(origin == faa))
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier name lat lon alt
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newa… 40.7 -74.2 18
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La G… 40.8 -73.9 22
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John… 40.6 -73.8 13
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John… 40.6 -73.8 13
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La G… 40.8 -73.9 22
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newa… 40.7 -74.2 18
#&gt; # … with 336,770 more rows, and 3 more variables: tz &lt;dbl&gt;, dst &lt;chr&gt;,
#&gt; # tzone &lt;chr&gt;</pre>
#&gt; year time_hour origin dest tailnum carrier name lat lon
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newark … 40.7 -74.2
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La Guar… 40.8 -73.9
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John F … 40.6 -73.8
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John F … 40.6 -73.8
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La Guar… 40.8 -73.9
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newark … 40.7 -74.2
#&gt; # … with 336,770 more rows, and 4 more variables: alt &lt;dbl&gt;, tz &lt;dbl&gt;,
#&gt; # dst &lt;chr&gt;, tzone &lt;chr&gt;</pre>
</div>
<p>In older code you might see a different way of specifying the join keys, using a character vector:</p>
<ul><li>
@@ -405,14 +398,14 @@ Filtering joins</h2>
<pre data-type="programlisting" data-code-language="downlit">airports |&gt;
semi_join(flights2, join_by(faa == dest))
#&gt; # A tibble: 101 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 ABQ Albuquerque International Sunport 35.0 -107. 5355 -7 A Americ
#&gt; 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A Americ
#&gt; 3 ALB Albany Intl 42.7 -73.8 285 -5 A Americ
#&gt; 4 ANC Ted Stevens Anchorage Intl 61.2 -150. 152 -9 A Americ
#&gt; 5 ATL Hartsfield Jackson Atlanta Intl 33.6 -84.4 1026 -5 A Americ
#&gt; 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A Americ
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 ABQ Albuquerque International Sunpo 35.0 -107. 5355 -7 A Amer…
#&gt; 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A Amer…
#&gt; 3 ALB Albany Intl 42.7 -73.8 285 -5 A Amer…
#&gt; 4 ANC Ted Stevens Anchorage Intl 61.2 -150. 152 -9 A Amer…
#&gt; 5 ATL Hartsfield Jackson Atlanta Intl 33.6 -84.4 1026 -5 A Amer…
#&gt; 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A Amer…
#&gt; # … with 95 more rows</pre>
</div>
<p><strong>Anti-joins</strong> are the opposite: they return all rows in <code>x</code> that dont have a match in <code>y</code>. Theyre useful for finding missing values that are <strong>implicit</strong> in the data, the topic of <a href="#sec-missing-implicit" data-type="xref">#sec-missing-implicit</a>. Implicitly missing values dont show up as <code>NA</code>s but instead only exist as an absence. For example, we can find rows that as missing from <code>airports</code> by looking for flights that dont have a matching destination airport:</p>
@@ -664,14 +657,14 @@ Allow multiple rows</h2>
plane_flights
#&gt; # A tibble: 284,170 × 9
#&gt; tailnum type engines seats year time_hour origin dest carrier
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 N10156 Fixed wi… 2 55 2013 2013-01-10 06:00:00 EWR PIT EV
#&gt; 2 N10156 Fixed wi… 2 55 2013 2013-01-10 10:00:00 EWR CHS EV
#&gt; 3 N10156 Fixed wi… 2 55 2013 2013-01-10 15:00:00 EWR MSP EV
#&gt; 4 N10156 Fixed wi… 2 55 2013 2013-01-11 06:00:00 EWR CMH EV
#&gt; 5 N10156 Fixed wi… 2 55 2013 2013-01-11 11:00:00 EWR MCI EV
#&gt; 6 N10156 Fixed wi… 2 55 2013 2013-01-11 18:00:00 EWR PWM EV
#&gt; tailnum type engines seats year time_hour origin dest carrier
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 N10156 Fixed… 2 55 2013 2013-01-10 06:00:00 EWR PIT EV
#&gt; 2 N10156 Fixed… 2 55 2013 2013-01-10 10:00:00 EWR CHS EV
#&gt; 3 N10156 Fixed… 2 55 2013 2013-01-10 15:00:00 EWR MSP EV
#&gt; 4 N10156 Fixed… 2 55 2013 2013-01-11 06:00:00 EWR CMH EV
#&gt; 5 N10156 Fixed… 2 55 2013 2013-01-11 11:00:00 EWR MCI EV
#&gt; 6 N10156 Fixed… 2 55 2013 2013-01-11 18:00:00 EWR PWM EV
#&gt; # … with 284,164 more rows</pre>
</div>
</section>