More minor page count tweaks & fixes
And re-convert with latest htmlbook
This commit is contained in:
		@@ -1,6 +1,6 @@
 | 
			
		||||
<section data-type="chapter" id="chp-joins">
 | 
			
		||||
<h1><span id="sec-joins" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Joins</span></span></h1>
 | 
			
		||||
<section id="introduction" data-type="sect1">
 | 
			
		||||
<section id="joins-introduction" data-type="sect1">
 | 
			
		||||
<h1>
 | 
			
		||||
Introduction</h1>
 | 
			
		||||
<p>It’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must <strong>join</strong> them together to answer the questions that you’re interested in. This chapter will introduce you to two important types of joins:</p>
 | 
			
		||||
@@ -8,7 +8,7 @@ Introduction</h1>
 | 
			
		||||
<li>Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.</li>
 | 
			
		||||
</ul><p>We’ll begin by discussing keys, the variables used to connect a pair of data frames in a join. We cement the theory with an examination of the keys in the nycflights13 datasets, then use that knowledge to start joining data frames together. Next we’ll discuss how joins work, focusing on their action on the rows. We’ll finish up with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.</p>
 | 
			
		||||
 | 
			
		||||
<section id="prerequisites" data-type="sect2">
 | 
			
		||||
<section id="joins-prerequisites" data-type="sect2">
 | 
			
		||||
<h2>
 | 
			
		||||
Prerequisites</h2>
 | 
			
		||||
<p>In this chapter, we’ll explore the five related datasets from nycflights13 using the join functions from dplyr.</p>
 | 
			
		||||
@@ -22,7 +22,7 @@ library(nycflights13)</pre>
 | 
			
		||||
<section id="keys" data-type="sect1">
 | 
			
		||||
<h1>
 | 
			
		||||
Keys</h1>
 | 
			
		||||
<p>To understand joins, you need to first understand how two tables can be connected through a pair of keys, with on each table. In this section, you’ll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. You’ll also learn how to check that your keys are valid, and what to do if your table lacks a key.</p>
 | 
			
		||||
<p>To understand joins, you need to first understand how two tables can be connected through a pair of keys, within each table. In this section, you’ll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. You’ll also learn how to check that your keys are valid, and what to do if your table lacks a key.</p>
 | 
			
		||||
 | 
			
		||||
<section id="primary-and-foreign-keys" data-type="sect2">
 | 
			
		||||
<h2>
 | 
			
		||||
@@ -46,51 +46,52 @@ Primary and foreign keys</h2>
 | 
			
		||||
</li>
 | 
			
		||||
<li>
 | 
			
		||||
<p><code>airports</code> records data about each airport. You can identify each airport by its three letter airport code, making <code>faa</code> the primary key.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<div class="cell" data-r.options="{"width":67}">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">airports
 | 
			
		||||
#> # A tibble: 1,458 × 8
 | 
			
		||||
#>   faa   name                             lat   lon   alt    tz dst   tzone   
 | 
			
		||||
#>   <chr> <chr>                          <dbl> <dbl> <dbl> <dbl> <chr> <chr>   
 | 
			
		||||
#> 1 04G   Lansdowne Airport               41.1 -80.6  1044    -5 A     America…
 | 
			
		||||
#> 2 06A   Moton Field Municipal Airport   32.5 -85.7   264    -6 A     America…
 | 
			
		||||
#> 3 06C   Schaumburg Regional             42.0 -88.1   801    -6 A     America…
 | 
			
		||||
#> 4 06N   Randall Airport                 41.4 -74.4   523    -5 A     America…
 | 
			
		||||
#> 5 09J   Jekyll Island Airport           31.1 -81.4    11    -5 A     America…
 | 
			
		||||
#> 6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593    -5 A     America…
 | 
			
		||||
#> # … with 1,452 more rows</pre>
 | 
			
		||||
#>   faa   name                            lat   lon   alt    tz dst  
 | 
			
		||||
#>   <chr> <chr>                         <dbl> <dbl> <dbl> <dbl> <chr>
 | 
			
		||||
#> 1 04G   Lansdowne Airport              41.1 -80.6  1044    -5 A    
 | 
			
		||||
#> 2 06A   Moton Field Municipal Airport  32.5 -85.7   264    -6 A    
 | 
			
		||||
#> 3 06C   Schaumburg Regional            42.0 -88.1   801    -6 A    
 | 
			
		||||
#> 4 06N   Randall Airport                41.4 -74.4   523    -5 A    
 | 
			
		||||
#> 5 09J   Jekyll Island Airport          31.1 -81.4    11    -5 A    
 | 
			
		||||
#> 6 0A9   Elizabethton Municipal Airpo…  36.4 -82.2  1593    -5 A    
 | 
			
		||||
#> # … with 1,452 more rows, and 1 more variable: tzone <chr></pre>
 | 
			
		||||
</div>
 | 
			
		||||
</li>
 | 
			
		||||
<li>
 | 
			
		||||
<p><code>planes</code> records data about each plane. You can identify a plane by its tail number, making <code>tailnum</code> the primary key.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<div class="cell" data-r.options="{"width":67}">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">planes
 | 
			
		||||
#> # A tibble: 3,322 × 9
 | 
			
		||||
#>   tailnum  year type            manufacturer model engines seats speed engine
 | 
			
		||||
#>   <chr>   <int> <chr>           <chr>        <chr>   <int> <int> <int> <chr> 
 | 
			
		||||
#> 1 N10156   2004 Fixed wing mul… EMBRAER      EMB-…       2    55    NA Turbo…
 | 
			
		||||
#> 2 N102UW   1998 Fixed wing mul… AIRBUS INDU… A320…       2   182    NA Turbo…
 | 
			
		||||
#> 3 N103US   1999 Fixed wing mul… AIRBUS INDU… A320…       2   182    NA Turbo…
 | 
			
		||||
#> 4 N104UW   1999 Fixed wing mul… AIRBUS INDU… A320…       2   182    NA Turbo…
 | 
			
		||||
#> 5 N10575   2002 Fixed wing mul… EMBRAER      EMB-…       2    55    NA Turbo…
 | 
			
		||||
#> 6 N105UW   1999 Fixed wing mul… AIRBUS INDU… A320…       2   182    NA Turbo…
 | 
			
		||||
#> # … with 3,316 more rows</pre>
 | 
			
		||||
#>   tailnum  year type              manufacturer    model     engines
 | 
			
		||||
#>   <chr>   <int> <chr>             <chr>           <chr>       <int>
 | 
			
		||||
#> 1 N10156   2004 Fixed wing multi… EMBRAER         EMB-145XR       2
 | 
			
		||||
#> 2 N102UW   1998 Fixed wing multi… AIRBUS INDUSTR… A320-214        2
 | 
			
		||||
#> 3 N103US   1999 Fixed wing multi… AIRBUS INDUSTR… A320-214        2
 | 
			
		||||
#> 4 N104UW   1999 Fixed wing multi… AIRBUS INDUSTR… A320-214        2
 | 
			
		||||
#> 5 N10575   2002 Fixed wing multi… EMBRAER         EMB-145LR       2
 | 
			
		||||
#> 6 N105UW   1999 Fixed wing multi… AIRBUS INDUSTR… A320-214        2
 | 
			
		||||
#> # … with 3,316 more rows, and 3 more variables: seats <int>,
 | 
			
		||||
#> #   speed <int>, engine <chr></pre>
 | 
			
		||||
</div>
 | 
			
		||||
</li>
 | 
			
		||||
<li>
 | 
			
		||||
<p><code>weather</code> records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making <code>origin</code> and <code>time_hour</code> the compound primary key.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<div class="cell" data-r.options="{"width":67}">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">weather
 | 
			
		||||
#> # A tibble: 26,115 × 15
 | 
			
		||||
#>   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
 | 
			
		||||
#>   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
 | 
			
		||||
#> 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
 | 
			
		||||
#> 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
 | 
			
		||||
#> 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
 | 
			
		||||
#> 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
 | 
			
		||||
#> 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
 | 
			
		||||
#> 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
 | 
			
		||||
#> # … with 26,109 more rows, and 5 more variables: wind_gust <dbl>,
 | 
			
		||||
#> #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm></pre>
 | 
			
		||||
#>   origin  year month   day  hour  temp  dewp humid wind_dir
 | 
			
		||||
#>   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>
 | 
			
		||||
#> 1 EWR     2013     1     1     1  39.0  26.1  59.4      270
 | 
			
		||||
#> 2 EWR     2013     1     1     2  39.0  27.0  61.6      250
 | 
			
		||||
#> 3 EWR     2013     1     1     3  39.0  28.0  64.4      240
 | 
			
		||||
#> 4 EWR     2013     1     1     4  39.9  28.0  62.2      250
 | 
			
		||||
#> 5 EWR     2013     1     1     5  39.0  28.0  64.4      260
 | 
			
		||||
#> 6 EWR     2013     1     1     6  37.9  28.0  67.2      240
 | 
			
		||||
#> # … with 26,109 more rows, and 6 more variables: wind_speed <dbl>,
 | 
			
		||||
#> #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>, …</pre>
 | 
			
		||||
</div>
 | 
			
		||||
</li>
 | 
			
		||||
</ul><p>A <strong>foreign key</strong> is a variable (or set of variables) that corresponds to a primary key in another table. For example:</p>
 | 
			
		||||
@@ -139,23 +140,20 @@ weather |>
 | 
			
		||||
  filter(is.na(tailnum))
 | 
			
		||||
#> # A tibble: 0 × 9
 | 
			
		||||
#> # … with 9 variables: tailnum <chr>, year <int>, type <chr>,
 | 
			
		||||
#> #   manufacturer <chr>, model <chr>, engines <int>, seats <int>,
 | 
			
		||||
#> #   speed <int>, engine <chr>
 | 
			
		||||
#> #   manufacturer <chr>, model <chr>, engines <int>, seats <int>, …
 | 
			
		||||
 | 
			
		||||
weather |> 
 | 
			
		||||
  filter(is.na(time_hour) | is.na(origin))
 | 
			
		||||
#> # A tibble: 0 × 15
 | 
			
		||||
#> # … with 15 variables: origin <chr>, year <int>, month <int>, day <int>,
 | 
			
		||||
#> #   hour <int>, temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>,
 | 
			
		||||
#> #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
 | 
			
		||||
#> #   visib <dbl>, time_hour <dttm></pre>
 | 
			
		||||
#> #   hour <int>, temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, …</pre>
 | 
			
		||||
</div>
 | 
			
		||||
</section>
 | 
			
		||||
 | 
			
		||||
<section id="surrogate-keys" data-type="sect2">
 | 
			
		||||
<h2>
 | 
			
		||||
Surrogate keys</h2>
 | 
			
		||||
<p>So far we haven’t talked about the primary key for <code>flights</code>. It’s not super important here, because there are no data frames that use it as a foreign key, but it’s still useful to consider because it’s easier to work with observations if have some way to describe them to others.</p>
 | 
			
		||||
<p>So far we haven’t talked about the primary key for <code>flights</code>. It’s not super important here, because there are no data frames that use it as a foreign key, but it’s still useful to consider because it’s easier to work with observations if we have some way to describe them to others.</p>
 | 
			
		||||
<p>After a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">flights |> 
 | 
			
		||||
@@ -190,14 +188,12 @@ flights2
 | 
			
		||||
#> 5     5  2013     1     1      554            600        -6      812
 | 
			
		||||
#> 6     6  2013     1     1      554            558        -4      740
 | 
			
		||||
#> # … with 336,770 more rows, and 12 more variables: sched_arr_time <int>,
 | 
			
		||||
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
 | 
			
		||||
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 | 
			
		||||
#> #   minute <dbl>, time_hour <dttm></pre>
 | 
			
		||||
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, …</pre>
 | 
			
		||||
</div>
 | 
			
		||||
<p>Surrogate keys can be particular useful when communicating to other humans: it’s much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.</p>
 | 
			
		||||
</section>
 | 
			
		||||
 | 
			
		||||
<section id="exercises" data-type="sect2">
 | 
			
		||||
<section id="joins-exercises" data-type="sect2">
 | 
			
		||||
<h2>
 | 
			
		||||
Exercises</h2>
 | 
			
		||||
<ol type="1"><li><p>We forgot to draw the relationship between <code>weather</code> and <code>airports</code> in <a href="#fig-flights-relationships" data-type="xref">#fig-flights-relationships</a>. What is the relationship and how should it appear in the diagram?</p></li>
 | 
			
		||||
@@ -211,7 +207,7 @@ Exercises</h2>
 | 
			
		||||
<section id="sec-mutating-joins" data-type="sect1">
 | 
			
		||||
<h1>
 | 
			
		||||
Basic joins</h1>
 | 
			
		||||
<p>Now that you understand how data frames are connected via keys, we can start using joins to better understand the <code>flights</code> dataset. dplyr provides six join functions: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>. They all have the same interface: they take a pair of data frames (<code>x</code> and <code>y</code>) and return a data frame. The order of the rows and columns in the output is primarily determined by <code>x</code>.</p>
 | 
			
		||||
<p>Now that you understand how data frames are connected via keys, we can start using joins to better understand the <code>flights</code> dataset. dplyr provides six join functions: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code>, <code>anti_join(), and full_join()</code>. They all have the same interface: they take a pair of data frames (<code>x</code> and <code>y</code>) and return a data frame. The order of the rows and columns in the output is primarily determined by <code>x</code>.</p>
 | 
			
		||||
<p>In this section, you’ll learn how to use one mutating join, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, and two filtering joins, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>. In the next section, you’ll learn exactly how these functions work, and about the remaining <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">full_join()</a></code>.</p>
 | 
			
		||||
 | 
			
		||||
<section id="mutating-joins" data-type="sect2">
 | 
			
		||||
@@ -271,15 +267,15 @@ flights2
 | 
			
		||||
  left_join(planes |> select(tailnum, type, engines, seats))
 | 
			
		||||
#> Joining with `by = join_by(tailnum)`
 | 
			
		||||
#> # A tibble: 336,776 × 9
 | 
			
		||||
#>    year time_hour           origin dest  tailnum carrier type   engines seats
 | 
			
		||||
#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>    <int> <int>
 | 
			
		||||
#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Fixed…       2   149
 | 
			
		||||
#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Fixed…       2   149
 | 
			
		||||
#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Fixed…       2   178
 | 
			
		||||
#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      Fixed…       2   200
 | 
			
		||||
#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Fixed…       2   178
 | 
			
		||||
#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Fixed…       2   191
 | 
			
		||||
#> # … with 336,770 more rows</pre>
 | 
			
		||||
#>    year time_hour           origin dest  tailnum carrier type                
 | 
			
		||||
#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>               
 | 
			
		||||
#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Fixed wing multi en…
 | 
			
		||||
#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Fixed wing multi en…
 | 
			
		||||
#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Fixed wing multi en…
 | 
			
		||||
#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      Fixed wing multi en…
 | 
			
		||||
#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Fixed wing multi en…
 | 
			
		||||
#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Fixed wing multi en…
 | 
			
		||||
#> # … with 336,770 more rows, and 2 more variables: engines <int>, seats <int></pre>
 | 
			
		||||
</div>
 | 
			
		||||
<p>When <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, there’s no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
@@ -326,16 +322,16 @@ Specifying join keys</h2>
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">flights2 |> 
 | 
			
		||||
  left_join(planes, join_by(tailnum))
 | 
			
		||||
#> # A tibble: 336,776 × 14
 | 
			
		||||
#>   year.x time_hour           origin dest  tailnum carrier year.y type        
 | 
			
		||||
#>    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>       
 | 
			
		||||
#> 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed wing …
 | 
			
		||||
#> 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed wing …
 | 
			
		||||
#> 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed wing …
 | 
			
		||||
#> 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed wing …
 | 
			
		||||
#> 5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991 Fixed wing …
 | 
			
		||||
#> 6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012 Fixed wing …
 | 
			
		||||
#> # … with 336,770 more rows, and 6 more variables: manufacturer <chr>,
 | 
			
		||||
#> #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr></pre>
 | 
			
		||||
#>   year.x time_hour           origin dest  tailnum carrier year.y
 | 
			
		||||
#>    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int>
 | 
			
		||||
#> 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999
 | 
			
		||||
#> 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998
 | 
			
		||||
#> 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990
 | 
			
		||||
#> 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012
 | 
			
		||||
#> 5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991
 | 
			
		||||
#> 6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012
 | 
			
		||||
#> # … with 336,770 more rows, and 7 more variables: type <chr>,
 | 
			
		||||
#> #   manufacturer <chr>, model <chr>, engines <int>, seats <int>, …</pre>
 | 
			
		||||
</div>
 | 
			
		||||
<p>Note that the <code>year</code> variables are disambiguated in the output with a suffix (<code>year.x</code> and <code>year.y</code>), which tells you whether the variable came from the <code>x</code> or <code>y</code> argument. You can override the default suffixes with the <code>suffix</code> argument.</p>
 | 
			
		||||
<p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. It’s important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is often called an <strong>equi-join</strong>. You’ll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
 | 
			
		||||
@@ -344,30 +340,30 @@ Specifying join keys</h2>
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">flights2 |> 
 | 
			
		||||
  left_join(airports, join_by(dest == faa))
 | 
			
		||||
#> # A tibble: 336,776 × 13
 | 
			
		||||
#>    year time_hour           origin dest  tailnum carrier name       lat   lon
 | 
			
		||||
#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>    <dbl> <dbl>
 | 
			
		||||
#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      George …  30.0 -95.3
 | 
			
		||||
#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      George …  30.0 -95.3
 | 
			
		||||
#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miami I…  25.8 -80.3
 | 
			
		||||
#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>      NA    NA  
 | 
			
		||||
#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hartsfi…  33.6 -84.4
 | 
			
		||||
#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chicago…  42.0 -87.9
 | 
			
		||||
#> # … with 336,770 more rows, and 4 more variables: alt <dbl>, tz <dbl>,
 | 
			
		||||
#> #   dst <chr>, tzone <chr>
 | 
			
		||||
#>    year time_hour           origin dest  tailnum carrier name                
 | 
			
		||||
#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>               
 | 
			
		||||
#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      George Bush Interco…
 | 
			
		||||
#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      George Bush Interco…
 | 
			
		||||
#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miami Intl          
 | 
			
		||||
#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>                
 | 
			
		||||
#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hartsfield Jackson …
 | 
			
		||||
#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chicago Ohare Intl  
 | 
			
		||||
#> # … with 336,770 more rows, and 6 more variables: lat <dbl>, lon <dbl>,
 | 
			
		||||
#> #   alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>
 | 
			
		||||
 | 
			
		||||
flights2 |> 
 | 
			
		||||
  left_join(airports, join_by(origin == faa))
 | 
			
		||||
#> # A tibble: 336,776 × 13
 | 
			
		||||
#>    year time_hour           origin dest  tailnum carrier name       lat   lon
 | 
			
		||||
#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>    <dbl> <dbl>
 | 
			
		||||
#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Newark …  40.7 -74.2
 | 
			
		||||
#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      La Guar…  40.8 -73.9
 | 
			
		||||
#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      John F …  40.6 -73.8
 | 
			
		||||
#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      John F …  40.6 -73.8
 | 
			
		||||
#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      La Guar…  40.8 -73.9
 | 
			
		||||
#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Newark …  40.7 -74.2
 | 
			
		||||
#> # … with 336,770 more rows, and 4 more variables: alt <dbl>, tz <dbl>,
 | 
			
		||||
#> #   dst <chr>, tzone <chr></pre>
 | 
			
		||||
#>    year time_hour           origin dest  tailnum carrier name               
 | 
			
		||||
#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>              
 | 
			
		||||
#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Newark Liberty Intl
 | 
			
		||||
#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      La Guardia         
 | 
			
		||||
#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      John F Kennedy Intl
 | 
			
		||||
#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      John F Kennedy Intl
 | 
			
		||||
#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      La Guardia         
 | 
			
		||||
#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Newark Liberty Intl
 | 
			
		||||
#> # … with 336,770 more rows, and 6 more variables: lat <dbl>, lon <dbl>,
 | 
			
		||||
#> #   alt <dbl>, tz <dbl>, dst <chr>, tzone <chr></pre>
 | 
			
		||||
</div>
 | 
			
		||||
<p>In older code you might see a different way of specifying the join keys, using a character vector:</p>
 | 
			
		||||
<ul><li>
 | 
			
		||||
@@ -396,17 +392,17 @@ Filtering joins</h2>
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">airports |> 
 | 
			
		||||
  semi_join(flights2, join_by(faa == dest))
 | 
			
		||||
#> # A tibble: 101 × 8
 | 
			
		||||
#>   faa   name                               lat    lon   alt    tz dst   tzone
 | 
			
		||||
#>   <chr> <chr>                            <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
 | 
			
		||||
#> 1 ABQ   Albuquerque International Sunpo…  35.0 -107.   5355    -7 A     Amer…
 | 
			
		||||
#> 2 ACK   Nantucket Mem                     41.3  -70.1    48    -5 A     Amer…
 | 
			
		||||
#> 3 ALB   Albany Intl                       42.7  -73.8   285    -5 A     Amer…
 | 
			
		||||
#> 4 ANC   Ted Stevens Anchorage Intl        61.2 -150.    152    -9 A     Amer…
 | 
			
		||||
#> 5 ATL   Hartsfield Jackson Atlanta Intl   33.6  -84.4  1026    -5 A     Amer…
 | 
			
		||||
#> 6 AUS   Austin Bergstrom Intl             30.2  -97.7   542    -6 A     Amer…
 | 
			
		||||
#>   faa   name                     lat    lon   alt    tz dst   tzone          
 | 
			
		||||
#>   <chr> <chr>                  <dbl>  <dbl> <dbl> <dbl> <chr> <chr>          
 | 
			
		||||
#> 1 ABQ   Albuquerque Internati…  35.0 -107.   5355    -7 A     America/Denver 
 | 
			
		||||
#> 2 ACK   Nantucket Mem           41.3  -70.1    48    -5 A     America/New_Yo…
 | 
			
		||||
#> 3 ALB   Albany Intl             42.7  -73.8   285    -5 A     America/New_Yo…
 | 
			
		||||
#> 4 ANC   Ted Stevens Anchorage…  61.2 -150.    152    -9 A     America/Anchor…
 | 
			
		||||
#> 5 ATL   Hartsfield Jackson At…  33.6  -84.4  1026    -5 A     America/New_Yo…
 | 
			
		||||
#> 6 AUS   Austin Bergstrom Intl   30.2  -97.7   542    -6 A     America/Chicago
 | 
			
		||||
#> # … with 95 more rows</pre>
 | 
			
		||||
</div>
 | 
			
		||||
<p><strong>Anti-joins</strong> are the opposite: they return all rows in <code>x</code> that don’t have a match in <code>y</code>. They’re useful for finding missing values that are <strong>implicit</strong> in the data, the topic of <a href="#sec-missing-implicit" data-type="xref">#sec-missing-implicit</a>. Implicitly missing values don’t show up as <code>NA</code>s but instead only exist as an absence. For example, we can find rows that as missing from <code>airports</code> by looking for flights that don’t have a matching destination airport:</p>
 | 
			
		||||
<p><strong>Anti-joins</strong> are the opposite: they return all rows in <code>x</code> that don’t have a match in <code>y</code>. They’re useful for finding missing values that are <strong>implicit</strong> in the data, the topic of <a href="#sec-missing-implicit" data-type="xref">#sec-missing-implicit</a>. Implicitly missing values don’t show up as <code>NA</code>s but instead only exist as an absence. For example, we can find rows that are missing from <code>airports</code> by looking for flights that don’t have a matching destination airport:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">flights2 |> 
 | 
			
		||||
  anti_join(airports, join_by(dest == faa)) |> 
 | 
			
		||||
@@ -437,7 +433,7 @@ Filtering joins</h2>
 | 
			
		||||
</div>
 | 
			
		||||
</section>
 | 
			
		||||
 | 
			
		||||
<section id="exercises-1" data-type="sect2">
 | 
			
		||||
<section id="joins-exercises-1" data-type="sect2">
 | 
			
		||||
<h2>
 | 
			
		||||
Exercises</h2>
 | 
			
		||||
<ol type="1"><li><p>Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the <code>weather</code> data. Can you see any patterns?</p></li>
 | 
			
		||||
@@ -655,15 +651,15 @@ Allow multiple rows</h2>
 | 
			
		||||
 | 
			
		||||
plane_flights
 | 
			
		||||
#> # A tibble: 284,170 × 9
 | 
			
		||||
#>   tailnum type   engines seats  year time_hour           origin dest  carrier
 | 
			
		||||
#>   <chr>   <chr>    <int> <int> <int> <dttm>              <chr>  <chr> <chr>  
 | 
			
		||||
#> 1 N10156  Fixed…       2    55  2013 2013-01-10 06:00:00 EWR    PIT   EV     
 | 
			
		||||
#> 2 N10156  Fixed…       2    55  2013 2013-01-10 10:00:00 EWR    CHS   EV     
 | 
			
		||||
#> 3 N10156  Fixed…       2    55  2013 2013-01-10 15:00:00 EWR    MSP   EV     
 | 
			
		||||
#> 4 N10156  Fixed…       2    55  2013 2013-01-11 06:00:00 EWR    CMH   EV     
 | 
			
		||||
#> 5 N10156  Fixed…       2    55  2013 2013-01-11 11:00:00 EWR    MCI   EV     
 | 
			
		||||
#> 6 N10156  Fixed…       2    55  2013 2013-01-11 18:00:00 EWR    PWM   EV     
 | 
			
		||||
#> # … with 284,164 more rows</pre>
 | 
			
		||||
#>   tailnum type                 engines seats  year time_hour           origin
 | 
			
		||||
#>   <chr>   <chr>                  <int> <int> <int> <dttm>              <chr> 
 | 
			
		||||
#> 1 N10156  Fixed wing multi en…       2    55  2013 2013-01-10 06:00:00 EWR   
 | 
			
		||||
#> 2 N10156  Fixed wing multi en…       2    55  2013 2013-01-10 10:00:00 EWR   
 | 
			
		||||
#> 3 N10156  Fixed wing multi en…       2    55  2013 2013-01-10 15:00:00 EWR   
 | 
			
		||||
#> 4 N10156  Fixed wing multi en…       2    55  2013 2013-01-11 06:00:00 EWR   
 | 
			
		||||
#> 5 N10156  Fixed wing multi en…       2    55  2013 2013-01-11 11:00:00 EWR   
 | 
			
		||||
#> 6 N10156  Fixed wing multi en…       2    55  2013 2013-01-11 18:00:00 EWR   
 | 
			
		||||
#> # … with 284,164 more rows, and 2 more variables: dest <chr>, carrier <chr></pre>
 | 
			
		||||
</div>
 | 
			
		||||
</section>
 | 
			
		||||
 | 
			
		||||
@@ -814,19 +810,19 @@ Rolling joins</h2>
 | 
			
		||||
<p>Now imagine that you have a table of employee birthdays:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">employees <- tibble(
 | 
			
		||||
  name = wakefield::name(100),
 | 
			
		||||
  name = sample(babynames::babynames$name, 100),
 | 
			
		||||
  birthday = lubridate::ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
 | 
			
		||||
)
 | 
			
		||||
employees
 | 
			
		||||
#> # A tibble: 100 × 2
 | 
			
		||||
#>   name       birthday  
 | 
			
		||||
#>   <variable> <date>    
 | 
			
		||||
#> 1 Lindzy     2022-08-11
 | 
			
		||||
#> 2 Santania   2022-03-01
 | 
			
		||||
#> 3 Gardell    2022-03-04
 | 
			
		||||
#> 4 Cyrille    2022-11-15
 | 
			
		||||
#> 5 Kynli      2022-07-09
 | 
			
		||||
#> 6 Sever      2022-02-03
 | 
			
		||||
#>   name    birthday  
 | 
			
		||||
#>   <chr>   <date>    
 | 
			
		||||
#> 1 Case    2022-09-13
 | 
			
		||||
#> 2 Shonnie 2022-03-30
 | 
			
		||||
#> 3 Burnard 2022-01-10
 | 
			
		||||
#> 4 Omer    2022-11-25
 | 
			
		||||
#> 5 Hillel  2022-07-30
 | 
			
		||||
#> 6 Curlie  2022-12-11
 | 
			
		||||
#> # … with 94 more rows</pre>
 | 
			
		||||
</div>
 | 
			
		||||
<p>And for each employee we want to find the first party date that comes after (or on) their birthday. We can express that with a rolling join:</p>
 | 
			
		||||
@@ -834,27 +830,22 @@ employees
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">employees |> 
 | 
			
		||||
  left_join(parties, join_by(closest(birthday >= party)))
 | 
			
		||||
#> # A tibble: 100 × 4
 | 
			
		||||
#>   name       birthday       q party     
 | 
			
		||||
#>   <variable> <date>     <int> <date>    
 | 
			
		||||
#> 1 Lindzy     2022-08-11     3 2022-07-11
 | 
			
		||||
#> 2 Santania   2022-03-01     1 2022-01-10
 | 
			
		||||
#> 3 Gardell    2022-03-04     1 2022-01-10
 | 
			
		||||
#> 4 Cyrille    2022-11-15     4 2022-10-03
 | 
			
		||||
#> 5 Kynli      2022-07-09     2 2022-04-04
 | 
			
		||||
#> 6 Sever      2022-02-03     1 2022-01-10
 | 
			
		||||
#>   name    birthday       q party     
 | 
			
		||||
#>   <chr>   <date>     <int> <date>    
 | 
			
		||||
#> 1 Case    2022-09-13     3 2022-07-11
 | 
			
		||||
#> 2 Shonnie 2022-03-30     1 2022-01-10
 | 
			
		||||
#> 3 Burnard 2022-01-10     1 2022-01-10
 | 
			
		||||
#> 4 Omer    2022-11-25     4 2022-10-03
 | 
			
		||||
#> 5 Hillel  2022-07-30     3 2022-07-11
 | 
			
		||||
#> 6 Curlie  2022-12-11     4 2022-10-03
 | 
			
		||||
#> # … with 94 more rows</pre>
 | 
			
		||||
</div>
 | 
			
		||||
<p>There is, however, one problem with this approach: the folks with birthdays before January 10 don’t get a party:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">employees |> 
 | 
			
		||||
  anti_join(parties, join_by(closest(birthday >= party)))
 | 
			
		||||
#> # A tibble: 4 × 2
 | 
			
		||||
#>   name       birthday  
 | 
			
		||||
#>   <variable> <date>    
 | 
			
		||||
#> 1 Janeida    2022-01-04
 | 
			
		||||
#> 2 Aires      2022-01-07
 | 
			
		||||
#> 3 Mikalya    2022-01-06
 | 
			
		||||
#> 4 Carlynn    2022-01-08</pre>
 | 
			
		||||
#> # A tibble: 0 × 2
 | 
			
		||||
#> # … with 2 variables: name <chr>, birthday <date></pre>
 | 
			
		||||
</div>
 | 
			
		||||
<p>To resolve that issue we’ll need to tackle the problem a different way, with overlap joins.</p>
 | 
			
		||||
</section>
 | 
			
		||||
@@ -910,19 +901,19 @@ parties
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">employees |> 
 | 
			
		||||
  inner_join(parties, join_by(between(birthday, start, end)), unmatched = "error")
 | 
			
		||||
#> # A tibble: 100 × 6
 | 
			
		||||
#>   name       birthday       q party      start      end       
 | 
			
		||||
#>   <variable> <date>     <int> <date>     <date>     <date>    
 | 
			
		||||
#> 1 Lindzy     2022-08-11     3 2022-07-11 2022-07-11 2022-10-02
 | 
			
		||||
#> 2 Santania   2022-03-01     1 2022-01-10 2022-01-01 2022-04-03
 | 
			
		||||
#> 3 Gardell    2022-03-04     1 2022-01-10 2022-01-01 2022-04-03
 | 
			
		||||
#> 4 Cyrille    2022-11-15     4 2022-10-03 2022-10-03 2022-12-31
 | 
			
		||||
#> 5 Kynli      2022-07-09     2 2022-04-04 2022-04-04 2022-07-10
 | 
			
		||||
#> 6 Sever      2022-02-03     1 2022-01-10 2022-01-01 2022-04-03
 | 
			
		||||
#>   name    birthday       q party      start      end       
 | 
			
		||||
#>   <chr>   <date>     <int> <date>     <date>     <date>    
 | 
			
		||||
#> 1 Case    2022-09-13     3 2022-07-11 2022-07-11 2022-10-02
 | 
			
		||||
#> 2 Shonnie 2022-03-30     1 2022-01-10 2022-01-01 2022-04-03
 | 
			
		||||
#> 3 Burnard 2022-01-10     1 2022-01-10 2022-01-01 2022-04-03
 | 
			
		||||
#> 4 Omer    2022-11-25     4 2022-10-03 2022-10-03 2022-12-31
 | 
			
		||||
#> 5 Hillel  2022-07-30     3 2022-07-11 2022-07-11 2022-10-02
 | 
			
		||||
#> 6 Curlie  2022-12-11     4 2022-10-03 2022-10-03 2022-12-31
 | 
			
		||||
#> # … with 94 more rows</pre>
 | 
			
		||||
</div>
 | 
			
		||||
</section>
 | 
			
		||||
 | 
			
		||||
<section id="exercises-2" data-type="sect2">
 | 
			
		||||
<section id="joins-exercises-2" data-type="sect2">
 | 
			
		||||
<h2>
 | 
			
		||||
Exercises</h2>
 | 
			
		||||
<ol type="1"><li>
 | 
			
		||||
@@ -951,7 +942,7 @@ x |> full_join(y, by = "key", keep = TRUE)
 | 
			
		||||
</ol></section>
 | 
			
		||||
</section>
 | 
			
		||||
 | 
			
		||||
<section id="summary" data-type="sect1">
 | 
			
		||||
<section id="joins-summary" data-type="sect1">
 | 
			
		||||
<h1>
 | 
			
		||||
Summary</h1>
 | 
			
		||||
<p>In this chapter, you’ve learned how to use mutating and filtering joins to combine data from a pair of data frames. Along the way you learned how to identify keys, and the difference between primary and foreign keys. You also understand how joins work and how to figure out how many rows the output will have. Finally, you’ve gained a glimpse into the power of non-equi-joins and seen a few interesting use cases.</p>
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user