Re-render book for O'Reilly
This commit is contained in:
		@@ -65,15 +65,15 @@ Primary and foreign keys</h2>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">planes
 | 
			
		||||
#> # A tibble: 3,322 × 9
 | 
			
		||||
#>   tailnum  year type                 manuf…¹ model engines seats speed engine
 | 
			
		||||
#>   <chr>   <int> <chr>                <chr>   <chr>   <int> <int> <int> <chr> 
 | 
			
		||||
#> 1 N10156   2004 Fixed wing multi en… EMBRAER EMB-…       2    55    NA Turbo…
 | 
			
		||||
#> 2 N102UW   1998 Fixed wing multi en… AIRBUS… A320…       2   182    NA Turbo…
 | 
			
		||||
#> 3 N103US   1999 Fixed wing multi en… AIRBUS… A320…       2   182    NA Turbo…
 | 
			
		||||
#> 4 N104UW   1999 Fixed wing multi en… AIRBUS… A320…       2   182    NA Turbo…
 | 
			
		||||
#> 5 N10575   2002 Fixed wing multi en… EMBRAER EMB-…       2    55    NA Turbo…
 | 
			
		||||
#> 6 N105UW   1999 Fixed wing multi en… AIRBUS… A320…       2   182    NA Turbo…
 | 
			
		||||
#> # … with 3,316 more rows, and abbreviated variable name ¹manufacturer</pre>
 | 
			
		||||
#>   tailnum  year type            manufacturer model engines seats speed engine
 | 
			
		||||
#>   <chr>   <int> <chr>           <chr>        <chr>   <int> <int> <int> <chr> 
 | 
			
		||||
#> 1 N10156   2004 Fixed wing mul… EMBRAER      EMB-…       2    55    NA Turbo…
 | 
			
		||||
#> 2 N102UW   1998 Fixed wing mul… AIRBUS INDU… A320…       2   182    NA Turbo…
 | 
			
		||||
#> 3 N103US   1999 Fixed wing mul… AIRBUS INDU… A320…       2   182    NA Turbo…
 | 
			
		||||
#> 4 N104UW   1999 Fixed wing mul… AIRBUS INDU… A320…       2   182    NA Turbo…
 | 
			
		||||
#> 5 N10575   2002 Fixed wing mul… EMBRAER      EMB-…       2    55    NA Turbo…
 | 
			
		||||
#> 6 N105UW   1999 Fixed wing mul… AIRBUS INDU… A320…       2   182    NA Turbo…
 | 
			
		||||
#> # … with 3,316 more rows</pre>
 | 
			
		||||
</div>
 | 
			
		||||
</li>
 | 
			
		||||
<li>
 | 
			
		||||
@@ -81,17 +81,16 @@ Primary and foreign keys</h2>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">weather
 | 
			
		||||
#> # A tibble: 26,115 × 15
 | 
			
		||||
#>   origin  year month   day  hour  temp  dewp humid wind_dir wind_sp…¹ wind_…²
 | 
			
		||||
#>   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>     <dbl>   <dbl>
 | 
			
		||||
#> 1 EWR     2013     1     1     1  39.0  26.1  59.4      270     10.4       NA
 | 
			
		||||
#> 2 EWR     2013     1     1     2  39.0  27.0  61.6      250      8.06      NA
 | 
			
		||||
#> 3 EWR     2013     1     1     3  39.0  28.0  64.4      240     11.5       NA
 | 
			
		||||
#> 4 EWR     2013     1     1     4  39.9  28.0  62.2      250     12.7       NA
 | 
			
		||||
#> 5 EWR     2013     1     1     5  39.0  28.0  64.4      260     12.7       NA
 | 
			
		||||
#> 6 EWR     2013     1     1     6  37.9  28.0  67.2      240     11.5       NA
 | 
			
		||||
#> # … with 26,109 more rows, 4 more variables: precip <dbl>, pressure <dbl>,
 | 
			
		||||
#> #   visib <dbl>, time_hour <dttm>, and abbreviated variable names
 | 
			
		||||
#> #   ¹wind_speed, ²wind_gust</pre>
 | 
			
		||||
#>   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
 | 
			
		||||
#>   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
 | 
			
		||||
#> 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
 | 
			
		||||
#> 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
 | 
			
		||||
#> 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
 | 
			
		||||
#> 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
 | 
			
		||||
#> 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
 | 
			
		||||
#> 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
 | 
			
		||||
#> # … with 26,109 more rows, and 5 more variables: wind_gust <dbl>,
 | 
			
		||||
#> #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm></pre>
 | 
			
		||||
</div>
 | 
			
		||||
</li>
 | 
			
		||||
</ul><p>A <strong>foreign key</strong> is a variable (or set of variables) that corresponds to a primary key in another table. For example:</p>
 | 
			
		||||
@@ -102,7 +101,7 @@ Primary and foreign keys</h2>
 | 
			
		||||
<li>
 | 
			
		||||
<code>flights$origin</code> is a foreign key that corresponds to the primary key <code>airports$faa</code>.</li>
 | 
			
		||||
<li>
 | 
			
		||||
<code>flights$dest</code> is a foreign key that corresponds to the primary key <code>airports$faa</code> .</li>
 | 
			
		||||
<code>flights$dest</code> is a foreign key that corresponds to the primary key <code>airports$faa</code>.</li>
 | 
			
		||||
<li>
 | 
			
		||||
<code>flights$origin</code>-<code>flights$time_hour</code> is a compound foreign key that corresponds to the compound primary key <code>weather$origin</code>-<code>weather$time_hour</code>.</li>
 | 
			
		||||
</ul><p>These relationships are summarized visually in <a href="#fig-flights-relationships" data-type="xref">#fig-flights-relationships</a>.</p>
 | 
			
		||||
@@ -110,7 +109,7 @@ Primary and foreign keys</h2>
 | 
			
		||||
<div class="cell-output-display">
 | 
			
		||||
 | 
			
		||||
<figure id="fig-flights-relationships"><p><img src="diagrams/relational.png" alt="The relationships between airports, planes, flights, weather, and airlines datasets from the nycflights13 package. airports$faa connected to the flights$origin and flights$dest. planes$tailnum is connected to the flights$tailnum. weather$time_hour and weather$origin are jointly connected to flights$time_hour and flights$origin. airlines$carrier is connected to flights$carrier. There are no direct connections between airports, planes, airlines, and weather data frames." width="502"/></p>
 | 
			
		||||
<figcaption>Connections between all five data frames in the nycflights13 package. Variables making up a primary key are coloured grey, and are connected to their corresponding foreign keys with arrows.</figcaption>
 | 
			
		||||
<figcaption>Connections between all five data frames in the nycflights13 package. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows.</figcaption>
 | 
			
		||||
</figure>
 | 
			
		||||
</div>
 | 
			
		||||
</div>
 | 
			
		||||
@@ -182,19 +181,18 @@ Surrogate keys</h2>
 | 
			
		||||
  mutate(id = row_number(), .before = 1)
 | 
			
		||||
flights2
 | 
			
		||||
#> # A tibble: 336,776 × 20
 | 
			
		||||
#>      id  year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
 | 
			
		||||
#>   <int> <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl>
 | 
			
		||||
#> 1     1  2013     1     1      517        515       2     830     819      11
 | 
			
		||||
#> 2     2  2013     1     1      533        529       4     850     830      20
 | 
			
		||||
#> 3     3  2013     1     1      542        540       2     923     850      33
 | 
			
		||||
#> 4     4  2013     1     1      544        545      -1    1004    1022     -18
 | 
			
		||||
#> 5     5  2013     1     1      554        600      -6     812     837     -25
 | 
			
		||||
#> 6     6  2013     1     1      554        558      -4     740     728      12
 | 
			
		||||
#> # … with 336,770 more rows, 10 more variables: carrier <chr>, flight <int>,
 | 
			
		||||
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
 | 
			
		||||
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable
 | 
			
		||||
#> #   names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
 | 
			
		||||
#> #   ⁵arr_delay</pre>
 | 
			
		||||
#>      id  year month   day dep_time sched_dep_time dep_delay arr_time
 | 
			
		||||
#>   <int> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 | 
			
		||||
#> 1     1  2013     1     1      517            515         2      830
 | 
			
		||||
#> 2     2  2013     1     1      533            529         4      850
 | 
			
		||||
#> 3     3  2013     1     1      542            540         2      923
 | 
			
		||||
#> 4     4  2013     1     1      544            545        -1     1004
 | 
			
		||||
#> 5     5  2013     1     1      554            600        -6      812
 | 
			
		||||
#> 6     6  2013     1     1      554            558        -4      740
 | 
			
		||||
#> # … with 336,770 more rows, and 12 more variables: sched_arr_time <int>,
 | 
			
		||||
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
 | 
			
		||||
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 | 
			
		||||
#> #   minute <dbl>, time_hour <dttm></pre>
 | 
			
		||||
</div>
 | 
			
		||||
<p>Surrogate keys can be particular useful when communicating to other humans: it’s much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.</p>
 | 
			
		||||
</section>
 | 
			
		||||
@@ -312,16 +310,16 @@ Specifying join keys</h2>
 | 
			
		||||
  left_join(planes)
 | 
			
		||||
#> Joining with `by = join_by(year, tailnum)`
 | 
			
		||||
#> # A tibble: 336,776 × 13
 | 
			
		||||
#>    year time_hour           origin dest  tailnum carrier type  manufa…¹ model
 | 
			
		||||
#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <chr>    <chr>
 | 
			
		||||
#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      <NA>  <NA>     <NA> 
 | 
			
		||||
#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      <NA>  <NA>     <NA> 
 | 
			
		||||
#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      <NA>  <NA>     <NA> 
 | 
			
		||||
#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>  <NA>     <NA> 
 | 
			
		||||
#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      <NA>  <NA>     <NA> 
 | 
			
		||||
#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      <NA>  <NA>     <NA> 
 | 
			
		||||
#> # … with 336,770 more rows, 4 more variables: engines <int>, seats <int>,
 | 
			
		||||
#> #   speed <int>, engine <chr>, and abbreviated variable name ¹manufacturer</pre>
 | 
			
		||||
#>    year time_hour           origin dest  tailnum carrier type  manufacturer
 | 
			
		||||
#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <chr>       
 | 
			
		||||
#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      <NA>  <NA>        
 | 
			
		||||
#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      <NA>  <NA>        
 | 
			
		||||
#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      <NA>  <NA>        
 | 
			
		||||
#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>  <NA>        
 | 
			
		||||
#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      <NA>  <NA>        
 | 
			
		||||
#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      <NA>  <NA>        
 | 
			
		||||
#> # … with 336,770 more rows, and 5 more variables: model <chr>,
 | 
			
		||||
#> #   engines <int>, seats <int>, speed <int>, engine <chr></pre>
 | 
			
		||||
</div>
 | 
			
		||||
<p>We get a lot of missing matches because our join is trying to use <code>tailnum</code> and <code>year</code> as a compound key. Both <code>flights</code> and <code>planes</code> have a <code>year</code> column but they mean different things: <code>flights$year</code> is year the flight occurred and <code>planes$year</code> is the year the plane was built. We only want to join on <code>tailnum</code> so we need to provide an explicit specification with <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
@@ -341,7 +339,7 @@ Specifying join keys</h2>
 | 
			
		||||
</div>
 | 
			
		||||
<p>Note that the <code>year</code> variables are disambiguated in the output with a suffix (<code>year.x</code> and <code>year.y</code>), which tells you whether the variable came from the <code>x</code> or <code>y</code> argument. You can override the default suffixes with the <code>suffix</code> argument.</p>
 | 
			
		||||
<p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. It’s important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is often called an <strong>equi-join</strong>. You’ll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
 | 
			
		||||
<p>Secondly, it’s how you specify different join keys in each table. For example, there are two ways to join the <code>flight2</code> and <code>airports</code> table: either by <code>dest</code> or <code>origin:</code></p>
 | 
			
		||||
<p>Secondly, it’s how you specify different join keys in each table. For example, there are two ways to join the <code>flight2</code> and <code>airports</code> table: either by <code>dest</code> or <code>origin</code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">flights2 |> 
 | 
			
		||||
  left_join(airports, join_by(dest == faa))
 | 
			
		||||
@@ -461,12 +459,12 @@ Exercises</h2>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">airports |>
 | 
			
		||||
  semi_join(flights, join_by(faa == dest)) |>
 | 
			
		||||
  ggplot(aes(lon, lat)) +
 | 
			
		||||
  ggplot(aes(x = lon, y = lat)) +
 | 
			
		||||
    borders("state") +
 | 
			
		||||
    geom_point() +
 | 
			
		||||
    coord_quickmap()</pre>
 | 
			
		||||
</div>
 | 
			
		||||
<p>You might want to use the <code>size</code> or <code>colour</code> of the points to display the average delay for each airport.</p>
 | 
			
		||||
<p>You might want to use the <code>size</code> or <code>color</code> of the points to display the average delay for each airport.</p>
 | 
			
		||||
</li>
 | 
			
		||||
<li><p>What happened on June 13 2013? Draw a map of the delays, and then use Google to cross-reference with the weather.</p></li>
 | 
			
		||||
</ol></section>
 | 
			
		||||
@@ -493,8 +491,8 @@ y <- tribble(
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<div class="cell-output-display">
 | 
			
		||||
 | 
			
		||||
<figure id="fig-join-setup"><p><img src="diagrams/join/setup.png" alt="x and y are two data frames with 2 columns and 3 rows, with contents as described in the text. The values of the keys are coloured: 1 is green, 2 is purple, 3 is orange, and 4 is yellow." width="160"/></p>
 | 
			
		||||
<figcaption>Graphical representation of two simple tables. The coloured <code>key</code> columns map background colour to key value. The grey columns represent the “value” columns that are carried along for the ride.</figcaption>
 | 
			
		||||
<figure id="fig-join-setup"><p><img src="diagrams/join/setup.png" alt="x and y are two data frames with 2 columns and 3 rows, with contents as described in the text. The values of the keys are colored: 1 is green, 2 is purple, 3 is orange, and 4 is yellow." width="160"/></p>
 | 
			
		||||
<figcaption>Graphical representation of two simple tables. The colored <code>key</code> columns map background color to key value. The grey columns represent the “value” columns that are carried along for the ride.</figcaption>
 | 
			
		||||
</figure>
 | 
			
		||||
</div>
 | 
			
		||||
</div>
 | 
			
		||||
@@ -710,11 +708,11 @@ Non-equi joins</h1>
 | 
			
		||||
<div class="cell-output-display">
 | 
			
		||||
 | 
			
		||||
<figure id="fig-inner-both"><p><img src="diagrams/join/inner-both.png" alt="A join diagram showing an inner join betwen x and y. The result now includes four columns: key.x, val_x, key.y, and val_y. The values of key.x and key.y are identical, which is why we usually only show one. " width="415"/></p>
 | 
			
		||||
<figcaption>An left join showing both <code>x</code> and <code>y</code> keys in the output.</figcaption>
 | 
			
		||||
<figcaption>A left join showing both <code>x</code> and <code>y</code> keys in the output.</figcaption>
 | 
			
		||||
</figure>
 | 
			
		||||
</div>
 | 
			
		||||
</div>
 | 
			
		||||
<p>When we move away from equi-joins we’ll always show the keys, because the key values will often different. For example, instead of matching only when the <code>x$key</code> and <code>y$key</code> are equal, we could match whenever the <code>x$key</code> is greater than or equal to the <code>y$key</code>, leading to <a href="#fig-join-gte" data-type="xref">#fig-join-gte</a>. dplyr’s join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.</p>
 | 
			
		||||
<p>When we move away from equi-joins we’ll always show the keys, because the key values will often be different. For example, instead of matching only when the <code>x$key</code> and <code>y$key</code> are equal, we could match whenever the <code>x$key</code> is greater than or equal to the <code>y$key</code>, leading to <a href="#fig-join-gte" data-type="xref">#fig-join-gte</a>. dplyr’s join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<div class="cell-output-display">
 | 
			
		||||
 | 
			
		||||
@@ -746,10 +744,10 @@ Cross joins</h2>
 | 
			
		||||
</figure>
 | 
			
		||||
</div>
 | 
			
		||||
</div>
 | 
			
		||||
<p>Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since we’re joining <code>df</code> to itself, this is sometimes called a <strong>self-join</strong>.</p>
 | 
			
		||||
<p>Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since we’re joining <code>df</code> to itself, this is sometimes called a <strong>self-join</strong>. Cross joins use a different join function because there’s no distinction between inner/left/right/full when you’re matching every row.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(name = c("John", "Simon", "Tracy", "Max"))
 | 
			
		||||
df |> left_join(df, join_by())
 | 
			
		||||
df |> cross_join(df)
 | 
			
		||||
#> # A tibble: 16 × 2
 | 
			
		||||
#>   name.x name.y
 | 
			
		||||
#>   <chr>  <chr> 
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user