Re-render book for O'Reilly
This commit is contained in:
@@ -65,15 +65,15 @@ Primary and foreign keys</h2>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">planes
|
||||
#> # A tibble: 3,322 × 9
|
||||
#> tailnum year type manuf…¹ model engines seats speed engine
|
||||
#> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
|
||||
#> 1 N10156 2004 Fixed wing multi en… EMBRAER EMB-… 2 55 NA Turbo…
|
||||
#> 2 N102UW 1998 Fixed wing multi en… AIRBUS… A320… 2 182 NA Turbo…
|
||||
#> 3 N103US 1999 Fixed wing multi en… AIRBUS… A320… 2 182 NA Turbo…
|
||||
#> 4 N104UW 1999 Fixed wing multi en… AIRBUS… A320… 2 182 NA Turbo…
|
||||
#> 5 N10575 2002 Fixed wing multi en… EMBRAER EMB-… 2 55 NA Turbo…
|
||||
#> 6 N105UW 1999 Fixed wing multi en… AIRBUS… A320… 2 182 NA Turbo…
|
||||
#> # … with 3,316 more rows, and abbreviated variable name ¹manufacturer</pre>
|
||||
#> tailnum year type manufacturer model engines seats speed engine
|
||||
#> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
|
||||
#> 1 N10156 2004 Fixed wing mul… EMBRAER EMB-… 2 55 NA Turbo…
|
||||
#> 2 N102UW 1998 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
|
||||
#> 3 N103US 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
|
||||
#> 4 N104UW 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
|
||||
#> 5 N10575 2002 Fixed wing mul… EMBRAER EMB-… 2 55 NA Turbo…
|
||||
#> 6 N105UW 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
|
||||
#> # … with 3,316 more rows</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
@@ -81,17 +81,16 @@ Primary and foreign keys</h2>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">weather
|
||||
#> # A tibble: 26,115 × 15
|
||||
#> origin year month day hour temp dewp humid wind_dir wind_sp…¹ wind_…²
|
||||
#> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
|
||||
#> 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
|
||||
#> 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
|
||||
#> 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
|
||||
#> 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
|
||||
#> 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
|
||||
#> # … with 26,109 more rows, 4 more variables: precip <dbl>, pressure <dbl>,
|
||||
#> # visib <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹wind_speed, ²wind_gust</pre>
|
||||
#> origin year month day hour temp dewp humid wind_dir wind_speed
|
||||
#> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
|
||||
#> 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
|
||||
#> 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
|
||||
#> 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
|
||||
#> 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
|
||||
#> 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
|
||||
#> # … with 26,109 more rows, and 5 more variables: wind_gust <dbl>,
|
||||
#> # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm></pre>
|
||||
</div>
|
||||
</li>
|
||||
</ul><p>A <strong>foreign key</strong> is a variable (or set of variables) that corresponds to a primary key in another table. For example:</p>
|
||||
@@ -102,7 +101,7 @@ Primary and foreign keys</h2>
|
||||
<li>
|
||||
<code>flights$origin</code> is a foreign key that corresponds to the primary key <code>airports$faa</code>.</li>
|
||||
<li>
|
||||
<code>flights$dest</code> is a foreign key that corresponds to the primary key <code>airports$faa</code> .</li>
|
||||
<code>flights$dest</code> is a foreign key that corresponds to the primary key <code>airports$faa</code>.</li>
|
||||
<li>
|
||||
<code>flights$origin</code>-<code>flights$time_hour</code> is a compound foreign key that corresponds to the compound primary key <code>weather$origin</code>-<code>weather$time_hour</code>.</li>
|
||||
</ul><p>These relationships are summarized visually in <a href="#fig-flights-relationships" data-type="xref">#fig-flights-relationships</a>.</p>
|
||||
@@ -110,7 +109,7 @@ Primary and foreign keys</h2>
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-flights-relationships"><p><img src="diagrams/relational.png" alt="The relationships between airports, planes, flights, weather, and airlines datasets from the nycflights13 package. airports$faa connected to the flights$origin and flights$dest. planes$tailnum is connected to the flights$tailnum. weather$time_hour and weather$origin are jointly connected to flights$time_hour and flights$origin. airlines$carrier is connected to flights$carrier. There are no direct connections between airports, planes, airlines, and weather data frames." width="502"/></p>
|
||||
<figcaption>Connections between all five data frames in the nycflights13 package. Variables making up a primary key are coloured grey, and are connected to their corresponding foreign keys with arrows.</figcaption>
|
||||
<figcaption>Connections between all five data frames in the nycflights13 package. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
@@ -182,19 +181,18 @@ Surrogate keys</h2>
|
||||
mutate(id = row_number(), .before = 1)
|
||||
flights2
|
||||
#> # A tibble: 336,776 × 20
|
||||
#> id year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
|
||||
#> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
|
||||
#> 1 1 2013 1 1 517 515 2 830 819 11
|
||||
#> 2 2 2013 1 1 533 529 4 850 830 20
|
||||
#> 3 3 2013 1 1 542 540 2 923 850 33
|
||||
#> 4 4 2013 1 1 544 545 -1 1004 1022 -18
|
||||
#> 5 5 2013 1 1 554 600 -6 812 837 -25
|
||||
#> 6 6 2013 1 1 554 558 -4 740 728 12
|
||||
#> # … with 336,770 more rows, 10 more variables: carrier <chr>, flight <int>,
|
||||
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
|
||||
#> # hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable
|
||||
#> # names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
|
||||
#> # ⁵arr_delay</pre>
|
||||
#> id year month day dep_time sched_dep_time dep_delay arr_time
|
||||
#> <int> <int> <int> <int> <int> <int> <dbl> <int>
|
||||
#> 1 1 2013 1 1 517 515 2 830
|
||||
#> 2 2 2013 1 1 533 529 4 850
|
||||
#> 3 3 2013 1 1 542 540 2 923
|
||||
#> 4 4 2013 1 1 544 545 -1 1004
|
||||
#> 5 5 2013 1 1 554 600 -6 812
|
||||
#> 6 6 2013 1 1 554 558 -4 740
|
||||
#> # … with 336,770 more rows, and 12 more variables: sched_arr_time <int>,
|
||||
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>Surrogate keys can be particular useful when communicating to other humans: it’s much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.</p>
|
||||
</section>
|
||||
@@ -312,16 +310,16 @@ Specifying join keys</h2>
|
||||
left_join(planes)
|
||||
#> Joining with `by = join_by(year, tailnum)`
|
||||
#> # A tibble: 336,776 × 13
|
||||
#> year time_hour origin dest tailnum carrier type manufa…¹ model
|
||||
#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
|
||||
#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA <NA> <NA> <NA>
|
||||
#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA <NA> <NA> <NA>
|
||||
#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA <NA> <NA> <NA>
|
||||
#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 <NA> <NA> <NA>
|
||||
#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL <NA> <NA> <NA>
|
||||
#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA <NA> <NA> <NA>
|
||||
#> # … with 336,770 more rows, 4 more variables: engines <int>, seats <int>,
|
||||
#> # speed <int>, engine <chr>, and abbreviated variable name ¹manufacturer</pre>
|
||||
#> year time_hour origin dest tailnum carrier type manufacturer
|
||||
#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> <chr>
|
||||
#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA <NA> <NA>
|
||||
#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA <NA> <NA>
|
||||
#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA <NA> <NA>
|
||||
#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 <NA> <NA>
|
||||
#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL <NA> <NA>
|
||||
#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA <NA> <NA>
|
||||
#> # … with 336,770 more rows, and 5 more variables: model <chr>,
|
||||
#> # engines <int>, seats <int>, speed <int>, engine <chr></pre>
|
||||
</div>
|
||||
<p>We get a lot of missing matches because our join is trying to use <code>tailnum</code> and <code>year</code> as a compound key. Both <code>flights</code> and <code>planes</code> have a <code>year</code> column but they mean different things: <code>flights$year</code> is year the flight occurred and <code>planes$year</code> is the year the plane was built. We only want to join on <code>tailnum</code> so we need to provide an explicit specification with <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code>:</p>
|
||||
<div class="cell">
|
||||
@@ -341,7 +339,7 @@ Specifying join keys</h2>
|
||||
</div>
|
||||
<p>Note that the <code>year</code> variables are disambiguated in the output with a suffix (<code>year.x</code> and <code>year.y</code>), which tells you whether the variable came from the <code>x</code> or <code>y</code> argument. You can override the default suffixes with the <code>suffix</code> argument.</p>
|
||||
<p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. It’s important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is often called an <strong>equi-join</strong>. You’ll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
|
||||
<p>Secondly, it’s how you specify different join keys in each table. For example, there are two ways to join the <code>flight2</code> and <code>airports</code> table: either by <code>dest</code> or <code>origin:</code></p>
|
||||
<p>Secondly, it’s how you specify different join keys in each table. For example, there are two ways to join the <code>flight2</code> and <code>airports</code> table: either by <code>dest</code> or <code>origin</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 |>
|
||||
left_join(airports, join_by(dest == faa))
|
||||
@@ -461,12 +459,12 @@ Exercises</h2>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">airports |>
|
||||
semi_join(flights, join_by(faa == dest)) |>
|
||||
ggplot(aes(lon, lat)) +
|
||||
ggplot(aes(x = lon, y = lat)) +
|
||||
borders("state") +
|
||||
geom_point() +
|
||||
coord_quickmap()</pre>
|
||||
</div>
|
||||
<p>You might want to use the <code>size</code> or <code>colour</code> of the points to display the average delay for each airport.</p>
|
||||
<p>You might want to use the <code>size</code> or <code>color</code> of the points to display the average delay for each airport.</p>
|
||||
</li>
|
||||
<li><p>What happened on June 13 2013? Draw a map of the delays, and then use Google to cross-reference with the weather.</p></li>
|
||||
</ol></section>
|
||||
@@ -493,8 +491,8 @@ y <- tribble(
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-join-setup"><p><img src="diagrams/join/setup.png" alt="x and y are two data frames with 2 columns and 3 rows, with contents as described in the text. The values of the keys are coloured: 1 is green, 2 is purple, 3 is orange, and 4 is yellow." width="160"/></p>
|
||||
<figcaption>Graphical representation of two simple tables. The coloured <code>key</code> columns map background colour to key value. The grey columns represent the “value” columns that are carried along for the ride.</figcaption>
|
||||
<figure id="fig-join-setup"><p><img src="diagrams/join/setup.png" alt="x and y are two data frames with 2 columns and 3 rows, with contents as described in the text. The values of the keys are colored: 1 is green, 2 is purple, 3 is orange, and 4 is yellow." width="160"/></p>
|
||||
<figcaption>Graphical representation of two simple tables. The colored <code>key</code> columns map background color to key value. The grey columns represent the “value” columns that are carried along for the ride.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
@@ -710,11 +708,11 @@ Non-equi joins</h1>
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-inner-both"><p><img src="diagrams/join/inner-both.png" alt="A join diagram showing an inner join betwen x and y. The result now includes four columns: key.x, val_x, key.y, and val_y. The values of key.x and key.y are identical, which is why we usually only show one. " width="415"/></p>
|
||||
<figcaption>An left join showing both <code>x</code> and <code>y</code> keys in the output.</figcaption>
|
||||
<figcaption>A left join showing both <code>x</code> and <code>y</code> keys in the output.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>When we move away from equi-joins we’ll always show the keys, because the key values will often different. For example, instead of matching only when the <code>x$key</code> and <code>y$key</code> are equal, we could match whenever the <code>x$key</code> is greater than or equal to the <code>y$key</code>, leading to <a href="#fig-join-gte" data-type="xref">#fig-join-gte</a>. dplyr’s join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.</p>
|
||||
<p>When we move away from equi-joins we’ll always show the keys, because the key values will often be different. For example, instead of matching only when the <code>x$key</code> and <code>y$key</code> are equal, we could match whenever the <code>x$key</code> is greater than or equal to the <code>y$key</code>, leading to <a href="#fig-join-gte" data-type="xref">#fig-join-gte</a>. dplyr’s join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
@@ -746,10 +744,10 @@ Cross joins</h2>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since we’re joining <code>df</code> to itself, this is sometimes called a <strong>self-join</strong>.</p>
|
||||
<p>Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since we’re joining <code>df</code> to itself, this is sometimes called a <strong>self-join</strong>. Cross joins use a different join function because there’s no distinction between inner/left/right/full when you’re matching every row.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(name = c("John", "Simon", "Tracy", "Max"))
|
||||
df |> left_join(df, join_by())
|
||||
df |> cross_join(df)
|
||||
#> # A tibble: 16 × 2
|
||||
#> name.x name.y
|
||||
#> <chr> <chr>
|
||||
|
||||
Reference in New Issue
Block a user