Don't transform non-crossref links

This commit is contained in:
Hadley Wickham
2022-11-18 10:30:32 -06:00
parent 4caea5281b
commit 78a1c12fe7
32 changed files with 693 additions and 693 deletions

View File

@@ -127,7 +127,7 @@ Primary and foreign keys</h2>
<section id="checking-primary-keys" data-type="sect2">
<h2>
Checking primary keys</h2>
<p>Now that that weve identified the primary keys in each table, its good practice to verify that they do indeed uniquely identify each observation. One way to do that is to <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> the primary keys and look for entries where <code>n</code> is greater than one. This reveals that <code>planes</code> and <code>weather</code> both look good:</p>
<p>Now that that weve identified the primary keys in each table, its good practice to verify that they do indeed uniquely identify each observation. One way to do that is to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> the primary keys and look for entries where <code>n</code> is greater than one. This reveals that <code>planes</code> and <code>weather</code> both look good:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">planes |&gt;
count(tailnum) |&gt;
@@ -219,13 +219,13 @@ Exercises</h2>
<section id="sec-mutating-joins" data-type="sect1">
<h1>
Basic joins</h1>
<p>Now that you understand how data frames are connected via keys, we can start using joins to better understand the <code>flights</code> dataset. dplyr provides six join functions: <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code>. They all have the same interface: they take a pair of data frames (<code>x</code> and <code>y</code>) and return a data frame. The order of the rows and columns in the output is primarily determined by <code>x</code>.</p>
<p>In this section, youll learn how to use one mutating join, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, and two filtering joins, <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code>. In the next section, youll learn exactly how these functions work, and about the remaining <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>.</p>
<p>Now that you understand how data frames are connected via keys, we can start using joins to better understand the <code>flights</code> dataset. dplyr provides six join functions: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>. They all have the same interface: they take a pair of data frames (<code>x</code> and <code>y</code>) and return a data frame. The order of the rows and columns in the output is primarily determined by <code>x</code>.</p>
<p>In this section, youll learn how to use one mutating join, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, and two filtering joins, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>. In the next section, youll learn exactly how these functions work, and about the remaining <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">full_join()</a></code>.</p>
<section id="mutating-joins" data-type="sect2">
<h2>
Mutating joins</h2>
<p>A <strong>mutating join</strong> allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, the join functions add variables to the right, so if your dataset has many variables, you wont see the new ones. For these examples, well make it easier to see whats going on by creating a narrower dataset with just six variables<span data-type="footnote">Remember that in RStudio you can also use <code><a href="#chp-https://rdrr.io/r/utils/View" data-type="xref">#chp-https://rdrr.io/r/utils/View</a></code> to avoid this problem.</span>:</p>
<p>A <strong>mutating join</strong> allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, the join functions add variables to the right, so if your dataset has many variables, you wont see the new ones. For these examples, well make it easier to see whats going on by creating a narrower dataset with just six variables<span data-type="footnote">Remember that in RStudio you can also use <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code> to avoid this problem.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 &lt;- flights |&gt;
select(year, time_hour, origin, dest, tailnum, carrier)
@@ -241,7 +241,7 @@ flights2
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA
#&gt; # … with 336,770 more rows</pre>
</div>
<p>There are four types of mutating join, but theres one that youll use almost all of the time: <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>. Its special because the output will always have the same rows as <code>x</code><span data-type="footnote">Thats not 100% true, but youll get a warning whenever it isnt.</span>. The primary use of <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> is to add in additional metadata. For example, we can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> to add the full airline name to the <code>flights2</code> data:</p>
<p>There are four types of mutating join, but theres one that youll use almost all of the time: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>. Its special because the output will always have the same rows as <code>x</code><span data-type="footnote">Thats not 100% true, but youll get a warning whenever it isnt.</span>. The primary use of <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> is to add in additional metadata. For example, we can use <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> to add the full airline name to the <code>flights2</code> data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(airlines)
@@ -289,7 +289,7 @@ flights2
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed wi… 2 191
#&gt; # … with 336,770 more rows</pre>
</div>
<p>When <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, theres no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p>
<p>When <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, theres no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
filter(tailnum == "N3ALAA") |&gt;
@@ -312,7 +312,7 @@ flights2
<section id="specifying-join-keys" data-type="sect2">
<h2>
Specifying join keys</h2>
<p>By default, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> will use all variables that appear in both data frames as the join key, the so called <strong>natural</strong> join. This is a useful heuristic, but it doesnt always work. For example, what happens if we try to join <code>flights2</code> with the complete <code>planes</code> dataset?</p>
<p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> will use all variables that appear in both data frames as the join key, the so called <strong>natural</strong> join. This is a useful heuristic, but it doesnt always work. For example, what happens if we try to join <code>flights2</code> with the complete <code>planes</code> dataset?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(planes)
@@ -329,7 +329,7 @@ Specifying join keys</h2>
#&gt; # … with 336,770 more rows, 4 more variables: engines &lt;int&gt;, seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;, and abbreviated variable name ¹manufacturer</pre>
</div>
<p>We get a lot of missing matches because our join is trying to use <code>tailnum</code> and <code>year</code> as a compound key. Both <code>flights</code> and <code>planes</code> have a <code>year</code> column but they mean different things: <code>flights$year</code> is year the flight occurred and <code>planes$year</code> is the year the plane was built. We only want to join on <code>tailnum</code> so we need to provide an explicit specification with <code><a href="#chp-https://dplyr.tidyverse.org/reference/join_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/join_by</a></code>:</p>
<p>We get a lot of missing matches because our join is trying to use <code>tailnum</code> and <code>year</code> as a compound key. Both <code>flights</code> and <code>planes</code> have a <code>year</code> column but they mean different things: <code>flights$year</code> is year the flight occurred and <code>planes$year</code> is the year the plane was built. We only want to join on <code>tailnum</code> so we need to provide an explicit specification with <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(planes, join_by(tailnum))
@@ -383,7 +383,7 @@ flights2 |&gt;
<code>by = "x"</code> corresponds to <code>join_by(x)</code>.</li>
<li>
<code>by = c("a" = "x")</code> corresponds to <code>join_by(a == x)</code>.</li>
</ul><p>Now that it exists, we prefer <code><a href="#chp-https://dplyr.tidyverse.org/reference/join_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/join_by</a></code> since it provides a clearer and more flexible specification.</p>
</ul><p>Now that it exists, we prefer <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code> since it provides a clearer and more flexible specification.</p>
</section>
<section id="filtering-joins" data-type="sect2">
@@ -570,7 +570,7 @@ y &lt;- tribble(
<section id="row-matching" data-type="sect2">
<h2>
Row matching</h2>
<p>So far weve explored what happens if a row in <code>x</code> matches zero or one rows in <code>y</code>. What happens if it matches more than one row? To understand whats going lets first narrow our focus to the <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> and then draw a picture, <a href="#fig-join-match-types" data-type="xref">#fig-join-match-types</a>.</p>
<p>So far weve explored what happens if a row in <code>x</code> matches zero or one rows in <code>y</code>. What happens if it matches more than one row? To understand whats going lets first narrow our focus to the <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code> and then draw a picture, <a href="#fig-join-match-types" data-type="xref">#fig-join-match-types</a>.</p>
<div class="cell">
<div class="cell-output-display">
@@ -606,7 +606,7 @@ df1 |&gt;
#&gt; 2 2 x2 y2
#&gt; 3 2 x2 y3</pre>
</div>
<p>This is one reason we like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> — if it runs without warning, you know that each row of the output matches the row in the same position in <code>x</code>.</p>
<p>This is one reason we like <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> — if it runs without warning, you know that each row of the output matches the row in the same position in <code>x</code>.</p>
<p>You can gain further control over row matching with two arguments:</p>
<ul><li>
<code>unmatched</code> controls what happens when a row in <code>x</code> fails to match any rows in <code>y</code>. It defaults to <code>"drop"</code> which will silently drop any unmatched rows.</li>
@@ -635,7 +635,7 @@ df1 |&gt;
#&gt; ! Each row of `x` must have a match in `y`.
#&gt; Row 1 of `x` does not have a match.</pre>
</div>
<p>Note that <code>unmatched = "error"</code> is not useful with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> because, as described above, every row in <code>x</code> has a fallback match to a virtual row in <code>y</code>.</p>
<p>Note that <code>unmatched = "error"</code> is not useful with <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> because, as described above, every row in <code>x</code> has a fallback match to a virtual row in <code>y</code>.</p>
</section>
<section id="allow-multiple-rows" data-type="sect2">
@@ -703,7 +703,7 @@ Filtering joins</h2>
<h1>
Non-equi joins</h1>
<p>So far youve only seen equi-joins, joins where the rows match if the <code>x</code> key equals the <code>y</code> key. Now were going to relax that restriction and discuss other ways of determining if a pair of rows match.</p>
<p>But before we can do that, we need to revisit a simplification we made above. In equi-joins the <code>x</code> keys and <code>y</code> are always equal, so we only need to show one in the output. We can request that dplyr keep both keys with <code>keep = TRUE</code>, leading to the code below and the re-drawn <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> in <a href="#fig-inner-both" data-type="xref">#fig-inner-both</a>.</p>
<p>But before we can do that, we need to revisit a simplification we made above. In equi-joins the <code>x</code> keys and <code>y</code> are always equal, so we only need to show one in the output. We can request that dplyr keep both keys with <code>keep = TRUE</code>, leading to the code below and the re-drawn <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code> in <a href="#fig-inner-both" data-type="xref">#fig-inner-both</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x |&gt; left_join(y, by = "key", keep = TRUE)
#&gt; # A tibble: 3 × 4
@@ -956,7 +956,7 @@ x |&gt; full_join(y, by = "key", keep = TRUE)
#&gt; 4 NA &lt;NA&gt; 4 y3</pre>
</div>
</li>
<li><p>When finding if any party period overlapped with another party period we used <code>q &lt; q</code> in the <code><a href="#chp-https://dplyr.tidyverse.org/reference/join_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/join_by</a></code>? Why? What happens if you remove this inequality?</p></li>
<li><p>When finding if any party period overlapped with another party period we used <code>q &lt; q</code> in the <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code>? Why? What happens if you remove this inequality?</p></li>
</ol></section>
</section>