Don't transform non-crossref links

This commit is contained in:
Hadley Wickham
2022-11-18 10:30:32 -06:00
parent 4caea5281b
commit 78a1c12fe7
32 changed files with 693 additions and 693 deletions

View File

@@ -42,7 +42,7 @@ Last observation carried forward</h2>
"Katherine Burke", 1, 4
)</pre>
</div>
<p>You can fill in these missing values with <code><a href="#chp-https://tidyr.tidyverse.org/reference/fill" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/fill</a></code>. It works like <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, taking a set of columns:</p>
<p>You can fill in these missing values with <code><a href="https://tidyr.tidyverse.org/reference/fill.html">tidyr::fill()</a></code>. It works like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, taking a set of columns:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">treatment |&gt;
fill(everything())
@@ -60,14 +60,14 @@ Last observation carried forward</h2>
<section id="fixed-values" data-type="sect2">
<h2>
Fixed values</h2>
<p>Some times missing values represent some fixed and known value, most commonly 0. You can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/coalesce" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/coalesce</a></code> to replace them:</p>
<p>Some times missing values represent some fixed and known value, most commonly 0. You can use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">dplyr::coalesce()</a></code> to replace them:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 4, 5, 7, NA)
coalesce(x, 0)
#&gt; [1] 1 4 5 7 0</pre>
</div>
<p>Sometimes youll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesnt have a proper way to represent missing values, so it must instead use some special value like 99 or -999.</p>
<p>If possible, handle this when reading in the data, for example, by using the <code>na</code> argument to <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code>. If you discover the problem later, or your data source doesnt provide a way to handle on it read, you can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/na_if" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/na_if</a></code>:</p>
<p>If possible, handle this when reading in the data, for example, by using the <code>na</code> argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">readr::read_csv()</a></code>. If you discover the problem later, or your data source doesnt provide a way to handle on it read, you can use <code><a href="https://dplyr.tidyverse.org/reference/na_if.html">dplyr::na_if()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 4, 5, 7, -99)
na_if(x, -99)
@@ -147,7 +147,7 @@ Pivoting</h2>
<section id="complete" data-type="sect2">
<h2>
Complete</h2>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/complete" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/complete</a></code> allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of <code>year</code> and <code>qtr</code> should exist in the <code>stocks</code> data:</p>
<p><code><a href="https://tidyr.tidyverse.org/reference/complete.html">tidyr::complete()</a></code> allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of <code>year</code> and <code>qtr</code> should exist in the <code>stocks</code> data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">stocks |&gt;
complete(year, qtr)
@@ -162,7 +162,7 @@ Complete</h2>
#&gt; 6 2021 2 0.92
#&gt; # … with 2 more rows</pre>
</div>
<p>Typically, youll call <code><a href="#chp-https://tidyr.tidyverse.org/reference/complete" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/complete</a></code> with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the <code>stocks</code> dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for <code>year</code>:</p>
<p>Typically, youll call <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code> with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the <code>stocks</code> dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for <code>year</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">stocks |&gt;
complete(year = 2019:2021, qtr)
@@ -178,14 +178,14 @@ Complete</h2>
#&gt; # … with 6 more rows</pre>
</div>
<p>If the range of a variable is correct, but not all values are present, you could use <code>full_seq(x, 1)</code> to generate all values from <code>min(x)</code> to <code>max(x)</code> spaced out by 1.</p>
<p>In some cases, the complete set of observations cant be generated by a simple combination of variables. In that case, you can do manually what <code><a href="#chp-https://tidyr.tidyverse.org/reference/complete" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/complete</a></code> does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>.</p>
<p>In some cases, the complete set of observations cant be generated by a simple combination of variables. In that case, you can do manually what <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code> does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">dplyr::full_join()</a></code>.</p>
</section>
<section id="joins" data-type="sect2">
<h2>
Joins</h2>
<p>This brings us to another important way of revealing implicitly missing observations: joins. Youll learn more about joins in <a href="#chp-joins" data-type="xref">#chp-joins</a>, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.</p>
<p><code>dplyr::anti_join(x, y)</code> is a particularly useful tool here because it selects only the rows in <code>x</code> that dont have a match in <code>y</code>. For example, we can use two <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code>s reveal to reveal that were missing information for four airports and 722 planes mentioned in <code>flights</code>:</p>
<p><code>dplyr::anti_join(x, y)</code> is a particularly useful tool here because it selects only the rows in <code>x</code> that dont have a match in <code>y</code>. For example, we can use two <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>s reveal to reveal that were missing information for four airports and 722 planes mentioned in <code>flights</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(nycflights13)
@@ -236,7 +236,7 @@ Factors and empty groups</h1>
age = c(34L, 88L, 75L, 47L, 56L),
)</pre>
</div>
<p>And we want to count the number of smokers with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>:</p>
<p>And we want to count the number of smokers with <code><a href="https://dplyr.tidyverse.org/reference/count.html">dplyr::count()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt; count(smoker)
#&gt; # A tibble: 1 × 2
@@ -244,7 +244,7 @@ Factors and empty groups</h1>
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 no 5</pre>
</div>
<p>This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty. We can request <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> to keep all the groups, even those not seen in the data by using <code>.drop = FALSE</code>:</p>
<p>This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty. We can request <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to keep all the groups, even those not seen in the data by using <code>.drop = FALSE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt; count(smoker, .drop = FALSE)
#&gt; # A tibble: 2 × 2
@@ -273,7 +273,7 @@ ggplot(health, aes(smoker)) +
</div>
</div>
</div>
<p>The same problem comes up more generally with <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>. And again you can use <code>.drop = FALSE</code> to preserve all factor levels:</p>
<p>The same problem comes up more generally with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">dplyr::group_by()</a></code>. And again you can use <code>.drop = FALSE</code> to preserve all factor levels:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt;
group_by(smoker, .drop = FALSE) |&gt;
@@ -309,8 +309,8 @@ x2 &lt;- numeric()
length(x2)
#&gt; [1] 0</pre>
</div>
<p>All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see <code>mean(age)</code> returning <code>NaN</code> because <code>mean(age)</code> = <code>sum(age)/length(age)</code> which here is 0/0. <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> and <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute youll get the minimum or maximum of the new data<span data-type="footnote">In other words, <code>min(c(x, y))</code> is always equal to <code>min(min(x), min(y))</code>.</span>.</p>
<p>Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with <code><a href="#chp-https://tidyr.tidyverse.org/reference/complete" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/complete</a></code>.</p>
<p>All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see <code>mean(age)</code> returning <code>NaN</code> because <code>mean(age)</code> = <code>sum(age)/length(age)</code> which here is 0/0. <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute youll get the minimum or maximum of the new data<span data-type="footnote">In other words, <code>min(c(x, y))</code> is always equal to <code>min(min(x), min(y))</code>.</span>.</p>
<p>Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt;
group_by(smoker) |&gt;