Fix code language

This commit is contained in:
Hadley Wickham
2022-11-18 11:26:25 -06:00
parent 69b4597f3b
commit 868a35ca71
29 changed files with 912 additions and 907 deletions

View File

@@ -11,7 +11,7 @@ Introduction</h1>
Prerequisites</h2>
<p>The functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
</section>
</section>
@@ -26,7 +26,7 @@ Explicit missing values</h1>
Last observation carried forward</h2>
<p>A common use for missing values is as a data entry convenience. When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward):</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">treatment &lt;- tribble(
<pre data-type="programlisting" data-code-language="r">treatment &lt;- tribble(
~person, ~treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
@@ -36,7 +36,7 @@ Last observation carried forward</h2>
</div>
<p>You can fill in these missing values with <code><a href="https://tidyr.tidyverse.org/reference/fill.html">tidyr::fill()</a></code>. It works like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, taking a set of columns:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">treatment |&gt;
<pre data-type="programlisting" data-code-language="r">treatment |&gt;
fill(everything())
#&gt; # A tibble: 4 × 3
#&gt; person treatment response
@@ -54,14 +54,14 @@ Last observation carried forward</h2>
Fixed values</h2>
<p>Some times missing values represent some fixed and known value, most commonly 0. You can use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">dplyr::coalesce()</a></code> to replace them:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 4, 5, 7, NA)
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1, 4, 5, 7, NA)
coalesce(x, 0)
#&gt; [1] 1 4 5 7 0</pre>
</div>
<p>Sometimes youll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesnt have a proper way to represent missing values, so it must instead use some special value like 99 or -999.</p>
<p>If possible, handle this when reading in the data, for example, by using the <code>na</code> argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">readr::read_csv()</a></code>. If you discover the problem later, or your data source doesnt provide a way to handle on it read, you can use <code><a href="https://dplyr.tidyverse.org/reference/na_if.html">dplyr::na_if()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 4, 5, 7, -99)
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1, 4, 5, 7, -99)
na_if(x, -99)
#&gt; [1] 1 4 5 7 NA</pre>
</div>
@@ -72,7 +72,7 @@ na_if(x, -99)
NaN</h2>
<p>Before we continue, theres one special type of missing value that youll encounter from time to time: a <code>NaN</code> (pronounced “nan”), or <strong>n</strong>ot <strong>a</strong> <strong>n</strong>umber. Its not that important to know about because it generally behaves just like <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(NA, NaN)
<pre data-type="programlisting" data-code-language="r">x &lt;- c(NA, NaN)
x * 10
#&gt; [1] NA NaN
x == 1
@@ -83,7 +83,7 @@ is.na(x)
<p>In the rare case you need to distinguish an <code>NA</code> from a <code>NaN</code>, you can use <code>is.nan(x)</code>.</p>
<p>Youll generally encounter a <code>NaN</code> when you perform a mathematical operation that has an indeterminate result:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">0 / 0
<pre data-type="programlisting" data-code-language="r">0 / 0
#&gt; [1] NaN
0 * Inf
#&gt; [1] NaN
@@ -101,7 +101,7 @@ sqrt(-1)
Implicit missing values</h1>
<p>So far weve talked about missing values that are <strong>explicitly</strong> missing, i.e. you can see an <code>NA</code> in your data. But missing values can also be <strong>implicitly</strong> missing, if an entire row of data is simply absent from the data. Lets illustrate the difference with a simple data set that records the price of some stock each quarter:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">stocks &lt;- tibble(
<pre data-type="programlisting" data-code-language="r">stocks &lt;- tibble(
year = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
price = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
@@ -122,7 +122,7 @@ Implicit missing values</h1>
Pivoting</h2>
<p>Youve already seen one tool that can make implicit missings explicit and vice versa: pivoting. Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value. For example, if we pivot <code>stocks</code> to put the <code>quarter</code> in the columns, both missing values become explicit:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">stocks |&gt;
<pre data-type="programlisting" data-code-language="r">stocks |&gt;
pivot_wider(
names_from = qtr,
values_from = price
@@ -141,7 +141,7 @@ Pivoting</h2>
Complete</h2>
<p><code><a href="https://tidyr.tidyverse.org/reference/complete.html">tidyr::complete()</a></code> allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of <code>year</code> and <code>qtr</code> should exist in the <code>stocks</code> data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">stocks |&gt;
<pre data-type="programlisting" data-code-language="r">stocks |&gt;
complete(year, qtr)
#&gt; # A tibble: 8 × 3
#&gt; year qtr price
@@ -156,7 +156,7 @@ Complete</h2>
</div>
<p>Typically, youll call <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code> with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the <code>stocks</code> dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for <code>year</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">stocks |&gt;
<pre data-type="programlisting" data-code-language="r">stocks |&gt;
complete(year = 2019:2021, qtr)
#&gt; # A tibble: 12 × 3
#&gt; year qtr price
@@ -179,7 +179,7 @@ Joins</h2>
<p>This brings us to another important way of revealing implicitly missing observations: joins. Youll learn more about joins in <a href="#chp-joins" data-type="xref">#chp-joins</a>, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.</p>
<p><code>dplyr::anti_join(x, y)</code> is a particularly useful tool here because it selects only the rows in <code>x</code> that dont have a match in <code>y</code>. For example, we can use two <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>s reveal to reveal that were missing information for four airports and 722 planes mentioned in <code>flights</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(nycflights13)
<pre data-type="programlisting" data-code-language="r">library(nycflights13)
flights |&gt;
distinct(faa = dest) |&gt;
@@ -222,7 +222,7 @@ Exercises</h2>
Factors and empty groups</h1>
<p>A final type of missingness is the empty group, a group that doesnt contain any observations, which can arise when working with factors. For example, imagine we have a dataset that contains some health information about people:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health &lt;- tibble(
<pre data-type="programlisting" data-code-language="r">health &lt;- tibble(
name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
age = c(34L, 88L, 75L, 47L, 56L),
@@ -230,7 +230,7 @@ Factors and empty groups</h1>
</div>
<p>And we want to count the number of smokers with <code><a href="https://dplyr.tidyverse.org/reference/count.html">dplyr::count()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt; count(smoker)
<pre data-type="programlisting" data-code-language="r">health |&gt; count(smoker)
#&gt; # A tibble: 1 × 2
#&gt; smoker n
#&gt; &lt;fct&gt; &lt;int&gt;
@@ -238,7 +238,7 @@ Factors and empty groups</h1>
</div>
<p>This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty. We can request <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to keep all the groups, even those not seen in the data by using <code>.drop = FALSE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt; count(smoker, .drop = FALSE)
<pre data-type="programlisting" data-code-language="r">health |&gt; count(smoker, .drop = FALSE)
#&gt; # A tibble: 2 × 2
#&gt; smoker n
#&gt; &lt;fct&gt; &lt;int&gt;
@@ -247,7 +247,7 @@ Factors and empty groups</h1>
</div>
<p>The same principle applies to ggplot2s discrete axes, which will also drop levels that dont have any values. You can force them to display by supplying <code>drop = FALSE</code> to the appropriate discrete axis:</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(health, aes(smoker)) +
<pre data-type="programlisting" data-code-language="r">ggplot(health, aes(smoker)) +
geom_bar() +
scale_x_discrete()
@@ -267,7 +267,7 @@ ggplot(health, aes(smoker)) +
</div>
<p>The same problem comes up more generally with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">dplyr::group_by()</a></code>. And again you can use <code>.drop = FALSE</code> to preserve all factor levels:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt;
<pre data-type="programlisting" data-code-language="r">health |&gt;
group_by(smoker, .drop = FALSE) |&gt;
summarise(
n = n(),
@@ -291,7 +291,7 @@ ggplot(health, aes(smoker)) +
</div>
<p>We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors. Theres an important distinction between empty vectors, which have length 0, and missing values, each of which has length 1.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># A vector containing two missing values
<pre data-type="programlisting" data-code-language="r"># A vector containing two missing values
x1 &lt;- c(NA, NA)
length(x1)
#&gt; [1] 2
@@ -304,7 +304,7 @@ length(x2)
<p>All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see <code>mean(age)</code> returning <code>NaN</code> because <code>mean(age)</code> = <code>sum(age)/length(age)</code> which here is 0/0. <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute youll get the minimum or maximum of the new data<span data-type="footnote">In other words, <code>min(c(x, y))</code> is always equal to <code>min(min(x), min(y))</code>.</span>.</p>
<p>Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt;
<pre data-type="programlisting" data-code-language="r">health |&gt;
group_by(smoker) |&gt;
summarise(
n = n(),