More minor page count tweaks & fixes

And re-convert with latest htmlbook
This commit is contained in:
Hadley Wickham
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions

View File

@@ -1,12 +1,12 @@
<section data-type="chapter" id="chp-logicals">
<h1><span id="sec-logicals" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Logical vectors</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="logicals-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In this chapter, youll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: <code>TRUE</code>, <code>FALSE</code>, and <code>NA</code>. Its relatively rare to find logical vectors in your raw data, but youll create and manipulate them in the course of almost every analysis.</p>
<p>Well begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then youll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. Well finish off with <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, two useful functions for making conditional changes powered by logical vectors.</p>
<section id="prerequisites" data-type="sect2">
<section id="logicals-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>Most of the functions youll learn about in this chapter are provided by base R, so we dont need the tidyverse, but well still load it so we can use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and friends to work with data frames. Well also continue to draw examples from the nycflights13 dataset.</p>
@@ -56,9 +56,7 @@ Comparisons</h1>
#&gt; 5 2013 1 1 606 610 -4 837 845
#&gt; 6 2013 1 1 607 607 0 858 915
#&gt; # … with 172,280 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,</pre>
</div>
<p>Its useful to know that this is a shortcut and you can explicitly create the underlying logical variables with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
<div class="cell">
@@ -151,17 +149,14 @@ x == y
filter(dep_time == NA)
#&gt; # A tibble: 0 × 19
#&gt; # … with 19 variables: year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;, dep_time &lt;int&gt;,
#&gt; # sched_dep_time &lt;int&gt;, dep_delay &lt;dbl&gt;, arr_time &lt;int&gt;,
#&gt; # sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
#&gt; # sched_dep_time &lt;int&gt;, dep_delay &lt;dbl&gt;, arr_time &lt;int&gt;,</pre>
</div>
<p>Instead well need a new tool: <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>.</p>
</section>
<section id="is.na" data-type="sect2">
<h2>
<code>is.na()</code>
is.na()
</h2>
<p><code>is.na(x)</code> works with any type of vector and returns <code>TRUE</code> for missing values and <code>FALSE</code> for everything else:</p>
<div class="cell">
@@ -186,9 +181,7 @@ is.na(c("a", NA, "b"))
#&gt; 5 2013 1 2 NA 1540 NA NA 1747
#&gt; 6 2013 1 2 NA 1620 NA NA 1746
#&gt; # … with 8,249 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,</pre>
</div>
<p><code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code> can also be useful in <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> usually puts all the missing values at the end but you can override this default by first sorting by <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>:</p>
<div class="cell">
@@ -205,9 +198,7 @@ is.na(c("a", NA, "b"))
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 836 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
flights |&gt;
filter(month == 1, day == 1) |&gt;
@@ -222,14 +213,12 @@ flights |&gt;
#&gt; 5 2013 1 1 517 515 2 830 819
#&gt; 6 2013 1 1 533 529 4 850 830
#&gt; # … with 836 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,</pre>
</div>
<p>Well come back to cover missing values in more depth in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="logicals-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>How does <code><a href="https://dplyr.tidyverse.org/reference/near.html">dplyr::near()</a></code> work? Type <code>near</code> to see the source code.</li>
@@ -295,9 +284,7 @@ Order of operations</h2>
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,</pre>
</div>
<p>This code doesnt error but it also doesnt seem to have worked. Whats going on? Here, R first evaluates <code>month == 11</code> creating a logical vector, which we call <code>nov</code>. It computes <code>nov | 12</code>. When you use a number with a logical operator it converts everything apart from 0 to <code>TRUE</code>, so this is equivalent to <code>nov | TRUE</code> which will always be <code>TRUE</code>, so every row will be selected:</p>
<div class="cell">
@@ -322,7 +309,7 @@ Order of operations</h2>
<section id="in" data-type="sect2">
<h2>
<code>%in%</code>
%in%
</h2>
<p>An easy way to avoid the problem of getting your <code>==</code>s and <code>|</code>s in the right order is to use <code>%in%</code>. <code>x %in% y</code> returns a logical vector the same length as <code>x</code> that is <code>TRUE</code> whenever a value in <code>x</code> is anywhere in <code>y</code> .</p>
<div class="cell">
@@ -357,13 +344,11 @@ c(1, 2, NA) %in% NA
#&gt; 5 2013 1 1 NA 1500 NA NA 1825
#&gt; 6 2013 1 1 NA 600 NA NA 901
#&gt; # … with 8,797 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,</pre>
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="logicals-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Find all flights where <code>arr_delay</code> is missing but <code>dep_delay</code> is not. Find all flights where neither <code>arr_time</code> nor <code>sched_arr_time</code> are missing, but <code>arr_delay</code> is.</li>
@@ -496,7 +481,7 @@ Logical subsetting</h2>
<p>Also note the difference in the group size: in the first chunk <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> gives the number of delayed flights per day; in the second, <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> gives the total number of flights.</p>
</section>
<section id="exercises-2" data-type="sect2">
<section id="logicals-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>What will <code>sum(is.na(x))</code> tell you? How about <code>mean(is.na(x))</code>?</li>
@@ -511,7 +496,7 @@ Conditional transformations</h1>
<section id="if_else" data-type="sect2">
<h2>
<code>if_else()</code>
if_else()
</h2>
<p>If you want to use one value when a condition is <code>TRUE</code> and another value when its <code>FALSE</code>, you can use <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">dplyr::if_else()</a></code><span data-type="footnote">dplyrs <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is very similar to base Rs <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>. There are two main advantages of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>over <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>: you can choose what should happen to missing values, and <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is much more likely to give you a meaningful error if you variables have incompatible types.</span>. Youll always use the first three argument of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>. The first argument, <code>condition</code>, is a logical vector, the second, <code>true</code>, gives the output when the condition is true, and the third, <code>false</code>, gives the output if the condition is false.</p>
<p>Lets begin with a simple example of labeling a numeric vector as either “+ve” or “-ve”:</p>
@@ -547,12 +532,13 @@ if_else(is.na(x1), y1, x1)
<section id="case_when" data-type="sect2">
<h2>
<code>case_when()</code>
case_when()
</h2>
<p>dplyrs <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is inspired by SQLs <code>CASE</code> statement and provides a flexible way of performing different computations for different conditions. It has a special syntax that unfortunately looks like nothing else youll use in the tidyverse. It takes pairs that look like <code>condition ~ output</code>. <code>condition</code> must be a logical vector; when its <code>TRUE</code>, <code>output</code> will be used.</p>
<p>This means we could recreate our previous nested <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> as follows:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">case_when(
<pre data-type="programlisting" data-code-language="r">x &lt;- c(-3:3, NA)
case_when(
x == 0 ~ "0",
x &lt; 0 ~ "-ve",
x &gt; 0 ~ "+ve",
@@ -582,7 +568,7 @@ if_else(is.na(x1), y1, x1)
<div class="cell">
<pre data-type="programlisting" data-code-language="r">case_when(
x &gt; 0 ~ "+ve",
x &gt; 3 ~ "big"
x &gt; 2 ~ "big"
)
#&gt; [1] NA NA NA NA "+ve" "+ve" "+ve" NA</pre>
</div>
@@ -595,8 +581,8 @@ if_else(is.na(x1), y1, x1)
arr_delay &lt; -30 ~ "very early",
arr_delay &lt; -15 ~ "early",
abs(arr_delay) &lt;= 15 ~ "on time",
arr_delay &gt; 15 ~ "late",
arr_delay &gt; 60 ~ "very late",
arr_delay &lt; 60 ~ "late",
arr_delay &lt; Inf ~ "very late",
),
.keep = "used"
)
@@ -611,6 +597,7 @@ if_else(is.na(x1), y1, x1)
#&gt; 6 12 on time
#&gt; # … with 336,770 more rows</pre>
</div>
<p>Be wary when writing this sort of complex <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> statement; my first two attempts used a mix of <code>&lt;</code> and <code>&gt;</code> and I kept accidentally creating overlapping conditions.</p>
</section>
<section id="compatible-types" data-type="sect2">
@@ -639,7 +626,7 @@ case_when(
</section>
</section>
<section id="summary" data-type="sect1">
<section id="logicals-summary" data-type="sect1">
<h1>
Summary</h1>
<p>The definition of a logical vector is simple because each value must be either <code>TRUE</code>, <code>FALSE</code>, or <code>NA</code>. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with <code>&gt;</code>, <code>&lt;</code>, <code>&lt;=</code>, <code>=&gt;</code>, <code>==</code>, <code>!=</code>, and <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>, how to combine them with <code>!</code>, <code>&amp;</code>, and <code>|</code>, and how to summarize them with <code><a href="https://rdrr.io/r/base/any.html">any()</a></code>, <code><a href="https://rdrr.io/r/base/all.html">all()</a></code>, <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>. You also learned the powerful <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> functions that allow you to return values depending on the value of a logical vector.</p>