More minor page count tweaks & fixes

And re-convert with latest htmlbook
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions
--- a/oreilly/logicals.html
+++ b/oreilly/logicals.html
@@ -1,12 +1,12 @@
 <section data-type="chapter" id="chp-logicals">
 <h1><span id="sec-logicals" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Logical vectors</span></span></h1>
-<section id="introduction" data-type="sect1">
+<section id="logicals-introduction" data-type="sect1">
 <h1>
 Introduction</h1>
 <p>In this chapter, you’ll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: <code>TRUE</code>, <code>FALSE</code>, and <code>NA</code>. It’s relatively rare to find logical vectors in your raw data, but you’ll create and manipulate them in the course of almost every analysis.</p>
 <p>We’ll begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then you’ll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. We’ll finish off with <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, two useful functions for making conditional changes powered by logical vectors.</p>

-<section id="prerequisites" data-type="sect2">
+<section id="logicals-prerequisites" data-type="sect2">
 <h2>
 Prerequisites</h2>
 <p>Most of the functions you’ll learn about in this chapter are provided by base R, so we don’t need the tidyverse, but we’ll still load it so we can use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and friends to work with data frames. We’ll also continue to draw examples from the nycflights13 dataset.</p>
@@ -56,9 +56,7 @@ Comparisons</h1>
 #&gt; 5  2013     1     1      606            610        -4      837            845
 #&gt; 6  2013     1     1      607            607         0      858            915
 #&gt; # … with 172,280 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>It’s useful to know that this is a shortcut and you can explicitly create the underlying logical variables with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
 <div class="cell">
@@ -151,17 +149,14 @@ x == y
  filter(dep_time == NA)
 #&gt; # A tibble: 0 × 19
 #&gt; # … with 19 variables: year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;, dep_time &lt;int&gt;,
-#&gt; #   sched_dep_time &lt;int&gt;, dep_delay &lt;dbl&gt;, arr_time &lt;int&gt;,
-#&gt; #   sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;,
-#&gt; #   tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
-#&gt; #   hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
+#&gt; #   sched_dep_time &lt;int&gt;, dep_delay &lt;dbl&gt;, arr_time &lt;int&gt;, …</pre>
 </div>
 <p>Instead we’ll need a new tool: <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>.</p>
 </section>

 <section id="is.na" data-type="sect2">
 <h2>
-<code>is.na()</code>
+is.na()
 </h2>
 <p><code>is.na(x)</code> works with any type of vector and returns <code>TRUE</code> for missing values and <code>FALSE</code> for everything else:</p>
 <div class="cell">
@@ -186,9 +181,7 @@ is.na(c("a", NA, "b"))
 #&gt; 5  2013     1     2       NA           1540        NA       NA           1747
 #&gt; 6  2013     1     2       NA           1620        NA       NA           1746
 #&gt; # … with 8,249 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p><code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code> can also be useful in <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> usually puts all the missing values at the end but you can override this default by first sorting by <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>:</p>
 <div class="cell">
@@ -205,9 +198,7 @@ is.na(c("a", NA, "b"))
 #&gt; 5  2013     1     1      554            600        -6      812            837
 #&gt; 6  2013     1     1      554            558        -4      740            728
 #&gt; # … with 836 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …

 flights |&gt; 
  filter(month == 1, day == 1) |&gt; 
@@ -222,14 +213,12 @@ flights |&gt;
 #&gt; 5  2013     1     1      517            515         2      830            819
 #&gt; 6  2013     1     1      533            529         4      850            830
 #&gt; # … with 836 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>We’ll come back to cover missing values in more depth in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
 </section>

-<section id="exercises" data-type="sect2">
+<section id="logicals-exercises" data-type="sect2">
 <h2>
 Exercises</h2>
 <ol type="1"><li>How does <code><a href="https://dplyr.tidyverse.org/reference/near.html">dplyr::near()</a></code> work? Type <code>near</code> to see the source code.</li>
@@ -295,9 +284,7 @@ Order of operations</h2>
 #&gt; 5  2013     1     1      554            600        -6      812            837
 #&gt; 6  2013     1     1      554            558        -4      740            728
 #&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 <p>This code doesn’t error but it also doesn’t seem to have worked. What’s going on? Here, R first evaluates <code>month == 11</code> creating a logical vector, which we call <code>nov</code>. It computes <code>nov | 12</code>. When you use a number with a logical operator it converts everything apart from 0 to <code>TRUE</code>, so this is equivalent to <code>nov | TRUE</code> which will always be <code>TRUE</code>, so every row will be selected:</p>
 <div class="cell">
@@ -322,7 +309,7 @@ Order of operations</h2>

 <section id="in" data-type="sect2">
 <h2>
-<code>%in%</code>
+%in%
 </h2>
 <p>An easy way to avoid the problem of getting your <code>==</code>s and <code>|</code>s in the right order is to use <code>%in%</code>. <code>x %in% y</code> returns a logical vector the same length as <code>x</code> that is <code>TRUE</code> whenever a value in <code>x</code> is anywhere in <code>y</code> .</p>
 <div class="cell">
@@ -357,13 +344,11 @@ c(1, 2, NA) %in% NA
 #&gt; 5  2013     1     1       NA           1500        NA       NA           1825
 #&gt; 6  2013     1     1       NA            600        NA       NA            901
 #&gt; # … with 8,797 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
-#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
-#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
-#&gt; #   time_hour &lt;dttm&gt;</pre>
+#&gt; #   carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
 </div>
 </section>

-<section id="exercises-1" data-type="sect2">
+<section id="logicals-exercises-1" data-type="sect2">
 <h2>
 Exercises</h2>
 <ol type="1"><li>Find all flights where <code>arr_delay</code> is missing but <code>dep_delay</code> is not. Find all flights where neither <code>arr_time</code> nor <code>sched_arr_time</code> are missing, but <code>arr_delay</code> is.</li>
@@ -496,7 +481,7 @@ Logical subsetting</h2>
 <p>Also note the difference in the group size: in the first chunk <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> gives the number of delayed flights per day; in the second, <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> gives the total number of flights.</p>
 </section>

-<section id="exercises-2" data-type="sect2">
+<section id="logicals-exercises-2" data-type="sect2">
 <h2>
 Exercises</h2>
 <ol type="1"><li>What will <code>sum(is.na(x))</code> tell you? How about <code>mean(is.na(x))</code>?</li>
@@ -511,7 +496,7 @@ Conditional transformations</h1>

 <section id="if_else" data-type="sect2">
 <h2>
-<code>if_else()</code>
+if_else()
 </h2>
 <p>If you want to use one value when a condition is <code>TRUE</code> and another value when it’s <code>FALSE</code>, you can use <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">dplyr::if_else()</a></code><span data-type="footnote">dplyr’s <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is very similar to base R’s <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>. There are two main advantages of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>over <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>: you can choose what should happen to missing values, and <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is much more likely to give you a meaningful error if you variables have incompatible types.</span>. You’ll always use the first three argument of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>. The first argument, <code>condition</code>, is a logical vector, the second, <code>true</code>, gives the output when the condition is true, and the third, <code>false</code>, gives the output if the condition is false.</p>
 <p>Let’s begin with a simple example of labeling a numeric vector as either “+ve” or “-ve”:</p>
@@ -547,12 +532,13 @@ if_else(is.na(x1), y1, x1)

 <section id="case_when" data-type="sect2">
 <h2>
-<code>case_when()</code>
+case_when()
 </h2>
 <p>dplyr’s <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is inspired by SQL’s <code>CASE</code> statement and provides a flexible way of performing different computations for different conditions. It has a special syntax that unfortunately looks like nothing else you’ll use in the tidyverse. It takes pairs that look like <code>condition ~ output</code>. <code>condition</code> must be a logical vector; when it’s <code>TRUE</code>, <code>output</code> will be used.</p>
 <p>This means we could recreate our previous nested <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> as follows:</p>
 <div class="cell">
-<pre data-type="programlisting" data-code-language="r">case_when(
+<pre data-type="programlisting" data-code-language="r">x &lt;- c(-3:3, NA)
+case_when(
  x == 0   ~ "0",
  x &lt; 0    ~ "-ve", 
  x &gt; 0    ~ "+ve",
@@ -582,7 +568,7 @@ if_else(is.na(x1), y1, x1)
 <div class="cell">
 <pre data-type="programlisting" data-code-language="r">case_when(
  x &gt; 0 ~ "+ve",
-  x &gt; 3 ~ "big"
+  x &gt; 2 ~ "big"
 )
 #&gt; [1] NA    NA    NA    NA    "+ve" "+ve" "+ve" NA</pre>
 </div>
@@ -595,8 +581,8 @@ if_else(is.na(x1), y1, x1)
      arr_delay &lt; -30       ~ "very early",
      arr_delay &lt; -15       ~ "early",
      abs(arr_delay) &lt;= 15  ~ "on time",
-      arr_delay &gt; 15        ~ "late",
-      arr_delay &gt; 60        ~ "very late",
+      arr_delay &lt; 60        ~ "late",
+      arr_delay &lt; Inf       ~ "very late",
    ),
    .keep = "used"
  )
@@ -611,6 +597,7 @@ if_else(is.na(x1), y1, x1)
 #&gt; 6        12 on time
 #&gt; # … with 336,770 more rows</pre>
 </div>
+<p>Be wary when writing this sort of complex <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> statement; my first two attempts used a mix of <code>&lt;</code> and <code>&gt;</code> and I kept accidentally creating overlapping conditions.</p>
 </section>

 <section id="compatible-types" data-type="sect2">
@@ -639,7 +626,7 @@ case_when(
 </section>
 </section>

-<section id="summary" data-type="sect1">
+<section id="logicals-summary" data-type="sect1">
 <h1>
 Summary</h1>
 <p>The definition of a logical vector is simple because each value must be either <code>TRUE</code>, <code>FALSE</code>, or <code>NA</code>. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with <code>&gt;</code>, <code>&lt;</code>, <code>&lt;=</code>, <code>=&gt;</code>, <code>==</code>, <code>!=</code>, and <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>, how to combine them with <code>!</code>, <code>&amp;</code>, and <code>|</code>, and how to summarize them with <code><a href="https://rdrr.io/r/base/any.html">any()</a></code>, <code><a href="https://rdrr.io/r/base/all.html">all()</a></code>, <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>. You also learned the powerful <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> functions that allow you to return values depending on the value of a logical vector.</p>