More minor page count tweaks & fixes
And re-convert with latest htmlbook
This commit is contained in:
@@ -1,12 +1,12 @@
|
||||
<section data-type="chapter" id="chp-logicals">
|
||||
<h1><span id="sec-logicals" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Logical vectors</span></span></h1>
|
||||
<section id="introduction" data-type="sect1">
|
||||
<section id="logicals-introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>In this chapter, you’ll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: <code>TRUE</code>, <code>FALSE</code>, and <code>NA</code>. It’s relatively rare to find logical vectors in your raw data, but you’ll create and manipulate them in the course of almost every analysis.</p>
|
||||
<p>We’ll begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then you’ll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. We’ll finish off with <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, two useful functions for making conditional changes powered by logical vectors.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<section id="logicals-prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
<p>Most of the functions you’ll learn about in this chapter are provided by base R, so we don’t need the tidyverse, but we’ll still load it so we can use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and friends to work with data frames. We’ll also continue to draw examples from the nycflights13 dataset.</p>
|
||||
@@ -56,9 +56,7 @@ Comparisons</h1>
|
||||
#> 5 2013 1 1 606 610 -4 837 845
|
||||
#> 6 2013 1 1 607 607 0 858 915
|
||||
#> # … with 172,280 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …</pre>
|
||||
</div>
|
||||
<p>It’s useful to know that this is a shortcut and you can explicitly create the underlying logical variables with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
|
||||
<div class="cell">
|
||||
@@ -151,17 +149,14 @@ x == y
|
||||
filter(dep_time == NA)
|
||||
#> # A tibble: 0 × 19
|
||||
#> # … with 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
|
||||
#> # sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
|
||||
#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
|
||||
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
|
||||
#> # hour <dbl>, minute <dbl>, time_hour <dttm></pre>
|
||||
#> # sched_dep_time <int>, dep_delay <dbl>, arr_time <int>, …</pre>
|
||||
</div>
|
||||
<p>Instead we’ll need a new tool: <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>.</p>
|
||||
</section>
|
||||
|
||||
<section id="is.na" data-type="sect2">
|
||||
<h2>
|
||||
<code>is.na()</code>
|
||||
is.na()
|
||||
</h2>
|
||||
<p><code>is.na(x)</code> works with any type of vector and returns <code>TRUE</code> for missing values and <code>FALSE</code> for everything else:</p>
|
||||
<div class="cell">
|
||||
@@ -186,9 +181,7 @@ is.na(c("a", NA, "b"))
|
||||
#> 5 2013 1 2 NA 1540 NA NA 1747
|
||||
#> 6 2013 1 2 NA 1620 NA NA 1746
|
||||
#> # … with 8,249 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …</pre>
|
||||
</div>
|
||||
<p><code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code> can also be useful in <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> usually puts all the missing values at the end but you can override this default by first sorting by <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>:</p>
|
||||
<div class="cell">
|
||||
@@ -205,9 +198,7 @@ is.na(c("a", NA, "b"))
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 836 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm>
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …
|
||||
|
||||
flights |>
|
||||
filter(month == 1, day == 1) |>
|
||||
@@ -222,14 +213,12 @@ flights |>
|
||||
#> 5 2013 1 1 517 515 2 830 819
|
||||
#> 6 2013 1 1 533 529 4 850 830
|
||||
#> # … with 836 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …</pre>
|
||||
</div>
|
||||
<p>We’ll come back to cover missing values in more depth in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises" data-type="sect2">
|
||||
<section id="logicals-exercises" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>How does <code><a href="https://dplyr.tidyverse.org/reference/near.html">dplyr::near()</a></code> work? Type <code>near</code> to see the source code.</li>
|
||||
@@ -295,9 +284,7 @@ Order of operations</h2>
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …</pre>
|
||||
</div>
|
||||
<p>This code doesn’t error but it also doesn’t seem to have worked. What’s going on? Here, R first evaluates <code>month == 11</code> creating a logical vector, which we call <code>nov</code>. It computes <code>nov | 12</code>. When you use a number with a logical operator it converts everything apart from 0 to <code>TRUE</code>, so this is equivalent to <code>nov | TRUE</code> which will always be <code>TRUE</code>, so every row will be selected:</p>
|
||||
<div class="cell">
|
||||
@@ -322,7 +309,7 @@ Order of operations</h2>
|
||||
|
||||
<section id="in" data-type="sect2">
|
||||
<h2>
|
||||
<code>%in%</code>
|
||||
%in%
|
||||
</h2>
|
||||
<p>An easy way to avoid the problem of getting your <code>==</code>s and <code>|</code>s in the right order is to use <code>%in%</code>. <code>x %in% y</code> returns a logical vector the same length as <code>x</code> that is <code>TRUE</code> whenever a value in <code>x</code> is anywhere in <code>y</code> .</p>
|
||||
<div class="cell">
|
||||
@@ -357,13 +344,11 @@ c(1, 2, NA) %in% NA
|
||||
#> 5 2013 1 1 NA 1500 NA NA 1825
|
||||
#> 6 2013 1 1 NA 600 NA NA 901
|
||||
#> # … with 8,797 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="exercises-1" data-type="sect2">
|
||||
<section id="logicals-exercises-1" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>Find all flights where <code>arr_delay</code> is missing but <code>dep_delay</code> is not. Find all flights where neither <code>arr_time</code> nor <code>sched_arr_time</code> are missing, but <code>arr_delay</code> is.</li>
|
||||
@@ -496,7 +481,7 @@ Logical subsetting</h2>
|
||||
<p>Also note the difference in the group size: in the first chunk <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> gives the number of delayed flights per day; in the second, <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> gives the total number of flights.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises-2" data-type="sect2">
|
||||
<section id="logicals-exercises-2" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>What will <code>sum(is.na(x))</code> tell you? How about <code>mean(is.na(x))</code>?</li>
|
||||
@@ -511,7 +496,7 @@ Conditional transformations</h1>
|
||||
|
||||
<section id="if_else" data-type="sect2">
|
||||
<h2>
|
||||
<code>if_else()</code>
|
||||
if_else()
|
||||
</h2>
|
||||
<p>If you want to use one value when a condition is <code>TRUE</code> and another value when it’s <code>FALSE</code>, you can use <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">dplyr::if_else()</a></code><span data-type="footnote">dplyr’s <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is very similar to base R’s <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>. There are two main advantages of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>over <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>: you can choose what should happen to missing values, and <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is much more likely to give you a meaningful error if you variables have incompatible types.</span>. You’ll always use the first three argument of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>. The first argument, <code>condition</code>, is a logical vector, the second, <code>true</code>, gives the output when the condition is true, and the third, <code>false</code>, gives the output if the condition is false.</p>
|
||||
<p>Let’s begin with a simple example of labeling a numeric vector as either “+ve” or “-ve”:</p>
|
||||
@@ -547,12 +532,13 @@ if_else(is.na(x1), y1, x1)
|
||||
|
||||
<section id="case_when" data-type="sect2">
|
||||
<h2>
|
||||
<code>case_when()</code>
|
||||
case_when()
|
||||
</h2>
|
||||
<p>dplyr’s <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is inspired by SQL’s <code>CASE</code> statement and provides a flexible way of performing different computations for different conditions. It has a special syntax that unfortunately looks like nothing else you’ll use in the tidyverse. It takes pairs that look like <code>condition ~ output</code>. <code>condition</code> must be a logical vector; when it’s <code>TRUE</code>, <code>output</code> will be used.</p>
|
||||
<p>This means we could recreate our previous nested <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> as follows:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">case_when(
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(-3:3, NA)
|
||||
case_when(
|
||||
x == 0 ~ "0",
|
||||
x < 0 ~ "-ve",
|
||||
x > 0 ~ "+ve",
|
||||
@@ -582,7 +568,7 @@ if_else(is.na(x1), y1, x1)
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">case_when(
|
||||
x > 0 ~ "+ve",
|
||||
x > 3 ~ "big"
|
||||
x > 2 ~ "big"
|
||||
)
|
||||
#> [1] NA NA NA NA "+ve" "+ve" "+ve" NA</pre>
|
||||
</div>
|
||||
@@ -595,8 +581,8 @@ if_else(is.na(x1), y1, x1)
|
||||
arr_delay < -30 ~ "very early",
|
||||
arr_delay < -15 ~ "early",
|
||||
abs(arr_delay) <= 15 ~ "on time",
|
||||
arr_delay > 15 ~ "late",
|
||||
arr_delay > 60 ~ "very late",
|
||||
arr_delay < 60 ~ "late",
|
||||
arr_delay < Inf ~ "very late",
|
||||
),
|
||||
.keep = "used"
|
||||
)
|
||||
@@ -611,6 +597,7 @@ if_else(is.na(x1), y1, x1)
|
||||
#> 6 12 on time
|
||||
#> # … with 336,770 more rows</pre>
|
||||
</div>
|
||||
<p>Be wary when writing this sort of complex <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> statement; my first two attempts used a mix of <code><</code> and <code>></code> and I kept accidentally creating overlapping conditions.</p>
|
||||
</section>
|
||||
|
||||
<section id="compatible-types" data-type="sect2">
|
||||
@@ -639,7 +626,7 @@ case_when(
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="summary" data-type="sect1">
|
||||
<section id="logicals-summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>The definition of a logical vector is simple because each value must be either <code>TRUE</code>, <code>FALSE</code>, or <code>NA</code>. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with <code>></code>, <code><</code>, <code><=</code>, <code>=></code>, <code>==</code>, <code>!=</code>, and <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>, how to combine them with <code>!</code>, <code>&</code>, and <code>|</code>, and how to summarize them with <code><a href="https://rdrr.io/r/base/any.html">any()</a></code>, <code><a href="https://rdrr.io/r/base/all.html">all()</a></code>, <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>. You also learned the powerful <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> functions that allow you to return values depending on the value of a logical vector.</p>
|
||||
|
||||
Reference in New Issue
Block a user