More on logical + numbers

2022-03-17 14:15:24 -05:00 · 2022-03-17 14:15:24 -05:00 · a73755838f
parent 0b5782dd45
commit a73755838f
2 changed files with 167 additions and 91 deletions
--- a/logicals.Rmd
+++ b/logicals.Rmd
@ -1,4 +1,4 @@
-# Logicals and numbers {#logicals-numbers}
+# Logicals and numbers {#logicals}

 ```{r, results = "asis", echo = FALSE}
 status("drafting")
@ -7,7 +7,8 @@ status("drafting")
 ## Introduction

 In this chapter, you'll learn useful tools for working with logical vectors.
-The elements in a logical vector can have one of three possible values: `TRUE`, `FALSE`, and `NA`.
+Logical vectors are the simplest type of vector because each element can only be one of three possible values: `TRUE`, `FALSE`, and `NA`.
+Despite that simplicity, they're an extremely powerful tool.

 ### Prerequisites

@ -18,44 +19,93 @@ library(nycflights13)

 ## Comparisons

-Some times you'll get data that already includes logical vectors but in most cases you'll create them by using a comparison.
+Some times you'll get data that already includes logical vectors but in most cases you'll create them by using a comparison, like `<`, `<=`, `>`, `>=`, `!=`, and `==`.

-`<`, `<=`, `>`, `>=`, `!=`, and `==`.
-If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.
+### In `mutate()`

-A useful shortcut is `between(x, low, high)` which is a bit less typing than `x >= low & x <= high)`.
-If you want an exclusive between or left-open right-closed etc, you'll need to write by hand.
+So far, you've mostly created these new variables implicitly within `filter()`:
+
+```{r}
+flights |> 
+  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
+```
+
+But it's useful to know that this is a shortcut and you can explicitly create perform these operations inside a `mutate()`
+
+```{r}
+flights |> 
+  mutate(
+    daytime = dep_time > 600 & dep_time < 2000,
+    approx_ontime = abs(arr_delay) < 20,
+    .keep = "used"
+  )
+```
+
+So the filter above could also be written as:
+
+```{r}
+flights |> 
+  mutate(
+    daytime = dep_time > 600 & dep_time < 2000,
+    approx_ontime = abs(arr_delay) < 20,
+  ) |> 
+  filter(daytime & approx_ontime)
+```
+
+This is an important technique when you're are doing complicated subsetting because it allows you to double-check the intermediate steps.
+
+### Floating point comparison

 Beware when using `==` with numbers as results might surprise you!
+You might think that the following two computations yield 1 and 2:
+
+```{r}
+(1 / 49 * 49)
+sqrt(2) ^ 2
+```
+
+But if you test them for equality, you'll discover that they're not what you expect!

 ```{r}
-(sqrt(2) ^ 2) == 2
 (1 / 49 * 49) == 1
+(sqrt(2) ^ 2) == 2
 ```

-Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.
+That's because computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so in most cases, the number number you see is an actually approximation.
+R usually rounds these numbers to avoid displaying a bunch of usually unimportant digits.
+You can use the `digits` argument to `format()` to force R to display more:

 ```{r}
-(sqrt(2) ^ 2) - 2
-(1 / 49 * 49) - 1
+format(1 / 49 * 49, digits = 20)
+format(sqrt(2) ^ 2, digits = 20)
 ```

-So instead of relying on `==`, use `near()`, which does the comparison with a small amount of tolerance:
+Instead of relying on `==`, you can use `dplyr::near()`, which does the comparison with a small amount of tolerance:

 ```{r}
 near(sqrt(2) ^ 2,  2)
 near(1 / 49 * 49, 1)
 ```

-Alternatively, you might want to use `round()` to trim off extra digits.
+### `is.na()`
+
+Another common way to create logical vector is with `is.na()`.
+This is particularly important in conjunction with `filter()` because filter only selects rows where the value is `TRUE`; rows where the value is `FALSE` are automatically dropped.
+
+```{r}
+flights |> filter(is.na(dep_delay) | is.na(arr_delay))
+flights |> filter(is.na(dep_delay) != is.na(arr_delay))
+```

 ## Boolean algebra

-For other types of combinations, you'll need to use Boolean operators yourself: `|` is "or" and `!` is "not".
-Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
+Once you have multiple logical vectors, you can combine them together using Boolean algebra: `&` is "and", `|` is "or", and `!` is "not".
+`xor()` provides one final useful operation: exclusive or.
+Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations and how they work.

 ```{r bool-ops}
 #| echo: false
+#| out.width: NULL
 #| fig.cap: > 
 #|    Complete set of boolean operations. `x` is the left-hand
 #|    circle, `y` is the right-hand circle, and the shaded region show 
@ -70,71 +120,122 @@ Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
 knitr::include_graphics("diagrams/transform-logical.png")
 ```

+As well as `&` and `|`, R also has `&&` and `||`.
+Don't use them in dplyr functions!
+These are called short-circuiting operators and only ever return a single `TRUE` or `FALSE`.
+They're important for programming so you'll learn more about them in Section \@ref(conditional-execution).
+
 The following code finds all flights that departed in November or December:

 ```{r, eval = FALSE}
-flights |> filter(month == 11 | month == 12)
+flights |> 
+   filter(month == 11 | month == 12)
 ```

 Note that the order of operations doesn't work like English.
-You can't write `filter(flights, month == 11 | 12)`, which you might read as "find all flights that departed in November or December".
-Instead it does something rather confusing.
-First it evaluates `11 | 12` which is equivalent to `TRUE | TRUE`, which returns `TRUE`.
+You can't think "find all flights that departed in November or December" and write `flights |> filter(month == 11 | 12)`.
+This code will not error, but it will do something rather confusing.
+First R evaluates `11 | 12` which is equivalent to `TRUE | TRUE`, which returns `TRUE`.
 Then it evaluates `month == TRUE`.
-Since month is numeric, this is equivalent to `month == 1`, so that expression finds all flights in January!
+Since month is numeric, this is equivalent to `month == 1`, so `flights |> filter(month == 11 | 12)` returns all flights in January!

-An easy way to solve this problem is to use `%in%`.
+### `%in%`
+
+An easy way to avoid this issue is to use `%in%`.
 `x %in% y` returns a logical vector the same length as `x` that is `TRUE` whenever a value in `x` is anywhere in `y` .
-So we could use it to rewrite the code above:
+So we could instead write:

 ```{r, eval = FALSE}
-nov_dec <- flights |> filter(month %in% c(11, 12))
+flights |> 
+  filter(month %in% c(11, 12))
 ```

 Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
 For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:

 ```{r, eval = FALSE}
-flights |> filter(!(arr_delay > 120 | dep_delay > 120))
-flights |> filter(arr_delay <= 120, dep_delay <= 120)
-```
-
-As well as `&` and `|`, R also has `&&` and `||`.
-Don't use them in dplyr functions!
-These are called short-circuiting operators and you'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.
-
-## Missing values {#logical-missing}
-
-`filter()` only selects rows where the logical expression is `TRUE`; it doesn't select rows where it's missing or `FALSE`.
-If you want to find rows containing missing values, you'll need to convert missingness into a logical vector using `is.na()`.
-
-```{r}
-flights |> filter(is.na(dep_delay) | is.na(arr_delay))
-flights |> filter(is.na(dep_delay) != is.na(arr_delay))
-```
-
-## In mutate()
-
-Whenever you start using complicated, multi-part expressions in `filter()`, consider making them explicit variables instead.
-That makes it much easier to check your work.When checking your work, a particularly useful `mutate()` argument is `.keep = "used"`: this will just show you the variables you've used, along with the variables that you created.
-This makes it easy to see the variables involved side-by-side.
-
-```{r}
 flights |> 
-  mutate(is_cancelled = is.na(dep_delay) | is.na(arr_delay), .keep = "used") |> 
-  filter(is_cancelled)
+  filter(!(arr_delay > 120 | dep_delay > 120))
+flights |> 
+  filter(arr_delay <= 120 & dep_delay <= 120)
 ```

-## Cumulative functions
+### Missing values {#logical-missing}
+
+The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:
+
+```{r}
+NA & c(TRUE, FALSE, NA)
+NA | c(TRUE, FALSE, NA)
+```
+
+<!-- Draw truth tables? -->
+
+To understand what's going on you need to think about `x | TRUE`, because regardless of whether `x` is `TRUE` or `FALSE` the result is still `TRUE`.
+That means even if you don't know what `x` is (i.e. it's missing), the result must still be `TRUE`.
+
+## Summaries
+
+There are four particularly useful summary functions for logical vectors: they all take a vector of logical values and return a single value, making them a good fit for use in `summarise()`.
+
+`any()` and `all()` --- `any()` will return if there's at least one `TRUE`, `all()` will return `TRUE` if all values are `TRUE`.
+Like all summary functions, they'll return `NA` if there are any missing values present, and like usual you can make the missing values go away with `na.rm = TRUE`.
+We could use this to see if there were any days where every flight was delayed:
+
+```{r}
+not_cancelled <- flights |> filter(!is.na(dep_delay), !is.na(arr_delay))
+
+not_cancelled |> 
+  group_by(year, month, day) |> 
+  filter(all(arr_delay >= 0))
+```
+
+`sum()` and `mean()` are particularly useful with logical vectors because when you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0.
+That means that `sum(x)` gives the number of `TRUE`s in `x` and `mean(x)` gives the proportion of `TRUE`s.
+That lets us find the day with the highest proportion of delayed flights:
+
+```{r}
+not_cancelled |> 
+  group_by(year, month, day) |> 
+  summarise(prop_delayed = mean(arr_delay > 0)) |> 
+  arrange(desc(prop_delayed))
+
+```
+
+Or we could ask how many flights left before 5am, which usually are flights that were delayed from the previous day:
+
+```{r}
+not_cancelled |> 
+  group_by(year, month, day) |> 
+  summarise(n_early = sum(dep_time < 500)) |> 
+  arrange(desc(n_early))
+```
+
+### Exercises
+
+1.  For each plane, count the number of flights before the first delay of greater than 1 hour.
+2.  What does `prod()` return when applied to a logical vector? What logical summary function is it equivalent to? What does `min()` return applied to a logical vector? What logical summary function is it equivalent to?
+
+## Transformations
+
+### Cumulative functions

 Another useful pair of functions are cumulative any, `cumany()`, and cumulative all, `cumall()`.
 `cumany()` will be `TRUE` after it encounters the first `TRUE`, and `cumall()` will be `FALSE` after it encounters its first `FALSE`.
-These are particularly useful in conjunction with `filter()` because they allow you to select:

-   `cumall(x)`: all cases until the first `FALSE`.
-   `cumall(!x)`: all cases until the first `TRUE`.
-   `cumany(x)`: all cases after the first `TRUE`.
-   `cumany(!x)`: all cases after the first `FALSE`.
+```{r}
+cumany(c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE))
+cumall(c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE))
+```
+
+These are particularly useful in conjunction with `filter()` because they allow you to select rows:
+
+-   Before the first `FALSE` with `cumall(x)`.
+-   Before the first `TRUE` with `cumall(!x)`.
+-   After the first `TRUE` with `cumany(x)`.
+-   After the first `FALSE` with `cumany(!x)`.
+
+If you imagine some data about a bank balance, then these functions allow you t

 ```{r}
 df <- data.frame(
@ -147,11 +248,11 @@ df |> filter(cumany(balance < 0))
 df |> filter(cumall(!(balance < 0)))
 ```

-## Conditional outputs
+### Conditional outputs

-If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `if_else()`[^logicals-numbers-1].
+If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `if_else()`[^logicals-1].

-[^logicals-numbers-1]: This is equivalent to the base R function `ifelse`.
+[^logicals-1]: This is equivalent to the base R function `ifelse`.
    There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error message if you use the wrong type of variable.

 ```{r}
@ -206,36 +307,6 @@ case_when(
 )
 ```

-## Summaries
-
-When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0, and when you use a numeric vector in a logical context, 0 becomes `FALSE` and everything else becomes `TRUE`.
-
-There are four particularly useful summary functions for logical vectors: they all take a vector of logical values and return a single value, making them a good fit for use in `summarise()`.
-
-`any()` and `all()` --- `any()` will return if there's at least one `TRUE`, `all()` will return `TRUE` if all values are `TRUE`.
-Like all summary functions, they'll return `NA` if there are any missing values present, and like usual you can make the missing values go away with `na.rm = TRUE`.
-
-`sum()` and `mean()` are particularly useful with logical vectors because `TRUE` is converted to 1 and `FALSE` to 0.
-This means that `sum(x)` gives the number of `TRUE`s in `x` and `mean(x)` gives the proportion of `TRUE`s:
-
-```{r}
-not_cancelled <- flights |> filter(!is.na(dep_delay), !is.na(arr_delay))
-
-# How many flights left before 5am? (these usually indicate delayed
-# flights from the previous day)
-not_cancelled |> 
-  group_by(year, month, day) |> 
-  summarise(n_early = sum(dep_time < 500))
-
-# What proportion of flights are delayed by more than an hour?
-not_cancelled |> 
-  group_by(year, month, day) |> 
-  summarise(hour_prop = mean(arr_delay > 60))
-```
-
-### Exercises
-
-1.  For each plane, count the number of flights before the first delay of greater than 1 hour.
-2.  What does `prod()` return when applied to a logical vector? What logical summary function is it equivalent to? What does `min()` return applied to a logical vector? What logical summary function is it equivalent to?
+## 

 ## 
--- a/numbers.Rmd
+++ b/numbers.Rmd
@ -1,4 +1,4 @@
-# Numbers {#logicals-numbers}
+# Numbers {#numbers}

 ```{r, results = "asis", echo = FALSE}
 status("drafting")
@ -19,6 +19,11 @@ library(nycflights13)

 Doesn't quite belong here, but it's really important (and it makes numbers) so I wanted to discuss it first.

+```{r}
+not_cancelled <- flights |> 
+  filter(!is.na(dep_time))
+```
+
 -   Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group.
    To count the number of non-missing values, use `sum(!is.na(x))`.
    To count the number of distinct (unique) values, use `n_distinct(x)`.