242 lines
8.8 KiB
Plaintext
242 lines
8.8 KiB
Plaintext
# Logicals and numbers {#logicals-numbers}
|
|
|
|
```{r, results = "asis", echo = FALSE}
|
|
status("drafting")
|
|
```
|
|
|
|
## Introduction
|
|
|
|
In this chapter, you'll learn useful tools for working with logical vectors.
|
|
The elements in a logical vector can have one of three possible values: `TRUE`, `FALSE`, and `NA`.
|
|
|
|
### Prerequisites
|
|
|
|
```{r, message = FALSE}
|
|
library(tidyverse)
|
|
library(nycflights13)
|
|
```
|
|
|
|
## Comparisons
|
|
|
|
Some times you'll get data that already includes logical vectors but in most cases you'll create them by using a comparison.
|
|
|
|
`<`, `<=`, `>`, `>=`, `!=`, and `==`.
|
|
If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.
|
|
|
|
A useful shortcut is `between(x, low, high)` which is a bit less typing than `x >= low & x <= high)`.
|
|
If you want an exclusive between or left-open right-closed etc, you'll need to write by hand.
|
|
|
|
Beware when using `==` with numbers as results might surprise you!
|
|
|
|
```{r}
|
|
(sqrt(2) ^ 2) == 2
|
|
(1 / 49 * 49) == 1
|
|
```
|
|
|
|
Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.
|
|
|
|
```{r}
|
|
(sqrt(2) ^ 2) - 2
|
|
(1 / 49 * 49) - 1
|
|
```
|
|
|
|
So instead of relying on `==`, use `near()`, which does the comparison with a small amount of tolerance:
|
|
|
|
```{r}
|
|
near(sqrt(2) ^ 2, 2)
|
|
near(1 / 49 * 49, 1)
|
|
```
|
|
|
|
Alternatively, you might want to use `round()` to trim off extra digits.
|
|
|
|
## Boolean algebra
|
|
|
|
For other types of combinations, you'll need to use Boolean operators yourself: `|` is "or" and `!` is "not".
|
|
Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
|
|
|
|
```{r bool-ops}
|
|
#| echo: false
|
|
#| fig.cap: >
|
|
#| Complete set of boolean operations. `x` is the left-hand
|
|
#| circle, `y` is the right-hand circle, and the shaded region show
|
|
#| which parts each operator selects."
|
|
#| fig.alt: >
|
|
#| Six Venn diagrams, each explaining a given logical operator. The
|
|
#| circles (sets) in each of the Venn diagrams represent x and y. 1. y &
|
|
#| !x is y but none of x, x & y is the intersection of x and y, x & !y is
|
|
#| x but none of y, x is all of x none of y, xor(x, y) is everything
|
|
#| except the intersection of x and y, y is all of y none of x, and
|
|
#| x | y is everything.
|
|
knitr::include_graphics("diagrams/transform-logical.png")
|
|
```
|
|
|
|
The following code finds all flights that departed in November or December:
|
|
|
|
```{r, eval = FALSE}
|
|
flights |> filter(month == 11 | month == 12)
|
|
```
|
|
|
|
Note that the order of operations doesn't work like English.
|
|
You can't write `filter(flights, month == 11 | 12)`, which you might read as "find all flights that departed in November or December".
|
|
Instead it does something rather confusing.
|
|
First it evaluates `11 | 12` which is equivalent to `TRUE | TRUE`, which returns `TRUE`.
|
|
Then it evaluates `month == TRUE`.
|
|
Since month is numeric, this is equivalent to `month == 1`, so that expression finds all flights in January!
|
|
|
|
An easy way to solve this problem is to use `%in%`.
|
|
`x %in% y` returns a logical vector the same length as `x` that is `TRUE` whenever a value in `x` is anywhere in `y` .
|
|
So we could use it to rewrite the code above:
|
|
|
|
```{r, eval = FALSE}
|
|
nov_dec <- flights |> filter(month %in% c(11, 12))
|
|
```
|
|
|
|
Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
|
|
For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
|
|
|
|
```{r, eval = FALSE}
|
|
flights |> filter(!(arr_delay > 120 | dep_delay > 120))
|
|
flights |> filter(arr_delay <= 120, dep_delay <= 120)
|
|
```
|
|
|
|
As well as `&` and `|`, R also has `&&` and `||`.
|
|
Don't use them in dplyr functions!
|
|
These are called short-circuiting operators and you'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.
|
|
|
|
## Missing values {#logical-missing}
|
|
|
|
`filter()` only selects rows where the logical expression is `TRUE`; it doesn't select rows where it's missing or `FALSE`.
|
|
If you want to find rows containing missing values, you'll need to convert missingness into a logical vector using `is.na()`.
|
|
|
|
```{r}
|
|
flights |> filter(is.na(dep_delay) | is.na(arr_delay))
|
|
flights |> filter(is.na(dep_delay) != is.na(arr_delay))
|
|
```
|
|
|
|
## In mutate()
|
|
|
|
Whenever you start using complicated, multi-part expressions in `filter()`, consider making them explicit variables instead.
|
|
That makes it much easier to check your work.When checking your work, a particularly useful `mutate()` argument is `.keep = "used"`: this will just show you the variables you've used, along with the variables that you created.
|
|
This makes it easy to see the variables involved side-by-side.
|
|
|
|
```{r}
|
|
flights |>
|
|
mutate(is_cancelled = is.na(dep_delay) | is.na(arr_delay), .keep = "used") |>
|
|
filter(is_cancelled)
|
|
```
|
|
|
|
## Cumulative functions
|
|
|
|
Another useful pair of functions are cumulative any, `cumany()`, and cumulative all, `cumall()`.
|
|
`cumany()` will be `TRUE` after it encounters the first `TRUE`, and `cumall()` will be `FALSE` after it encounters its first `FALSE`.
|
|
These are particularly useful in conjunction with `filter()` because they allow you to select:
|
|
|
|
- `cumall(x)`: all cases until the first `FALSE`.
|
|
- `cumall(!x)`: all cases until the first `TRUE`.
|
|
- `cumany(x)`: all cases after the first `TRUE`.
|
|
- `cumany(!x)`: all cases after the first `FALSE`.
|
|
|
|
```{r}
|
|
df <- data.frame(
|
|
date = as.Date("2020-01-01") + 0:6,
|
|
balance = c(100, 50, 25, -25, -50, 30, 120)
|
|
)
|
|
# all rows after first overdraft
|
|
df |> filter(cumany(balance < 0))
|
|
# all rows until first overdraft
|
|
df |> filter(cumall(!(balance < 0)))
|
|
```
|
|
|
|
## Conditional outputs
|
|
|
|
If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `if_else()`[^logicals-numbers-1].
|
|
|
|
[^logicals-numbers-1]: This is equivalent to the base R function `ifelse`.
|
|
There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error message if you use the wrong type of variable.
|
|
|
|
```{r}
|
|
df <- data.frame(
|
|
date = as.Date("2020-01-01") + 0:6,
|
|
balance = c(100, 50, 25, -25, -50, 30, 120)
|
|
)
|
|
df |> mutate(status = if_else(balance < 0, "overdraft", "ok"))
|
|
```
|
|
|
|
If you start to nest multiple sets of `if_else`s, I'd suggest switching to `case_when()` instead.
|
|
`case_when()` has a special syntax: it takes pairs that look like `condition ~ output`.
|
|
`condition` must evaluate to a logical vector; when it's `TRUE`, output will be used.
|
|
|
|
```{r}
|
|
df |>
|
|
mutate(
|
|
status = case_when(
|
|
balance == 0 ~ "no money",
|
|
balance < 0 ~ "overdraft",
|
|
balance > 0 ~ "ok"
|
|
)
|
|
)
|
|
```
|
|
|
|
(Note that I usually add spaces to make the outputs line up so it's easier to scan)
|
|
|
|
If none of the cases match, the output will be missing:
|
|
|
|
```{r}
|
|
x <- 1:10
|
|
case_when(
|
|
x %% 2 == 0 ~ "even",
|
|
)
|
|
```
|
|
|
|
You can create a catch all value by using `TRUE` as the condition:
|
|
|
|
```{r}
|
|
case_when(
|
|
x %% 2 == 0 ~ "even",
|
|
TRUE ~ "odd"
|
|
)
|
|
```
|
|
|
|
If multiple conditions are `TRUE`, the first is used:
|
|
|
|
```{r}
|
|
case_when(
|
|
x < 5 ~ "< 5",
|
|
x < 3 ~ "< 3",
|
|
)
|
|
```
|
|
|
|
## Summaries
|
|
|
|
When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0, and when you use a numeric vector in a logical context, 0 becomes `FALSE` and everything else becomes `TRUE`.
|
|
|
|
There are four particularly useful summary functions for logical vectors: they all take a vector of logical values and return a single value, making them a good fit for use in `summarise()`.
|
|
|
|
`any()` and `all()` --- `any()` will return if there's at least one `TRUE`, `all()` will return `TRUE` if all values are `TRUE`.
|
|
Like all summary functions, they'll return `NA` if there are any missing values present, and like usual you can make the missing values go away with `na.rm = TRUE`.
|
|
|
|
`sum()` and `mean()` are particularly useful with logical vectors because `TRUE` is converted to 1 and `FALSE` to 0.
|
|
This means that `sum(x)` gives the number of `TRUE`s in `x` and `mean(x)` gives the proportion of `TRUE`s:
|
|
|
|
```{r}
|
|
not_cancelled <- flights |> filter(!is.na(dep_delay), !is.na(arr_delay))
|
|
|
|
# How many flights left before 5am? (these usually indicate delayed
|
|
# flights from the previous day)
|
|
not_cancelled |>
|
|
group_by(year, month, day) |>
|
|
summarise(n_early = sum(dep_time < 500))
|
|
|
|
# What proportion of flights are delayed by more than an hour?
|
|
not_cancelled |>
|
|
group_by(year, month, day) |>
|
|
summarise(hour_prop = mean(arr_delay > 60))
|
|
```
|
|
|
|
### Exercises
|
|
|
|
1. For each plane, count the number of flights before the first delay of greater than 1 hour.
|
|
2. What does `prod()` return when applied to a logical vector? What logical summary function is it equivalent to? What does `min()` return applied to a logical vector? What logical summary function is it equivalent to?
|
|
|
|
##
|