r4ds/logicals.Rmd

# Logicals and numbers {#logicals-numbers}

```{r, results = "asis", echo = FALSE}
status("drafting")
```

## Introduction

In this chapter, you'll learn useful tools for working with logical vectors.
The elements in a logical vector can have one of three possible values: `TRUE`, `FALSE`, and `NA`.

### Prerequisites

```{r, message = FALSE}
library(tidyverse)
library(nycflights13)
```

## Comparisons

Some times you'll get data that already includes logical vectors but in most cases you'll create them by using a comparison.

`<`, `<=`, `>`, `>=`, `!=`, and `==`.
If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.

A useful shortcut is `between(x, low, high)` which is a bit less typing than `x >= low & x <= high)`.
If you want an exclusive between or left-open right-closed etc, you'll need to write by hand.

Beware when using `==` with numbers as results might surprise you!

```{r}
(sqrt(2) ^ 2) == 2
(1 / 49 * 49) == 1
```

Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.

```{r}
(sqrt(2) ^ 2) - 2
(1 / 49 * 49) - 1
```

So instead of relying on `==`, use `near()`, which does the comparison with a small amount of tolerance:

```{r}
near(sqrt(2) ^ 2,  2)
near(1 / 49 * 49, 1)
```

Alternatively, you might want to use `round()` to trim off extra digits.

## Boolean algebra

For other types of combinations, you'll need to use Boolean operators yourself: `|` is "or" and `!` is "not".
Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.

```{r bool-ops}
#| echo: false
#| fig.cap: > 
#|    Complete set of boolean operations. `x` is the left-hand
#|    circle, `y` is the right-hand circle, and the shaded region show 
#|    which parts each operator selects."
#| fig.alt: >
#|    Six Venn diagrams, each explaining a given logical operator. The
#|    circles (sets) in each of the Venn diagrams represent x and y. 1. y &
#|    !x is y but none of x, x & y is the intersection of x and y, x & !y is
#|    x but none of y, x is all of x none of y, xor(x, y) is everything
#|    except the intersection of x and y, y is all of y none of x, and 
#|    x | y is everything.
knitr::include_graphics("diagrams/transform-logical.png")
```

The following code finds all flights that departed in November or December:

```{r, eval = FALSE}
flights |> filter(month == 11 | month == 12)
```

Note that the order of operations doesn't work like English.
You can't write `filter(flights, month == 11 | 12)`, which you might read as "find all flights that departed in November or December".
Instead it does something rather confusing.
First it evaluates `11 | 12` which is equivalent to `TRUE | TRUE`, which returns `TRUE`.
Then it evaluates `month == TRUE`.
Since month is numeric, this is equivalent to `month == 1`, so that expression finds all flights in January!

An easy way to solve this problem is to use `%in%`.
`x %in% y` returns a logical vector the same length as `x` that is `TRUE` whenever a value in `x` is anywhere in `y` .
So we could use it to rewrite the code above:

```{r, eval = FALSE}
nov_dec <- flights |> filter(month %in% c(11, 12))
```

Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:

```{r, eval = FALSE}
flights |> filter(!(arr_delay > 120 | dep_delay > 120))
flights |> filter(arr_delay <= 120, dep_delay <= 120)
```

As well as `&` and `|`, R also has `&&` and `||`.
Don't use them in dplyr functions!
These are called short-circuiting operators and you'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.

## Missing values {#logical-missing}

`filter()` only selects rows where the logical expression is `TRUE`; it doesn't select rows where it's missing or `FALSE`.
If you want to find rows containing missing values, you'll need to convert missingness into a logical vector using `is.na()`.

```{r}
flights |> filter(is.na(dep_delay) | is.na(arr_delay))
flights |> filter(is.na(dep_delay) != is.na(arr_delay))
```

## In mutate()

Whenever you start using complicated, multi-part expressions in `filter()`, consider making them explicit variables instead.
That makes it much easier to check your work.When checking your work, a particularly useful `mutate()` argument is `.keep = "used"`: this will just show you the variables you've used, along with the variables that you created.
This makes it easy to see the variables involved side-by-side.

```{r}
flights |> 
  mutate(is_cancelled = is.na(dep_delay) | is.na(arr_delay), .keep = "used") |> 
  filter(is_cancelled)
```

## Cumulative functions

Another useful pair of functions are cumulative any, `cumany()`, and cumulative all, `cumall()`.
`cumany()` will be `TRUE` after it encounters the first `TRUE`, and `cumall()` will be `FALSE` after it encounters its first `FALSE`.
These are particularly useful in conjunction with `filter()` because they allow you to select:

-   `cumall(x)`: all cases until the first `FALSE`.
-   `cumall(!x)`: all cases until the first `TRUE`.
-   `cumany(x)`: all cases after the first `TRUE`.
-   `cumany(!x)`: all cases after the first `FALSE`.

```{r}
df <- data.frame(
  date = as.Date("2020-01-01") + 0:6,
  balance = c(100, 50, 25, -25, -50, 30, 120)
)
# all rows after first overdraft
df |> filter(cumany(balance < 0))
# all rows until first overdraft
df |> filter(cumall(!(balance < 0)))
```

## Conditional outputs

If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `if_else()`[^logicals-numbers-1].

[^logicals-numbers-1]: This is equivalent to the base R function `ifelse`.
    There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error message if you use the wrong type of variable.

```{r}
df <- data.frame(
  date = as.Date("2020-01-01") + 0:6,
  balance = c(100, 50, 25, -25, -50, 30, 120)
)
df |> mutate(status = if_else(balance < 0, "overdraft", "ok"))
```

If you start to nest multiple sets of `if_else`s, I'd suggest switching to `case_when()` instead.
`case_when()` has a special syntax: it takes pairs that look like `condition ~ output`.
`condition` must evaluate to a logical vector; when it's `TRUE`, output will be used.

```{r}
df |> 
  mutate(
    status = case_when(
      balance == 0 ~ "no money", 
      balance  < 0 ~ "overdraft",
      balance  > 0 ~ "ok"
    )
  )
```

(Note that I usually add spaces to make the outputs line up so it's easier to scan)

If none of the cases match, the output will be missing:

```{r}
x <- 1:10
case_when(
  x %% 2 == 0 ~ "even",
)
```

You can create a catch all value by using `TRUE` as the condition:

```{r}
case_when(
  x %% 2 == 0 ~ "even",
  TRUE        ~ "odd"
)
```

If multiple conditions are `TRUE`, the first is used:

```{r}
case_when(
  x < 5 ~ "< 5",
  x < 3 ~ "< 3",
)
```

## Summaries

When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0, and when you use a numeric vector in a logical context, 0 becomes `FALSE` and everything else becomes `TRUE`.

There are four particularly useful summary functions for logical vectors: they all take a vector of logical values and return a single value, making them a good fit for use in `summarise()`.

`any()` and `all()` --- `any()` will return if there's at least one `TRUE`, `all()` will return `TRUE` if all values are `TRUE`.
Like all summary functions, they'll return `NA` if there are any missing values present, and like usual you can make the missing values go away with `na.rm = TRUE`.

`sum()` and `mean()` are particularly useful with logical vectors because `TRUE` is converted to 1 and `FALSE` to 0.
This means that `sum(x)` gives the number of `TRUE`s in `x` and `mean(x)` gives the proportion of `TRUE`s:

```{r}
not_cancelled <- flights |> filter(!is.na(dep_delay), !is.na(arr_delay))

# How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)
not_cancelled |> 
  group_by(year, month, day) |> 
  summarise(n_early = sum(dep_time < 500))

# What proportion of flights are delayed by more than an hour?
not_cancelled |> 
  group_by(year, month, day) |> 
  summarise(hour_prop = mean(arr_delay > 60))
```

### Exercises

1.  For each plane, count the number of flights before the first delay of greater than 1 hour.
2.  What does `prod()` return when applied to a logical vector? What logical summary function is it equivalent to? What does `min()` return applied to a logical vector? What logical summary function is it equivalent to?

##
Data transformation (#940) * Minor edit + link to style guide * Fix reference * If you don't know order of operations, not clear * Alt text + minor edits * Add median and fix reference * Move up mult groups up to discuss summarise msg * Go over grouping again * Part rename * Chapter rename * Clean up section labels to avoid dups * Update comment * Switch part order * Move columnwise to transform 2021-03-29 21:58:27 +08:00			`# Logicals and numbers {#logicals-numbers}`
Second crack and 2e structure 2021-03-04 01:13:14 +08:00
Add chapter status 2021-05-04 21:10:39 +08:00			```{r, results = "asis", echo = FALSE}
			`status("drafting")`
			```

Second crack and 2e structure 2021-03-04 01:13:14 +08:00			`## Introduction`
Break up data-transform content 2021-04-19 20:56:29 +08:00
Some vector chapter reorganisation 2022-03-17 22:46:35 +08:00			`In this chapter, you'll learn useful tools for working with logical vectors.`
			The elements in a logical vector can have one of three possible values: `TRUE`, `FALSE`, and `NA`.
Break up data-transform content 2021-04-19 20:56:29 +08:00
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			`### Prerequisites`

			```{r, message = FALSE}
Get code working again 2021-04-19 22:31:38 +08:00			`library(tidyverse)`
			`library(nycflights13)`
			```

Some vector chapter reorganisation 2022-03-17 22:46:35 +08:00			`## Comparisons`
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00
Some vector chapter reorganisation 2022-03-17 22:46:35 +08:00			`Some times you'll get data that already includes logical vectors but in most cases you'll create them by using a comparison.`

			`<`, `<=`, `>`, `>=`, `!=`, and `==`.
			`If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.`

			A useful shortcut is `between(x, low, high)` which is a bit less typing than `x >= low & x <= high)`.
			`If you want an exclusive between or left-open right-closed etc, you'll need to write by hand.`

			Beware when using `==` with numbers as results might surprise you!

			```{r}
			`(sqrt(2) ^ 2) == 2`
			`(1 / 49 * 49) == 1`
			```

			`Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.`

			```{r}
			`(sqrt(2) ^ 2) - 2`
			`(1 / 49 * 49) - 1`
			```

			So instead of relying on `==`, use `near()`, which does the comparison with a small amount of tolerance:

			```{r}
			`near(sqrt(2) ^ 2, 2)`
			`near(1 / 49 * 49, 1)`
			```
Break up data-transform content 2021-04-19 20:56:29 +08:00
Some vector chapter reorganisation 2022-03-17 22:46:35 +08:00			Alternatively, you might want to use `round()` to trim off extra digits.

			`## Boolean algebra`
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00
			For other types of combinations, you'll need to use Boolean operators yourself: `\|` is "or" and `!` is "not".
Break up data-transform content 2021-04-19 20:56:29 +08:00			`Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.`

Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			```{r bool-ops}
			`#\| echo: false`
			`#\| fig.cap: >`
			#\| Complete set of boolean operations. `x` is the left-hand
			#\| circle, `y` is the right-hand circle, and the shaded region show
			`#\| which parts each operator selects."`
			`#\| fig.alt: >`
			`#\| Six Venn diagrams, each explaining a given logical operator. The`
			`#\| circles (sets) in each of the Venn diagrams represent x and y. 1. y &`
			`#\| !x is y but none of x, x & y is the intersection of x and y, x & !y is`
			`#\| x but none of y, x is all of x none of y, xor(x, y) is everything`
			`#\| except the intersection of x and y, y is all of y none of x, and`
			`#\| x \| y is everything.`
Break up data-transform content 2021-04-19 20:56:29 +08:00			`knitr::include_graphics("diagrams/transform-logical.png")`
			```

			`The following code finds all flights that departed in November or December:`

			```{r, eval = FALSE}
Convert from %>% to \|> 2022-02-24 03:15:52 +08:00			`flights \|> filter(month == 11 \| month == 12)`
Break up data-transform content 2021-04-19 20:56:29 +08:00			```

Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			`Note that the order of operations doesn't work like English.`
			You can't write `filter(flights, month == 11 \| 12)`, which you might read as "find all flights that departed in November or December".
			`Instead it does something rather confusing.`
			First it evaluates `11 \| 12` which is equivalent to `TRUE \| TRUE`, which returns `TRUE`.
			Then it evaluates `month == TRUE`.
			Since month is numeric, this is equivalent to `month == 1`, so that expression finds all flights in January!
Break up data-transform content 2021-04-19 20:56:29 +08:00
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			An easy way to solve this problem is to use `%in%`.
			`x %in% y` returns a logical vector the same length as `x` that is `TRUE` whenever a value in `x` is anywhere in `y` .
			`So we could use it to rewrite the code above:`
Break up data-transform content 2021-04-19 20:56:29 +08:00
			```{r, eval = FALSE}
Convert from %>% to \|> 2022-02-24 03:15:52 +08:00			`nov_dec <- flights \|> filter(month %in% c(11, 12))`
Break up data-transform content 2021-04-19 20:56:29 +08:00			```

			Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x \| !y`, and `!(x \| y)` is the same as `!x & !y`.
			`For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:`

			```{r, eval = FALSE}
Convert from %>% to \|> 2022-02-24 03:15:52 +08:00			`flights \|> filter(!(arr_delay > 120 \| dep_delay > 120))`
			`flights \|> filter(arr_delay <= 120, dep_delay <= 120)`
Break up data-transform content 2021-04-19 20:56:29 +08:00			```

			As well as `&` and `\|`, R also has `&&` and `\|\|`.
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			`Don't use them in dplyr functions!`
			`These are called short-circuiting operators and you'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.`
Break up data-transform content 2021-04-19 20:56:29 +08:00
Some vector chapter reorganisation 2022-03-17 22:46:35 +08:00			`## Missing values {#logical-missing}`
Break up data-transform content 2021-04-19 20:56:29 +08:00
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			`filter()` only selects rows where the logical expression is `TRUE`; it doesn't select rows where it's missing or `FALSE`.
			If you want to find rows containing missing values, you'll need to convert missingness into a logical vector using `is.na()`.
Break up data-transform content 2021-04-19 20:56:29 +08:00
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			```{r}
Convert from %>% to \|> 2022-02-24 03:15:52 +08:00			`flights \|> filter(is.na(dep_delay) \| is.na(arr_delay))`
			`flights \|> filter(is.na(dep_delay) != is.na(arr_delay))`
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			```
Break up data-transform content 2021-04-19 20:56:29 +08:00
Some vector chapter reorganisation 2022-03-17 22:46:35 +08:00			`## In mutate()`
Get code working again 2021-04-19 22:31:38 +08:00
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			Whenever you start using complicated, multi-part expressions in `filter()`, consider making them explicit variables instead.
			That makes it much easier to check your work.When checking your work, a particularly useful `mutate()` argument is `.keep = "used"`: this will just show you the variables you've used, along with the variables that you created.
			`This makes it easy to see the variables involved side-by-side.`
Break up data-transform content 2021-04-19 20:56:29 +08:00
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			```{r}
Convert from %>% to \|> 2022-02-24 03:15:52 +08:00			`flights \|>`
			`mutate(is_cancelled = is.na(dep_delay) \| is.na(arr_delay), .keep = "used") \|>`
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			`filter(is_cancelled)`
			```

Some vector chapter reorganisation 2022-03-17 22:46:35 +08:00			`## Cumulative functions`

			Another useful pair of functions are cumulative any, `cumany()`, and cumulative all, `cumall()`.
			`cumany()` will be `TRUE` after it encounters the first `TRUE`, and `cumall()` will be `FALSE` after it encounters its first `FALSE`.
			These are particularly useful in conjunction with `filter()` because they allow you to select:

			- `cumall(x)`: all cases until the first `FALSE`.
			- `cumall(!x)`: all cases until the first `TRUE`.
			- `cumany(x)`: all cases after the first `TRUE`.
			- `cumany(!x)`: all cases after the first `FALSE`.

			```{r}
			`df <- data.frame(`
			`date = as.Date("2020-01-01") + 0:6,`
			`balance = c(100, 50, 25, -25, -50, 30, 120)`
			`)`
			`# all rows after first overdraft`
			`df \|> filter(cumany(balance < 0))`
			`# all rows until first overdraft`
			`df \|> filter(cumall(!(balance < 0)))`
			```

			`## Conditional outputs`
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00
			If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `if_else()`[^logicals-numbers-1].

			[^logicals-numbers-1]: This is equivalent to the base R function `ifelse`.
			There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error message if you use the wrong type of variable.

			```{r}
			`df <- data.frame(`
			`date = as.Date("2020-01-01") + 0:6,`
			`balance = c(100, 50, 25, -25, -50, 30, 120)`
			`)`
Convert from %>% to \|> 2022-02-24 03:15:52 +08:00			`df \|> mutate(status = if_else(balance < 0, "overdraft", "ok"))`
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			```

			If you start to nest multiple sets of `if_else`s, I'd suggest switching to `case_when()` instead.
			`case_when()` has a special syntax: it takes pairs that look like `condition ~ output`.
			`condition` must evaluate to a logical vector; when it's `TRUE`, output will be used.
Break up data-transform content 2021-04-19 20:56:29 +08:00
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			```{r}
Convert from %>% to \|> 2022-02-24 03:15:52 +08:00			`df \|>`
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			`mutate(`
			`status = case_when(`
			`balance == 0 ~ "no money",`
			`balance < 0 ~ "overdraft",`
			`balance > 0 ~ "ok"`
			`)`
			`)`
			```

			`(Note that I usually add spaces to make the outputs line up so it's easier to scan)`

			`If none of the cases match, the output will be missing:`

			```{r}
			`x <- 1:10`
			`case_when(`
			`x %% 2 == 0 ~ "even",`
			`)`
			```

			You can create a catch all value by using `TRUE` as the condition:

			```{r}
			`case_when(`
			`x %% 2 == 0 ~ "even",`
			`TRUE ~ "odd"`
			`)`
			```

			If multiple conditions are `TRUE`, the first is used:

			```{r}
			`case_when(`
			`x < 5 ~ "< 5",`
			`x < 3 ~ "< 3",`
			`)`
			```

Some vector chapter reorganisation 2022-03-17 22:46:35 +08:00			`## Summaries`

			When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0, and when you use a numeric vector in a logical context, 0 becomes `FALSE` and everything else becomes `TRUE`.
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00
			There are four particularly useful summary functions for logical vectors: they all take a vector of logical values and return a single value, making them a good fit for use in `summarise()`.

			`any()` and `all()` --- `any()` will return if there's at least one `TRUE`, `all()` will return `TRUE` if all values are `TRUE`.
			Like all summary functions, they'll return `NA` if there are any missing values present, and like usual you can make the missing values go away with `na.rm = TRUE`.

			`sum()` and `mean()` are particularly useful with logical vectors because `TRUE` is converted to 1 and `FALSE` to 0.
			This means that `sum(x)` gives the number of `TRUE`s in `x` and `mean(x)` gives the proportion of `TRUE`s:

			```{r}
Convert from %>% to \|> 2022-02-24 03:15:52 +08:00			`not_cancelled <- flights \|> filter(!is.na(dep_delay), !is.na(arr_delay))`
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00
			`# How many flights left before 5am? (these usually indicate delayed`
			`# flights from the previous day)`
Convert from %>% to \|> 2022-02-24 03:15:52 +08:00			`not_cancelled \|>`
			`group_by(year, month, day) \|>`
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			`summarise(n_early = sum(dep_time < 500))`

			`# What proportion of flights are delayed by more than an hour?`
Convert from %>% to \|> 2022-02-24 03:15:52 +08:00			`not_cancelled \|>`
			`group_by(year, month, day) \|>`
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			`summarise(hour_prop = mean(arr_delay > 60))`
			```
Start rewriting transform chapter 2021-04-20 20:59:47 +08:00
			`### Exercises`

			`1. For each plane, count the number of flights before the first delay of greater than 1 hour.`
Hacking away at logicals/numerics 2022-02-05 02:27:20 +08:00			2. What does `prod()` return when applied to a logical vector? What logical summary function is it equivalent to? What does `min()` return applied to a logical vector? What logical summary function is it equivalent to?

Some vector chapter reorganisation 2022-03-17 22:46:35 +08:00			`##`