Iterate on logicals chapter

Also figure out how to get figures sizes aligned nicely.
This commit is contained in:
Hadley Wickham 2022-03-21 09:24:42 -05:00
parent 3e3faea9cb
commit 090333be5b
6 changed files with 204 additions and 109 deletions

View File

@ -1,8 +1,9 @@
# R for Data Science # R for Data Science
[![Travis build status](https://travis-ci.org/hadley/r4ds.svg?branch=master)](https://travis-ci.org/hadley/r4ds) [![Travis build status](https://travis-ci.org/hadley/r4ds.svg?branch=master)](https://travis-ci.org/hadley/r4ds)
This repository contains the source of [R for Data Science](http://r4ds.had.co.nz) This repository contains the source of [R for Data Science](http://r4ds.had.co.nz) book.
book. The book is built using [bookdown](https://github.com/rstudio/bookdown). The book is built using [bookdown](https://github.com/rstudio/bookdown).
The R packages used in this book can be installed via The R packages used in this book can be installed via
@ -10,6 +11,24 @@ The R packages used in this book can be installed via
devtools::install_github("hadley/r4ds") devtools::install_github("hadley/r4ds")
``` ```
## Images
### Omnigraffle drawings
- Font: 12pt Ubuntu mono
- Export as 300 dpi png.
- Website font is 18 px = 13.5 pt, so scale dpi to match font sizes: 270 = 300 \* 12 / 13.5
- Verified sizes are visually equivalent by screenshotting.
```{=html}
<!-- -->
```
#| echo: FALSE
#| out.width: NULL
#| fig.retina: 1.5
knitr::include_graphics("diagrams/transform.png", dpi = 270)
## Code of Conduct ## Code of Conduct
Please note that r4ds uses a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this book, you agree to abide by its terms. Please note that r4ds uses a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html).
By contributing to this book, you agree to abide by its terms.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 155 KiB

Binary file not shown.

BIN
diagrams/transform.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 83 KiB

View File

@ -1,4 +1,4 @@
# Logicals and numbers {#logicals} # Logical vectors {#logicals}
```{r, results = "asis", echo = FALSE} ```{r, results = "asis", echo = FALSE}
status("drafting") status("drafting")
@ -8,29 +8,50 @@ status("drafting")
In this chapter, you'll learn useful tools for working with logical vectors. In this chapter, you'll learn useful tools for working with logical vectors.
Logical vectors are the simplest type of vector because each element can only be one of three possible values: `TRUE`, `FALSE`, and `NA`. Logical vectors are the simplest type of vector because each element can only be one of three possible values: `TRUE`, `FALSE`, and `NA`.
Despite that simplicity, they're an extremely powerful tool. You'll find logical vectors directly in data relatively rarely, but despite that they're extremely powerful because you'll frequently create them during data analysis.
We'll begin with the most common way of creating logical vectors: numeric comparisons.
Then we'll talk about using Boolean algebra to combine different logical vectors, and some useful summaries for logical vectors.
We'll finish off with some other tool for making conditional changes
### Prerequisites ### Prerequisites
```{r, message = FALSE} In this chapter, we'll continue to draw inspiration from the nyclights13 dataset.
```{r setup, message = FALSE}
library(tidyverse) library(tidyverse)
library(nycflights13) library(nycflights13)
``` ```
But as we start to discuss more tools, there won't always be a perfect example.
So from this chapter on we'll start to use more abstract examples where we create a vector with `c()`, and then manipulate it in various ways.
This will make it easier to explain the general point without having to construct a full example.
It does make it a little harder to apply directly to your data problems, but remember that you can do these same manipulations with a vector inside a data frame using `mutate()` and friends.
```{r}
x <- c(1, 2, 3, 5, 7, 11, 13)
x * 2
df <- tibble(
x = c(1, 2, 3, 5, 7, 11, 13)
)
df |>
mutate(y = x * 2)
```
## Comparisons ## Comparisons
Some times you'll get data that already includes logical vectors but in most cases you'll create them by using a comparison, like `<`, `<=`, `>`, `>=`, `!=`, and `==`. A very common way to create a logical vector is via a numeric comparison with `<`, `<=`, `>`, `>=`, `!=`, and `==`.
You'll learn other ways to create them in later chapters dealing with strings and dates.
### In `mutate()` So far, we've mostly created logical variables implicitly within `filter()` --- they are computed, used, and then throw away.
For example, the following filter finds all day time departures that leave roughly on time:
So far, you've mostly created these new variables implicitly within `filter()`:
```{r} ```{r}
flights |> flights |>
filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20) filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
``` ```
But it's useful to know that this is a shortcut and you can explicitly create perform these operations inside a `mutate()` But it's useful to know that this is a shortcut and you can explicitly create the underlying logical variables with `mutate()`:
```{r} ```{r}
flights |> flights |>
@ -41,9 +62,12 @@ flights |>
) )
``` ```
So the filter above could also be written as: This is useful because it allows you to name components, which can made the code easier to read, and it allows you to double-check the intermediate steps.
This is a particularly useful technique when you're doing more complicated Boolean algebra, as you'll learn about in the next section.
```{r} So the initial filter could also be written as:
```{r, results = FALSE}
flights |> flights |>
mutate( mutate(
daytime = dep_time > 600 & dep_time < 2000, daytime = dep_time > 600 & dep_time < 2000,
@ -52,57 +76,123 @@ flights |>
filter(daytime & approx_ontime) filter(daytime & approx_ontime)
``` ```
This is an important technique when you're are doing complicated subsetting because it allows you to double-check the intermediate steps.
### Floating point comparison ### Floating point comparison
Beware when using `==` with numbers as results might surprise you! Beware when using `==` with numbers as results might surprise you!
You might think that the following two computations yield 1 and 2: It looks like this vector contains the numbers 1 and 2:
```{r} ```{r}
(1 / 49 * 49) x <- c(1 / 49 * 49, sqrt(2) ^ 2)
sqrt(2) ^ 2 x
``` ```
But if you test them for equality, you'll discover that they're not what you expect! But if you test them for equality, you surprisingly get `FALSE`:
```{r} ```{r}
(1 / 49 * 49) == 1 x == c(1, 2)
(sqrt(2) ^ 2) == 2
``` ```
That's because computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so in most cases, the number number you see is an actually approximation. That's because computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so in most cases, the number number you see is an actually approximation.
R usually rounds these numbers to avoid displaying a bunch of usually unimportant digits. R usually rounds these numbers to avoid displaying a bunch of usually unimportant digits.
You can use the `digits` argument to `format()` to force R to display more:
To see the details you can call `print()` with the the `digits`[^logicals-1] argument.
R normally calls print automatically for you (i.e. `x` is a shortcut for `print(x)`), but calling it explicitly is useful if you want to provide other arguments:
[^logicals-1]: A floating point number can hold roughly 16 decimal digits; the precise number is surprisingly complicated and depends on the number.
```{r} ```{r}
format(1 / 49 * 49, digits = 20) print(x, digits = 16)
format(sqrt(2) ^ 2, digits = 20)
``` ```
Instead of relying on `==`, you can use `dplyr::near()`, which does the comparison with a small amount of tolerance: Now that you've seen why `==` is failing, what can you do about it?
One option is to use `round()` to round to any number of digits, or instead of `==`, use `dplyr::near()`, which does the comparison with a small amount of tolerance:
```{r} ```{r}
near(sqrt(2) ^ 2, 2) near(x, c(1, 2))
near(1 / 49 * 49, 1)
``` ```
### Missing values {#na-comparison}
Missing values represent the unknown so they missing values are "contagious": almost any operation involving an unknown value will also be unknown:
```{r}
NA > 5
10 == NA
```
The most confusing result is this one:
```{r}
NA == NA
```
It's easiest to understand why this is true with a bit more context:
```{r}
# Let x be Mary's age. We don't know how old she is.
x <- NA
# Let y be John's age. We don't know how old he is.
y <- NA
# Are John and Mary the same age?
x == y
# We don't know!
```
So if you want to find all flights with `dep_time` is missing, the following code won't work because `dep_time == NA` will yield a `NA` for every single row, and `filter()` automatically drops missing values:
```{r}
flights |>
filter(dep_time == NA)
```
Instead we'll need a new too: `is.na()`.
### `is.na()` ### `is.na()`
Another common way to create logical vector is with `is.na()`. There's one other very useful way to create logical vectors: `is.na()`.
This is particularly important in conjunction with `filter()` because filter only selects rows where the value is `TRUE`; rows where the value is `FALSE` are automatically dropped. This takes any type of vector and returns `TRUE` is the value is `NA`, and `FALSE` otherwise:
```{r} ```{r}
flights |> filter(is.na(dep_delay) | is.na(arr_delay)) is.na(c(TRUE, NA, FALSE))
flights |> filter(is.na(dep_delay) != is.na(arr_delay)) is.na(c(1, NA, 3))
is.na(c("a", NA, "b"))
``` ```
We can use `is.na()` to find all the rows with a missing `dep_time`:
```{r}
flights |>
filter(is.na(dep_time))
```
It can also be useful in `arrange()`, because by default, `arrange()` puts all the missing values at the end.
You can override this default by first sorting by `is.na()`:
```{r}
flights |>
arrange(arr_delay)
flights |>
arrange(desc(is.na(arr_delay)), arr_delay)
```
### Exercises
1. How does `dplyr::near()` work? Read the source code to find out.
2. Use `mutate()`, `is.na()`, and `count()` together to describe how the missing values in `dep_time`, `sched_dep_time` and `dep_delay` are connected.
## Boolean algebra ## Boolean algebra
Once you have multiple logical vectors, you can combine them together using Boolean algebra: `&` is "and", `|` is "or", and `!` is "not". Once you have multiple logical vectors, you can combine them together using Boolean algebra.
`xor()` provides one final useful operation: exclusive or. In R, `&` is "and", `|` is "or", and `!` is "not", and `xor()` is exclusive or[^logicals-2].
Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations and how they work. Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations and how they work.
[^logicals-2]: That is, `xor(x, y)` is true if x is true, or y is true, but not both.
This is how we usually use "or" In English.
Both is not usually an acceptable answer to the question "would you like ice cream or cake?".
```{r bool-ops} ```{r bool-ops}
#| echo: false #| echo: false
#| out.width: NULL #| out.width: NULL
@ -117,7 +207,7 @@ Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations and how
#| x but none of y, x is all of x none of y, xor(x, y) is everything #| x but none of y, x is all of x none of y, xor(x, y) is everything
#| except the intersection of x and y, y is all of y none of x, and #| except the intersection of x and y, y is all of y none of x, and
#| x | y is everything. #| x | y is everything.
knitr::include_graphics("diagrams/transform-logical.png") knitr::include_graphics("diagrams/transform.png", dpi = 270)
``` ```
As well as `&` and `|`, R also has `&&` and `||`. As well as `&` and `|`, R also has `&&` and `||`.
@ -143,6 +233,11 @@ Since month is numeric, this is equivalent to `month == 1`, so `flights |> filte
An easy way to avoid this issue is to use `%in%`. An easy way to avoid this issue is to use `%in%`.
`x %in% y` returns a logical vector the same length as `x` that is `TRUE` whenever a value in `x` is anywhere in `y` . `x %in% y` returns a logical vector the same length as `x` that is `TRUE` whenever a value in `x` is anywhere in `y` .
```{r}
letters[1:10] %in% c("a", "e", "i", "o", "u")
```
So we could instead write: So we could instead write:
```{r, eval = FALSE} ```{r, eval = FALSE}
@ -150,17 +245,14 @@ flights |>
filter(month %in% c(11, 12)) filter(month %in% c(11, 12))
``` ```
Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`. Note the `%in%` obeys different rules for `NA` to `==`.
For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
```{r, eval = FALSE} ```{r}
flights |> flights |>
filter(!(arr_delay > 120 | dep_delay > 120)) filter(dep_time %in% c(NA, 0800))
flights |>
filter(arr_delay <= 120 & dep_delay <= 120)
``` ```
### Missing values {#logical-missing} ### Missing values {#na-boolean}
The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance: The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:
@ -169,10 +261,14 @@ NA & c(TRUE, FALSE, NA)
NA | c(TRUE, FALSE, NA) NA | c(TRUE, FALSE, NA)
``` ```
<!-- Draw truth tables? --> To understand what's going on, think about `NA | TRUE`.
If a logical is `NA`, than means it could either be `TRUE` or `FALSE`.
`TRUE | TRUE` and `FALSE | TRUE` are both `TRUE`, so `NA | TRUE` must also be `TRUE`.
Similar reasoning applies with `NA & FALSE`.
To understand what's going on you need to think about `x | TRUE`, because regardless of whether `x` is `TRUE` or `FALSE` the result is still `TRUE`. ### Exercises
That means even if you don't know what `x` is (i.e. it's missing), the result must still be `TRUE`.
1. Find all flights where `arr_delay` is missing but `dep_delay` is not. Find all flights where neither `arr_time` nor `sched_arr_time` are missing, but `arr_delay` is.
## Summaries ## Summaries
@ -199,7 +295,6 @@ not_cancelled |>
group_by(year, month, day) |> group_by(year, month, day) |>
summarise(prop_delayed = mean(arr_delay > 0)) |> summarise(prop_delayed = mean(arr_delay > 0)) |>
arrange(desc(prop_delayed)) arrange(desc(prop_delayed))
``` ```
Or we could ask how many flights left before 5am, which usually are flights that were delayed from the previous day: Or we could ask how many flights left before 5am, which usually are flights that were delayed from the previous day:
@ -211,6 +306,17 @@ not_cancelled |>
arrange(desc(n_early)) arrange(desc(n_early))
``` ```
You can also use logical vectors inside summaries:
```{r}
not_cancelled |>
group_by(year, month, day) |>
summarise(
ahead = mean(arr_delay[arr_delay > 0]),
behind = mean(arr_delay[arr_delay < 0]),
)
```
### Exercises ### Exercises
1. For each plane, count the number of flights before the first delay of greater than 1 hour. 1. For each plane, count the number of flights before the first delay of greater than 1 hour.
@ -218,41 +324,11 @@ not_cancelled |>
## Transformations ## Transformations
### Cumulative functions
Another useful pair of functions are cumulative any, `cumany()`, and cumulative all, `cumall()`.
`cumany()` will be `TRUE` after it encounters the first `TRUE`, and `cumall()` will be `FALSE` after it encounters its first `FALSE`.
```{r}
cumany(c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE))
cumall(c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE))
```
These are particularly useful in conjunction with `filter()` because they allow you to select rows:
- Before the first `FALSE` with `cumall(x)`.
- Before the first `TRUE` with `cumall(!x)`.
- After the first `TRUE` with `cumany(x)`.
- After the first `FALSE` with `cumany(!x)`.
If you imagine some data about a bank balance, then these functions allow you t
```{r}
df <- data.frame(
date = as.Date("2020-01-01") + 0:6,
balance = c(100, 50, 25, -25, -50, 30, 120)
)
# all rows after first overdraft
df |> filter(cumany(balance < 0))
# all rows until first overdraft
df |> filter(cumall(!(balance < 0)))
```
### Conditional outputs ### Conditional outputs
If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `if_else()`[^logicals-1]. If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `if_else()`[^logicals-3].
[^logicals-1]: This is equivalent to the base R function `ifelse`. [^logicals-3]: This is equivalent to the base R function `ifelse`.
There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error message if you use the wrong type of variable. There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error message if you use the wrong type of variable.
```{r} ```{r}
@ -307,6 +383,36 @@ case_when(
) )
``` ```
## ### Cumulative functions
Another useful pair of functions are cumulative any, `cumany()`, and cumulative all, `cumall()`.
`cumany()` will be `TRUE` after it encounters the first `TRUE`, and `cumall()` will be `FALSE` after it encounters its first `FALSE`.
```{r}
cumany(c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE))
cumall(c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE))
```
These are particularly useful in conjunction with `filter()` because they allow you to select rows:
- Before the first `FALSE` with `cumall(x)`.
- Before the first `TRUE` with `cumall(!x)`.
- After the first `TRUE` with `cumany(x)`.
- After the first `FALSE` with `cumany(!x)`.
If you imagine some data about a bank balance, then these functions allow you t
```{r}
df <- data.frame(
date = as.Date("2020-01-01") + 0:6,
balance = c(100, 50, 25, -25, -50, 30, 120)
)
# all rows after first overdraft
df |> filter(cumany(balance < 0))
# all rows until first overdraft
df |> filter(cumall(!(balance < 0)))
```
###
## ##

View File

@ -22,36 +22,6 @@ Missing topics:
### Missing values {#missing-values-filter} ### Missing values {#missing-values-filter}
One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables").
`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown.
```{r}
NA > 5
10 == NA
NA + 10
NA / 2
```
The most confusing result is this one:
```{r}
NA == NA
```
It's easiest to understand why this is true with a bit more context:
```{r}
# Let x be Mary's age. We don't know how old she is.
x <- NA
# Let y be John's age. We don't know how old he is.
y <- NA
# Are John and Mary the same age?
x == y
# We don't know!
```
If you want to determine if a value is missing, use `is.na()`: If you want to determine if a value is missing, use `is.na()`:
```{r} ```{r}