You'll find logical vectors directly in data relatively rarely, but despite that they're extremely powerful because you'll frequently create them during data analysis.
We'll begin with the most common way of creating logical vectors: numeric comparisons.
Then we'll talk about using Boolean algebra to combine different logical vectors, and some useful summaries for logical vectors.
This is useful because it allows you to name components, which can made the code easier to read, and it allows you to double-check the intermediate steps.
This is a particularly useful technique when you're doing more complicated Boolean algebra, as you'll learn about in the next section.
That's because computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so in most cases, the number number you see is an actually approximation.
R usually rounds these numbers to avoid displaying a bunch of usually unimportant digits.
To see the details you can call `print()` with the the `digits`[^logicals-1] argument.
R normally calls print automatically for you (i.e. `x` is a shortcut for `print(x)`), but calling it explicitly is useful if you want to provide other arguments:
[^logicals-1]: A floating point number can hold roughly 16 decimal digits; the precise number is surprisingly complicated and depends on the number.
Now that you've seen why `==` is failing, what can you do about it?
One option is to use `round()` to round to any number of digits, or instead of `==`, use `dplyr::near()`, which does the comparison with a small amount of tolerance:
Missing values represent the unknown so they missing values are "contagious": almost any operation involving an unknown value will also be unknown:
```{r}
NA > 5
10 == NA
```
The most confusing result is this one:
```{r}
NA == NA
```
It's easiest to understand why this is true with a bit more context:
```{r}
# Let x be Mary's age. We don't know how old she is.
x <- NA
# Let y be John's age. We don't know how old he is.
y <- NA
# Are John and Mary the same age?
x == y
# We don't know!
```
So if you want to find all flights with `dep_time` is missing, the following code won't work because `dep_time == NA` will yield a `NA` for every single row, and `filter()` automatically drops missing values:
1. Find all flights where `arr_delay` is missing but `dep_delay` is not. Find all flights where neither `arr_time` nor `sched_arr_time` are missing, but `arr_delay` is.
2. How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent?
3. How could you use `arrange()` to sort all missing values to the start? (Hint: use `!is.na()`).
4. Come up with another approach that will give you the same output as `not_cancelled |> count(dest)` and `not_cancelled |> count(tailnum, wt = distance)` (without using `count()`).
5. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
There are four particularly useful summary functions for logical vectors: they all take a vector of logical values and return a single value, making them a good fit for use in `summarise()`.
`any()` and `all()` --- `any()` will return if there's at least one `TRUE`, `all()` will return `TRUE` if all values are `TRUE`.
Like all summary functions, they'll return `NA` if there are any missing values present, and like usual you can make the missing values go away with `na.rm = TRUE`.
We could use this to see if there were any days where every flight was delayed:
`sum()` and `mean()` are particularly useful with logical vectors because when you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0.
That means that `sum(x)` gives the number of `TRUE`s in `x` and `mean(x)` gives the proportion of `TRUE`s.
That lets us find the day with the highest proportion of delayed flights:
There's another useful way to use logical vectors with summaries: to reduce variables to a subset of interest.
This makes use of the base `[` (pronounced subset) operator.
You'll learn more about this in Section \@ref(vector-subsetting), but this usage works in a similar way to a `filter()` except that instead of applying to entire data frame it applies to a single variable.
1. For each plane, count the number of flights before the first delay of greater than 1 hour.
2. What does `prod()` return when applied to a logical vector? What logical summary function is it equivalent to? What does `min()` return applied to a logical vector? What logical summary function is it equivalent to?
One of the most powerful features of logical vectors are their use for conditional transformations, i.e. returning one value for true values, and a different value for false values.
We'll see a couple of different ways to do this, and the
There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error message if you use the wrong type of variable.
Before we move on to the next chapter, I want to show you a grab bag of tricks that make use of cumulative functions (i.e. functions that depending on every previous value of a vector in some way).
These all feel a bit magical, and I'm torn on whether or not they should be included in this book.
But in the end, some of them are just so useful I think it's important to mention them --- they don't help with that many problems, but when they do, they provide a substantial advantage.
<!-- TODO: illustration of accumulating function -->