2021-03-29 21:58:27 +08:00
|
|
|
# Missing values {#missing-values}
|
2021-03-04 01:13:14 +08:00
|
|
|
|
|
|
|
## Introduction
|
2021-04-19 20:56:29 +08:00
|
|
|
|
|
|
|
## Basics
|
|
|
|
|
|
|
|
### Missing values {#missing-values-filter}
|
|
|
|
|
|
|
|
One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables").
|
|
|
|
`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown.
|
|
|
|
|
|
|
|
```{r}
|
|
|
|
NA > 5
|
|
|
|
10 == NA
|
|
|
|
NA + 10
|
|
|
|
NA / 2
|
|
|
|
```
|
|
|
|
|
|
|
|
The most confusing result is this one:
|
|
|
|
|
|
|
|
```{r}
|
|
|
|
NA == NA
|
|
|
|
```
|
|
|
|
|
|
|
|
It's easiest to understand why this is true with a bit more context:
|
|
|
|
|
|
|
|
```{r}
|
|
|
|
# Let x be Mary's age. We don't know how old she is.
|
|
|
|
x <- NA
|
|
|
|
|
|
|
|
# Let y be John's age. We don't know how old he is.
|
|
|
|
y <- NA
|
|
|
|
|
|
|
|
# Are John and Mary the same age?
|
|
|
|
x == y
|
|
|
|
# We don't know!
|
|
|
|
```
|
|
|
|
|
|
|
|
If you want to determine if a value is missing, use `is.na()`:
|
|
|
|
|
|
|
|
```{r}
|
|
|
|
is.na(x)
|
|
|
|
```
|
|
|
|
|
|
|
|
## dplyr verbs
|
|
|
|
|
|
|
|
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
|
|
|
|
If you want to preserve missing values, ask for them explicitly:
|
|
|
|
|
|
|
|
```{r}
|
|
|
|
df <- tibble(x = c(1, NA, 3))
|
|
|
|
filter(df, x > 1)
|
|
|
|
filter(df, is.na(x) | x > 1)
|
|
|
|
```
|
|
|
|
|
|
|
|
Missing values are always sorted at the end:
|
|
|
|
|
|
|
|
```{r}
|
|
|
|
df <- tibble(x = c(5, 2, NA))
|
|
|
|
arrange(df, x)
|
|
|
|
arrange(df, desc(x))
|
|
|
|
```
|
|
|
|
|
|
|
|
## Exercises
|
|
|
|
|
|
|
|
1. Why is `NA ^ 0` not missing?
|
|
|
|
Why is `NA | TRUE` not missing?
|
|
|
|
Why is `FALSE & NA` not missing?
|
|
|
|
Can you figure out the general rule?
|
|
|
|
(`NA * 0` is a tricky counterexample!)
|