We've touched on missing values in earlier in the the book.
You first saw them in Section \@ref(summarise) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison).
Now we'll come back to them in more depth, so you can learn more of the details.
We'll start by discussing some general tools for working with missing values that are explicitly recorded as `NA` in your data.
We'll explore the idea of implicit missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit.
We'll finish off with a related discussion of empty groups, caused by factor levels that don't appear in the data.
Sometimes you'll hit the opposite problem where some value should actually be treated as a missing value.
This typically arises in data generated by older software which doesn't have an explicit way to represent missing values, so it uses some special sentinel value like 99 or -999.
If possible, handle this when reading in the data, for example, by using the `na` argument to `readr::read_csv()`.
If you discover the problem later, or your data source doesn't provide a way to handle on it read, you can use `dplyr::na_if():`
There's one special type of missing value that you'll encounter from time-to-time, a `NaN` (pronounced "nan"), or **n**ot **a** **n**umber.
It's not that important because it generally behaves just like `NA`:
```{r}
x <- c(NA, NaN)
x * 10
x == 1
is.na(x)
```
While it's infectious, the NaN'ness isn't always preserved, and this varies from platform to platform and compiler to compiler, so you shouldn't rely on it.
In the rare case you need to distinguish an `NA` from a `NaN`, you can use `is.nan(x)`.
You'll generally encounter a `NaN` when you perform a mathematical operation that don't have a well defined answer:
```{r}
0 / 0
0 * Inf
Inf - Inf
sqrt(-1)
```
## Implicit missing values
So far we've worked with missing values that are **explicitly** missing, i.e. flagged with `NA`.
But missing values can also be **implicitly** missing, if they are simply not present in the data.
Let's illustrate this idea with a simple data set, which records the price of a stock in each quarter.
Making data longer generally preserves explicit missing values, but you can make them implicit by setting `drop_na` if they are structural missing values that only exist because the data is not tidy.
`tidyr::complete()` allows you to generate explicit missing values in tidy data by providing a set of variables that generates all rows that should exist:
If the range is correct, but not all values are present, you could use `full_seq(x, 1)` to generate all values from `min(x)` to `max(x)` spaced out by 1.
In some cases, it won't be possible to generate the correct grid of all possible values.
In that case, you can do manually what `complete()` does for you: create a data frame that contains all the rows that should exist, then combine it with your original dataset with `dplyr::full_join()`.
If you're worried about a join failing to reveal the lack of a match, and you have dplyr 1.1.0 or newer, you can use the new `unmatched = "error"` argument to tell joins to error if they find any missing values.
We get some interesting results here because the summary functions are applied to zero-length vectors.
These are different to vectors containing missing values;
```{r}
x1 <- c(NA, NA)
length(x1)
x2 <- numeric()
length(x2)
```
Summary functions will work with zero-length vectors, but they may return results that are surprising at first glance.
Here we see `mean(age)` returning `NaN` because `mean(age)` = `sum(age)/length(age)` which here is 0/0.
`max()` and `min()` return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you'll get min or max of the new data.