r4ds/missing-values.Rmd

# Missing values {#missing-values}

```{r, results = "asis", echo = FALSE}
status("restructuring")
```

## Introduction

We've touched on missing values in earlier in the the book.
You first saw them in Section \@ref(summarise) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison).
Now we'll come back to them in more depth, so you can learn more of the details.

We'll start by discussing some general tools for working with missing values that are explicitly recorded as `NA` in your data.
We'll explore the idea of implicit missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit.
We'll finish off with a related discussion of empty groups, caused by factor levels that don't appear in the data.

### Prerequisites

Most of the functions for working with missing values live in tidyr, but some are also in dplyr.
So we'll load the whole tidyverse ando t

```{r setup, message = FALSE}
library(tidyverse)
library(nycflights13)
```

## Explicit missing values

To begin, let's

### Last observation carried forward

Another place that missing values arise is as a data entry convenience.
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:

```{r}
treatment <- tribble(
  ~person,           ~treatment, ~response,
  "Derrick Whitmore", 1,         7,
  NA,                 2,         10,
  NA,                 3,         9,
  "Katherine Burke",  1,         4
)
```

You can fill in these missing values with `tidyr::fill()`.
It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).

```{r}
treatment |>
  fill(person)
```

You can use the `direction` argument to fill in missing values that have been generated in more exotic ways.

### Fixed values

Some times missing values represent some fixed known value, mostly commonly 0.
You can use `dplyr::coalesce()` to replace.

```{r}
x <- c(1, 4, 5, 7, NA)
coalesce(x, 0)
```

You could apply this to every numeric column in a data frame with:

```{r, eval = FALSE}
df |> mutate(across(where(is.numeric), coalesce, 0))
```

### Sentinel values

Sometimes you'll hit the opposite problem because some older software doesn't have an explicit way to represent missing values, so it might be recorded using some special sentinel value like 99 or -999.
If possible, handle this when reading in the data, for example, by using the `na` argument to `read::read_csv()`.
If you discover later, or from a data source that doesn't provide a way to handle on read, you can use `na_if()`

```{r}
x <- c(1, 4, 5, 7, -99)
na_if(x, -99)
```

You could apply this to every numeric column in a data frame with:

```{r, eval = FALSE}
df |> mutate(across(where(is.numeric), na_if, -99))
```

### NaN

There's one special type of missing value that you'll encounter from time-to-time, a `NaN` (pronounced "nan"), or **n**ot **a** **n**umber.
It's not that important because it generally behaves just like `NA`:

```{r}
x <- c(NA, NaN)
x * 10
x == 1
is.na(x)
```

While it's infectious, the NaN'ness isn't always preserved, and this varies from platform to platform and compiler to compiler, so you shouldn't rely on it.
In the rare case you need to distinguish an `NA` from a `NaN`, you can use `is.nan(x)`.

You'll generally encounter a `NaN` when you perform a mathematical operation that don't have a well defined answer:

```{r}
0 / 0 
0 * Inf
Inf - Inf
sqrt(-1)
```

## Implicit missing values

So far we've worked with missing values that are **explicitly** missing, i.e. flagged with `NA`.
But missing values can also be **implicitly** missing, if they are simply not present in the data.
Let's illustrate this idea with a simple data set, which records the price of a stock in each quarter.

```{r}
stocks <- tibble(
  year  = c(2022, 2022, 2022, 2022, 2023, 2023, 2023),
  qtr   = c(   1,    2,    3,    4,    2,    3,    4),
  price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
```

There are two missing values in this dataset:

-   The return for the fourth quarter of 2022 is explicitly missing, because the cell where its value should be instead contains `NA`.

-   The return for the first quarter of 2023 is implicitly missing, because it simply does not appear in the dataset.

One way to think about the difference is with this Zen-like koan:

> An explicit missing value is the presence of an absence.\
>
> An implicit missing value is the absence of a presence.

It's often useful to make implicit missings explicit so you have something physical that you can work with.
In other cases, explicit missings are forced upon you by the structure of the data.
The following sections discuss some tools for moving between implicit and explict.

### Pivoting

You've already learned about one tool that can make implicit missings explicit and vice versa: pivoting.
Making data wider can make implicit missing values become explicit.
For example, if we pivot `stocks` to put the `year` in the columns pivoting, we can make both missing values explicit:

```{r}
stocks |>
  pivot_wider(
    names_from = year, 
    values_from = price
  )
```

Making data longer generally preserves explicit missing values, but you can make them implicit by setting `drop_na` if they are structural missing values that only exist because the data is not tidy.
See the examples in Chapter \@ref(tidy-data) for more details.

### Complete

`tidyr::complete()` allows you to generate explicit missing values in tidy data by providing a set of variables that generates all rows that should exist:

```{r}
stocks |>
  complete(year, qtr)
```

Typically, you'll call `complete()` with the names of variables that already existing, just filling in missing combinations.
However, sometimes the individual variables are themselves incomplete, so you can also provide your own data.
For example, you might know that this dataset is supposed to run from 2021 to 2023, so you could explicitly supply those values for `year`:

```{r}
stocks |>
  complete(year = 2021:2023, qtr)
```

If the range is correct, but not all values are present, you could use `full_seq(x, 1)` to generate all values from `min(x)` to `max(x)` spaced out by 1.

In some cases, it won't be possible to generate the correct grid of all possible values.
In that case, you can do manually what `complete()` does for you: create a data frame that contains all the rows that should exist, then combine it with your original dataset with `dplyr::full_join()`.

### Joins

This brings us to another important way of revealing implicitly missing observations: joins.
Often you can only know that values are missing from one dataset when you go to join it to another dataset.
`dplyr::anti_join()` is particularly useful here.
The following example shows how two `anti_join()`s reveals that we're missing information for four airports and 722 planes.

```{r}
flights |> 
  distinct(faa = dest) |> 
  anti_join(airports)

flights |> 
  distinct(tailnum) |> 
  anti_join(planes)
```

If you're worried about a join failing to reveal the lack of a match, and you have dplyr 1.1.0 or newer, you can use the new `unmatched = "error"` argument to tell joins to error if they find any missing values.

### Exercises

1.  Can you find any relationship between the carrier and the missing planes?

## Factors and empty groups

Another sort of missing value arises with factors.
For example, imagine we have a dataset that contains some health information about people:

```{r}
health <- tibble(
  name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
  smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
  age = c(34L, 88L, 75L, 47L, 56L),
)
```

And we want to count the number of smokers:

```{r}
health |> count(smoker)
```

This dataset only contains non-smokers, but we know that smokers exist.
We can request to keep all the value, even if not seen in the data with `.drop = FALSE`:

```{r}
health |> count(smoker, .drop = FALSE)
```

Similarly, ggplot2's discrete axes will also drop levels that don't have any values.
You can force them to display with by supplying `drop = FALSE` to the appropriate discrete axis:

```{r}
#| fig.align: default
#| out.width: "50%"
#| fig.width: 3
#| fig.alt: 
#| - >
#|   A bar chart with a single value on the x-axis, "no".
#| - >
#|   The same bar chart as the last plot, but now with two values on
#|   the x-axis, "yes" and "no". There is no bar for the "yes" category.
ggplot(health, aes(smoker)) +
  geom_bar() +
  scale_x_discrete()

ggplot(health, aes(smoker)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)
```

`.drop = TRUE` also works with `group_by()`:

```{r}
health |> 
  group_by(smoker, .drop = FALSE) |> 
  summarise(
    n = n(),
    mean_age = mean(age),
    min_age = min(age),
    max_age = max(age),
    sd_age = sd(age)
  )
```

We get some interesting results here because the summary functions are applied to zero-length vectors.
These are different to vectors containing missing values;

```{r}
x1 <- c(NA, NA)
length(x1)

x2 <- numeric()
length(x2)
```

Summary functions will work with zero-length vectors, but they may return results that are surprising at first glance.
Here we see `mean(age)` returning `NaN` because `mean(age)` = `sum(age)/length(age)` which here is 0/0.
`max()` and `min()` return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you'll get min or max of the new data.

A sometimes simpler approach is to perform the summary and then make the implicit missings explicit with `complete()`.

```{r}
health |> 
  group_by(smoker) |> 
  summarise(
    n = n(),
    mean_age = mean(age),
    min_age = min(age),
    max_age = max(age),
    sd_age = sd(age)
  ) |> 
  complete(smoker)
```

The main drawback of this approach is that you get an `NA` for the count, even though you know that's zero.
Data transformation (#940) * Minor edit + link to style guide * Fix reference * If you don't know order of operations, not clear * Alt text + minor edits * Add median and fix reference * Move up mult groups up to discuss summarise msg * Go over grouping again * Part rename * Chapter rename * Clean up section labels to avoid dups * Update comment * Switch part order * Move columnwise to transform 2021-03-29 21:58:27 +08:00			`# Missing values {#missing-values}`
Second crack and 2e structure 2021-03-04 01:13:14 +08:00
Add chapter status 2021-05-04 21:10:39 +08:00			```{r, results = "asis", echo = FALSE}
Hack missing values into shape 2022-05-05 00:57:28 +08:00			`status("restructuring")`
Add chapter status 2021-05-04 21:10:39 +08:00			```

Second crack and 2e structure 2021-03-04 01:13:14 +08:00			`## Introduction`
Break up data-transform content 2021-04-19 20:56:29 +08:00
Hack missing values into shape 2022-05-05 00:57:28 +08:00			`We've touched on missing values in earlier in the the book.`
			`You first saw them in Section \@ref(summarise) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison).`
			`Now we'll come back to them in more depth, so you can learn more of the details.`
Continuing to rewrite data-transform 2021-04-21 21:25:39 +08:00
Hack missing values into shape 2022-05-05 00:57:28 +08:00			We'll start by discussing some general tools for working with missing values that are explicitly recorded as `NA` in your data.
			`We'll explore the idea of implicit missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit.`
			`We'll finish off with a related discussion of empty groups, caused by factor levels that don't appear in the data.`
Continuing to rewrite data-transform 2021-04-21 21:25:39 +08:00
A little thinking about missing values 2022-03-31 21:10:52 +08:00			`### Prerequisites`
Continuing to rewrite data-transform 2021-04-21 21:25:39 +08:00
Hack missing values into shape 2022-05-05 00:57:28 +08:00			`Most of the functions for working with missing values live in tidyr, but some are also in dplyr.`
			`So we'll load the whole tidyverse ando t`

A little thinking about missing values 2022-03-31 21:10:52 +08:00			```{r setup, message = FALSE}
			`library(tidyverse)`
			`library(nycflights13)`
			```
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Hack missing values into shape 2022-05-05 00:57:28 +08:00			`## Explicit missing values`

			`To begin, let's`

			`### Last observation carried forward`

			`Another place that missing values arise is as a data entry convenience.`
			`Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Hack missing values into shape 2022-05-05 00:57:28 +08:00			```{r}
			`treatment <- tribble(`
			`~person, ~treatment, ~response,`
			`"Derrick Whitmore", 1, 7,`
			`NA, 2, 10,`
			`NA, 3, 9,`
			`"Katherine Burke", 1, 4`
			`)`
			```

			You can fill in these missing values with `tidyr::fill()`.
			`It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).`

			```{r}
			`treatment \|>`
			`fill(person)`
			```

			You can use the `direction` argument to fill in missing values that have been generated in more exotic ways.

			`### Fixed values`

			`Some times missing values represent some fixed known value, mostly commonly 0.`
			You can use `dplyr::coalesce()` to replace.

			```{r}
			`x <- c(1, 4, 5, 7, NA)`
			`coalesce(x, 0)`
			```

			`You could apply this to every numeric column in a data frame with:`

			```{r, eval = FALSE}
			`df \|> mutate(across(where(is.numeric), coalesce, 0))`
			```

			`### Sentinel values`

			`Sometimes you'll hit the opposite problem because some older software doesn't have an explicit way to represent missing values, so it might be recorded using some special sentinel value like 99 or -999.`
			If possible, handle this when reading in the data, for example, by using the `na` argument to `read::read_csv()`.
			If you discover later, or from a data source that doesn't provide a way to handle on read, you can use `na_if()`

			```{r}
			`x <- c(1, 4, 5, 7, -99)`
			`na_if(x, -99)`
			```

			`You could apply this to every numeric column in a data frame with:`

			```{r, eval = FALSE}
			`df \|> mutate(across(where(is.numeric), na_if, -99))`
			```

			`### NaN`

			There's one special type of missing value that you'll encounter from time-to-time, a `NaN` (pronounced "nan"), or not a number.
			It's not that important because it generally behaves just like `NA`:

			```{r}
			`x <- c(NA, NaN)`
			`x * 10`
			`x == 1`
			`is.na(x)`
			```

			`While it's infectious, the NaN'ness isn't always preserved, and this varies from platform to platform and compiler to compiler, so you shouldn't rely on it.`
			In the rare case you need to distinguish an `NA` from a `NaN`, you can use `is.nan(x)`.

			You'll generally encounter a `NaN` when you perform a mathematical operation that don't have a well defined answer:

			```{r}
			`0 / 0`
			`0 * Inf`
			`Inf - Inf`
			`sqrt(-1)`
			```

			`## Implicit missing values`

			So far we've worked with missing values that are explicitly missing, i.e. flagged with `NA`.
			`But missing values can also be implicitly missing, if they are simply not present in the data.`
			`Let's illustrate this idea with a simple data set, which records the price of a stock in each quarter.`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
			```{r}
			`stocks <- tibble(`
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`year = c(2022, 2022, 2022, 2022, 2023, 2023, 2023),`
			`qtr = c( 1, 2, 3, 4, 2, 3, 4),`
			`price = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)`
Pull content out of tidying 2021-04-19 20:59:07 +08:00			`)`
			```

			`There are two missing values in this dataset:`

Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			- The return for the fourth quarter of 2022 is explicitly missing, because the cell where its value should be instead contains `NA`.

			`- The return for the first quarter of 2023 is implicitly missing, because it simply does not appear in the dataset.`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`One way to think about the difference is with this Zen-like koan:`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`> An explicit missing value is the presence of an absence.\`
Finish up with factors Fixes #728 2022-05-04 21:41:40 +08:00			`>`
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`> An implicit missing value is the absence of a presence.`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Hack missing values into shape 2022-05-05 00:57:28 +08:00			`It's often useful to make implicit missings explicit so you have something physical that you can work with.`
			`In other cases, explicit missings are forced upon you by the structure of the data.`
			`The following sections discuss some tools for moving between implicit and explict.`

Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`### Pivoting`
A little thinking about missing values 2022-03-31 21:10:52 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`You've already learned about one tool that can make implicit missings explicit and vice versa: pivoting.`
			`Making data wider can make implicit missing values become explicit.`
			For example, if we pivot `stocks` to put the `year` in the columns pivoting, we can make both missing values explicit:
A little thinking about missing values 2022-03-31 21:10:52 +08:00
			```{r}
			`stocks \|>`
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`pivot_wider(`
			`names_from = year,`
			`values_from = price`
			`)`
A little thinking about missing values 2022-03-31 21:10:52 +08:00			```

Hack missing values into shape 2022-05-05 00:57:28 +08:00			Making data longer generally preserves explicit missing values, but you can make them implicit by setting `drop_na` if they are structural missing values that only exist because the data is not tidy.
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`See the examples in Chapter \@ref(tidy-data) for more details.`

			`### Complete`

Hack missing values into shape 2022-05-05 00:57:28 +08:00			`tidyr::complete()` allows you to generate explicit missing values in tidy data by providing a set of variables that generates all rows that should exist:
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00
			```{r}
			`stocks \|>`
			`complete(year, qtr)`
A little thinking about missing values 2022-03-31 21:10:52 +08:00			```

Hack missing values into shape 2022-05-05 00:57:28 +08:00			Typically, you'll call `complete()` with the names of variables that already existing, just filling in missing combinations.
			`However, sometimes the individual variables are themselves incomplete, so you can also provide your own data.`
			For example, you might know that this dataset is supposed to run from 2021 to 2023, so you could explicitly supply those values for `year`:
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00
A little thinking about missing values 2022-03-31 21:10:52 +08:00			```{r}
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`stocks \|>`
Hack missing values into shape 2022-05-05 00:57:28 +08:00			`complete(year = 2021:2023, qtr)`
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			```

Hack missing values into shape 2022-05-05 00:57:28 +08:00			If the range is correct, but not all values are present, you could use `full_seq(x, 1)` to generate all values from `min(x)` to `max(x)` spaced out by 1.
A little thinking about missing values 2022-03-31 21:10:52 +08:00
Hack missing values into shape 2022-05-05 00:57:28 +08:00			`In some cases, it won't be possible to generate the correct grid of all possible values.`
			In that case, you can do manually what `complete()` does for you: create a data frame that contains all the rows that should exist, then combine it with your original dataset with `dplyr::full_join()`.
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00
			`### Joins`

Hack missing values into shape 2022-05-05 00:57:28 +08:00			`This brings us to another important way of revealing implicitly missing observations: joins.`
			`Often you can only know that values are missing from one dataset when you go to join it to another dataset.`
			`dplyr::anti_join()` is particularly useful here.
			The following example shows how two `anti_join()`s reveals that we're missing information for four airports and 722 planes.
A little thinking about missing values 2022-03-31 21:10:52 +08:00
			```{r}
			`flights \|>`
			`distinct(faa = dest) \|>`
			`anti_join(airports)`

			`flights \|>`
			`distinct(tailnum) \|>`
			`anti_join(planes)`
			```

Hack missing values into shape 2022-05-05 00:57:28 +08:00			If you're worried about a join failing to reveal the lack of a match, and you have dplyr 1.1.0 or newer, you can use the new `unmatched = "error"` argument to tell joins to error if they find any missing values.
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Hack missing values into shape 2022-05-05 00:57:28 +08:00			`### Exercises`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Hack missing values into shape 2022-05-05 00:57:28 +08:00			`1. Can you find any relationship between the carrier and the missing planes?`
Pull content out of tidying 2021-04-19 20:59:07 +08:00
Hack missing values into shape 2022-05-05 00:57:28 +08:00			`## Factors and empty groups`
Polishing data transformation 2022-02-16 01:59:19 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			`Another sort of missing value arises with factors.`
			`For example, imagine we have a dataset that contains some health information about people:`

			```{r}
			`health <- tibble(`
			`name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),`
			`smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),`
			`age = c(34L, 88L, 75L, 47L, 56L),`
			`)`
			```

			`And we want to count the number of smokers:`

			```{r}
			`health \|> count(smoker)`
			```

			`This dataset only contains non-smokers, but we know that smokers exist.`
			We can request to keep all the value, even if not seen in the data with `.drop = FALSE`:

			```{r}
			`health \|> count(smoker, .drop = FALSE)`
			```

Hack missing values into shape 2022-05-05 00:57:28 +08:00			`Similarly, ggplot2's discrete axes will also drop levels that don't have any values.`
			You can force them to display with by supplying `drop = FALSE` to the appropriate discrete axis:

			```{r}
			`#\| fig.align: default`
			`#\| out.width: "50%"`
			`#\| fig.width: 3`
			`#\| fig.alt:`
			`#\| - >`
			`#\| A bar chart with a single value on the x-axis, "no".`
			`#\| - >`
			`#\| The same bar chart as the last plot, but now with two values on`
			`#\| the x-axis, "yes" and "no". There is no bar for the "yes" category.`
			`ggplot(health, aes(smoker)) +`
			`geom_bar() +`
			`scale_x_discrete()`

			`ggplot(health, aes(smoker)) +`
			`geom_bar() +`
			`scale_x_discrete(drop = FALSE)`
			```

			`.drop = TRUE` also works with `group_by()`:
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00
			```{r}
			`health \|>`
			`group_by(smoker, .drop = FALSE) \|>`
			`summarise(`
			`n = n(),`
			`mean_age = mean(age),`
			`min_age = min(age),`
			`max_age = max(age),`
			`sd_age = sd(age)`
			`)`
			```

Hack missing values into shape 2022-05-05 00:57:28 +08:00			`We get some interesting results here because the summary functions are applied to zero-length vectors.`
			`These are different to vectors containing missing values;`

			```{r}
			`x1 <- c(NA, NA)`
			`length(x1)`

			`x2 <- numeric()`
			`length(x2)`
			```

			`Summary functions will work with zero-length vectors, but they may return results that are surprising at first glance.`
			Here we see `mean(age)` returning `NaN` because `mean(age)` = `sum(age)/length(age)` which here is 0/0.
			`max()` and `min()` return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you'll get min or max of the new data.
Placeholder for NaNs 2022-04-30 04:01:54 +08:00
Sketch out main structure of missing values chapter 2022-04-13 21:23:44 +08:00			A sometimes simpler approach is to perform the summary and then make the implicit missings explicit with `complete()`.

			```{r}
			`health \|>`
			`group_by(smoker) \|>`
			`summarise(`
			`n = n(),`
			`mean_age = mean(age),`
			`min_age = min(age),`
			`max_age = max(age),`
			`sd_age = sd(age)`
			`) \|>`
Hack missing values into shape 2022-05-05 00:57:28 +08:00			`complete(smoker)`
Placeholder for NaNs 2022-04-30 04:01:54 +08:00			```

Hack missing values into shape 2022-05-05 00:57:28 +08:00			The main drawback of this approach is that you get an `NA` for the count, even though you know that's zero.