r4ds/missing-values.Rmd

131 lines
3.5 KiB
Plaintext
Raw Normal View History

# Missing values {#missing-values}
2021-03-04 01:13:14 +08:00
2021-05-04 21:10:39 +08:00
```{r, results = "asis", echo = FALSE}
status("drafting")
```
2021-03-04 01:13:14 +08:00
## Introduction
2021-04-19 20:56:29 +08:00
2022-03-31 21:10:52 +08:00
A value can be missing in one of two possible ways.
It can be **explicitly** missing, i.e. flagged with `NA`, or it can be **implicitly**, missing i.e. simply not present in the data.
2021-04-21 21:25:39 +08:00
2022-03-31 21:10:52 +08:00
This chapter will explore cases where implicit and explicit missing values can become explict,
2021-04-21 21:25:39 +08:00
2022-03-31 21:10:52 +08:00
### Prerequisites
2021-04-21 21:25:39 +08:00
2022-03-31 21:10:52 +08:00
```{r setup, message = FALSE}
library(tidyverse)
library(nycflights13)
```
2021-04-19 20:59:07 +08:00
2022-03-31 21:10:52 +08:00
## Motivation
2021-04-19 20:59:07 +08:00
2022-03-31 21:10:52 +08:00
Let's illustrate this idea with a very simple data set.
2021-04-19 20:59:07 +08:00
```{r}
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
```
There are two missing values in this dataset:
- The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains `NA`.
- The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
2022-03-31 21:10:52 +08:00
## Complete and joins
If a dataset has a regular structure, you can make implicit missing values implicit with `complete()`:
```{r}
stocks |>
complete(year, qtr)
```
If you know that the range isn't correct, you can:
```{r}
stocks |>
complete(year = 2015:2017, qtr)
```
`complete()` takes a set of columns, and finds all unique combinations.
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
```{r}
stocks |>
expand(year, qtr) |>
left_join(stocks)
```
Other times missing values might be defined by another dataset.
```{r}
flights |>
distinct(faa = dest) |>
anti_join(airports)
flights |>
distinct(tailnum) |>
anti_join(planes)
```
## Pivotting {#missing-values-tidy}
Changing the representation of a dataset brings up an important subtlety of missing values.
2021-04-19 20:59:07 +08:00
The way that a dataset is represented can make implicit values explicit.
For example, we can make the implicit missing value explicit by putting years in the columns:
```{r}
2022-02-24 03:15:52 +08:00
stocks |>
2021-04-19 20:59:07 +08:00
pivot_wider(names_from = year, values_from = return)
```
Because these explicit missing values may not be important in other representations of the data, you can set `values_drop_na = TRUE` in `pivot_longer()` to turn explicit missing values implicit:
```{r}
2022-02-24 03:15:52 +08:00
stocks |>
pivot_wider(names_from = year, values_from = return) |>
2021-04-19 20:59:07 +08:00
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)
```
2022-03-31 21:10:52 +08:00
## Last observation carried forward
2021-04-19 20:59:07 +08:00
There's one other important tool that you should know for working with missing values.
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
```{r}
treatment <- tribble(
~person, ~treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
```
You can fill in these missing values with `fill()`.
It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
```{r}
2022-02-24 03:15:52 +08:00
treatment |>
2021-04-19 20:59:07 +08:00
fill(person)
```
2022-03-31 21:10:52 +08:00
## Factors
2022-02-16 01:59:19 +08:00
2022-03-31 21:10:52 +08:00
- factors: `group_by` + `.drop = FALSE`
2021-04-19 20:56:29 +08:00
2022-03-31 21:10:52 +08:00
##