A little thinking about missing values
This commit is contained in:
parent
61d8a75908
commit
27507a8bf2
|
@ -6,27 +6,21 @@ status("drafting")
|
|||
|
||||
## Introduction
|
||||
|
||||
```{r}
|
||||
A value can be missing in one of two possible ways.
|
||||
It can be **explicitly** missing, i.e. flagged with `NA`, or it can be **implicitly**, missing i.e. simply not present in the data.
|
||||
|
||||
This chapter will explore cases where implicit and explicit missing values can become explict,
|
||||
|
||||
### Prerequisites
|
||||
|
||||
```{r setup, message = FALSE}
|
||||
library(tidyverse)
|
||||
library(nycflights13)
|
||||
```
|
||||
|
||||
Missing topics:
|
||||
## Motivation
|
||||
|
||||
- Missing values generated from matching data frames (i.e. `left_join()` and `anti_join()`
|
||||
|
||||
- Last observation carried forward and `tidy::fill()`
|
||||
|
||||
- `coalesce()` and `na_if()`
|
||||
|
||||
## Explicit vs implicit missing values {#missing-values-tidy}
|
||||
|
||||
Changing the representation of a dataset brings up an important subtlety of missing values.
|
||||
Surprisingly, a value can be missing in one of two possible ways:
|
||||
|
||||
- **Explicitly**, i.e. flagged with `NA`.
|
||||
- **Implicitly**, i.e. simply not present in the data.
|
||||
|
||||
Let's illustrate this idea with a very simple data set:
|
||||
Let's illustrate this idea with a very simple data set.
|
||||
|
||||
```{r}
|
||||
stocks <- tibble(
|
||||
|
@ -44,6 +38,47 @@ There are two missing values in this dataset:
|
|||
|
||||
One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
|
||||
|
||||
## Complete and joins
|
||||
|
||||
If a dataset has a regular structure, you can make implicit missing values implicit with `complete()`:
|
||||
|
||||
```{r}
|
||||
stocks |>
|
||||
complete(year, qtr)
|
||||
```
|
||||
|
||||
If you know that the range isn't correct, you can:
|
||||
|
||||
```{r}
|
||||
stocks |>
|
||||
complete(year = 2015:2017, qtr)
|
||||
```
|
||||
|
||||
`complete()` takes a set of columns, and finds all unique combinations.
|
||||
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
|
||||
|
||||
```{r}
|
||||
stocks |>
|
||||
expand(year, qtr) |>
|
||||
left_join(stocks)
|
||||
```
|
||||
|
||||
Other times missing values might be defined by another dataset.
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
distinct(faa = dest) |>
|
||||
anti_join(airports)
|
||||
|
||||
flights |>
|
||||
distinct(tailnum) |>
|
||||
anti_join(planes)
|
||||
```
|
||||
|
||||
## Pivotting {#missing-values-tidy}
|
||||
|
||||
Changing the representation of a dataset brings up an important subtlety of missing values.
|
||||
|
||||
The way that a dataset is represented can make implicit values explicit.
|
||||
For example, we can make the implicit missing value explicit by putting years in the columns:
|
||||
|
||||
|
@ -65,15 +100,7 @@ stocks |>
|
|||
)
|
||||
```
|
||||
|
||||
Another important tool for making missing values explicit in tidy data is `complete()`:
|
||||
|
||||
```{r}
|
||||
stocks |>
|
||||
complete(year, qtr)
|
||||
```
|
||||
|
||||
`complete()` takes a set of columns, and finds all unique combinations.
|
||||
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
|
||||
## Last observation carried forward
|
||||
|
||||
There's one other important tool that you should know for working with missing values.
|
||||
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
|
||||
|
@ -96,41 +123,8 @@ treatment |>
|
|||
fill(person)
|
||||
```
|
||||
|
||||
`group_by` + `.drop = FALSE`
|
||||
## Factors
|
||||
|
||||
### Exercises
|
||||
- factors: `group_by` + `.drop = FALSE`
|
||||
|
||||
1. Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`.
|
||||
|
||||
2. What does the direction argument to `fill()` do?
|
||||
|
||||
## dplyr verbs
|
||||
|
||||
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
|
||||
If you want to preserve missing values, ask for them explicitly:
|
||||
|
||||
```{r}
|
||||
df <- tibble(x = c(1, NA, 3))
|
||||
filter(df, x > 1)
|
||||
filter(df, is.na(x) | x > 1)
|
||||
```
|
||||
|
||||
Missing values are always sorted at the end:
|
||||
|
||||
```{r}
|
||||
df <- tibble(x = c(5, 2, NA))
|
||||
arrange(df, x)
|
||||
arrange(df, desc(x))
|
||||
```
|
||||
|
||||
Explain the warning here
|
||||
|
||||
```{r, eval = FALSE}
|
||||
flights |>
|
||||
group_by(dest) |>
|
||||
summarise(max_delay = max(arr_delay, na.rm = TRUE))
|
||||
```
|
||||
|
||||
## Exercises
|
||||
|
||||
1. Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)
|
||||
##
|
||||
|
|
|
@ -344,6 +344,10 @@ slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
|
|||
|
||||
These are often used with numbers, but can be applied to most other column types.
|
||||
|
||||
### Missing values
|
||||
|
||||
`coalesce()`
|
||||
|
||||
### Ranks
|
||||
|
||||
dplyr provides a number of ranking functions, but you should start with `dplyr::min_rank()`.
|
||||
|
|
Loading…
Reference in New Issue