Polishing missing values
This commit is contained in:
parent
721ba68ac4
commit
0ea0ce5e14
|
@ -1,18 +1,17 @@
|
|||
# Missing values {#missing-values}
|
||||
|
||||
```{r, results = "asis", echo = FALSE}
|
||||
status("restructuring")
|
||||
status("polishing")
|
||||
```
|
||||
|
||||
## Introduction
|
||||
|
||||
You've already learned the basics of missing values earlier in the the book.
|
||||
You first saw them in Section \@ref(summarize) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison).
|
||||
Now we'll come back to them in more depth, so you can learn more of the details.
|
||||
You've already learned the basics of missing values earlier in the the book: you first saw them in Section \@ref(summarize) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison).
|
||||
In this chapter, we'll come back to missing values in more depth, so you can learn more of the details.
|
||||
|
||||
We'll start by discussing some general tools for working with missing values recorded as `NA`s.
|
||||
We'll start by discussing some general tools for explicitly missing values that recorded as `NA`.
|
||||
We'll then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit.
|
||||
We'll finish off with a related discussion of empty groups, caused by factor levels that don't appear in the data.
|
||||
We'll finish off with a of empty groups, caused by factor levels that don't appear in the data.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
|
@ -24,12 +23,12 @@ library(tidyverse)
|
|||
|
||||
## Explicit missing values
|
||||
|
||||
To begin, let's explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an `NA`.
|
||||
To begin, let's explore a few handy tools for creating or eliminating explicitly `NA`s.
|
||||
In the following sections you'll learn how to carry the last observation forward, convert `NA`s to fixed values, convert some fixed value to `NA`s, and learn about the special variant of `NA` known as "not a number".
|
||||
|
||||
### Last observation carried forward
|
||||
|
||||
A common use for missing values is as a data entry convenience.
|
||||
Sometimes data that has been entered by hand, missing values indicate that the value in the previous row has been repeated:
|
||||
Missing values are commonly used as data entry convenience where they indicate a repeat of the value in the previous row:
|
||||
|
||||
```{r}
|
||||
treatment <- tribble(
|
||||
|
@ -42,18 +41,19 @@ treatment <- tribble(
|
|||
```
|
||||
|
||||
You can fill in these missing values with `tidyr::fill()`.
|
||||
It works like `select()`, taking a set of columns where you want missing values to be replaced by last observation carried forward:
|
||||
It works like `select()`, taking a set of columns:
|
||||
|
||||
```{r}
|
||||
treatment |>
|
||||
fill(everything())
|
||||
```
|
||||
|
||||
This treatment is sometimes called "last observation carried forward", or **locf** for short.
|
||||
You can use the `direction` argument to fill in missing values that have been generated in more exotic ways.
|
||||
|
||||
### Fixed values
|
||||
|
||||
Some times missing values represent some fixed known value, mostly commonly 0.
|
||||
Some times missing values represent some fixed and known value, mostly commonly 0.
|
||||
You can use `dplyr::coalesce()` to replace them:
|
||||
|
||||
```{r}
|
||||
|
@ -61,7 +61,7 @@ x <- c(1, 4, 5, 7, NA)
|
|||
coalesce(x, 0)
|
||||
```
|
||||
|
||||
You could use `mutate()` together with `across()` to apply to every numeric column in a data frame:
|
||||
You could use `mutate()` together with `across()` to apply to every this treatment to (say) every numeric column in a data frame:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
df |>
|
||||
|
@ -70,8 +70,8 @@ df |>
|
|||
|
||||
### Sentinel values
|
||||
|
||||
Sometimes you'll hit the opposite problem where some value should actually be treated as a missing value.
|
||||
This typically arises in data generated by older software which doesn't have an explicit way to represent missing values, so it uses some special sentinel value like 99 or -999.
|
||||
Sometimes you'll hit the opposite problem where some conrete value actually represents as a missing value.
|
||||
This typically arises in data generated by older software that doesn't have a proper way to represent missing values, so it must instead use some special value like 99 or -999.
|
||||
|
||||
If possible, handle this when reading in the data, for example, by using the `na` argument to `readr::read_csv()`.
|
||||
If you discover the problem later, or your data source doesn't provide a way to handle on it read, you can use `dplyr::na_if():`
|
||||
|
@ -81,7 +81,7 @@ x <- c(1, 4, 5, 7, -99)
|
|||
na_if(x, -99)
|
||||
```
|
||||
|
||||
And you could apply this transformation to every numeric column in a data frame with the following code.
|
||||
You could apply this transformation to every numeric column in a data frame with the following code.
|
||||
|
||||
```{r, eval = FALSE}
|
||||
df |>
|
||||
|
@ -113,9 +113,9 @@ sqrt(-1)
|
|||
|
||||
## Implicit missing values
|
||||
|
||||
So far we've talked with missing values that are **explicitly** missing, i.e. you can see them in your data as an `NA`.
|
||||
So far we've talked about missing values that are **explicitly** missing, i.e. you can see an `NA` in your data.
|
||||
But missing values can also be **implicitly** missing, if an entire row of data is simply absent from the data.
|
||||
Let's illustrate this idea with a simple data set, which records the price of a stock in each quarter:
|
||||
Let's illustrate the difference with a simple data set that records the price of some stock each quarter:
|
||||
|
||||
```{r}
|
||||
stocks <- tibble(
|
||||
|
@ -137,9 +137,9 @@ One way to think about the difference is with this Zen-like koan:
|
|||
>
|
||||
> An implicit missing value is the absence of a presence.
|
||||
|
||||
It's often useful to make implicit missings explicit so you have something physical that you can work with.
|
||||
In other cases, explicit missings are forced upon you by the structure of the data.
|
||||
The following sections discuss some tools for moving between implicit and explict.
|
||||
Sometimes you want to make implicit missings explicit in order to have something physical to work with.
|
||||
In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them.
|
||||
The following sections discuss some tools for moving between implicit and explicit missingness.
|
||||
|
||||
### Pivoting
|
||||
|
||||
|
@ -160,16 +160,17 @@ See the examples in Chapter \@ref(tidy-data) for more details.
|
|||
|
||||
### Complete
|
||||
|
||||
`tidyr::complete()` allows you to generate explicit missing values in tidy data by providing a set of variables that generates all rows that should exist:
|
||||
`tidyr::complete()` allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist.
|
||||
For example, we know that all combinations of `year` and `qtr` should exist in the `stocks` data:
|
||||
|
||||
```{r}
|
||||
stocks |>
|
||||
complete(year, qtr)
|
||||
```
|
||||
|
||||
Typically, you'll call `complete()` with names of variables that already exist, filling in their missing combinations.
|
||||
However, sometimes the individual variables are themselves incomplete, so you can also provide your own data.
|
||||
For example, you might know that this dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for `year`:
|
||||
Typically, you'll call `complete()` with names of existing variables, filling in the missing combinations.
|
||||
However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data.
|
||||
For example, you might know that the `stocks` dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for `year`:
|
||||
|
||||
```{r}
|
||||
stocks |>
|
||||
|
@ -178,7 +179,7 @@ stocks |>
|
|||
|
||||
If the range of a variable is correct, but not all values are present, you could use `full_seq(x, 1)` to generate all values from `min(x)` to `max(x)` spaced out by 1.
|
||||
|
||||
In some cases, the complete set of observations can't be generated by a simple combination of variables with `complete()`.
|
||||
In some cases, the complete set of observations can't be generated by a simple combination of variables.
|
||||
In that case, you can do manually what `complete()` does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with `dplyr::full_join()`.
|
||||
|
||||
### Joins
|
||||
|
@ -209,7 +210,7 @@ If you're worried about this, and you have dplyr 1.1.0 or newer, you can use the
|
|||
|
||||
## Factors and empty groups
|
||||
|
||||
A final type of missingness is empty groups, groups that don't contain any observation, which can arise when working with factors.
|
||||
A final type of missingness is the empty group, a group that doesn't contain any observations, which can arise when working with factors.
|
||||
For example, imagine we have a dataset that contains some health information about people:
|
||||
|
||||
```{r}
|
||||
|
@ -226,8 +227,7 @@ And we want to count the number of smokers with `dplyr::count()`:
|
|||
health |> count(smoker)
|
||||
```
|
||||
|
||||
This dataset only contains non-smokers, but we know that smokers exist.
|
||||
The group of non-smoker is empty.
|
||||
This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty.
|
||||
We can request `count()` to keep all the groups, even those not seen in the data by using `.drop = FALSE`:
|
||||
|
||||
```{r}
|
||||
|
@ -271,20 +271,24 @@ health |>
|
|||
)
|
||||
```
|
||||
|
||||
We get some interesting results here because we are a summarizing an empty group, so the summary functions are applied to zero-length vectors.
|
||||
Zero-length vectors are empty, not missing:
|
||||
We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors.
|
||||
There's an important distinction between empty vectors, which have length 0, and missing values, which each have length 1.
|
||||
|
||||
```{r}
|
||||
# A vector containing two missing values
|
||||
x1 <- c(NA, NA)
|
||||
length(x1)
|
||||
|
||||
# A vector containing nothing
|
||||
x2 <- numeric()
|
||||
length(x2)
|
||||
```
|
||||
|
||||
Summary functions do work with zero-length vectors, but they may return results that are surprising at first glance.
|
||||
All summary functions work with zero-length vectors, but they may return results that are surprising at first glance.
|
||||
Here we see `mean(age)` returning `NaN` because `mean(age)` = `sum(age)/length(age)` which here is 0/0.
|
||||
`max()` and `min()` return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you'll get min or max of the new data.
|
||||
`max()` and `min()` return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you'll get the minimum or maximum of the new data[^missing-values-1].
|
||||
|
||||
[^missing-values-1]: In other words, `min(c(x, y))` is always equal to `min(min(x), min(y)).`
|
||||
|
||||
A sometimes simpler approach is to perform the summary and then make the implicit missings explicit with `complete()`.
|
||||
|
||||
|
|
Loading…
Reference in New Issue