Pull content out of tidying

This commit is contained in:
Hadley Wickham 2021-04-19 07:59:07 -05:00
parent 861e27026e
commit 78ab61f284
3 changed files with 209 additions and 209 deletions

View File

@ -1,7 +1,5 @@
# Data tidying {#data-tidy}
<!--# Take out bit on missing values and move to missing values chapter. Maybe also move case study elsewhere? -->
## Introduction
> "Happy families are all alike; every unhappy family is unhappy in its own way." ---- Leo Tolstoy
@ -440,213 +438,6 @@ As you might have guessed from their names, `pivot_wider()` and `pivot_longer()`
pivot_wider(names_from = drv, values_from = n)
```
## Separating
So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`.
`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`).
To fix this problem, we'll need the `separate()` function.
You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
### Separate
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
Take `table3`:
```{r}
table3
```
The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
```{r}
table3 %>%
separate(rate, into = c("cases", "population"))
```
```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
knitr::include_graphics("images/tidy-17.png")
```
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
For example, we could rewrite the code above as:
```{r eval = FALSE}
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
```
(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).)
Look carefully at the column types: you'll notice that `cases` and `population` are character columns.
This is the default behaviour in `separate()`: it leaves the type of the column as is.
Here, however, it's not very useful as those really are numbers.
We can ask `separate()` to try and convert to better types using `convert = TRUE`:
```{r}
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
```
### Unite
`unite()` is the inverse of `separate()`: it combines multiple columns into a single column.
You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example.
That data is saved as `tidyr::table1`.
`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
```{r}
table1 %>%
unite(rate, cases, population)
```
In this case we also need to use the `sep` argument.
The default will place an underscore (`_`) between the values from different columns.
Here we want `"/"` instead:
```{r}
table1 %>%
unite(rate, cases, population, sep = "/")
```
### Exercises
1. What do the `extra` and `fill` arguments do in `separate()`?
Experiment with the various options for the following two toy datasets.
```{r, eval = FALSE}
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
```
2. Both `unite()` and `separate()` have a `remove` argument.
What does it do?
Why would you set it to `FALSE`?
3. Compare and contrast `separate()` and `extract()`.
Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
```{r, eval = FALSE}
events <- tribble(
~month, ~day,
1 , 20,
1 , 21,
1 , 22
)
events %>%
unite("date", month:day, sep = "-", remove = FALSE)
```
5. You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
Do this in two ways: using a positive and a negative value for `sep`.
```{r}
baker <- tribble(
~location,
"FLBaker County",
"GABaker County",
"ORBaker County",
)
baker
```
## Missing values {#missing-values-tidy}
Changing the representation of a dataset brings up an important subtlety of missing values.
Surprisingly, a value can be missing in one of two possible ways:
- **Explicitly**, i.e. flagged with `NA`.
- **Implicitly**, i.e. simply not present in the data.
Let's illustrate this idea with a very simple data set:
```{r}
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
```
There are two missing values in this dataset:
- The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains `NA`.
- The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
The way that a dataset is represented can make implicit values explicit.
For example, we can make the implicit missing value explicit by putting years in the columns:
```{r}
stocks %>%
pivot_wider(names_from = year, values_from = return)
```
Because these explicit missing values may not be important in other representations of the data, you can set `values_drop_na = TRUE` in `pivot_longer()` to turn explicit missing values implicit:
```{r}
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)
```
Another important tool for making missing values explicit in tidy data is `complete()`:
```{r}
stocks %>%
complete(year, qtr)
```
`complete()` takes a set of columns, and finds all unique combinations.
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
There's one other important tool that you should know for working with missing values.
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
```{r}
treatment <- tribble(
~person, ~treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
```
You can fill in these missing values with `fill()`.
It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
```{r}
treatment %>%
fill(person)
```
### Exercises
1. Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`.
2. What does the direction argument to `fill()` do?
## Case study
To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem.

View File

@ -42,6 +42,90 @@ If you want to determine if a value is missing, use `is.na()`:
is.na(x)
```
## Explicit vs implicit missing values {#missing-values-tidy}
Changing the representation of a dataset brings up an important subtlety of missing values.
Surprisingly, a value can be missing in one of two possible ways:
- **Explicitly**, i.e. flagged with `NA`.
- **Implicitly**, i.e. simply not present in the data.
Let's illustrate this idea with a very simple data set:
```{r}
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
```
There are two missing values in this dataset:
- The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains `NA`.
- The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
The way that a dataset is represented can make implicit values explicit.
For example, we can make the implicit missing value explicit by putting years in the columns:
```{r}
stocks %>%
pivot_wider(names_from = year, values_from = return)
```
Because these explicit missing values may not be important in other representations of the data, you can set `values_drop_na = TRUE` in `pivot_longer()` to turn explicit missing values implicit:
```{r}
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)
```
Another important tool for making missing values explicit in tidy data is `complete()`:
```{r}
stocks %>%
complete(year, qtr)
```
`complete()` takes a set of columns, and finds all unique combinations.
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
There's one other important tool that you should know for working with missing values.
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
```{r}
treatment <- tribble(
~person, ~treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
```
You can fill in these missing values with `fill()`.
It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
```{r}
treatment %>%
fill(person)
```
### Exercises
1. Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`.
2. What does the direction argument to `fill()` do?
## dplyr verbs
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.

View File

@ -1048,3 +1048,128 @@ The main difference is the prefix: `str_` vs. `stri_`.
c. Generate random text.
2. How do you control the language that `stri_sort()` uses for sorting?
## tidyr
So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`.
`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`).
To fix this problem, we'll need the `separate()` function.
You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
### Separate
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
Take `table3`:
```{r}
table3
```
The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
```{r}
table3 %>%
separate(rate, into = c("cases", "population"))
```
```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
knitr::include_graphics("images/tidy-17.png")
```
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
For example, we could rewrite the code above as:
```{r eval = FALSE}
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
```
(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).)
Look carefully at the column types: you'll notice that `cases` and `population` are character columns.
This is the default behaviour in `separate()`: it leaves the type of the column as is.
Here, however, it's not very useful as those really are numbers.
We can ask `separate()` to try and convert to better types using `convert = TRUE`:
```{r}
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
```
### Unite
`unite()` is the inverse of `separate()`: it combines multiple columns into a single column.
You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example.
That data is saved as `tidyr::table1`.
`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
```{r}
table1 %>%
unite(rate, cases, population)
```
In this case we also need to use the `sep` argument.
The default will place an underscore (`_`) between the values from different columns.
Here we want `"/"` instead:
```{r}
table1 %>%
unite(rate, cases, population, sep = "/")
```
### Exercises
1. What do the `extra` and `fill` arguments do in `separate()`?
Experiment with the various options for the following two toy datasets.
```{r, eval = FALSE}
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
```
2. Both `unite()` and `separate()` have a `remove` argument.
What does it do?
Why would you set it to `FALSE`?
3. Compare and contrast `separate()` and `extract()`.
Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
```{r, eval = FALSE}
events <- tribble(
~month, ~day,
1 , 20,
1 , 21,
1 , 22
)
events %>%
unite("date", month:day, sep = "-", remove = FALSE)
```
5. You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
Do this in two ways: using a positive and a negative value for `sep`.
```{r}
baker <- tribble(
~location,
"FLBaker County",
"GABaker County",
"ORBaker County",
)
baker
```
##