Pull content out of tidying
This commit is contained in:
		
							
								
								
									
										209
									
								
								data-tidy.Rmd
									
									
									
									
									
								
							
							
						
						
									
										209
									
								
								data-tidy.Rmd
									
									
									
									
									
								
							@@ -1,7 +1,5 @@
 | 
				
			|||||||
# Data tidying {#data-tidy}
 | 
					# Data tidying {#data-tidy}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!--# Take out bit on missing values and move to missing values chapter. Maybe also move case study elsewhere? -->
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
## Introduction
 | 
					## Introduction
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> "Happy families are all alike; every unhappy family is unhappy in its own way." ---- Leo Tolstoy
 | 
					> "Happy families are all alike; every unhappy family is unhappy in its own way." ---- Leo Tolstoy
 | 
				
			||||||
@@ -440,213 +438,6 @@ As you might have guessed from their names, `pivot_wider()` and `pivot_longer()`
 | 
				
			|||||||
      pivot_wider(names_from = drv, values_from = n)
 | 
					      pivot_wider(names_from = drv, values_from = n)
 | 
				
			||||||
    ```
 | 
					    ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Separating
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`.
 | 
					 | 
				
			||||||
`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`).
 | 
					 | 
				
			||||||
To fix this problem, we'll need the `separate()` function.
 | 
					 | 
				
			||||||
You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### Separate
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
 | 
					 | 
				
			||||||
Take `table3`:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
table3
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
 | 
					 | 
				
			||||||
`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
table3 %>%
 | 
					 | 
				
			||||||
  separate(rate, into = c("cases", "population"))
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
 | 
					 | 
				
			||||||
knitr::include_graphics("images/tidy-17.png")
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
 | 
					 | 
				
			||||||
For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
 | 
					 | 
				
			||||||
If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
 | 
					 | 
				
			||||||
For example, we could rewrite the code above as:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r eval = FALSE}
 | 
					 | 
				
			||||||
table3 %>%
 | 
					 | 
				
			||||||
  separate(rate, into = c("cases", "population"), sep = "/")
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Look carefully at the column types: you'll notice that `cases` and `population` are character columns.
 | 
					 | 
				
			||||||
This is the default behaviour in `separate()`: it leaves the type of the column as is.
 | 
					 | 
				
			||||||
Here, however, it's not very useful as those really are numbers.
 | 
					 | 
				
			||||||
We can ask `separate()` to try and convert to better types using `convert = TRUE`:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
table3 %>%
 | 
					 | 
				
			||||||
  separate(rate, into = c("cases", "population"), convert = TRUE)
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### Unite
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
`unite()` is the inverse of `separate()`: it combines multiple columns into a single column.
 | 
					 | 
				
			||||||
You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example.
 | 
					 | 
				
			||||||
That data is saved as `tidyr::table1`.
 | 
					 | 
				
			||||||
`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
table1 %>%
 | 
					 | 
				
			||||||
  unite(rate, cases, population)
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
In this case we also need to use the `sep` argument.
 | 
					 | 
				
			||||||
The default will place an underscore (`_`) between the values from different columns.
 | 
					 | 
				
			||||||
Here we want `"/"` instead:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
table1 %>%
 | 
					 | 
				
			||||||
  unite(rate, cases, population, sep = "/")
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### Exercises
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
1.  What do the `extra` and `fill` arguments do in `separate()`?
 | 
					 | 
				
			||||||
    Experiment with the various options for the following two toy datasets.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    ```{r, eval = FALSE}
 | 
					 | 
				
			||||||
    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
 | 
					 | 
				
			||||||
      separate(x, c("one", "two", "three"))
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
 | 
					 | 
				
			||||||
      separate(x, c("one", "two", "three"))
 | 
					 | 
				
			||||||
    ```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
2.  Both `unite()` and `separate()` have a `remove` argument.
 | 
					 | 
				
			||||||
    What does it do?
 | 
					 | 
				
			||||||
    Why would you set it to `FALSE`?
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
3.  Compare and contrast `separate()` and `extract()`.
 | 
					 | 
				
			||||||
    Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
4.  In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
 | 
					 | 
				
			||||||
    How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    ```{r, eval = FALSE}
 | 
					 | 
				
			||||||
    events <- tribble(
 | 
					 | 
				
			||||||
      ~month, ~day,
 | 
					 | 
				
			||||||
      1     , 20,
 | 
					 | 
				
			||||||
      1     , 21,
 | 
					 | 
				
			||||||
      1     , 22
 | 
					 | 
				
			||||||
    )
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    events %>%
 | 
					 | 
				
			||||||
      unite("date", month:day, sep = "-", remove = FALSE)
 | 
					 | 
				
			||||||
    ```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
5.  You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
 | 
					 | 
				
			||||||
    Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
 | 
					 | 
				
			||||||
    Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
 | 
					 | 
				
			||||||
    Do this in two ways: using a positive and a negative value for `sep`.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    ```{r}
 | 
					 | 
				
			||||||
    baker <- tribble(
 | 
					 | 
				
			||||||
      ~location,
 | 
					 | 
				
			||||||
      "FLBaker County",
 | 
					 | 
				
			||||||
      "GABaker County",
 | 
					 | 
				
			||||||
      "ORBaker County",
 | 
					 | 
				
			||||||
    )
 | 
					 | 
				
			||||||
    baker
 | 
					 | 
				
			||||||
    ```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
## Missing values {#missing-values-tidy}
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Changing the representation of a dataset brings up an important subtlety of missing values.
 | 
					 | 
				
			||||||
Surprisingly, a value can be missing in one of two possible ways:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
-   **Explicitly**, i.e. flagged with `NA`.
 | 
					 | 
				
			||||||
-   **Implicitly**, i.e. simply not present in the data.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Let's illustrate this idea with a very simple data set:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
stocks <- tibble(
 | 
					 | 
				
			||||||
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
 | 
					 | 
				
			||||||
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
 | 
					 | 
				
			||||||
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
 | 
					 | 
				
			||||||
)
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
There are two missing values in this dataset:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
-   The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains `NA`.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
-   The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
The way that a dataset is represented can make implicit values explicit.
 | 
					 | 
				
			||||||
For example, we can make the implicit missing value explicit by putting years in the columns:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
stocks %>%
 | 
					 | 
				
			||||||
  pivot_wider(names_from = year, values_from = return)
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Because these explicit missing values may not be important in other representations of the data, you can set `values_drop_na = TRUE` in `pivot_longer()` to turn explicit missing values implicit:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
stocks %>%
 | 
					 | 
				
			||||||
  pivot_wider(names_from = year, values_from = return) %>%
 | 
					 | 
				
			||||||
  pivot_longer(
 | 
					 | 
				
			||||||
    cols = c(`2015`, `2016`),
 | 
					 | 
				
			||||||
    names_to = "year",
 | 
					 | 
				
			||||||
    values_to = "return",
 | 
					 | 
				
			||||||
    values_drop_na = TRUE
 | 
					 | 
				
			||||||
  )
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Another important tool for making missing values explicit in tidy data is `complete()`:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
stocks %>%
 | 
					 | 
				
			||||||
  complete(year, qtr)
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
`complete()` takes a set of columns, and finds all unique combinations.
 | 
					 | 
				
			||||||
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
There's one other important tool that you should know for working with missing values.
 | 
					 | 
				
			||||||
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
treatment <- tribble(
 | 
					 | 
				
			||||||
  ~person,           ~treatment, ~response,
 | 
					 | 
				
			||||||
  "Derrick Whitmore", 1,         7,
 | 
					 | 
				
			||||||
  NA,                 2,         10,
 | 
					 | 
				
			||||||
  NA,                 3,         9,
 | 
					 | 
				
			||||||
  "Katherine Burke",  1,         4
 | 
					 | 
				
			||||||
)
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
You can fill in these missing values with `fill()`.
 | 
					 | 
				
			||||||
It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
treatment %>%
 | 
					 | 
				
			||||||
  fill(person)
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### Exercises
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
1.  Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
2.  What does the direction argument to `fill()` do?
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
## Case study
 | 
					## Case study
 | 
				
			||||||
 | 
					
 | 
				
			||||||
To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem.
 | 
					To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem.
 | 
				
			||||||
 
 | 
				
			|||||||
@@ -42,6 +42,90 @@ If you want to determine if a value is missing, use `is.na()`:
 | 
				
			|||||||
is.na(x)
 | 
					is.na(x)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Explicit vs implicit missing values {#missing-values-tidy}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Changing the representation of a dataset brings up an important subtlety of missing values.
 | 
				
			||||||
 | 
					Surprisingly, a value can be missing in one of two possible ways:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					-   **Explicitly**, i.e. flagged with `NA`.
 | 
				
			||||||
 | 
					-   **Implicitly**, i.e. simply not present in the data.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Let's illustrate this idea with a very simple data set:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
 | 
					stocks <- tibble(
 | 
				
			||||||
 | 
					  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
 | 
				
			||||||
 | 
					  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
 | 
				
			||||||
 | 
					  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
 | 
				
			||||||
 | 
					)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There are two missing values in this dataset:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					-   The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains `NA`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					-   The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The way that a dataset is represented can make implicit values explicit.
 | 
				
			||||||
 | 
					For example, we can make the implicit missing value explicit by putting years in the columns:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
 | 
					stocks %>%
 | 
				
			||||||
 | 
					  pivot_wider(names_from = year, values_from = return)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Because these explicit missing values may not be important in other representations of the data, you can set `values_drop_na = TRUE` in `pivot_longer()` to turn explicit missing values implicit:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
 | 
					stocks %>%
 | 
				
			||||||
 | 
					  pivot_wider(names_from = year, values_from = return) %>%
 | 
				
			||||||
 | 
					  pivot_longer(
 | 
				
			||||||
 | 
					    cols = c(`2015`, `2016`),
 | 
				
			||||||
 | 
					    names_to = "year",
 | 
				
			||||||
 | 
					    values_to = "return",
 | 
				
			||||||
 | 
					    values_drop_na = TRUE
 | 
				
			||||||
 | 
					  )
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Another important tool for making missing values explicit in tidy data is `complete()`:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
 | 
					stocks %>%
 | 
				
			||||||
 | 
					  complete(year, qtr)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					`complete()` takes a set of columns, and finds all unique combinations.
 | 
				
			||||||
 | 
					It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There's one other important tool that you should know for working with missing values.
 | 
				
			||||||
 | 
					Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
 | 
					treatment <- tribble(
 | 
				
			||||||
 | 
					  ~person,           ~treatment, ~response,
 | 
				
			||||||
 | 
					  "Derrick Whitmore", 1,         7,
 | 
				
			||||||
 | 
					  NA,                 2,         10,
 | 
				
			||||||
 | 
					  NA,                 3,         9,
 | 
				
			||||||
 | 
					  "Katherine Burke",  1,         4
 | 
				
			||||||
 | 
					)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					You can fill in these missing values with `fill()`.
 | 
				
			||||||
 | 
					It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
 | 
					treatment %>%
 | 
				
			||||||
 | 
					  fill(person)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Exercises
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1.  Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					2.  What does the direction argument to `fill()` do?
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## dplyr verbs
 | 
					## dplyr verbs
 | 
				
			||||||
 | 
					
 | 
				
			||||||
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
 | 
					`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
 | 
				
			||||||
 
 | 
				
			|||||||
							
								
								
									
										125
									
								
								strings.Rmd
									
									
									
									
									
								
							
							
						
						
									
										125
									
								
								strings.Rmd
									
									
									
									
									
								
							@@ -1048,3 +1048,128 @@ The main difference is the prefix: `str_` vs. `stri_`.
 | 
				
			|||||||
    c.  Generate random text.
 | 
					    c.  Generate random text.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
2.  How do you control the language that `stri_sort()` uses for sorting?
 | 
					2.  How do you control the language that `stri_sort()` uses for sorting?
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## tidyr
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`.
 | 
				
			||||||
 | 
					`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`).
 | 
				
			||||||
 | 
					To fix this problem, we'll need the `separate()` function.
 | 
				
			||||||
 | 
					You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Separate
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
 | 
				
			||||||
 | 
					Take `table3`:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
 | 
					table3
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
 | 
				
			||||||
 | 
					`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
 | 
					table3 %>%
 | 
				
			||||||
 | 
					  separate(rate, into = c("cases", "population"))
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
 | 
				
			||||||
 | 
					knitr::include_graphics("images/tidy-17.png")
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
 | 
				
			||||||
 | 
					For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
 | 
				
			||||||
 | 
					If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
 | 
				
			||||||
 | 
					For example, we could rewrite the code above as:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r eval = FALSE}
 | 
				
			||||||
 | 
					table3 %>%
 | 
				
			||||||
 | 
					  separate(rate, into = c("cases", "population"), sep = "/")
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Look carefully at the column types: you'll notice that `cases` and `population` are character columns.
 | 
				
			||||||
 | 
					This is the default behaviour in `separate()`: it leaves the type of the column as is.
 | 
				
			||||||
 | 
					Here, however, it's not very useful as those really are numbers.
 | 
				
			||||||
 | 
					We can ask `separate()` to try and convert to better types using `convert = TRUE`:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
 | 
					table3 %>%
 | 
				
			||||||
 | 
					  separate(rate, into = c("cases", "population"), convert = TRUE)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Unite
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					`unite()` is the inverse of `separate()`: it combines multiple columns into a single column.
 | 
				
			||||||
 | 
					You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example.
 | 
				
			||||||
 | 
					That data is saved as `tidyr::table1`.
 | 
				
			||||||
 | 
					`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
 | 
					table1 %>%
 | 
				
			||||||
 | 
					  unite(rate, cases, population)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					In this case we also need to use the `sep` argument.
 | 
				
			||||||
 | 
					The default will place an underscore (`_`) between the values from different columns.
 | 
				
			||||||
 | 
					Here we want `"/"` instead:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
 | 
					table1 %>%
 | 
				
			||||||
 | 
					  unite(rate, cases, population, sep = "/")
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Exercises
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1.  What do the `extra` and `fill` arguments do in `separate()`?
 | 
				
			||||||
 | 
					    Experiment with the various options for the following two toy datasets.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    ```{r, eval = FALSE}
 | 
				
			||||||
 | 
					    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
 | 
				
			||||||
 | 
					      separate(x, c("one", "two", "three"))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
 | 
				
			||||||
 | 
					      separate(x, c("one", "two", "three"))
 | 
				
			||||||
 | 
					    ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					2.  Both `unite()` and `separate()` have a `remove` argument.
 | 
				
			||||||
 | 
					    What does it do?
 | 
				
			||||||
 | 
					    Why would you set it to `FALSE`?
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					3.  Compare and contrast `separate()` and `extract()`.
 | 
				
			||||||
 | 
					    Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					4.  In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
 | 
				
			||||||
 | 
					    How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    ```{r, eval = FALSE}
 | 
				
			||||||
 | 
					    events <- tribble(
 | 
				
			||||||
 | 
					      ~month, ~day,
 | 
				
			||||||
 | 
					      1     , 20,
 | 
				
			||||||
 | 
					      1     , 21,
 | 
				
			||||||
 | 
					      1     , 22
 | 
				
			||||||
 | 
					    )
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    events %>%
 | 
				
			||||||
 | 
					      unite("date", month:day, sep = "-", remove = FALSE)
 | 
				
			||||||
 | 
					    ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					5.  You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
 | 
				
			||||||
 | 
					    Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
 | 
				
			||||||
 | 
					    Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
 | 
				
			||||||
 | 
					    Do this in two ways: using a positive and a negative value for `sep`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    ```{r}
 | 
				
			||||||
 | 
					    baker <- tribble(
 | 
				
			||||||
 | 
					      ~location,
 | 
				
			||||||
 | 
					      "FLBaker County",
 | 
				
			||||||
 | 
					      "GABaker County",
 | 
				
			||||||
 | 
					      "ORBaker County",
 | 
				
			||||||
 | 
					    )
 | 
				
			||||||
 | 
					    baker
 | 
				
			||||||
 | 
					    ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## 
 | 
				
			||||||
 
 | 
				
			|||||||
		Reference in New Issue
	
	Block a user