Pull content out of tidying

2021-04-19 07:59:07 -05:00
parent 861e27026e
commit 78ab61f284
3 changed files with 209 additions and 209 deletions
--- a/strings.Rmd
+++ b/strings.Rmd
@@ -1048,3 +1048,128 @@ The main difference is the prefix: `str_` vs. `stri_`.
    c.  Generate random text.

 2.  How do you control the language that `stri_sort()` uses for sorting?
+
+## tidyr
+
+So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`.
+`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`).
+To fix this problem, we'll need the `separate()` function.
+You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
+
+### Separate
+
+`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
+Take `table3`:
+
+```{r}
+table3
+```
+
+The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
+`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
+
+```{r}
+table3 %>%
+  separate(rate, into = c("cases", "population"))
+```
+
+```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
+knitr::include_graphics("images/tidy-17.png")
+```
+
+By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
+For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
+If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
+For example, we could rewrite the code above as:
+
+```{r eval = FALSE}
+table3 %>%
+  separate(rate, into = c("cases", "population"), sep = "/")
+```
+
+(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).)
+
+Look carefully at the column types: you'll notice that `cases` and `population` are character columns.
+This is the default behaviour in `separate()`: it leaves the type of the column as is.
+Here, however, it's not very useful as those really are numbers.
+We can ask `separate()` to try and convert to better types using `convert = TRUE`:
+
+```{r}
+table3 %>%
+  separate(rate, into = c("cases", "population"), convert = TRUE)
+```
+
+### Unite
+
+`unite()` is the inverse of `separate()`: it combines multiple columns into a single column.
+You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
+
+We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example.
+That data is saved as `tidyr::table1`.
+`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
+
+```{r}
+table1 %>%
+  unite(rate, cases, population)
+```
+
+In this case we also need to use the `sep` argument.
+The default will place an underscore (`_`) between the values from different columns.
+Here we want `"/"` instead:
+
+```{r}
+table1 %>%
+  unite(rate, cases, population, sep = "/")
+```
+
+### Exercises
+
+1.  What do the `extra` and `fill` arguments do in `separate()`?
+    Experiment with the various options for the following two toy datasets.
+
+    ```{r, eval = FALSE}
+    tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
+      separate(x, c("one", "two", "three"))
+
+    tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
+      separate(x, c("one", "two", "three"))
+    ```
+
+2.  Both `unite()` and `separate()` have a `remove` argument.
+    What does it do?
+    Why would you set it to `FALSE`?
+
+3.  Compare and contrast `separate()` and `extract()`.
+    Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
+
+4.  In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
+    How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
+
+    ```{r, eval = FALSE}
+    events <- tribble(
+      ~month, ~day,
+      1     , 20,
+      1     , 21,
+      1     , 22
+    )
+
+    events %>%
+      unite("date", month:day, sep = "-", remove = FALSE)
+    ```
+
+5.  You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
+    Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
+    Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
+    Do this in two ways: using a positive and a negative value for `sep`.
+
+    ```{r}
+    baker <- tribble(
+      ~location,
+      "FLBaker County",
+      "GABaker County",
+      "ORBaker County",
+    )
+    baker
+    ```
+
+##