Start sketching out extract section
This commit is contained in:
parent
35d4eed391
commit
7189136cf5
100
strings.qmd
100
strings.qmd
|
@ -258,15 +258,99 @@ df |>
|
|||
|
||||
Working from <https://github.com/tidyverse/tidyr/pull/1304>.
|
||||
|
||||
Common for multiple variables worth of data to be stored in a single string.
|
||||
In this section you'll learn how to four various tidyr to extract them.
|
||||
It's very common for multiple variables to be crammed together into a single string.
|
||||
In this section you'll learn how to use four tidyr to extract them:
|
||||
|
||||
- `separate_by_longer()`
|
||||
- `separate_at_longer()`
|
||||
- `separate_by_wider()`
|
||||
- `separate_at_wider()`
|
||||
- `df |> separate_by_longer(col, sep)`
|
||||
- `df |> separate_at_longer(col, width)`
|
||||
- `df |> separate_by_wider(col, sep, names)`
|
||||
- `df |> separate_at_wider(col, widths)`
|
||||
|
||||
We'll come back to the fifth member of this family, `separate_regex_wider()`, in @sec-regular-expressions since you need to know regular expression to use it.
|
||||
If you look closely you can see there's a common pattern here: `separate` followed by `by` or `at`, followed by longer or `wider`.
|
||||
`by` splits up a string with a separator like `", "` or `" "`.
|
||||
`at` splits at given locations, like 5, 10, and 17.
|
||||
`longer` makes input data frame longer, making new rows; `wider` makes the input data frame wider, add new columns.
|
||||
|
||||
There's one more member of this family, `separate_regex_wider()`, that we'll come back in @sec-regular-expressions.
|
||||
It's the most flexible of the `at` forms but you need to know a bit about regular expression in order to use it.
|
||||
|
||||
```{r}
|
||||
#| include: false
|
||||
has_dev_tidyr <- packageVersion("tidyr") >= "1.2.1.9001"
|
||||
```
|
||||
|
||||
The next two sections will give you the basic idea behind these separate functions, and then we'll work through a few case studies that require mutliple uses.
|
||||
|
||||
### Splitting into rows
|
||||
|
||||
`separate_by_longer()` and `separate_at_longer()` are most useful when the number of components varies from row to row.
|
||||
`separate_by_longer()` arises most commonly:
|
||||
|
||||
```{r}
|
||||
#| eval: !expr has_dev_tidyr
|
||||
|
||||
df1 <- tibble(x = c("a,b,c", "d,e", "f"))
|
||||
df1 |>
|
||||
separate_by_longer(x, sep = ",")
|
||||
```
|
||||
|
||||
(If the separators have some variation you can use a regular expression instead, if you know about it.)
|
||||
|
||||
It's rarer to see `separate_at_longer()` in the wild, but some older datasets can adopt a very compact format where each character is used to record a value:
|
||||
|
||||
```{r}
|
||||
#| eval: !expr has_dev_tidyr
|
||||
|
||||
df2 <- tibble(x = c("1211", "131", "21"))
|
||||
df2 |>
|
||||
separate_at_longer(x, width = 1)
|
||||
```
|
||||
|
||||
### Splitting into columns
|
||||
|
||||
`separate_by_wider()` and `separate_at_wider()` are most useful when there are a fixed number of components in each string, and you want to spread them into columns.
|
||||
They are more complicated that their `by` equivalents because you need to name the columns.
|
||||
|
||||
```{r}
|
||||
#| eval: !expr has_dev_tidyr
|
||||
|
||||
df3 <- tibble(x = c("a,1,2022", "b,2,2011", "e,5,2015"))
|
||||
df3 |>
|
||||
separate_by_wider(x, sep = ",", names = c("letter", "number", "year"))
|
||||
```
|
||||
|
||||
If a specific value is not useful you can use `NA` to omit it from the results:
|
||||
|
||||
```{r}
|
||||
#| eval: !expr has_dev_tidyr
|
||||
|
||||
df3 <- tibble(x = c("a,1,2022", "b,2,2011", "e,5,2015"))
|
||||
df3 |>
|
||||
separate_by_wider(x, sep = ",", names = c("letter", NA, "year"))
|
||||
```
|
||||
|
||||
Alternatively, you can provide `names_sep` and `separate_by_wider()` will use that separator to name automatically:
|
||||
|
||||
```{r}
|
||||
#| eval: !expr has_dev_tidyr
|
||||
|
||||
df3 |>
|
||||
separate_by_wider(x, sep = ",", names_sep = "_")
|
||||
```
|
||||
|
||||
`separate_at_wider()` works a little differently, because you typically want to specify the width of each column.
|
||||
So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies.
|
||||
You can omit values from the output by not naming them:
|
||||
|
||||
```{r}
|
||||
#| eval: !expr has_dev_tidyr
|
||||
|
||||
df4 <- tibble(x = c("202215TX", "202122LA", "202325CA"))
|
||||
df4 |>
|
||||
separate_at_wider(x, c(year = 4, age = 2, state = 2))
|
||||
```
|
||||
|
||||
### Case studies
|
||||
|
||||
## Letters
|
||||
|
||||
|
@ -355,7 +439,7 @@ TODO: add example with a plot.
|
|||
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
|
||||
2. Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
|
||||
|
||||
## Locale dependent operations {#sec-other-languages}
|
||||
## Locale dependent {#sec-other-languages}
|
||||
|
||||
So far all of our examples have been using English.
|
||||
The details of the many ways other languages are different to English are too diverse to detail here, but we wanted to give a quick outline of the functions who's behavior differs based on your **locale**, the set of settings that vary from country to country.
|
||||
|
|
Loading…
Reference in New Issue