More string stuff

This commit is contained in:
Hadley Wickham
2022-01-05 22:49:31 -06:00
parent fd2a95d4dc
commit 3c97cfed3f
2 changed files with 99 additions and 97 deletions

View File

@@ -376,6 +376,33 @@ There are three ways we could fix this:
This is pretty typical when working with strings --- there are often multiple ways to reach your goal, either making your pattern more complicated or by doing some preprocessing on your string.
If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.
### Replace matches
Sometimes there are inconsistencies in the formatting that are easier to fix before you start extracting; easier to make the data more regular and check your work than coming up with a more complicated regular expression in `str_*` and friends.
`str_replace_all()` allow you to replace matches with new strings.
The simplest use is to replace a pattern with a fixed string:
```{r}
x <- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-")
```
With `str_replace_all()` you can perform multiple replacements by supplying a named vector.
The name gives a regular expression to match, and the value gives the replacement.
```{r}
x <- c("1 house", "1 person has 2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
`str_remove_all()` is a short cut for `str_replace_all(x, pattern, "")` --- it removes matching patterns from a string.
Use in `mutate()`
Using pipe inside mutate.
Recommendation to make a function, and think about testing it --- don't need formal tests, but useful to build up a set of positive and negative test cases as you.
### Pattern control
Now that you've learn about regular expressions, you might be worried about them working when you don't want them to.
@@ -420,33 +447,6 @@ In this section you'll learn how to use various functions tidyr to extract them.
Waiting on: <https://github.com/tidyverse/tidyups/pull/15>
### Replace matches
Sometimes there are inconsistencies in the formatting that are easier to fix before you start extracting; easier to make the data more regular and check your work than coming up with a more complicated regular expression in `str_*` and friends.
`str_replace_all()` allow you to replace matches with new strings.
The simplest use is to replace a pattern with a fixed string:
```{r}
x <- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-")
```
With `str_replace_all()` you can perform multiple replacements by supplying a named vector.
The name gives a regular expression to match, and the value gives the replacement.
```{r}
x <- c("1 house", "1 person has 2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
`str_remove_all()` is a short cut for `str_replace_all(x, pattern, "")` --- it removes matching patterns from a string.
Use in `mutate()`
Using pipe inside mutate.
Recommendation to make a function, and think about testing it --- don't need formal tests, but useful to build up a set of positive and negative test cases as you.
## Locale dependent operations {#other-languages}
So far all of our examples have been using English.
@@ -499,9 +499,9 @@ Fortunately there are three sets of functions where the locale matters:
[^strings-8]: Sorting in languages that don't have an alphabet (like Chinese) is more complicated still.
## Handy functions
## Letters
Before we study three useful families of string functions, I want to
Functions that work with the letters inside of the string.
### Length
@@ -527,24 +527,6 @@ babynames %>%
count(name, wt = n, sort = TRUE)
```
### Long strings
Sometimes the reason you care about the length of a string is because you're trying to fit it into a label on a plot or in a table.
stringr provides two useful tools for cases where your string is too long:
- `str_trunc(x, 20)` ensures that no string is longer than 20 characters, replacing any thing too long with `…`.
- `str_wrap(x, 20)` wraps a string introducing new lines so that each line is at most 20 characters (it doesn't hyphenate, however, so any word longer than 20 characters will make a longer time)
```{r}
x <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
str_trunc(x, 30)
str_view(str_wrap(x, 30))
```
TODO: add example with a plot.
### Subsetting
You can extract parts of a string using `str_sub(string, start, end)`.
@@ -577,6 +559,24 @@ babynames %>%
)
```
### Long strings
Sometimes the reason you care about the length of a string is because you're trying to fit it into a label on a plot or in a table.
stringr provides two useful tools for cases where your string is too long:
- `str_trunc(x, 20)` ensures that no string is longer than 20 characters, replacing any thing too long with `…`.
- `str_wrap(x, 20)` wraps a string introducing new lines so that each line is at most 20 characters (it doesn't hyphenate, however, so any word longer than 20 characters will make a longer time)
```{r}
x <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
str_trunc(x, 30)
str_view(str_wrap(x, 30))
```
TODO: add example with a plot.
### Exercises
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?