More string stuff

This commit is contained in:
Hadley Wickham 2022-01-05 22:49:31 -06:00
parent fd2a95d4dc
commit 3c97cfed3f
2 changed files with 99 additions and 97 deletions

View File

@ -14,6 +14,10 @@ Here we'll focus mostly on pattern language itself, not the functions that use i
That means we'll mostly work with character vectors, showing the results with `str_view()` and `str_view_all()`.
You'll need to take what you learn and apply it to data frames with tidyr functions or by combining dplyr and stringr functions.
We'll first expand your knowledge of patterns.
Then we'll talk about "grouping" and "capturing" and how they work with `str_separate_group()`.
We'll finish up with a few important details for how regexps work, and then discuss some useful strategies.
### Prerequisites
This chapter will use regular expressions as provided by the **stringr** package.
@ -31,7 +35,27 @@ You can learn more about these advanced features in `vignette("regular-expressio
Another useful reference is [https://www.regular-expressions.info/](https://www.regular-expressions.info/tutorial.html).
It's not R specific, but it includes a lot more information about how regular expressions actually work.
## Escaping {#regexp-escaping}
### Exercises
1. Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
2. How would you match the sequence `"'\`?
3. What patterns will the regular expression `\..\..\..` match?
How would you represent it as a string?
## More patterns
- Anchors, which allow you to ensure the match is at the start or end of a string.
- Alternation and parentheses, which allows you to match "this" or "that", and allow you to control which
- ???
- Character classes, which allow you to assemble
- Quantifiers, which controls the number of times a pattern matches
- Grouping and backreferences
Here I used the technical names for each components, even when not that evocative of the purpose, because it's helpful to know the correct terms if you later want to Google for more information.
### Escaping {#regexp-escaping}
In Chapter \@ref(strings), you'll learned how to match a literal `.` by using `fixed(".")`.
What if you want to match a literal `.` as part of a regular expression?
@ -76,38 +100,10 @@ str_view(x, r"(\\)")
The full set of characters with special meanings that need to be escaped is `.^$\|*+?{}[]()`.
In general, look at punctuation character with suspicion; if your regular expression isn't matching what you think it should, check if you've used any of these characters.
As we'll see shortly, escapes can also convert exact matches into special matches.
For example, `s` matches the letter "s", but `\s` matches any whitespace.
### Exercises
1. Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
2. How would you match the sequence `"'\`?
3. What patterns will the regular expression `\..\..\..` match?
How would you represent it as a string?
## More patterns
With the most important topic of escaping under your belt, now it's time to learn a grab bag of useful patterns.
The following sections will teach you about:
- Anchors, which allow you to ensure the match is at the start or end of a string.
- Alternation and parentheses, which allows you to match "this" or "that", and allow you to control which
- ???
- Character classes, which allow you to assemble
- Quantifiers, which controls the number of times a pattern matches
- Grouping and backreferences
I've tried to the use the technical names for these various components.
They're not always super informative, but they'll usually at least seem somewhat related, and it's helpful to know the correct terms if you later want to google for more information.
### Anchors
By default, regular expressions will match any part of a string.
It's often useful to **anchor** the regular expression so that it matches from the start or to the end of the string.
You can use:
If you want to match at the start of end you need to **anchor** the regular expression using `^` or `$`.
- `^` to match the start of the string.
- `$` to match the end of the string.
@ -120,6 +116,7 @@ str_view(x, "a$") # match "a" at end
```
To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
It's tempting to put `$` at the start, because that's how we write sums of money, but it's not what regular expressions want.
To force a regular expression to only match the full string, anchor it with both `^` and `$`:
@ -149,8 +146,11 @@ You can also create your own collections of characters using `[]`:
- `[^abc]`: matches anything except a, b, or c.
- `[\^\-]`: matches `^` or `-`.
A character class containing a single character can be a nice alternative to escapes when you want to include a single special character (i.e. `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`, but not `]` `\` `^`).
This can be more readable because there are fewer slashes, but it also requires a deeper understanding of regular expressions.
Remember that these are case sensitive.
If you want to match any Latin letter or number, you could write `[a-zA-Z0-9]`.
A character class containing a single character can be a nice alternative to escapes when you want to include a single special character (i.e. `$` `.` `|` `?` `*` `+` `(` `)` `[` `{` `}`, but not `]` `\` `^`).
This can be more readable because there are fewer slashes, but it does require a deeper understanding of regular expressions.
```{r}
# Look for a literal character that normally has special meaning in a regex
@ -247,37 +247,39 @@ str_view(c("grey", "gray"), "gr(e|a)y")
Since `words` is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.
3. Create regular expressions to find all words that:
3. What strings will `$a` match?
4. Create regular expressions to find all words that:
a. Start with a vowel.
b. That only contain consonants. (Hint: thinking about matching "not"-vowels.)
c. End with `ed`, but not with `eed`.
d. End with `ing` or `ise`.
4. Empirically verify the rule "i before e except after c".
5. Empirically verify the rule "i before e except after c".
5. Is "q" always followed by a "u"?
6. Is "q" always followed by a "u"?
6. Write a regular expression that matches a `word` if it's probably written in British English, not American English.
7. Write a regular expression that matches a `word` if it's probably written in British English, not American English.
7. Create a regular expression that will match telephone numbers as commonly written in your country.
8. Create a regular expression that will match telephone numbers as commonly written in your country.
8. Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.
9. Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.
9. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
10. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
a. `^.*$`
b. `"\\{.+\\}"`
c. `\d{4}-\d{2}-\d{2}`
d. `"\\\\{4}"`
10. Create regular expressions to find all words that:
11. Create regular expressions to find all words that:
a. Start with three consonants.
b. Have three or more vowels in a row.
c. Have two or more vowel-consonant pairs in a row.
11. Solve the beginner regexp crosswords at <https://regexcrossword.com/challenges/beginner>.
12. Solve the beginner regexp crosswords at <https://regexcrossword.com/challenges/beginner>.
## Grouping and capturing
@ -408,18 +410,18 @@ You can use the other arguments of `regex()` to control details of the match:
```
- `comments = TRUE` allows you to use comments and white space to make complex regular expressions more understandable.
Spaces are ignored, as is everything after `#`.
Spaces and new lines are ignored, as is everything after `#`.
To match a literal space, you'll need to escape it: `"\\ "`.
```{r}
phone <- regex("
\\(? # optional opening parens
(\\d{3}) # area code
[) -]? # optional closing parens, space, or dash
(\\d{3}) # another three numbers
[ -]? # optional space or dash
(\\d{3}) # three more numbers
", comments = TRUE)
phone <- regex(r"(
\(? # optional opening parens
(\d{3}) # area code
[) -]? # optional closing parens, space, or dash
(\d{3}) # another three numbers
[ -]? # optional space or dash
(\d{3}) # three more numbers
)", comments = TRUE)
str_match("514-791-8141", phone)
```

View File

@ -376,6 +376,33 @@ There are three ways we could fix this:
This is pretty typical when working with strings --- there are often multiple ways to reach your goal, either making your pattern more complicated or by doing some preprocessing on your string.
If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.
### Replace matches
Sometimes there are inconsistencies in the formatting that are easier to fix before you start extracting; easier to make the data more regular and check your work than coming up with a more complicated regular expression in `str_*` and friends.
`str_replace_all()` allow you to replace matches with new strings.
The simplest use is to replace a pattern with a fixed string:
```{r}
x <- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-")
```
With `str_replace_all()` you can perform multiple replacements by supplying a named vector.
The name gives a regular expression to match, and the value gives the replacement.
```{r}
x <- c("1 house", "1 person has 2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
`str_remove_all()` is a short cut for `str_replace_all(x, pattern, "")` --- it removes matching patterns from a string.
Use in `mutate()`
Using pipe inside mutate.
Recommendation to make a function, and think about testing it --- don't need formal tests, but useful to build up a set of positive and negative test cases as you.
### Pattern control
Now that you've learn about regular expressions, you might be worried about them working when you don't want them to.
@ -420,33 +447,6 @@ In this section you'll learn how to use various functions tidyr to extract them.
Waiting on: <https://github.com/tidyverse/tidyups/pull/15>
### Replace matches
Sometimes there are inconsistencies in the formatting that are easier to fix before you start extracting; easier to make the data more regular and check your work than coming up with a more complicated regular expression in `str_*` and friends.
`str_replace_all()` allow you to replace matches with new strings.
The simplest use is to replace a pattern with a fixed string:
```{r}
x <- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-")
```
With `str_replace_all()` you can perform multiple replacements by supplying a named vector.
The name gives a regular expression to match, and the value gives the replacement.
```{r}
x <- c("1 house", "1 person has 2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
`str_remove_all()` is a short cut for `str_replace_all(x, pattern, "")` --- it removes matching patterns from a string.
Use in `mutate()`
Using pipe inside mutate.
Recommendation to make a function, and think about testing it --- don't need formal tests, but useful to build up a set of positive and negative test cases as you.
## Locale dependent operations {#other-languages}
So far all of our examples have been using English.
@ -499,9 +499,9 @@ Fortunately there are three sets of functions where the locale matters:
[^strings-8]: Sorting in languages that don't have an alphabet (like Chinese) is more complicated still.
## Handy functions
## Letters
Before we study three useful families of string functions, I want to
Functions that work with the letters inside of the string.
### Length
@ -527,24 +527,6 @@ babynames %>%
count(name, wt = n, sort = TRUE)
```
### Long strings
Sometimes the reason you care about the length of a string is because you're trying to fit it into a label on a plot or in a table.
stringr provides two useful tools for cases where your string is too long:
- `str_trunc(x, 20)` ensures that no string is longer than 20 characters, replacing any thing too long with `…`.
- `str_wrap(x, 20)` wraps a string introducing new lines so that each line is at most 20 characters (it doesn't hyphenate, however, so any word longer than 20 characters will make a longer time)
```{r}
x <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
str_trunc(x, 30)
str_view(str_wrap(x, 30))
```
TODO: add example with a plot.
### Subsetting
You can extract parts of a string using `str_sub(string, start, end)`.
@ -577,6 +559,24 @@ babynames %>%
)
```
### Long strings
Sometimes the reason you care about the length of a string is because you're trying to fit it into a label on a plot or in a table.
stringr provides two useful tools for cases where your string is too long:
- `str_trunc(x, 20)` ensures that no string is longer than 20 characters, replacing any thing too long with `…`.
- `str_wrap(x, 20)` wraps a string introducing new lines so that each line is at most 20 characters (it doesn't hyphenate, however, so any word longer than 20 characters will make a longer time)
```{r}
x <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
str_trunc(x, 30)
str_view(str_wrap(x, 30))
```
TODO: add example with a plot.
### Exercises
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?