[^regexps-1]: With a hard g, sounding like "reg-x".
The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis.
We'll then expand your knowledge of patterns, to cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping).
Next we'll talk about some of the other types of pattern that stringr functions can work with, and the various "flags" that allow you to tweak the operation of regular expressions.
We'll finish up with a survey of other places in stringr, the tidyverse, and base R where you might use regexes.
In this chapter, we'll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.
Learning regular expressions requires learning two things at once: learning how regular expressions work in general, and learning about the various functions that use them.
We'll start with a basic intro to both, learning some simple patterns and some useful stringr and tidyr functions.
Through this chapter we'll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:
- `fruit` contains the names of 80 fruits.
- `words` contains 980 common English words.
- `sentences` contains 720 short sentences.
To learn how to regex patterns work, we'll start with `str_view()`.
We used `str_view()` in the last chapter to better understand a string vs its printed representation.
Now we'll use it with its second argument which is a regular expression.
When supplied, `str_view()` will show only the elements of the string the match, as well as surrounding the match with `<>` and highlighting in blue, where possible.
In general, any letter or number will match exactly, but punctuation characters like `.`, `+`, `*`, `[`, `]`, `?`, often have special meanings[^regexps-2].
**Quantifiers** control how many times an element that can be applied to other pattern: `?` makes a pattern optional (i.e. it matches 0 or 1 times), `+` lets a pattern repeat (i.e. it matches at least once), and `*` lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
```{r}
# ab? matches an "a", optionally followed by a "b".
Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1, so `sum(str_detect(x, pattern))` tells you the number of observations that match and `mean(str_detect(x, pattern))` tells you the proportion that match.
(Note that this gives us the proportion of names that contain an x; if you wanted the proportion of babies with a name containing an x, you'd need to perform a weighted mean.)
If you look closely, you'll notice that there's something off with our calculations: "Aaban" contains three "a"s, but our summary reports only two vowels.
- Use `str_to_lower()` to convert the names to lower case: `str_count(str_to_lower(name), "[aeiou]")`. You learned about this function in @sec-other-languages.
This plethora of options is pretty typical when working with strings --- there are often multiple ways to reach your goal, either making your pattern more complicated or by doing some preprocessing on your string.
We'll concentrate on showing how these patterns work with `str_view()`; remember that you can use them with any of the functions that you learned above.
In general, look at punctuation characters with suspicion; if your regular expression isn't matching what you think it should, check if you've used any of these characters.
To remember which is which, try this mnemonic which we learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
For example, `colou?r` will match American or British spelling, `\d+` will match one or more digits, and `\s?` will optionally match a single whitespace.
The answer to these questions is determined by operator precedence, similar to the PEMDAS or BEDMAS rules you might have learned in school for what `a + b * c`.
You already know that `a + b * c` is equivalent to `a + (b * c)` not `(a + b) * c` because `*` has higher precedence and `+` has lower precedence: you compute `*` before `+`.
Technically the escape, character classes, and parentheses are all operators that also have precedence.
But these tend to be less likely to cause confusion because they mostly behave how you expect: it's unlikely that you'd think that `\(s|d)` would mean `(\s)|(\d)`.
Parentheses are an important tool for controlling the order in which pattern operations are applied but they also have an important additional effect: they create **capturing groups** that allow you to use to sub-components of the match.
You can refer back to previously matched text inside parentheses by using **back reference**: `\1` refers to the match contained in the first parenthesis, `\2` in the second parenthesis, and so on.
For example, the following pattern finds all fruits that have a repeated pair of letters:
```{r}
str_view(fruit, "(..)\\1")
```
And this one finds all words that start and end with the same pair of letters:
```{r}
str_view(words, "^(..).*\\1$")
```
You can also use backreferences in `str_replace()`:
If you want extract the matches for each group you can use `str_match()`.
But it returns a matrix, so isn't as easy to work with:
```{r}
sentences |>
str_match("the (\\w+) (\\w+)") |>
head()
```
You could convert to a tibble and name the columns:
```{r}
sentences |>
str_match("the (\\w+) (\\w+)") |>
as_tibble(.name_repair = "minimal") |>
set_names("match", "word1", "word2")
```
But then you've basically recreated your own simple version of `separate_regex_wider()`.
Indeed, behind the scenes `separate_regexp_wider()` converts your vector of patterns to a single regexp that uses grouping to capture only the named components.
Occasionally, you'll want to use parentheses without creating matching groups.
5. Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut.
7. Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `multiline` and `dotall` can also be useful.
`dotall = TRUE` allows `.` to match everything, including `\n`:
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_view(x, ".L")
str_view(x, regex(".L", dotall = TRUE))
```
And `multiline = TRUE` allows `^` and `$` to match the start and end of each line rather than the start and end of the complete string:
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_view(x, "^Line")
str_view(x, regex("^Line", multiline = TRUE))
```
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, `comments = TRUE` can be extremely useful.
It allows you to use comments and whitespace to make complex regular expressions more understandable.
Spaces and new lines are ignored, as is everything after `#`.
(Note that we use a raw string here to minimize the number of escapes needed.)
```{r}
phone <- regex(r"(
\(? # optional opening parens
(\d{3}) # area code
[)\ -]? # optional closing parens, space, or dash
(\d{3}) # another three numbers
[\ -]? # optional space or dash
(\d{3}) # three more numbers
)", comments = TRUE)
str_match("514-791-8141", phone)
```
If you're using comments and want to match a space, newline, or `#`, you'll need to escape it:
```{r}
str_view("x x #", regex("x #", comments = TRUE))
str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
The following three sections help you practice the components of a pattern by discussing three general techniques: checking you work by creating simple positive and negative controls, combining regular expressions with Boolean algebra, and creating complex patterns using string manipulation.
### Check your work
First, let's find all sentences that start with "The".
It's typically much easier to come up with positive examples than negative examples, because it takes some time until you're good enough with regular expressions to predict where your weaknesses are.
Nevertheless they're still useful; even if you don't get them correct right away, you can slowly accumulate them as you work on your problem.
If you later get more into programming and learn about unit tests, you can then turn these examples into automated tests that ensure you never make the same mistake twice.)
Imagine we want to find words that only contain consonants.
One technique is to create a character class that contains all letters except for the vowels (`[^aeiou]`), then allow that to match any number of letters (`[^aeiou]+`), then force it to match the whole string by anchoring to the beginning and the end (`^[^aeiou]+$`):
This is a useful technique whenever you're dealing with logical combinations, particularly those involving "and" or "not".
For example, imagine if you want to find all words that contain "a" and "b".
There's no "and" operator built in to regular expressions so we have to tackle it by looking for all words that contain an "a" followed by a "b", or a "b" followed by an "a":
In general, if you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
### Creating a pattern with code
What if we wanted to find all `sentences` that mention a color?
The basic idea is simple: we just combine alternation with word boundaries.
In this example `cols` only contains numbers and letters so you don't need to worry about metacharacters.
But in general, when creating patterns from existing strings it's good practice to run through `str_escape()` which will automatically add `\` in front of otherwise special characters.
### Exercises
1. Construct patterns to find evidence for and against the rule "i before e except after c"?
2. `colors()` contains a number of modifiers like "lightgray" and "darkblue". How could you automatically identify these modifiers? (Think about how you might detect and removed what colors are being modified).
3. Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the `data()` function: `data(package = "datasets")$results[, "Item"]`. Note that a number of old datasets are individual vectors; these contain the name of the grouping "data frame" in parentheses, so you'll need to also strip these off.
- `matches()`: a "tidyselect" function that you can use anywhere in the tidyverse when selecting variables (e.g. `dplyr::select()`, `rename_with()`, `across()`, ...).
The regular expressions used by stringr are very slightly different to those of base R.
That's because stringr is built on top of the [stringi package](https://stringi.gagolewski.com), which is in turn built on top of the [ICU engine](https://unicode-org.github.io/icu/userguide/strings/regexp.html), whereas base R functions (like `gsub()` and `grepl()`) use either the [TRE engine](https://github.com/laurikari/tre) or the [PCRE engine](https://www.pcre.org).
Fortunately, the basics of regular expressions are so well established that you'll encounter few variations when working with the patterns you'll learn in this book (and we'll point them out where important).
You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the `(?…)` syntax.
You can learn more about these advanced features in `vignette("regular-expressions", package = "stringr")`.