[^regexps-1]: With a hard g, sounding like "reg-x".
The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis.
We'll then expand your knowledge of patterns, to cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping).
Next we'll talk about some of the other types of pattern that stringr functions can work with, and the various "flags" that allow you to tweak the operation of regular expressions.
We'll finish up with a survey of other places in stringr, the tidyverse, and base R where you might use regexes.
In general, any letter or number will match exactly, but punctuation characters like `.`, `+`, `*`, `[`, `]`, `?`, often have special meanings[^regexps-2].
**Quantifiers** control how many times an element that can be applied to other pattern: `?` makes a pattern optional (i.e. it matches 0 or 1 times), `+` lets a pattern repeat (i.e. it matches at least once), and `*` lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
```{r}
# ab? matches an "a", optionally followed by a "b".
Regular expressions are very compact and use a lot of punctuation characters, so they can seem overwhelming at first, and you'll think a cat has walked across your keyboard.
So don't worry if they're hard to understand at first; you'll get better with practice.
Lets start that practice with some useful stringr functions.
We can also use `str_detect()` with `summarize()` by pairing it with `sum()` or `mean()`.
remembering that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1 so `sum(str_detect(x, pattern))` tells you the number of observations that match and `mean(str_detect(x, pattern))` tells you the proportion of observations that match.
(Note that this gives us the proportion of names that contain an x; if you wanted the proportion of babies with a name containing an x, you'd need to perform a weighted mean.)
### Count matches
A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string:
```{r}
x <- c("apple", "banana", "pear")
str_count(x, "p")
```
Note that regular expression matches never overlap so `str_count()` only starts looking for a new match after the end of the last match.
For example, in `"abababa"`, how many times will the pattern `"aba"` match?
If you look closely, you'll notice that there's something off with our calculations: "Aaban" contains three "a"s, but our summary reports only two vowels.
That's because we've forgotten to tell you that regular expressions are case sensitive.
- Add the upper case vowels to the character class: `str_count(name, "[aeiouAEIOU]")`.
- Tell the regular expression to ignore case: `str_count(regex(name, ignore_case = TRUE), "[aeiou]")`. We'll talk about more in @sec-flags..
- Use `str_to_lower()` to convert the names to lower case: `str_count(str_to_lower(name), "[aeiou]")`. You learned about this function in @sec-other-languages.
This is pretty typical when working with strings --- there are often multiple ways to reach your goal, either making your pattern more complicated or by doing some preprocessing on your string.
If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.
We'll concentrate on showing how these patterns work with `str_view()` but remember that you can use them with any of the functions that you learned above.
In general, look at punctuation characters with suspicion; if your regular expression isn't matching what you think it should, check if you've used any of these characters.
To remember which is which, try this mnemonic which Hadley learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
For example, `colou?r` will match American or British spelling, `\d+` will match one or more digits, and `\s?` will optionally match a single whitespace.
The answer to these questions is determined by operator precedence, similar to the PEMDAS or BEDMAS rules you might have learned in school for what `a + b * c`.
You already know that `a + b * c` is equivalent to `a + (b * c)` not `(a + b) * c` because `*` has high precedence and `+` has lower precedence: you compute `*` before `+`.
In regular expressions, quantifiers have high precedence and alternation has low precedence.
That means `ab+` is equivalent to `a(b+)`, and `^a|b$` is equivalent to `(^a)|(b$)`.
Technically the escape, character classes, and parentheses are all operators that also have precedence.
But these tend to be less likely to cause confusion because they mostly behave how you expect: it's unlikely that you'd think that `\(s|d)` would mean `(\s)|(\d)`.
Parentheses are an important tool for controlling the order in which pattern operations are applied but they also have an important additional effect: they create **capturing groups** that allow you to use to sub-components of the match.
You can refer back to previously matched text inside parentheses by using **back reference**: `\1` refers to the match contained in the first parenthesis, `\2` in the second parenthesis, and so on.
For example, the following pattern finds all fruits that have a repeated pair of letters:
```{r}
str_view(fruit, "(..)\\1")
```
And this one finds all words that start and end with the same pair of letters:
```{r}
str_view(words, "^(..).*\\1$")
```
You can also use backreferences in `str_replace()`:
If you want extract the matches for each group you can use `str_match()`.
But it returns a matrix, so isn't as easy to work with:
```{r}
sentences |>
str_match("the (\\w+) (\\w+)") |>
head()
```
You could convert to a tibble and name the columns:
```{r}
sentences |>
str_match("the (\\w+) (\\w+)") |>
as_tibble(.name_repair = "minimal") |>
set_names("match", "word1", "word2")
```
But then you've basically recreated your own simple version of `separate_regex_wider()`.
Indeed, behind the scenes `separate_regexp_wider()` converts your vector of patterns to a single regexp that uses grouping to capture only the named components.
Occasionally, you'll want to use parentheses without creating matching groups.
3. Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut.
6. Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `multiline` and `dotall` can also be useful.
`dotall = TRUE` allows `.` to match everything, including `\n`:
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_view(x, ".L")
str_view(x, regex(".L", dotall = TRUE))
```
And `multiline = TRUE` allows `^` and `$` to match the start and end of each line rather than the start and end of the complete string:
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_view(x, "^Line")
str_view(x, regex("^Line", multiline = TRUE))
```
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, `comments = TRUE` can be extremely useful.
It allows you to use comments and whitespace to make complex regular expressions more understandable.
Spaces and new lines are ignored, as is everything after `#`.
(Note that we use a raw string here to minimize the number of escapes needed.)
```{r}
phone <- regex(r"(
\(? # optional opening parens
(\d{3}) # area code
[)\ -]? # optional closing parens, space, or dash
(\d{3}) # another three numbers
[\ -]? # optional space or dash
(\d{3}) # three more numbers
)", comments = TRUE)
str_match("514-791-8141", phone)
```
If you're using comments and want to match a space, newline, or `#`, you'll need to escape it:
```{r}
str_view("x x #", regex("x #", comments = TRUE))
str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
The following three sections help you practice the components of a pattern by discussing three general techniques: checking you work by creating simple positive and negative controls, combining regular expressions with Boolean algebra, and creating complex patterns using string manipulation.
### Check your work
First, let's find all sentences that start with "The".
It's typically much easier to come up with positive examples than negative examples, because it takes some time until you're good enough with regular expressions to predict where your weaknesses are.
Nevertheless they're still useful; even if you don't get them correct right away, you can slowly accumulate them as you work on your problem.
If you later get more into programming and learn about unit tests, you can then turn these examples into automated tests that ensure you never make the same mistake twice.)
Imagine we want to find words that only contain consonants.
One technique is to create a character class that contains all letters except for the vowels (`[^aeiou]`), then allow that to match any number of letters (`[^aeiou]+`), then force it to match the whole string by anchoring to the beginning and the end (`^[^aeiou]+$`):
This is a useful technique whenever you're dealing with logical combinations, particularly those involving "and" or "not".
For example, imagine if you want to find all words that contain "a" and "b".
There's no "and" operator built in to regular expressions so we have to tackle it by looking for all words that contain an "a" followed by a "b", or a "b" followed by an "a":
In general, if you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
### Creating a pattern with code
What if we wanted to find all `sentences` that mention a color?
The basic idea is simple: we just combine alternation with word boundaries.
In this example `cols` only contains numbers and letters so you don't need to worry about metacharacters.
But in general, when creating patterns from existing strings it's good practice to run through `str_escape()` which will automatically add `\` in front of otherwise special characters.
### Exercises
1. Construct patterns to find evidence for and against the rule "i before e except after c"?
2. `colors()` contains a number of modifiers like "lightgray" and "darkblue". How could you automatically identify these modifiers? (Think about how you might detect and removed what colors are being modified).
3. Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the `data()` function: `data(package = "datasets")$results[, "Item"]`. Note that a number of old datasets are individual vectors; these contain the name of the grouping "data frame" in parentheses, so you'll need to also strip these off.
- `matches()`: a "tidyselect" function that you can use anywhere in the tidyverse when selecting variables (e.g. `dplyr::select()`, `rename_with()`, `across()`, ...).
The regular expressions used by stringr are very slightly different to those of base R.
That's because stringr is built on top of the [stringi package](https://stringi.gagolewski.com), which is in turn built on top of the [ICU engine](https://unicode-org.github.io/icu/userguide/strings/regexp.html), whereas base R functions (like `gsub()` and `grepl()`) use either the [TRE engine](https://github.com/laurikari/tre) or the [PCRE engine](https://www.pcre.org).
Fortunately, the basics of regular expressions are so well established that you'll encounter few variations when working with the patterns you'll learn in this book (and we'll point them out where important).
You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the `(?…)` syntax.
You can learn more about these advanced features in `vignette("regular-expressions", package = "stringr")`.