26
regexps.qmd
26
regexps.qmd
@@ -34,7 +34,7 @@ library(tidyverse)
|
||||
|
||||
It's worth noting that the regular expressions used by stringr are very slightly different to those of base R.
|
||||
That's because stringr is built on top of the [stringi package](https://stringi.gagolewski.com), which is in turn built on top of the [ICU engine](https://unicode-org.github.io/icu/userguide/strings/regexp.html), whereas base R functions (like `gsub()` and `grepl()`) use either the [TRE engine](https://github.com/laurikari/tre) or the [PCRE engine](https://www.pcre.org).
|
||||
Fortunately, the basics of regular expressions are so well established that you'll encounter few variations when working with the patterns you'll learn in this book (and I'll point them out where important).
|
||||
Fortunately, the basics of regular expressions are so well established that you'll encounter few variations when working with the patterns you'll learn in this book (and we'll point them out where important).
|
||||
You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the `(?…)` syntax.
|
||||
You can learn more about these advanced features in `vignette("regular-expressions", package = "stringr")`.
|
||||
Another useful reference is [https://www.regular-expressions.info/](https://www.regular-expressions.info/tutorial.html).
|
||||
@@ -57,10 +57,10 @@ Next you'll learn about **anchors**, which allow you to match the start or end o
|
||||
Then you'll learn about **character classes** and their shortcuts, which allow you to match any character from a set.
|
||||
We'll finish up with **quantifiers**, which control how many times a pattern can match, and **alternation**, which allows you to match either *this* or *that.*
|
||||
|
||||
The terms I use here are the technical names for each component.
|
||||
The terms we use here are the technical names for each component.
|
||||
They're not always the most evocative of their purpose, but it's very helpful to know the correct terms if you later want to Google for more details.
|
||||
|
||||
I'll concentrate on showing how these patterns work with `str_view()` and `str_view_all()` but remember that you can use them with any of the functions that you learned about in [Chapter -@sec-strings], i.e.:
|
||||
We'll concentrate on showing how these patterns work with `str_view()` and `str_view_all()` but remember that you can use them with any of the functions that you learned about in [Chapter -@sec-strings], i.e.:
|
||||
|
||||
- `str_detect(x, pattern)` returns a logical vector the same length as `x`, indicating whether each element matches (`TRUE`) or doesn't match (`FALSE`) the pattern.
|
||||
- `str_count(x, pattern)` returns the number of times `pattern` matches in each element of `x`.
|
||||
@@ -87,7 +87,7 @@ str_view(dot)
|
||||
str_view(c("abc", "a.c", "bef"), "a\\.c")
|
||||
```
|
||||
|
||||
In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.
|
||||
In this book, we'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.
|
||||
|
||||
If `\` is used as an escape character in regular expressions, how do you match a literal `\`?
|
||||
Well you need to escape it, creating the regular expression `\\`.
|
||||
@@ -125,7 +125,7 @@ str_view(x, "^a") # match "a" at start
|
||||
str_view(x, "a$") # match "a" at end
|
||||
```
|
||||
|
||||
To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
|
||||
To remember which is which, try this mnemonic which Hadley learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
|
||||
It's tempting to put `$` at the start, because that's how we write sums of money, but it's not what regular expressions want.
|
||||
|
||||
To force a regular expression to only match the full string, anchor it with both `^` and `$`:
|
||||
@@ -137,9 +137,9 @@ str_view(x, "^apple$")
|
||||
```
|
||||
|
||||
You can also match the boundary between words (i.e. the start or end of a word) with `\b`.
|
||||
I don't often use this in my R code, but I'll sometimes use it when I'm doing a search in RStudio.
|
||||
This is not that useful in R code, but it can be handy when searching in RStudio.
|
||||
It's useful to find the name of a function that's a component of other functions.
|
||||
For example, if I want to find all uses of `sum()`, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on:
|
||||
For example, if to find all uses of `sum()`, you can search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on:
|
||||
|
||||
```{r}
|
||||
x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
|
||||
@@ -266,7 +266,7 @@ But these tend to be less likely to cause confusion because they mostly behave h
|
||||
|
||||
6. Write the equivalents of `?`, `+`, `*` in `{m,n}` form.
|
||||
|
||||
7. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
|
||||
7. Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)
|
||||
|
||||
a. `^.*$`
|
||||
b. `"\\{.+\\}"`
|
||||
@@ -310,7 +310,7 @@ str_view(sentences, "^She|He|It|They\\b", match = TRUE)
|
||||
```
|
||||
|
||||
A quick inspection of the results shows that we're getting some spurious matches.
|
||||
That's because I've forgotten to use parentheses:
|
||||
That's because we've forgotten to use parentheses:
|
||||
|
||||
```{r}
|
||||
str_view(sentences, "^(She|He|It|They)\\b", match = TRUE)
|
||||
@@ -356,7 +356,7 @@ There's no "and" operator built in to regular expressions so we have to tackle i
|
||||
words[str_detect(words, "a.*b|b.*a")]
|
||||
```
|
||||
|
||||
I think its simpler to combine the results of two calls to `str_detect()`:
|
||||
Its simpler to combine the results of two calls to `str_detect()`:
|
||||
|
||||
```{r}
|
||||
words[str_detect(words, "a") & str_detect(words, "b")]
|
||||
@@ -490,7 +490,7 @@ sentences |>
|
||||
head(10)
|
||||
```
|
||||
|
||||
But I think you're generally better off using `str_match()` or `tidyr::separate_groups()`, which you'll learn about next.
|
||||
But you're generally better off using `str_match()` or `tidyr::separate_groups()`, which you'll learn about next.
|
||||
|
||||
### Extracting groups
|
||||
|
||||
@@ -503,7 +503,7 @@ sentences |>
|
||||
head()
|
||||
```
|
||||
|
||||
Instead I recommend using tidyr's `separate_groups()` which creates a column for each capturing group.
|
||||
Instead, we recommend using tidyr's `separate_groups()` which creates a column for each capturing group.
|
||||
|
||||
### Named groups
|
||||
|
||||
@@ -601,7 +601,7 @@ str_view_all(x, regex("^Line", multiline = TRUE))
|
||||
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, `comments = TRUE` can be extremely useful.
|
||||
It allows you to use comments and whitespace to make complex regular expressions more understandable.
|
||||
Spaces and new lines are ignored, as is everything after `#`.
|
||||
(Note that I'm using a raw string here to minimize the number of escapes needed)
|
||||
(Note that we use a raw string here to minimize the number of escapes needed)
|
||||
|
||||
```{r}
|
||||
phone <- regex(r"(
|
||||
|
||||
Reference in New Issue
Block a user