More noodling on regexps
This commit is contained in:
parent
011f8cceee
commit
fd2a95d4dc
99
regexps.Rmd
99
regexps.Rmd
|
@ -14,8 +14,6 @@ Here we'll focus mostly on pattern language itself, not the functions that use i
|
|||
That means we'll mostly work with character vectors, showing the results with `str_view()` and `str_view_all()`.
|
||||
You'll need to take what you learn and apply it to data frames with tidyr functions or by combining dplyr and stringr functions.
|
||||
|
||||
The full language of regular expression includes some
|
||||
|
||||
### Prerequisites
|
||||
|
||||
This chapter will use regular expressions as provided by the **stringr** package.
|
||||
|
@ -30,6 +28,9 @@ Fortunately, the basics of regular expressions are so well established that you'
|
|||
You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the `(?…)` syntax.
|
||||
You can learn more about these advanced features in `vignette("regular-expressions", package = "stringr")`.
|
||||
|
||||
Another useful reference is [https://www.regular-expressions.info/](https://www.regular-expressions.info/tutorial.html).
|
||||
It's not R specific, but it includes a lot more information about how regular expressions actually work.
|
||||
|
||||
## Escaping {#regexp-escaping}
|
||||
|
||||
In Chapter \@ref(strings), you'll learned how to match a literal `.` by using `fixed(".")`.
|
||||
|
@ -113,13 +114,14 @@ You can use:
|
|||
|
||||
```{r}
|
||||
x <- c("apple", "banana", "pear")
|
||||
str_view(x, "^a")
|
||||
str_view(x, "a$")
|
||||
str_view(x, "a") # match "a" anywhere
|
||||
str_view(x, "^a") # match "a" at start
|
||||
str_view(x, "a$") # match "a" at end
|
||||
```
|
||||
|
||||
To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
|
||||
|
||||
To force a regular expression to only match a complete string, anchor it with both `^` and `$`:
|
||||
To force a regular expression to only match the full string, anchor it with both `^` and `$`:
|
||||
|
||||
```{r}
|
||||
x <- c("apple pie", "apple", "apple cake")
|
||||
|
@ -138,46 +140,12 @@ str_view(x, "sum")
|
|||
str_view(x, "\\bsum\\b")
|
||||
```
|
||||
|
||||
### Alternation and parentheses
|
||||
|
||||
You can use **alternation** to pick between one or more alternative patterns.
|
||||
For example, `abc|def` will match either `"abcef"`, or `"abdef"`.
|
||||
Note that the precedence for `|` is low, so you'll often need to use it with parentheses: `(abc)|(def)` will match either `"abc"`, or `"def"`.
|
||||
|
||||
`abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`.
|
||||
Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
|
||||
|
||||
```{r}
|
||||
str_view(c("grey", "gray"), "gr(e|a)y")
|
||||
```
|
||||
|
||||
### Matching multiple characters
|
||||
|
||||
There are a number of special patterns that match more than one character.
|
||||
You've already seen `.`, which matches any character apart from a newline.
|
||||
There are three escaped pairs that match narrower classes of characters:
|
||||
|
||||
- `\d`: matches any digit. `\D` matches anything that isn't a digit.
|
||||
- `\s`: matches any whitespace (e.g. space, tab, newline). `\S` matches anything that isn't whitespace.
|
||||
- `\w` matches any "word" character, i.e. letters and numbers. The complement, `\W`, matches any non-word character.
|
||||
|
||||
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
|
||||
|
||||
```{r}
|
||||
str_view_all("abcd12345!@#%. ", "\\d+")
|
||||
str_view_all("abcd12345!@#%. ", "\\D+")
|
||||
str_view_all("abcd12345!@#%. ", "\\w+")
|
||||
str_view_all("abcd12345!@#%. ", "\\W+")
|
||||
str_view_all("abcd12345!@#%. ", "\\s+")
|
||||
str_view_all("abcd12345!@#%. ", "\\S+")
|
||||
```
|
||||
|
||||
### Character classes
|
||||
|
||||
You can also create your own collections of characters using `[]`:
|
||||
|
||||
- `[abc]`: matches a, b, or c.
|
||||
- `[a-z]`: matches every character between a and z.
|
||||
- `[a-z]`: matches every character between a and z. `[0-9]` matches any number.
|
||||
- `[^abc]`: matches anything except a, b, or c.
|
||||
- `[\^\-]`: matches `^` or `-`.
|
||||
|
||||
|
@ -191,6 +159,28 @@ str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
|
|||
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
|
||||
```
|
||||
|
||||
### Shorthand character classes
|
||||
|
||||
There are a few character classes that are used so commonly that they get their own shortcut.
|
||||
You've already seen `.`, which matches any character apart from a newline.
|
||||
There are three other useful pairs:
|
||||
|
||||
- `\d`: matches any digit; `\D` matches anything that isn't a digit.
|
||||
- `\s`: matches any whitespace (e.g. space, tab, newline); `\S` matches anything that isn't whitespace.
|
||||
- `\w` matches any "word" character, i.e. letters and numbers; `\W`, matches any non-word character.
|
||||
|
||||
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
|
||||
The following code demonstrates the different matches with a selection of letters, numbers, and punctuation characters.
|
||||
|
||||
```{r}
|
||||
str_view_all("abcd12345!@#%. ", "\\d+")
|
||||
str_view_all("abcd12345!@#%. ", "\\D+")
|
||||
str_view_all("abcd12345!@#%. ", "\\w+")
|
||||
str_view_all("abcd12345!@#%. ", "\\W+")
|
||||
str_view_all("abcd12345!@#%. ", "\\s+")
|
||||
str_view_all("abcd12345!@#%. ", "\\S+")
|
||||
```
|
||||
|
||||
### Quantifiers
|
||||
|
||||
The next step up in power involves controlling how many times a pattern matches, the so called **quantifiers**.
|
||||
|
@ -222,6 +212,27 @@ str_view(x, 'C+[LX]+')
|
|||
str_view(x, 'C+[LX]+?')
|
||||
```
|
||||
|
||||
### Alternation
|
||||
|
||||
You can use **alternation** to pick between one or more alternative patterns.
|
||||
This is a more general form of character classes that's not limited to match single characters.
|
||||
I recommend always pairing `|` with parentheses, to make it very clear what the alternatives are.
|
||||
Here are a few examples:
|
||||
|
||||
- Match apple, pear, or banana: `"(apple)|(pear)|(banana)"`
|
||||
- Match 3 letters or two digits: `"(\\w{3})|(\\d{3})"`
|
||||
|
||||
We'll come back to parentheses very shortly in more detail.
|
||||
|
||||
For example, `abc|def` will match either `"abcef"`, or `"abdef"`.
|
||||
Note that the precedence for `|` is low, so you'll often need to use it with parentheses: `(abc)|(def)` will match either `"abc"`, or `"def"`.
|
||||
`abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`.
|
||||
Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
|
||||
|
||||
```{r}
|
||||
str_view(c("grey", "gray"), "gr(e|a)y")
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. How would you match the literal string `"$^$"`?
|
||||
|
@ -268,7 +279,7 @@ str_view(x, 'C+[LX]+?')
|
|||
|
||||
11. Solve the beginner regexp crosswords at <https://regexcrossword.com/challenges/beginner>.
|
||||
|
||||
## Parentheses, grouping and backreferences
|
||||
## Grouping and capturing
|
||||
|
||||
Earlier, you learned about parentheses as a way to disambiguate complex expressions.
|
||||
Parentheses also create a numbered capturing group (number 1, 2 etc.).
|
||||
|
@ -358,6 +369,10 @@ str_view_all("this is a sentence", "\\b")
|
|||
str_view_all("this is a sentence", "^")
|
||||
```
|
||||
|
||||
### Greediness
|
||||
|
||||
Regular expressions always attempt to match the longest possible string.
|
||||
|
||||
### Multi-line strings
|
||||
|
||||
- `dotall = TRUE` allows `.` to match everything, including `\n`.
|
||||
|
@ -370,7 +385,7 @@ str_view_all("this is a sentence", "^")
|
|||
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
|
||||
```
|
||||
|
||||
## Options
|
||||
## Flags
|
||||
|
||||
When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`:
|
||||
|
||||
|
|
Loading…
Reference in New Issue