Working on pattern language
This commit is contained in:
parent
e1375dfb18
commit
8ec221c922
110
regexps.Rmd
110
regexps.Rmd
|
@ -6,7 +6,7 @@ status("restructuring")
|
|||
|
||||
## Introduction
|
||||
|
||||
You learned the basics of regular expressions in Chapter \@ref(strings), but because regular expressions are a miniature language it's worth spending some extra time on the details.
|
||||
You learned the basics of regular expressions in Chapter \@ref(strings), but regular expressions are fairly rich language so it's worth spending some extra time on the details.
|
||||
|
||||
The chapter starts by expanding your knowledge of patterns, to cover six important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, and alternation).
|
||||
Here we'll focus mostly on the language itself, not the functions that use it.
|
||||
|
@ -16,7 +16,8 @@ You'll need to take what you learn here and apply it to data frames with tidyr f
|
|||
Next we'll talk about the important concepts of "grouping" and "capturing" which give you new ways to extract variables out of strings using `tidyr::separate_group()`.
|
||||
Grouping also allows you to use back references which allow you do things like match repeated patterns.
|
||||
|
||||
We'll finish by discussing the various "flags" that allow you to tweak the operation of regular expressions and then cover a details about how regular expressions work that , and then discuss some useful strategies .
|
||||
We'll finish by discussing the various "flags" that allow you to tweak the operation of regular expressions and cover a few final details about how regular expressions work.
|
||||
These aren't particularly important in day-to-day usage, but at little extra understanding of the underlying tools is often helpful.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
|
@ -58,19 +59,18 @@ They're not always the most evocative of their purpose, but it's very helpful to
|
|||
### Escaping {#regexp-escaping}
|
||||
|
||||
In Chapter \@ref(strings), you'll learned how to match a literal `.` by using `fixed(".")`.
|
||||
What if you want to match a literal `.` as part of a regular expression?
|
||||
You'll need to use an escape, which tells the regular expression you want it to match exactly, not use its special behavior.
|
||||
Like strings, regexps use the backslash, `\`, to escape special behavior.
|
||||
So to match a `.`, you need the regexp `\.`.
|
||||
But what if you want to match a literal `.` as part of a bigger regular expression?
|
||||
You'll need to use an **escape**, which tells the regular expression you want it to match exactly, not use its special behavior.
|
||||
Like strings, regexps use the backslash for escaping, so to match a `.`, you need the regexp `\.`.
|
||||
Unfortunately this creates a problem.
|
||||
We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings.
|
||||
So to create the regular expression `\.` we need the string `"\\."`.
|
||||
So, as the following example shows, to create the regular expression `\.` we need the string `"\\."`.
|
||||
|
||||
```{r}
|
||||
# To create the regular expression \., we need to use \\.
|
||||
dot <- "\\."
|
||||
|
||||
# But the expression itself only contains one:
|
||||
# But the expression itself only contains one \
|
||||
str_view(dot)
|
||||
|
||||
# And this tells R to look for an explicit .
|
||||
|
@ -91,14 +91,14 @@ str_view(x, "\\\\")
|
|||
```
|
||||
|
||||
Alternatively, you might find it easier to use the raw strings you learned about in Section \@ref(raw-strings)).
|
||||
That allows you to avoid one layer of escaping:
|
||||
That lets you to avoid one layer of escaping:
|
||||
|
||||
```{r}
|
||||
str_view(x, r"(\\)")
|
||||
```
|
||||
|
||||
The full set of characters with special meanings that need to be escaped is `.^$\|*+?{}[]()`.
|
||||
In general, look at punctuation character with suspicion; if your regular expression isn't matching what you think it should, check if you've used any of these characters.
|
||||
In general, look at punctuation characters with suspicion; if your regular expression isn't matching what you think it should, check if you've used any of these characters.
|
||||
|
||||
### Anchors
|
||||
|
||||
|
@ -128,8 +128,8 @@ str_view(x, "^apple$")
|
|||
|
||||
You can also match the boundary between words with `\b`.
|
||||
I don't often use this in my R code, but I'll sometimes use it when I'm doing a search in RStudio.
|
||||
It's use to find the name of a function that's a component of other functions.
|
||||
For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on:
|
||||
It's useful to find the name of a function that's a component of other functions.
|
||||
For example, if I want to find all uses of `sum()`, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on:
|
||||
|
||||
```{r}
|
||||
x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
|
||||
|
@ -139,29 +139,26 @@ str_view(x, "\\bsum\\b")
|
|||
|
||||
### Character classes
|
||||
|
||||
You can also create your own collections of characters using `[]`:
|
||||
A **character class**, or character **set**, allows you to match any character in a set.
|
||||
The basic syntax lists each character you want to match inside of `[]`, so `[abc]` will match a, b, or c.
|
||||
Inside of `[]` only `-`, `^`, and `\` have special meanings:
|
||||
|
||||
- `[abc]`: matches a, b, or c.
|
||||
- `[a-z]`: matches every character between a and z. `[0-9]` matches any number.
|
||||
- `[^abc]`: matches anything except a, b, or c.
|
||||
- `[\^\-]`: matches `^` or `-`.
|
||||
|
||||
Remember that these are case sensitive.
|
||||
If you want to match any Latin letter or number, you could write `[a-zA-Z0-9]`.
|
||||
|
||||
A character class containing a single character can be a nice alternative to escapes when you want to include a single special character (i.e. `$` `.` `|` `?` `*` `+` `(` `)` `[` `{` `}`, but not `]` `\` `^`).
|
||||
This can be more readable because there are fewer slashes, but it does require a deeper understanding of regular expressions.
|
||||
- `-` defines a range. `[a-z]` matches any lower case letter and `[0-9]` matches any number.
|
||||
- `^` takes the inverse of the set. `[^abc]`: matches anything except a, b, or c.
|
||||
- `\` escapes special characters so `[\^\-\]]`: matches `^`, `-`, or `]`.
|
||||
|
||||
```{r}
|
||||
# Look for a literal character that normally has special meaning in a regex
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
|
||||
str_view_all("abcd12345-!@#%. [", "[abc]")
|
||||
str_view_all("abcd12345-!@#%. [", "[a-z]")
|
||||
str_view_all("abcd12345-!@#%. [", "[^a-z0-9]")
|
||||
str_view_all("abcd12345-!@#%. []", "[\\-]")
|
||||
```
|
||||
|
||||
Remember that regular expressions are case sensitive so if you want to match any lowercase or uppercase letter, you'd need to write `[a-zA-Z0-9]`.
|
||||
|
||||
### Shorthand character classes
|
||||
|
||||
There are a few character classes that are used so commonly that they get their own shortcut.
|
||||
There are a few character classes that are used so commonly that they get their own single character shortcut.
|
||||
You've already seen `.`, which matches any character apart from a newline.
|
||||
There are three other useful pairs:
|
||||
|
||||
|
@ -183,11 +180,9 @@ str_view_all("abcd12345!@#%. ", "\\S+")
|
|||
|
||||
### Quantifiers
|
||||
|
||||
The next step up in power involves controlling how many times a pattern matches, the so called **quantifiers**.
|
||||
We discussed `?` (0 or 1 matches), `+` (1 or more matches), and `*` (0 or more matches) in the last chapter.
|
||||
Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings.
|
||||
That means most uses will need parentheses, like `bana(na)+`.
|
||||
|
||||
The **quantifiers** control how many times a pattern matches.
|
||||
In Chapter \@ref(strings) we discussed `?` (0 or 1 matches), `+` (1 or more matches), and `*` (0 or more matches).
|
||||
So `colou?r` will match American and British spelling, `\d+` will match one or more digits, `\s?` will optionally match a single whitespace.
|
||||
You can also specify the number of matches precisely:
|
||||
|
||||
- `{n}`: exactly n
|
||||
|
@ -212,26 +207,23 @@ str_view(x, 'C+[LX]+')
|
|||
str_view(x, 'C+[LX]+?')
|
||||
```
|
||||
|
||||
### Parentheses
|
||||
|
||||
Quantifiers apply to the preceding pattern, so `a+` matches one or more "a"s, `\d+` matches one or more digits, and `[aeiou]+` matches one or more vowels.
|
||||
You can use parentheses to define a more complex pattern.
|
||||
For example, `([aeiou].)+` matches a vowel followed by any letter, repeated any number of times.
|
||||
|
||||
### Alternation
|
||||
|
||||
You can use **alternation** to pick between one or more alternative patterns.
|
||||
This is a more general form of character classes that's not limited to match single characters.
|
||||
I recommend always pairing `|` with parentheses, to make it very clear what the alternatives are.
|
||||
Here are a few examples:
|
||||
|
||||
- Match apple, pear, or banana: `"(apple)|(pear)|(banana)"`
|
||||
- Match 3 letters or two digits: `"(\\w{3})|(\\d{3})"`
|
||||
- Match apple, pear, or banana: `apple|pear|banana`.
|
||||
- Match 3 letters or two digits: `\w{3}|\d{2}`.
|
||||
|
||||
We'll come back to parentheses very shortly in more detail.
|
||||
|
||||
For example, `abc|def` will match either `"abcef"`, or `"abdef"`.
|
||||
Note that the precedence for `|` is low, so you'll often need to use it with parentheses: `(abc)|(def)` will match either `"abc"`, or `"def"`.
|
||||
`abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`.
|
||||
Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
|
||||
|
||||
```{r}
|
||||
str_view(c("grey", "gray"), "gr(e|a)y")
|
||||
```
|
||||
`|` has very low precedence, so if you want to use it inside a bigger pattern you'll need to wrap it in parenthesis.
|
||||
For example if you want to match only a complete string, you'll need `^(apple|pear|banana)$`.
|
||||
`^apple|pear|banana$` will match apple at the start of the string, pear anywhere, and banana at the end.
|
||||
|
||||
### Exercises
|
||||
|
||||
|
@ -247,43 +239,45 @@ str_view(c("grey", "gray"), "gr(e|a)y")
|
|||
|
||||
Since `words` is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.
|
||||
|
||||
3. What strings will `$a` match?
|
||||
3. Create regular expressions that match the British or American spellings of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut.
|
||||
|
||||
4. Create regular expressions to find all words that:
|
||||
4. What strings will `$a` match?
|
||||
|
||||
5. Create regular expressions to find all words that:
|
||||
|
||||
a. Start with a vowel.
|
||||
b. That only contain consonants. (Hint: thinking about matching "not"-vowels.)
|
||||
c. End with `ed`, but not with `eed`.
|
||||
d. End with `ing` or `ise`.
|
||||
|
||||
5. Empirically verify the rule "i before e except after c".
|
||||
6. Empirically verify the rule "i before e except after c".
|
||||
|
||||
6. Is "q" always followed by a "u"?
|
||||
7. Is "q" always followed by a "u"?
|
||||
|
||||
7. Write a regular expression that matches a `word` if it's probably written in British English, not American English.
|
||||
8. Write a regular expression that matches a `word` if it's probably written in British English, not American English.
|
||||
|
||||
8. Create a regular expression that will match telephone numbers as commonly written in your country.
|
||||
9. Create a regular expression that will match telephone numbers as commonly written in your country.
|
||||
|
||||
9. Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.
|
||||
10. Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.
|
||||
|
||||
10. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
|
||||
11. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
|
||||
|
||||
a. `^.*$`
|
||||
b. `"\\{.+\\}"`
|
||||
c. `\d{4}-\d{2}-\d{2}`
|
||||
d. `"\\\\{4}"`
|
||||
|
||||
11. Create regular expressions to find all words that:
|
||||
12. Create regular expressions to find all words that:
|
||||
|
||||
a. Start with three consonants.
|
||||
b. Have three or more vowels in a row.
|
||||
c. Have two or more vowel-consonant pairs in a row.
|
||||
|
||||
12. Solve the beginner regexp crosswords at <https://regexcrossword.com/challenges/beginner>.
|
||||
13. Solve the beginner regexp crosswords at <https://regexcrossword.com/challenges/beginner>.
|
||||
|
||||
## Grouping and capturing
|
||||
|
||||
Earlier, you learned about parentheses as a way to disambiguate complex expressions.
|
||||
Earlier, you learned about parentheses as a way to create complex patterns.
|
||||
Parentheses also create a numbered capturing group (number 1, 2 etc.).
|
||||
A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses.
|
||||
You can refer to the same text as previously matched by a capturing group with **backreferences**, like `\1`, `\2` etc.
|
||||
|
|
|
@ -617,6 +617,8 @@ Don't forget that you're in a programming language and you have other tools at y
|
|||
Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
|
||||
If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
|
||||
|
||||
A regular expression is a program that must be written in a single string, and has no debugger, no built-in documentation.
|
||||
|
||||
### Using multiple regular expressions
|
||||
|
||||
When you have complex logical conditions (e.g. match `a` or `b` but not `c` unless `d`) it's often easier to combine multiple `str_detect()` calls with logical operators instead of trying to create a single regular expression.
|
||||
|
@ -634,4 +636,3 @@ The results are identical, but I think the first approach is significantly easie
|
|||
If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations.
|
||||
|
||||
### Repeated `str_replace()`
|
||||
|
||||
|
|
Loading…
Reference in New Issue