2021-04-21 21:43:19 +08:00
# Regular expressions
2021-05-04 21:10:39 +08:00
```{r, results = "asis", echo = FALSE}
status("restructuring")
```
2021-04-22 01:30:25 +08:00
## Introduction
2021-12-14 04:44:52 +08:00
We touched on regular expressions in Chapter \@ref(strings), but regular expressions really are their own miniature language so it's worth spending some extra time on them.
Regular expressions can be overwhelming at first, and you'll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.
2021-04-22 01:30:25 +08:00
2021-12-14 04:44:52 +08:00
More details in `vignette("regular-expressions", package = "stringr")`.
2021-04-21 21:43:19 +08:00
2021-12-14 04:44:52 +08:00
Here we'll focus mostly on pattern language itself, not the functions that use it.
That means we'll mostly work with simple vectors showing the results with `str_view()` and `str_view_all()`.
You'll need to take what you learn and apply it to data frames either with tidyr functions or by combining dplyr functions with stringr functions.
2021-04-21 21:43:19 +08:00
### Prerequisites
This chapter will focus on the **stringr** package for string manipulation, which is part of the core tidyverse.
```{r setup, message = FALSE}
library(tidyverse)
```
2021-12-14 04:44:52 +08:00
## Escaping {#regexp-escaping}
2021-04-21 21:43:19 +08:00
But if "`.`" matches any character, how do you match the character "`.`"?
You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour.
Like strings, regexps use the backslash, `\`, to escape special behaviour.
So to match an `.`, you need the regexp `\.`.
Unfortunately this creates a problem.
We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings.
So to create the regular expression `\.` we need the string `"\\."`.
```{r}
# To create the regular expression, we need \\
dot <- "\\."
# But the expression itself only contains one:
2021-12-14 04:44:52 +08:00
str_view(dot)
2021-04-21 21:43:19 +08:00
# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
```
If `\` is used as an escape character in regular expressions, how do you match a literal `\`?
Well you need to escape it, creating the regular expression `\\`.
To create that regular expression, you need to use a string, which also needs to escape `\`.
That means to match a literal `\` you need to write `"\\\\"` --- you need four backslashes to match one!
```{r}
x <- "a\\b"
2021-12-14 04:44:52 +08:00
str_view(x)
2021-04-21 21:43:19 +08:00
str_view(x, "\\\\")
```
2021-12-14 04:44:52 +08:00
Alternatively, you might find it easier to use the raw strings we discussed in Section \@ref(raw-strings) as that allows you to avoid one layer of escaping:
```{r}
str_view(x, r"(\\)")
```
2021-04-21 21:43:19 +08:00
In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.
### Exercises
1. Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
2. How would you match the sequence `"'\`?
3. What patterns will the regular expression `\..\..\..` match?
How would you represent it as a string?
## Anchors
By default, regular expressions will match any part of a string.
2021-12-14 04:44:52 +08:00
It's often useful to **anchor** the regular expression so that it matches from the start or end of the string.
2021-04-21 21:43:19 +08:00
You can use:
- `^` to match the start of the string.
- `$` to match the end of the string.
```{r}
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
```
To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
To force a regular expression to only match a complete string, anchor it with both `^` and `$`:
```{r}
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")
```
You can also match the boundary between words with `\b`.
I don't often use this in R, but I will sometimes use it when I'm doing a search in RStudio when I want to find the name of a function that's a component of other functions.
For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
2021-12-14 04:44:52 +08:00
```{r}
x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum")
str_view_all(x, "\\bsum\\b")
```
2021-04-21 21:43:19 +08:00
### Exercises
1. How would you match the literal string `"$^$"`?
2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that:
a. Start with "y".
b. End with "x"
c. Are exactly three letters long. (Don't cheat by using `str_length()`!)
d. Have seven letters or more.
Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.
2021-12-14 04:44:52 +08:00
## Matching multiple characters
2021-04-21 21:43:19 +08:00
There are a number of special patterns that match more than one character.
You've already seen `.`, which matches any character apart from a newline.
There are four other useful tools:
2021-12-14 04:44:52 +08:00
- `\d`: matches any digit. `\D` matches anything that isn't a digit.
- `\s`: matches any whitespace (e.g. space, tab, newline). `\S` matches anything that isn't whitespace.
2021-04-21 21:43:19 +08:00
- `[abc]`: matches a, b, or c.
- `[^abc]`: matches anything except a, b, or c.
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex.
Many people find this more readable.
```{r}
# Look for a literal character that normally has special meaning in a regex
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
```
This works for most (but not all) regex metacharacters: `$` `.` `|` `?` `*` `+` `(` `)` `[` `{`.
Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: `]` `\` `^` and `-`.
2021-04-27 03:49:14 +08:00
When you have complex logical conditions (e.g. match a or b but not c unless d) it's often easier to combine multiple `str_detect()` calls with logical operators, rather than trying to create a single regular expression.
For example, here are two ways to find all words that don't contain any vowels:
```{r}
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
```
The results are identical, but I think the first approach is significantly easier to understand.
If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations.
2021-04-21 21:43:19 +08:00
### Exercises
1. Create regular expressions to find all words that:
a. Start with a vowel.
b. That only contain consonants. (Hint: thinking about matching "not"-vowels.)
c. End with `ed`, but not with `eed`.
d. End with `ing` or `ise`.
2. Empirically verify the rule "i before e except after c".
3. Is "q" always followed by a "u"?
4. Write a regular expression that matches a word if it's probably written in British English, not American English.
5. Create a regular expression that will match telephone numbers as commonly written in your country.
## Repetition / Quantifiers
2021-12-14 04:44:52 +08:00
The next step up in power involves controlling how many times a pattern matches, the so called **quantifiers**.
We discussed `?` (0 or 1 matches), `+` (1 or more matches), and `*` (0 or more matches) in the last chapter.
2021-04-21 21:43:19 +08:00
Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings.
That means most uses will need parentheses, like `bana(na)+`.
You can also specify the number of matches precisely:
- `{n}`: exactly n
- `{n,}`: n or more
- `{n,m}`: between n and m
```{r}
2021-12-14 04:44:52 +08:00
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
2021-04-21 21:43:19 +08:00
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{1,3}")
str_view(x, "C{2,3}")
```
2021-12-14 04:44:52 +08:00
By default these matches are **greedy**: they will match the longest string possible.
You can make them **lazy**, matching the shortest string possible by putting a `?` after them.
2021-04-21 21:43:19 +08:00
This is an advanced feature of regular expressions, but it's useful to know that it exists:
```{r}
str_view(x, 'C{2,3}?')
2021-12-14 04:44:52 +08:00
str_view(x, 'C+[LX]+')
str_view(x, 'C+[LX]+?')
2021-04-21 21:43:19 +08:00
```
### Exercises
1. Describe the equivalents of `?`, `+`, `*` in `{m,n}` form.
2. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
a. `^.*$`
b. `"\\{.+\\}"`
c. `\d{4}-\d{2}-\d{2}`
d. `"\\\\{4}"`
3. Create regular expressions to find all words that:
a. Start with three consonants.
b. Have three or more vowels in a row.
c. Have two or more vowel-consonant pairs in a row.
2021-12-06 00:52:47 +08:00
4. Solve the beginner regexp crosswords at [\<https://regexcrossword.com/challenges/beginner>](https://regexcrossword.com/challenges/beginner){.uri}.
2021-04-21 21:43:19 +08:00
## Grouping and backreferences
Earlier, you learned about parentheses as a way to disambiguate complex expressions.
Parentheses also create a *numbered* capturing group (number 1, 2 etc.).
A capturing group stores *the part of the string* matched by the part of the regular expression inside the parentheses.
You can refer to the same text as previously matched by a capturing group with *backreferences*, like `\1`, `\2` etc.
For example, the following regular expression finds all fruits that have a repeated pair of letters.
```{r}
str_view(fruit, "(..)\\1", match = TRUE)
```
2021-04-22 01:30:25 +08:00
Also use for replacement:
```{r}
sentences %>%
2021-12-14 04:44:52 +08:00
str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") %>%
2021-04-22 01:30:25 +08:00
head(5)
```
2021-04-23 21:07:16 +08:00
Names that start and end with the same letter.
Implement with `str_sub()` instead.
2021-12-14 04:44:52 +08:00
Can create non-capturing groups with `(?:)`.
2021-04-21 21:43:19 +08:00
### Exercises
1. Describe, in words, what these expressions will match:
a. `(.)\1\1`
b. `"(.)(.)\\2\\1"`
c. `(..)\1`
d. `"(.).\\1.\\1"`
e. `"(.)(.)(.).*\\3\\2\\1"`
2. Construct regular expressions to match words that:
a. Start and end with the same character.
b. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
c. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
## Other uses of regular expressions
There are two useful function in base R that also use regular expressions:
2021-04-22 21:41:06 +08:00
## Options
When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`:
```{r, eval = FALSE}
# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))
```
You can use the other arguments of `regex()` to control details of the match:
- `ignore_case = TRUE` allows characters to match either their uppercase or lowercase forms.
This always uses the current locale.
```{r}
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
str_view(bananas, regex("banana", ignore_case = TRUE))
```
- `multiline = TRUE` allows `^` and `$` to match the start and end of each line rather than the start and end of the complete string.
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
```
- `comments = TRUE` allows you to use comments and white space to make complex regular expressions more understandable.
Spaces are ignored, as is everything after `#`.
To match a literal space, you'll need to escape it: `"\\ "`.
```{r}
phone <- regex("
\\(? # optional opening parens
(\\d{3}) # area code
[) -]? # optional closing parens, space, or dash
(\\d{3}) # another three numbers
[ -]? # optional space or dash
(\\d{3}) # three more numbers
", comments = TRUE)
str_match("514-791-8141", phone)
```
- `dotall = TRUE` allows `.` to match everything, including `\n`.
2021-12-14 04:44:52 +08:00
## Some details
### Overlapping
Matches never overlap, and the regular expression engine only starts looking for a new match after the end of the last match.
For example, in `"abababa"`, how many times will the pattern `"aba"` match?
Regular expressions say two, not three:
```{r}
str_count("abababa", "aba")
str_view_all("abababa", "aba")
```
### Zero width matches
It's possible for a regular expression to match no character, i.e. the space between too characters.
This typically happens when you use a quantifier that allows zero matches:
```{r}
str_view_all("abcdef", "c?")
```
But `\b` also creatse a match:
```{r}
str_view_all("this is a sentence", "\\b")
```
### Operator precedence
You can use *alternation* to pick between one or more alternative patterns.
For example, `abc|d..f` will match either '"abc"', or `"deaf"`.
Note that the precedence for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`.
Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
```{r}
str_view(c("grey", "gray"), "gr(e|a)y")
```
2021-04-21 21:43:19 +08:00
## A caution
A word of caution before we continue: because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression.
In the words of Jamie Zawinski:
> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
As a cautionary tale, check out this regular expression that checks if a email address is valid:
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
\t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
\t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)
This is a somewhat pathological example (because email addresses are actually surprisingly complex), but is used in real code.
See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for more details.
Don't forget that you're in a programming language and you have other tools at your disposal.
Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
2021-12-06 00:52:47 +08:00
### Exercises
1. In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem.
2. Find all words that come after a "number" like "one", "two", "three" etc. Pull out both the number and the word.
3. Find all contractions. Separate out the pieces before and after the apostrophe.