More regexp polish
This commit is contained in:
parent
40a56c55ed
commit
f97f5479e3
59
regexps.qmd
59
regexps.qmd
|
@ -273,8 +273,6 @@ And we'll finish off with some details of **grouping** components of the pattern
|
|||
The terms we use here are the technical names for each component.
|
||||
They're not always the most evocative of their purpose, but it's very helpful to know the correct terms if you later want to Google for more details.
|
||||
|
||||
We'll concentrate on showing how these patterns work with `str_view()`; remember that you can use them with any of the functions that you learned above.
|
||||
|
||||
### Escaping {#sec-regexp-escaping}
|
||||
|
||||
In order to match a literal `.`, you need an **escape**, which tells the regular expression to ignore the special behavior and match exactly.
|
||||
|
@ -491,7 +489,7 @@ sentences |>
|
|||
```
|
||||
|
||||
But then you've basically recreated your own version of `separate_wider_regex()`.
|
||||
And indeed, behind the scenes `separate_wider_regex()` converts your vector of patterns to a single regex that uses grouping to capture the named components.
|
||||
Indeed, behind the scenes, `separate_wider_regex()` converts your vector of patterns to a single regex that uses grouping to capture the named components.
|
||||
|
||||
Occasionally, you'll want to use parentheses without creating matching groups.
|
||||
You can create a non-capturing group with `(?:)`.
|
||||
|
@ -553,25 +551,27 @@ str_view(bananas, "banana")
|
|||
str_view(bananas, regex("banana", ignore_case = TRUE))
|
||||
```
|
||||
|
||||
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `dotall`and `multiline` also be useful.
|
||||
`dotall = TRUE` lets `.` match everything, including `\n`:
|
||||
If you're doing a lot of work with multiline strings (i.e. strings that contain `\n`), `dotall`and `multiline` also be useful:
|
||||
|
||||
```{r}
|
||||
x <- "Line 1\nLine 2\nLine 3"
|
||||
str_view(x, ".Line")
|
||||
str_view(x, regex(".Line", dotall = TRUE))
|
||||
```
|
||||
- `dotall = TRUE` lets `.` match everything, including `\n`:
|
||||
|
||||
And `multiline = TRUE` makes `^` and `$` match the start and end of each line rather than the start and end of the complete string:
|
||||
```{r}
|
||||
x <- "Line 1\nLine 2\nLine 3"
|
||||
str_view(x, ".Line")
|
||||
str_view(x, regex(".Line", dotall = TRUE))
|
||||
```
|
||||
|
||||
```{r}
|
||||
x <- "Line 1\nLine 2\nLine 3"
|
||||
str_view(x, "^Line")
|
||||
str_view(x, regex("^Line", multiline = TRUE))
|
||||
```
|
||||
- `multiline = TRUE` makes `^` and `$` match the start and end of each line rather than the start and end of the complete string:
|
||||
|
||||
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, you might find `comments = TRUE` to be useful.
|
||||
It ignores spaces and new lines, as well is everything after `#`, allowing you to use comments and whitespace to make complex regular expressions more understandable[^regexps-7].
|
||||
```{r}
|
||||
x <- "Line 1\nLine 2\nLine 3"
|
||||
str_view(x, "^Line")
|
||||
str_view(x, regex("^Line", multiline = TRUE))
|
||||
```
|
||||
|
||||
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, you might try `comments = TRUE`.
|
||||
It tweaks the pattern language to ignore spaces and new lines, as well as everything after `#`.
|
||||
This allows you to use comments and whitespace to make complex regular expressions more understandable[^regexps-7], as in the following example:
|
||||
|
||||
[^regexps-7]: `comments = TRUE` is particularly effective in combination with a raw string, as we use here.
|
||||
|
||||
|
@ -614,7 +614,7 @@ str_view("x X", fixed("X", ignore_case = TRUE))
|
|||
```
|
||||
|
||||
If you're working with non-English text, you should generally use `coll()` instead, as it implements the full rules for capitalization as used by the `locale` you specify.
|
||||
See \@#sec-other-languages for more details.
|
||||
See @sec-other-languages for more details on locales.
|
||||
|
||||
```{r}
|
||||
str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
|
||||
|
@ -667,8 +667,8 @@ str_detect(pos, pattern)
|
|||
str_detect(neg, pattern)
|
||||
```
|
||||
|
||||
It's typically much easier to come up with positive examples than negative examples, because it takes some time until you're good enough with regular expressions to predict where your weaknesses are.
|
||||
Nevertheless they're still useful; even if you don't get them correct right away, you can slowly accumulate them as you work on your problem.
|
||||
It's typically much easier to come up with positive examples than negative examples, because it takes a while before you're good enough with regular expressions to predict where your weaknesses are.
|
||||
Nevertheless they're still useful; even if you don't get them correct right away, you can slowly accumulate them as you work on the problem.
|
||||
If you later get more into programming and learn about unit tests, you can then turn these examples into automated tests that ensure you never make the same mistake twice.
|
||||
|
||||
### Boolean operations {#sec-boolean-operations}
|
||||
|
@ -684,7 +684,7 @@ But we can make this problem a bit easier by flipping the problem around.
|
|||
Instead of looking for words that contain only consonants, we could look for words that don't contain any vowels:
|
||||
|
||||
```{r}
|
||||
words[!str_detect(words, "[aeiou]")]
|
||||
str_view(words[!str_detect(words, "[aeiou]")])
|
||||
```
|
||||
|
||||
This is a useful technique whenever you're dealing with logical combinations, particularly those involving "and" or "not".
|
||||
|
@ -692,7 +692,7 @@ For example, imagine if you want to find all words that contain "a" and "b".
|
|||
There's no "and" operator built in to regular expressions so we have to tackle it by looking for all words that contain an "a" followed by a "b", or a "b" followed by an "a":
|
||||
|
||||
```{r}
|
||||
words[str_detect(words, "a.*b|b.*a")]
|
||||
str_view(words, "a.*b|b.*a")
|
||||
```
|
||||
|
||||
It's simpler to combine the results of two calls to `str_detect()`:
|
||||
|
@ -735,8 +735,8 @@ The basic idea is simple: we just combine alternation with word boundaries.
|
|||
str_view(sentences, "\\b(red|green|blue)\\b")
|
||||
```
|
||||
|
||||
But as the number of colours grows, it would quickly get tedious to construct this pattern by hand.
|
||||
Wouldn't it be nice if we could store the colours in a vector?
|
||||
But as the number of colors grows, it would quickly get tedious to construct this pattern by hand.
|
||||
Wouldn't it be nice if we could store the colors in a vector?
|
||||
|
||||
```{r}
|
||||
rgb <- c("red", "green", "blue")
|
||||
|
@ -750,7 +750,7 @@ str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
|
|||
```
|
||||
|
||||
We could make this pattern more comprehensive if we had a good list of colors.
|
||||
One place we could start from is the list of built-in colours that R can use for plots:
|
||||
One place we could start from is the list of built-in colors that R can use for plots:
|
||||
|
||||
```{r}
|
||||
str_view(colors())
|
||||
|
@ -786,15 +786,16 @@ But generally, when creating patterns from existing strings it's wise to run the
|
|||
|
||||
3. `colors()` contains a number of modifiers like "lightgray" and "darkblue".
|
||||
How could you automatically identify these modifiers?
|
||||
(Think about how you might detect and removed what colors are being modified).
|
||||
(Think about how you might detect and then removed the colors that are modified).
|
||||
|
||||
4. Create a regular expression that finds any base R dataset.
|
||||
You can get a list of these datasets via a special use of the `data()` function: `data(package = "datasets")$results[, "Item"]`.
|
||||
Note that a number of old datasets are individual vectors; these contain the name of the grouping "data frame" in parentheses, so you'll need to also strip these off.
|
||||
|
||||
## Elsewhere
|
||||
## Regular expressions
|
||||
|
||||
The are a bunch of other places you can use regular expressions outside of stringr.
|
||||
As well as the stringr and tidyr functions we discussed at the very start of other chapter, there are many other places where you can use regular expressions.
|
||||
The following sections describe some other use stringr functions, some other places in the tidyverse that use regular expressions, and some handy base R functions.
|
||||
|
||||
### stringr
|
||||
|
||||
|
|
Loading…
Reference in New Issue