More on strings

This commit is contained in:
hadley 2015-11-05 08:10:27 -06:00
parent 2a62c53d56
commit 99338eece0
1 changed files with 161 additions and 60 deletions

View File

@ -408,21 +408,23 @@ Because regular expressions are so powerful, it's easy to try and solve every pr
### Detect matches
To determine if a character vector matches a pattern, use `str_detect()`. It returns a logical vector:
To determine if a character vector matches a pattern, use `str_detect()`. It returns a logical vector the same length as the input:
```{r}
x <- c("apple", "banana", "pear")
str_detect(x, "e")
```
Remember that logical vectors are effectively combined with `sum()` and `mean()`. This makes it easy to answer questions about a complete vector:
Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1. That makes `sum()` and `mean()` useful if you want answer questions about matches across a larger vector:
```{r}
# How many common words start with t?
sum(str_detect(common, "^t"))
# What proportion of common words end with a vowel?
mean(str_detect(common, "[aeiou]$"))
```
When you have complicated logical conditions (e.g. match this or that but not these) combining multiple `str_detect()` calls with logical operators is often easy. A simple example is if you want to find all words that don't contain any vowels:
When you have complex logical conditions (e.g. match a or b but not c unless d) it's often easier to combine multiple `str_detect()` calls with logical operators, rather than trying to create a single regular expression. For example, here are two ways to find all words that don't contain any vowels:
```{r}
# Find all words containing at least one vowel, and negate
@ -432,33 +434,52 @@ no_vowels_2 <- str_detect(common, "^[^aeiou]+$")
all.equal(no_vowels_1, no_vowels_2)
```
If you find your regular expression is getting hard to understand, trying breaking it up into smaller pieces, giving each piece a name, and then combining with logical operations.
The results are identical, but I think the first approach is significantly easier to understand. So if you find your regular expression is getting overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining with logical operations.
`str_count()` is similar to `str_detect()` but it returns an integer count of the number of matches, instead of a true/false:
A common use of `str_detect()` is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient `str_subset()` wrapper:
```{r}
common[str_detect(common, "x$")]
str_subset(common, "x$")
```
A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string:
```{r}
x <- c("apple", "banana", "pear")
str_count(x, "a")
# What's the average number of vowels per word?
# On average, how many vowels per word?
mean(str_count(common, "[aeiou]"))
```
`str_subset()` is a wrapper for the common pattern `x[str_detect(x, pattern)]`.
Note that matches never overlap. For example, in `"abababa"`, how many times will the pattern `"aba"` match? Regular expressions say two, not three:
```{r}
str_count("abababa", "aba")
str_view_all("abababa", "aba")
```
Note the use of `str_view_all()`. As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches.
### Exercises
1. For each of the following challenges, try solving with both a single
1. For each of the following challenges, try solving it both a single
regular expression, and a combination of multiple `str_detect()` calls.
1. Find all words that start or end with `x`.
1. Find all words that start with a vowel and end with a consonant.
1. Are there any words that contain at least one of each different
vowel?
1. What word has the highest number of vowels? What word has the highest
proportion of vowels?
proportion of vowels? (Hint: what is the denominator?)
### Extract matches
To extract the actual text of a match, use `str_extract()`. For that to be useful, we need a somewhat more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences): these are sentences designed to tested VOIP systems, but we're going to use them as random data.
To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to tested VOIP systems, but are also useful for practicing regexs.
```{r}
length(sentences)
@ -481,26 +502,23 @@ matches <- str_extract(has_colour, colour_match)
head(matches)
```
A few sentences contain more than one colour and `str_extract()` only extracts the first:
Note that `str_extract()` only extracts the first match. We can see that most easily by first selecting all the sentences that have more than 1 match:
```{r}
table(str_count(sentences, colour_match))
more <- sentences[str_count(sentences, colour_match) > 1]
more
str_view_all(more, colour_match)
str_extract(more, colour_match)
```
To get all matches, use `str_extract_all()`:
This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures. To get all matches, use `str_extract_all()`. It returns either a list or a matrix, based on the value of the `simplify` argument:
```{r}
str_view_all(more, colour_match)
str_extract_all(more, colour_match)
str_extract_all(more, colour_match, simplify = TRUE)
```
This returns a list, which is a little hard to work with, which is why it's not the default. You'll learn more about working with lists in Chapter XYZ. Note that matches are always non-overlapping: the second match starts after the first is complete.
Another options is to convert it to a character matrix with `simplify = TRUE`. Short matches are expanded with `""` to the length of the longest:
You'll learn more about working with lists in Chapter XYZ. If you use `simplify = TRUE`, note that short matches are expanded to the same length as the longest:
```{r}
x <- c("a", "a b", "a b c")
@ -509,18 +527,19 @@ str_extract_all(x, "[a-z]", simplify = TRUE)
#### Exercises
1. In the previous example, you might have noticed that the regular
expression matched "fickered", which is not a colour. Modify the
regex to fix the problem.
1. From the Harvard sentences data, extract:
1. The first word from each sentence.
1. All words ending in `ing`.
1. In the previous example, you might have noticed that our regular expression
matched "fickered", which is not a colour. Modify the regex to prevent
this problematic match.
1. All plurals.
### Grouped matches
We talked early about the use of parentheses. You can use them if you want to extract parts of a match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the":
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and to use with backreferences when matching. You can also parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the". Defining a "word" in a regular expression is a little tricky. Here I use a sequence of at least one character that isn't a space.
```{r}
noun <- "(a|the) ([^ ]+)"
@ -531,47 +550,64 @@ has_noun <- sentences %>%
str_extract(has_noun, noun)
```
(Defining a "word" in a regular expression is a little tricky. I've decided to go for a sequence of any characters except for a space.)
`str_extract()` gives us the complete match, but we'd like to be able to dig into the pieces. That's the job of `str_match()`. Instead of a character vector, it returns a matrix, with one column for each group, and one column for the complete match:
`str_extract()` gives us the complete match; `str_match()` gives each individual component. Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
```{r}
str_match(has_noun, noun)
```
(You can see our heuristic for finding nouns isn't that good as it also picks up adjectives like smooth and parked.)
(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
Like `str_extract()`, if you want all matches, you'll need to use `str_match_all()` and then work with the list that it returns.
```{r}
num <- str_c("one", "two", "three", "four", "five", "six",
"seven", "eight", "nine", "ten", sep = "|")
match <- str_interp("(${num}) ([^ ]+s)\\b")
sentences %>%
str_subset(match) %>%
head(10) %>%
str_match(match)
```
Like `str_extract()`, if you want all matches for each string, you'll need `str_match_all()`.
#### Exercises
### Replacing matches
`str_replace()` allows you to transform
`str_replace()` and `str_replace_all()` allow you to replace matches with new strings:
```{r}
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
str_replace_all(x, "[aeiou]", "-")
```
With `str_replace_all()` you can also perform multiple replacements by supplying a named vector:
```{r}
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
You can refer to groups with backreferences:
```{r}
sentences %>%
head(5) %>%
str_replace("([^ ]+) ([^ ]+)", "\\2 \\1")
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2")
```
Like `str_extract()` and `str_match()`, `str_replace()` only affects the first match. To replace every match, use `str_replace_all()`. Compared to the other two `all()` functions, the output from `str_replace_all()` is simpler because it can stay as a character vector.
Multiple replacements
Backreferences.
Replacing with a function call (hopefully)
<!-- Replacing with a function call (hopefully) -->
#### Exercises
1. Replace all `/` in a string with `\`.
### Splitting
Another useful application is to split strings up into pieces. For example we could split sentences up into words
Use `str_split()` to split a string up into pieces. For example, we could split sentences into words:
```{r}
sentences %>%
@ -579,7 +615,7 @@ sentences %>%
str_split(" ")
```
Note that this function has to return a list: the number of pieces each element is split up into might be difference, so there's no way to put them in a vector. If you're working with a length-1 vector, the easiest thing is to just extra the first element of the list:
Because each component might contain a different number of pieces, this returns a list. If you're working with a length-1 vector, the easiest thing is to just extra the first element of the list:
```{r}
"a|b|c|d" %>%
@ -587,27 +623,40 @@ Note that this function has to return a list: the number of pieces each element
.[[1]]
```
You'll learn other techniques in the lists chapter.
If you want all strings to be split up into the same number of pieces, you can use `str_split_fixed()`. This outputs a matrix with one row for each string and one column for each piece:
Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
```{r}
c("Name: Hadley", "County: NZ", "Age: 35") %>%
str_split_fixed(": ", 2)
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE)
```
<!-- Add comment to stringi issue that split should also preserve names -->
You can also request a maximum number of pieces;
Instead of splitting up strings by patterns, you can also split up by a predefined set of boundaries with `boundary()`: by character, by line, by sentence and by word.
```{r}
fields <- c("Name: Hadley", "County: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
```
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
```{r}
x <- "This is a sentence. This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, " ")
str_split(x, boundary("word"))
str_split(x, " ")[[1]]
str_split(x, boundary("word"))[[1]]
```
#### Exercises
1. Split up a string like `"apples, pears, and bananas"` into individual
components.
1. Why is it's better to split up by `boundary("word")` than `" "`?
1. What does splitting with an empty string (`""`) do?
### Find matches
`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
@ -652,9 +701,10 @@ You can use the other arguments of `regex()` to control details of the match:
There are three other functions you can use instead of `regex()`:
* `fixed()`: matches exactly that sequence of characters (i.e. ignored
all special regular expression pattern). This allows you to avoid complex
escaping and is faster than matching regular expressions:
* `fixed()`: matches exactly the specified sequence of bytes. It ignores
all special regular expressions and operates at a very low level.
This allows you to avoid complex escaping can be much faster than
regular expressions:
```{r}
microbenchmark::microbenchmark(
@ -663,9 +713,29 @@ There are three other functions you can use instead of `regex()`:
)
```
The fixed match is almost 3x times faster than the regular expression match.
But note the units: here it's only 200 µs faster.
Here the fixed match is almost 3x times faster than the regular
expression match. However, if you're working with non-English data
`fixed()` can lead to unreliable matches because there are often
multiple ways of representing the same character. For example, there
are two ways to define "á": either as a single character or as an "a"
plus an accent:
```{r}
a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
a1 == a2
```
They render identically, but because they're defined differently,
`fixed()` does find a match. Instead, you can use `coll()`, defined
next to respect human character comparison rules:
```{r}
str_detect(a1, fixed(a2))
str_detect(a1, coll(a2))
```
* `coll()`: compare strings using standard **coll**ation rules. This is
useful for doing case insensitive matching. Note that `coll()` takes a
`locale` parameter that controls which rules are used for comparing
@ -689,6 +759,10 @@ There are three other functions you can use instead of `regex()`:
```{r}
stringi::stri_locale_info()
```
The downside of `coll()` is because the rules for recognising which
characters are the same are complicated, `coll()` is relatively slow
compared to `regex()` and `fixed()`.
* As you saw with `str_split()` you can use `boundary()` to match boundaries.
You can also use it with the other functions, all though
@ -699,23 +773,41 @@ There are three other functions you can use instead of `regex()`:
str_extract_all(x, boundary("word"))
```
### Exercises
1. How would you find all strings containing `\` with `regex()` vs.
with `fixed()`?
1. What are the five most common words in `sentences`?
## Other uses of regular expressions
There are a few other functions in base R that accept regular expressions:
* `apropos()` searchs all objects avaiable from the global environment. This
is useful if you can't quite remember the name of the function.
```{r}
apropos("replace")
```
* `dir()` lists all the files in a directory. The `pattern` argument takes
a regular expression and only return file names that match the pattern.
For example, you can find all the rmarkdown files in the current
directory with:
```{r}
head(dir(pattern = "\\.Rmd$"))
```
(If you're more comfortable with "globs" like `*.Rmd`, you can convert
them to regular expressions with `glob2rx()`):
* `ls()` is similar to `apropos()` but only works in the current
environment. However, if you have so many objects in your environment
that you have to use a regular expression to filter them all, you
need to think about what you're doing! (And probably use a list instead).
* `dir()` lists all the files in a directory. The `pattern` argument takes
a regular expression and only return file names that match the pattern.
For example, you can find all csv files with `dir(pattern = "\\.csv$")`.
(If you're more comfortable with "globs" like `*.csv`, you can convert
them to regular expressions with `glob2rx()`)
## Advanced topics
@ -746,3 +838,12 @@ str_conv(x, "ISO-8859-1")
as.data.frame(stringi::stri_enc_detect(x))
str_conv(x, "ISO-8859-2")
```
### UTF-8
<http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes>
<http://www.joelonsoftware.com/articles/Unicode.html>
Homoglyph attack, https://github.com/reinderien/mimic.