This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically unstructured or semi-structured data so you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
This chapter will focus on the __stringr__ package. This package provides a consistent set of functions that all work the same way and are easier to learn than the base R equivalents. We'll also take a brief look at the __stringi__ package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
In R, strings are stored in a character vector. You can create strings with either single quotes or double quotes: there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`, in which case use `'`.
Beware that the printed representation of the string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines()`:
There are a handful of other special characters. The most common used are `"\n"`, new line, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. You'll also sometimes strings like `"\u00b5"`, this is a way of writing non-English characters that works on all platforms:
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent and hard to remember. Their behaviour is particularly inconsistent when it comes to missing values. For examle, `nchar()`, which gives the length of a string, returns 2 for `NA` (instead of `NA`)
The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` trigger autocomplete, so you can easily see all of the stringr functions.
You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` argument which give the (inclusive) position of the substring:
Above I used`str_to_lower()` to change to lower case. You can also use `str_to_upper()` or `str_to_title()`. However, changing case is more complicated than it might at first seem because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:
The locale is specified as ISO 639 language codes, which are two or three letter abbreviations. If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list. If you leave the locale blank, it will use the current locale.
Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the currect locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:
Regular expressions, regexps for short, are a very terse language that allow to describe patterns in strings. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and shows you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` - you need four backslashes to match one!
By default, regular expressions will match any part of a string. It's often useful to _anchor_ the regular expression so that it matches from the start or end of the string. You can use:
To remember which is which, try this mneomic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them. This is an advanced feature of regular expressions, but it's useful to know that it exists:
Note that the precedence of these operators are high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+` or `ba(na){2,}`.
You learned about parentheses earlier as a way to disambiguate complex expression. They do one other special thing: they also define numeric groups that you can refer to with _backreferences_, `\1`, `\2` etc.For example, the following regular expression finds all fruits that have a pair letters that's repeated.
Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. They are called non-capturing parentheses.
Now that you've learned the basics of regular expression, it's time to learn how to apply to real problems. In this section you'll learn a wide array of stringr functions that let you:
Because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. But since you're in a programming language, it's often easy to break the problem down into smaller pieces. If you find yourself getting stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down in to smaller pieces, solving each challenge before moving onto the next one.
When you have complicated logical conditions (e.g. match this or that but not these) combining multiple `str_detect()` calls with logical operators is often easy. A simple example is if you want to find all words that don't contain any vowels:
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(common, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(common, "^[^aeiou]+$")
all.equal(no_vowels_1, no_vowels_2)
```
If you find your regular expression is getting hard to understand, trying breaking it up into smaller pieces, giving each piece a name, and then combining with logical operations.
`str_count()` is similar to `str_detect()` but it returns an integer count of the number of matches, instead of a true/false:
To extract the actual text of a match, use `str_extract()`. For that to be useful, we need a somewhat more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences): these are sentences designed to tested VOIP systems, but we're going to use them as random data.
```{r}
length(sentences)
head(sentences)
```
Imagine we want to find all sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:
This returns a list, which is a little hard to work with, which is why it's not the default. You'll learn more about working with lists in Chapter XYZ. Note that matches are always non-overlapping: the second match starts after the first is complete.
We talked early about the use of parentheses. You can use them if you want to extract parts of a match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the":
(Defining a "word" in a regular expression is a little tricky. I've decided to go for a sequence of any characters except for a space.)
`str_extract()` gives us the complete match, but we'd like to be able to dig into the pieces. That's the job of `str_match()`. Instead of a character vector, it returns a matrix, with one column for each group, and one column for the complete match:
```{r}
str_match(has_noun, noun)
```
(You can see our heuristic for finding nouns isn't that good as it also picks up adjectives like smooth and parked.)
Like `str_extract()`, if you want all matches, you'll need to use `str_match_all()` and then work with the list that it returns.
Like `str_extract()` and `str_match()`, `str_replace()` only affects the first match. To replace every match, use `str_replace_all()`. Compared to the other two `all()` functions, the output from `str_replace_all()` is simpler because it can stay as a character vector.
Another useful application is to split strings up into pieces. For example we could split sentences up into words
```{r}
sentences %>%
head(5) %>%
str_split(" ")
```
Note that this function has to return a list: the number of pieces each element is split up into might be difference, so there's no way to put them in a vector. If you're working with a length-1 vector, the easiest thing is to just extra the first element of the list:
```{r}
"a|b|c|d" %>%
str_split("\\|") %>%
.[[1]]
```
You'll learn other techniques in the lists chapter.
If you want all strings to be split up into the same number of pieces, you can use `str_split_fixed()`. This outputs a matrix with one row for each string and one column for each piece:
```{r}
c("Name: Hadley", "County: NZ", "Age: 35") %>%
str_split_fixed(": ", 2)
```
<!-- Add comment to stringi issue that split should also preserve names -->
Instead of splitting up strings by patterns, you can also split up by a predefined set of boundaries with `boundary()`: by character, by line, by sentence and by word.
`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `length(ls("package:stringi"))` functions to stringr's `length(ls("package:stringr"))`.
So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages are very similar because stringi was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs `stri_`.
### Encoding
Complicated and fraught with difficulty. Best approach is to convert to UTF-8 as soon as possible. All stringr and stringi functions do this. Readr always reads as UTF-8.
* UTF-8
* Latin1
* bytes: everything else
Generally, you should fix encoding problems during the data import phase.
Detect encoding operates statistically, by comparing frequency of byte fragments across languages and encodings. Fundamentally heuristic and works better with larger amounts of text (i.e. a whole file, not a single string from that file).
```{r}
x <- "\xc9migr\xe9 cause c\xe9l\xe8bre d\xe9j\xe0 vu."