r4ds/strings.Rmd

276 lines
9.5 KiB
Plaintext
Raw Normal View History

2015-10-21 22:31:15 +08:00
---
layout: default
title: String manipulation
output: bookdown::html_chapter
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
2015-10-23 02:17:00 +08:00
library(stringr)
2015-10-21 22:31:15 +08:00
```
# String manipulation
When working with text data, one of the most powerful tools at your disposal is regular expressions. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely.
In this chapter, you'll learn the basics of regular expressions using the stringr package.
2015-10-23 02:17:00 +08:00
```{r}
library(stringr)
```
2015-10-21 22:31:15 +08:00
The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
## String basics
In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no different in behaviour.
To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`).
```{r}
x <- c("\"", "\\")
x
writeLines(x)
```
2015-10-23 02:17:00 +08:00
### String length
2015-10-21 22:31:15 +08:00
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
```{r}
# (Will be fixed in R 3.3.0)
nchar(NA)
2015-10-23 02:17:00 +08:00
str_length(NA)
```
### Combining strings
To combine two or more strings, use `str_c()`:
2015-10-21 22:31:15 +08:00
2015-10-23 02:17:00 +08:00
```{r}
str_c("x", "y")
str_c("x", "y", "z")
2015-10-21 22:31:15 +08:00
```
2015-10-23 02:17:00 +08:00
Use the `sep` argument to control how they're separated:
2015-10-21 22:31:15 +08:00
```{r}
2015-10-23 02:17:00 +08:00
str_c("x", "y", sep = ", ")
```
Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
```{r}
str_c("x", NA, "y")
str_c("x", str_replace_na(NA), "y")
```
`str_c()` is vectorised, and it automatically recycles the shortest vectors to the same length as the longest:
```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```
To collapse vectors into a single string, use `collapse`:
```{r}
str_c(c("x", "y", "z"), collapse = ", ")
```
When creating strings you might also find `str_pad()` and `str_dup()` useful:
```{r}
x <- c("apple", "banana", "pear")
str_pad(x, 10)
str_c("Na ", str_dup("na ", 4), "batman!")
```
### Subsetting strings
You can extract parts of a string using `str_sub()`:
```{r}
x <- c("apple", "banana", "pear")
str_sub(x, 1, 3)
# negative numbers count backwards from end
str_sub(x, -3, -1)
```
You can also use `str_sub()` to modify strings:
```{r}
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
```
### Exercises
1. In your own words, describe the difference between `sep` and `collapse`.
## Regular expressions
The stringr package contains functions for working with strings and patterns. We'll focus on four main categories
* What matches the pattern?
* Does a string match a pattern?
* How can you replace a pattern with text?
* How can you split a string into pieces?
Key to all of these functions are regular expressions. Regular expressions are a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
```{r}
2015-10-21 22:31:15 +08:00
```
2015-10-23 02:17:00 +08:00
Goal is not to be exhaustive, but to give you a solid foundation that allows you to solve a wide variety of problems. We'll point you to more resources where you can learn more about regular expresssions.
2015-10-21 22:31:15 +08:00
2015-10-23 02:17:00 +08:00
### Matching anything and escaping
2015-10-21 22:31:15 +08:00
2015-10-23 02:17:00 +08:00
Regular expression are not limited to matching fixed string. You can also use special characters that match patterns. For example, `.` allows you to match any character:
2015-10-21 22:31:15 +08:00
2015-10-23 02:17:00 +08:00
```{r}
str_subset(c("abc", "adc", "bef"), "a.c")
```
2015-10-21 22:31:15 +08:00
2015-10-23 02:17:00 +08:00
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
2015-10-21 22:31:15 +08:00
2015-10-23 02:17:00 +08:00
```{r}
# To create the regular expression, we need \\
dot <- "\\."
2015-10-21 22:31:15 +08:00
2015-10-23 02:17:00 +08:00
# But the expression itself only contains one:
cat(dot, "\n")
# And this tells R to look for explicit .
str_subset(c("abc", "a.c", "bef"), "a\\.c")
```
2015-10-21 22:31:15 +08:00
2015-10-23 02:17:00 +08:00
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one!
2015-10-21 22:31:15 +08:00
2015-10-23 02:17:00 +08:00
### Character classes and alternatives
As well as `.` there are a number of other special patterns that match more than one character:
* `\d`: any digit
* `\s`: any whitespace (space, tab, newline)
2015-10-21 22:31:15 +08:00
* `[abc]`: match a, b, or c
* `[a-e]`: match any character between a and e
* `[!abc]`: match anything except a, b, or c
2015-10-23 02:17:00 +08:00
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
A similar idea is alternation: `x|y` matches either x or y. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
```{r}
str_detect(c("abc", "xyz"), "abc|xyz")
```
Like with mathematics, if precedence ever gets confusing, use parentheses to make it clear what you want:
```{r}
str_detect(c("grey", "gray"), "gr(e|a)y")
str_detect(c("grey", "gray"), "gr(?:e|a)y")
```
2015-10-21 22:31:15 +08:00
2015-10-23 02:17:00 +08:00
Unfortunately parentheses have some other side-effects in regular expressions, which we'll learn about later. Technically, the parentheses you should use are `(?:)` which are called non-capturing parentheses. Most of the time this won't make any difference so it's easy to use `()`, but it sometimes helpful to be aware of `(?:)`.
2015-10-21 22:31:15 +08:00
### Repetition
* `?`: 0 or 1
* `+`: 1 or more
* `*`: 0 or more
* `{n}`: exactly n
* `{n,}`: n or more
* `{,m}`: at most m
* `{n,m}`: between n and m
(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)
2015-10-23 02:17:00 +08:00
Note that the precedence of these operators are high, so you write: `colou?r`. That means you'll need to use parentheses for many uses: `bana(na)+` or `ba(na){2,}`.
2015-10-21 22:31:15 +08:00
### Anchors
* `^` match the start of the line
* `*` match the end of the line
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
2015-10-23 02:17:00 +08:00
To force a regular expression to only match a complete string:
```{r}
str_detect(c("abcdef", "bcd"), "^bcd$")
```
You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
### Exercises
1. Replace all `/` in a string with `\`.
2015-10-21 22:31:15 +08:00
## Detecting matches
2015-10-23 02:17:00 +08:00
`str_detect()`, `str_subset()`, `str_count()`
## Extracting matches
`str_extract()`, `str_extract_all()`
2015-10-21 22:31:15 +08:00
### Groups
`str_match()`, `str_match_all()`
## Replacing patterns
2015-10-23 02:17:00 +08:00
`str_replace()`, `str_replace_all()`
## Splitting
`str_split()`, `str_split_fixed()`.
2015-10-21 22:31:15 +08:00
## Other types of pattern
2015-10-23 02:17:00 +08:00
When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`. Sometimes it's useful to call it explicitly so you can control the
* `fixed()`: matches exactly that sequence of characters (i.e. ignored
all special regular expression pattern).
* `coll()`: compare strings using standard **coll**ation rules. This is
useful for doing case insensitive matching. Note that `coll()` takes a
`locale` parameter that controls which rules are used for comparing
characters. Unfortunately different parts of the world use different rules!
```{r}
# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
# That means you also need to be aware of the difference
# when doing case insensitive matches:
i <- c("I", "İ", "i", "ı")
i
str_subset(i, fixed("i", TRUE))
str_subset(i, coll("i", TRUE))
str_subset(i, coll("i", TRUE, locale = "tr"))
```
## Other uses of regular expressions
There are a few other functions in base R that accept regular expressions:
* `apropos()` searchs all objects avaiable from the global environment. This
is useful if you can't quite remember the name of the function.
* `ls()` is similar to `apropos()` but only works in the current
environment. However, if you have so many objects in your environment
that you have to use a regular expression to filter them all, you
need to think about what you're doing! (And probably use a list instead).
* `dir()` lists all the files in a directory. The `pattern` argument takes
a regular expression and only return file names that match the pattern.
For example, you can find all csv files with `dir(pattern = "\\.csv$")`.
(If you're more comfortable with "globs" like `*.csv`, you can convert
them to regular expressions with `glob2rx()`)