More about strings

This commit is contained in:
hadley 2015-10-22 13:17:00 -05:00
parent 88626be626
commit ec529ef1fa
1 changed files with 189 additions and 24 deletions

View File

@ -6,6 +6,7 @@ output: bookdown::html_chapter
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(stringr)
```
# String manipulation
@ -14,6 +15,10 @@ When working with text data, one of the most powerful tools at your disposal is
In this chapter, you'll learn the basics of regular expressions using the stringr package.
```{r}
library(stringr)
```
The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
## String basics
@ -28,56 +33,153 @@ x
writeLines(x)
```
### String length
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
```{r}
# (Will be fixed in R 3.3.0)
nchar(NA)
stringr::str_length(NA)
str_length(NA)
```
## Introduction to stringr
### Combining strings
To combine two or more strings, use `str_c()`:
```{r}
library(stringr)
str_c("x", "y")
str_c("x", "y", "z")
```
The stringr package contains functions for working with strings and patterns. We'll focus on three:
Use the `sep` argument to control how they're separated:
* `str_detect(string, pattern)`: does string match a pattern?
* `str_extract(string, pattern)`: extact matching pattern from string
* `str_replace(string, pattern, replacement)`: replace pattern with replacement
* `str_split(string, pattern)`.
```{r}
str_c("x", "y", sep = ", ")
```
## Extracting patterns
Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
## Introduction to regular expressions
```{r}
str_c("x", NA, "y")
str_c("x", str_replace_na(NA), "y")
```
Goal is not to be exhaustive.
`str_c()` is vectorised, and it automatically recycles the shortest vectors to the same length as the longest:
### Character classes and alternative
```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```
* `.`: any character
* `\d`: a digit
* `\s`: whitespace
To collapse vectors into a single string, use `collapse`:
* `x|y`: match x or y
```{r}
str_c(c("x", "y", "z"), collapse = ", ")
```
When creating strings you might also find `str_pad()` and `str_dup()` useful:
```{r}
x <- c("apple", "banana", "pear")
str_pad(x, 10)
str_c("Na ", str_dup("na ", 4), "batman!")
```
### Subsetting strings
You can extract parts of a string using `str_sub()`:
```{r}
x <- c("apple", "banana", "pear")
str_sub(x, 1, 3)
# negative numbers count backwards from end
str_sub(x, -3, -1)
```
You can also use `str_sub()` to modify strings:
```{r}
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
```
### Exercises
1. In your own words, describe the difference between `sep` and `collapse`.
## Regular expressions
The stringr package contains functions for working with strings and patterns. We'll focus on four main categories
* What matches the pattern?
* Does a string match a pattern?
* How can you replace a pattern with text?
* How can you split a string into pieces?
Key to all of these functions are regular expressions. Regular expressions are a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
```{r}
```
Goal is not to be exhaustive, but to give you a solid foundation that allows you to solve a wide variety of problems. We'll point you to more resources where you can learn more about regular expresssions.
### Matching anything and escaping
Regular expression are not limited to matching fixed string. You can also use special characters that match patterns. For example, `.` allows you to match any character:
```{r}
str_subset(c("abc", "adc", "bef"), "a.c")
```
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
```{r}
# To create the regular expression, we need \\
dot <- "\\."
# But the expression itself only contains one:
cat(dot, "\n")
# And this tells R to look for explicit .
str_subset(c("abc", "a.c", "bef"), "a\\.c")
```
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one!
### Character classes and alternatives
As well as `.` there are a number of other special patterns that match more than one character:
* `\d`: any digit
* `\s`: any whitespace (space, tab, newline)
* `[abc]`: match a, b, or c
* `[a-e]`: match any character between a and e
* `[!abc]`: match anything except a, b, or c
### Escaping
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
You may have noticed that since `.` is a special regular expression character, you'll need to escape `.`
A similar idea is alternation: `x|y` matches either x or y. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
```{r}
str_detect(c("abc", "xyz"), "abc|xyz")
```
Like with mathematics, if precedence ever gets confusing, use parentheses to make it clear what you want:
```{r}
str_detect(c("grey", "gray"), "gr(e|a)y")
str_detect(c("grey", "gray"), "gr(?:e|a)y")
```
Unfortunately parentheses have some other side-effects in regular expressions, which we'll learn about later. Technically, the parentheses you should use are `(?:)` which are called non-capturing parentheses. Most of the time this won't make any difference so it's easy to use `()`, but it sometimes helpful to be aware of `(?:)`.
### Repetition
* `?`: 0 or 1
* `+`: 1 or more
* `*`: 0 or more
* `{n}`: exactly n
* `{n,}`: n or more
* `{,m}`: at most m
@ -85,17 +187,34 @@ You may have noticed that since `.` is a special regular expression character, y
(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)
Note that the precedence of these operators are high, so you write: `colou?r`. That means you'll need to use parentheses for many uses: `bana(na)+` or `ba(na){2,}`.
### Anchors
* `^` match the start of the line
* `*` match the end of the line
* `\b` match boundary between words
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
To force a regular expression to only match a complete string:
```{r}
str_detect(c("abcdef", "bcd"), "^bcd$")
```
You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
### Exercises
1. Replace all `/` in a string with `\`.
## Detecting matches
`str_detect()`, `str_subset()`, `str_count()`
## Extracting matches
`str_extract()`, `str_extract_all()`
### Groups
@ -103,8 +222,54 @@ My favourite mneomic for rememember which is which (from [Evan Misshula](https:/
## Replacing patterns
`str_replace()`, `str_replace_all()`
## Splitting
`str_split()`, `str_split_fixed()`.
## Other types of pattern
* `fixed()`
* `coll()`
* `boundary()`
When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`. Sometimes it's useful to call it explicitly so you can control the
* `fixed()`: matches exactly that sequence of characters (i.e. ignored
all special regular expression pattern).
* `coll()`: compare strings using standard **coll**ation rules. This is
useful for doing case insensitive matching. Note that `coll()` takes a
`locale` parameter that controls which rules are used for comparing
characters. Unfortunately different parts of the world use different rules!
```{r}
# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
# That means you also need to be aware of the difference
# when doing case insensitive matches:
i <- c("I", "İ", "i", "ı")
i
str_subset(i, fixed("i", TRUE))
str_subset(i, coll("i", TRUE))
str_subset(i, coll("i", TRUE, locale = "tr"))
```
## Other uses of regular expressions
There are a few other functions in base R that accept regular expressions:
* `apropos()` searchs all objects avaiable from the global environment. This
is useful if you can't quite remember the name of the function.
* `ls()` is similar to `apropos()` but only works in the current
environment. However, if you have so many objects in your environment
that you have to use a regular expression to filter them all, you
need to think about what you're doing! (And probably use a list instead).
* `dir()` lists all the files in a directory. The `pattern` argument takes
a regular expression and only return file names that match the pattern.
For example, you can find all csv files with `dir(pattern = "\\.csv$")`.
(If you're more comfortable with "globs" like `*.csv`, you can convert
them to regular expressions with `glob2rx()`)