More about strings
This commit is contained in:
parent
88626be626
commit
ec529ef1fa
213
strings.Rmd
213
strings.Rmd
|
@ -6,6 +6,7 @@ output: bookdown::html_chapter
|
|||
|
||||
```{r setup, include=FALSE}
|
||||
knitr::opts_chunk$set(echo = TRUE)
|
||||
library(stringr)
|
||||
```
|
||||
|
||||
# String manipulation
|
||||
|
@ -14,6 +15,10 @@ When working with text data, one of the most powerful tools at your disposal is
|
|||
|
||||
In this chapter, you'll learn the basics of regular expressions using the stringr package.
|
||||
|
||||
```{r}
|
||||
library(stringr)
|
||||
```
|
||||
|
||||
The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
|
||||
|
||||
## String basics
|
||||
|
@ -28,56 +33,153 @@ x
|
|||
writeLines(x)
|
||||
```
|
||||
|
||||
### String length
|
||||
|
||||
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
|
||||
|
||||
```{r}
|
||||
# (Will be fixed in R 3.3.0)
|
||||
nchar(NA)
|
||||
|
||||
stringr::str_length(NA)
|
||||
str_length(NA)
|
||||
```
|
||||
|
||||
## Introduction to stringr
|
||||
### Combining strings
|
||||
|
||||
To combine two or more strings, use `str_c()`:
|
||||
|
||||
```{r}
|
||||
library(stringr)
|
||||
str_c("x", "y")
|
||||
str_c("x", "y", "z")
|
||||
```
|
||||
|
||||
The stringr package contains functions for working with strings and patterns. We'll focus on three:
|
||||
Use the `sep` argument to control how they're separated:
|
||||
|
||||
* `str_detect(string, pattern)`: does string match a pattern?
|
||||
* `str_extract(string, pattern)`: extact matching pattern from string
|
||||
* `str_replace(string, pattern, replacement)`: replace pattern with replacement
|
||||
* `str_split(string, pattern)`.
|
||||
```{r}
|
||||
str_c("x", "y", sep = ", ")
|
||||
```
|
||||
|
||||
## Extracting patterns
|
||||
Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
|
||||
|
||||
## Introduction to regular expressions
|
||||
```{r}
|
||||
str_c("x", NA, "y")
|
||||
str_c("x", str_replace_na(NA), "y")
|
||||
```
|
||||
|
||||
Goal is not to be exhaustive.
|
||||
`str_c()` is vectorised, and it automatically recycles the shortest vectors to the same length as the longest:
|
||||
|
||||
### Character classes and alternative
|
||||
```{r}
|
||||
str_c("prefix-", c("a", "b", "c"), "-suffix")
|
||||
```
|
||||
|
||||
* `.`: any character
|
||||
* `\d`: a digit
|
||||
* `\s`: whitespace
|
||||
To collapse vectors into a single string, use `collapse`:
|
||||
|
||||
* `x|y`: match x or y
|
||||
```{r}
|
||||
str_c(c("x", "y", "z"), collapse = ", ")
|
||||
```
|
||||
|
||||
When creating strings you might also find `str_pad()` and `str_dup()` useful:
|
||||
|
||||
```{r}
|
||||
x <- c("apple", "banana", "pear")
|
||||
str_pad(x, 10)
|
||||
|
||||
str_c("Na ", str_dup("na ", 4), "batman!")
|
||||
```
|
||||
|
||||
### Subsetting strings
|
||||
|
||||
You can extract parts of a string using `str_sub()`:
|
||||
|
||||
```{r}
|
||||
x <- c("apple", "banana", "pear")
|
||||
str_sub(x, 1, 3)
|
||||
# negative numbers count backwards from end
|
||||
str_sub(x, -3, -1)
|
||||
```
|
||||
|
||||
You can also use `str_sub()` to modify strings:
|
||||
|
||||
```{r}
|
||||
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
|
||||
x
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. In your own words, describe the difference between `sep` and `collapse`.
|
||||
|
||||
## Regular expressions
|
||||
|
||||
The stringr package contains functions for working with strings and patterns. We'll focus on four main categories
|
||||
|
||||
* What matches the pattern?
|
||||
* Does a string match a pattern?
|
||||
* How can you replace a pattern with text?
|
||||
* How can you split a string into pieces?
|
||||
|
||||
Key to all of these functions are regular expressions. Regular expressions are a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
|
||||
|
||||
```{r}
|
||||
|
||||
```
|
||||
|
||||
Goal is not to be exhaustive, but to give you a solid foundation that allows you to solve a wide variety of problems. We'll point you to more resources where you can learn more about regular expresssions.
|
||||
|
||||
### Matching anything and escaping
|
||||
|
||||
Regular expression are not limited to matching fixed string. You can also use special characters that match patterns. For example, `.` allows you to match any character:
|
||||
|
||||
```{r}
|
||||
str_subset(c("abc", "adc", "bef"), "a.c")
|
||||
```
|
||||
|
||||
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
|
||||
|
||||
```{r}
|
||||
# To create the regular expression, we need \\
|
||||
dot <- "\\."
|
||||
|
||||
# But the expression itself only contains one:
|
||||
cat(dot, "\n")
|
||||
|
||||
# And this tells R to look for explicit .
|
||||
str_subset(c("abc", "a.c", "bef"), "a\\.c")
|
||||
```
|
||||
|
||||
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one!
|
||||
|
||||
### Character classes and alternatives
|
||||
|
||||
As well as `.` there are a number of other special patterns that match more than one character:
|
||||
|
||||
* `\d`: any digit
|
||||
* `\s`: any whitespace (space, tab, newline)
|
||||
* `[abc]`: match a, b, or c
|
||||
* `[a-e]`: match any character between a and e
|
||||
* `[!abc]`: match anything except a, b, or c
|
||||
|
||||
### Escaping
|
||||
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
|
||||
|
||||
You may have noticed that since `.` is a special regular expression character, you'll need to escape `.`
|
||||
A similar idea is alternation: `x|y` matches either x or y. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
|
||||
|
||||
```{r}
|
||||
str_detect(c("abc", "xyz"), "abc|xyz")
|
||||
```
|
||||
|
||||
Like with mathematics, if precedence ever gets confusing, use parentheses to make it clear what you want:
|
||||
|
||||
```{r}
|
||||
str_detect(c("grey", "gray"), "gr(e|a)y")
|
||||
str_detect(c("grey", "gray"), "gr(?:e|a)y")
|
||||
```
|
||||
|
||||
Unfortunately parentheses have some other side-effects in regular expressions, which we'll learn about later. Technically, the parentheses you should use are `(?:)` which are called non-capturing parentheses. Most of the time this won't make any difference so it's easy to use `()`, but it sometimes helpful to be aware of `(?:)`.
|
||||
|
||||
### Repetition
|
||||
|
||||
* `?`: 0 or 1
|
||||
* `+`: 1 or more
|
||||
* `*`: 0 or more
|
||||
|
||||
* `{n}`: exactly n
|
||||
* `{n,}`: n or more
|
||||
* `{,m}`: at most m
|
||||
|
@ -85,17 +187,34 @@ You may have noticed that since `.` is a special regular expression character, y
|
|||
|
||||
(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)
|
||||
|
||||
Note that the precedence of these operators are high, so you write: `colou?r`. That means you'll need to use parentheses for many uses: `bana(na)+` or `ba(na){2,}`.
|
||||
|
||||
### Anchors
|
||||
|
||||
* `^` match the start of the line
|
||||
* `*` match the end of the line
|
||||
* `\b` match boundary between words
|
||||
|
||||
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
|
||||
|
||||
To force a regular expression to only match a complete string:
|
||||
|
||||
```{r}
|
||||
str_detect(c("abcdef", "bcd"), "^bcd$")
|
||||
```
|
||||
|
||||
You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Replace all `/` in a string with `\`.
|
||||
|
||||
## Detecting matches
|
||||
|
||||
`str_detect()`, `str_subset()`, `str_count()`
|
||||
|
||||
## Extracting matches
|
||||
|
||||
`str_extract()`, `str_extract_all()`
|
||||
|
||||
### Groups
|
||||
|
||||
|
@ -103,8 +222,54 @@ My favourite mneomic for rememember which is which (from [Evan Misshula](https:/
|
|||
|
||||
## Replacing patterns
|
||||
|
||||
`str_replace()`, `str_replace_all()`
|
||||
|
||||
## Splitting
|
||||
|
||||
`str_split()`, `str_split_fixed()`.
|
||||
|
||||
## Other types of pattern
|
||||
|
||||
* `fixed()`
|
||||
* `coll()`
|
||||
* `boundary()`
|
||||
When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`. Sometimes it's useful to call it explicitly so you can control the
|
||||
|
||||
* `fixed()`: matches exactly that sequence of characters (i.e. ignored
|
||||
all special regular expression pattern).
|
||||
|
||||
* `coll()`: compare strings using standard **coll**ation rules. This is
|
||||
useful for doing case insensitive matching. Note that `coll()` takes a
|
||||
`locale` parameter that controls which rules are used for comparing
|
||||
characters. Unfortunately different parts of the world use different rules!
|
||||
|
||||
```{r}
|
||||
# Turkish has two i's: with and without a dot, and it
|
||||
# has a different rule for capitalising them:
|
||||
str_to_upper(c("i", "ı"))
|
||||
str_to_upper(c("i", "ı"), locale = "tr")
|
||||
|
||||
# That means you also need to be aware of the difference
|
||||
# when doing case insensitive matches:
|
||||
i <- c("I", "İ", "i", "ı")
|
||||
i
|
||||
|
||||
str_subset(i, fixed("i", TRUE))
|
||||
str_subset(i, coll("i", TRUE))
|
||||
str_subset(i, coll("i", TRUE, locale = "tr"))
|
||||
```
|
||||
|
||||
## Other uses of regular expressions
|
||||
|
||||
There are a few other functions in base R that accept regular expressions:
|
||||
|
||||
* `apropos()` searchs all objects avaiable from the global environment. This
|
||||
is useful if you can't quite remember the name of the function.
|
||||
|
||||
* `ls()` is similar to `apropos()` but only works in the current
|
||||
environment. However, if you have so many objects in your environment
|
||||
that you have to use a regular expression to filter them all, you
|
||||
need to think about what you're doing! (And probably use a list instead).
|
||||
|
||||
* `dir()` lists all the files in a directory. The `pattern` argument takes
|
||||
a regular expression and only return file names that match the pattern.
|
||||
For example, you can find all csv files with `dir(pattern = "\\.csv$")`.
|
||||
(If you're more comfortable with "globs" like `*.csv`, you can convert
|
||||
them to regular expressions with `glob2rx()`)
|
||||
|
|
Loading…
Reference in New Issue