111 lines
3.4 KiB
Plaintext
111 lines
3.4 KiB
Plaintext
|
---
|
||
|
layout: default
|
||
|
title: String manipulation
|
||
|
output: bookdown::html_chapter
|
||
|
---
|
||
|
|
||
|
```{r setup, include=FALSE}
|
||
|
knitr::opts_chunk$set(echo = TRUE)
|
||
|
```
|
||
|
|
||
|
# String manipulation
|
||
|
|
||
|
When working with text data, one of the most powerful tools at your disposal is regular expressions. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely.
|
||
|
|
||
|
In this chapter, you'll learn the basics of regular expressions using the stringr package.
|
||
|
|
||
|
The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
|
||
|
|
||
|
## String basics
|
||
|
|
||
|
In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no different in behaviour.
|
||
|
|
||
|
To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`).
|
||
|
|
||
|
```{r}
|
||
|
x <- c("\"", "\\")
|
||
|
x
|
||
|
writeLines(x)
|
||
|
```
|
||
|
|
||
|
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
|
||
|
|
||
|
```{r}
|
||
|
# (Will be fixed in R 3.3.0)
|
||
|
nchar(NA)
|
||
|
|
||
|
stringr::str_length(NA)
|
||
|
```
|
||
|
|
||
|
## Introduction to stringr
|
||
|
|
||
|
```{r}
|
||
|
library(stringr)
|
||
|
```
|
||
|
|
||
|
The stringr package contains functions for working with strings and patterns. We'll focus on three:
|
||
|
|
||
|
* `str_detect(string, pattern)`: does string match a pattern?
|
||
|
* `str_extract(string, pattern)`: extact matching pattern from string
|
||
|
* `str_replace(string, pattern, replacement)`: replace pattern with replacement
|
||
|
* `str_split(string, pattern)`.
|
||
|
|
||
|
## Extracting patterns
|
||
|
|
||
|
## Introduction to regular expressions
|
||
|
|
||
|
Goal is not to be exhaustive.
|
||
|
|
||
|
### Character classes and alternative
|
||
|
|
||
|
* `.`: any character
|
||
|
* `\d`: a digit
|
||
|
* `\s`: whitespace
|
||
|
|
||
|
* `x|y`: match x or y
|
||
|
|
||
|
* `[abc]`: match a, b, or c
|
||
|
* `[a-e]`: match any character between a and e
|
||
|
* `[!abc]`: match anything except a, b, or c
|
||
|
|
||
|
### Escaping
|
||
|
|
||
|
You may have noticed that since `.` is a special regular expression character, you'll need to escape `.`
|
||
|
|
||
|
### Repetition
|
||
|
|
||
|
* `?`: 0 or 1
|
||
|
* `+`: 1 or more
|
||
|
* `*`: 0 or more
|
||
|
|
||
|
* `{n}`: exactly n
|
||
|
* `{n,}`: n or more
|
||
|
* `{,m}`: at most m
|
||
|
* `{n,m}`: between n and m
|
||
|
|
||
|
(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)
|
||
|
|
||
|
### Anchors
|
||
|
|
||
|
* `^` match the start of the line
|
||
|
* `*` match the end of the line
|
||
|
* `\b` match boundary between words
|
||
|
|
||
|
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
|
||
|
|
||
|
|
||
|
## Detecting matches
|
||
|
|
||
|
|
||
|
### Groups
|
||
|
|
||
|
`str_match()`, `str_match_all()`
|
||
|
|
||
|
## Replacing patterns
|
||
|
|
||
|
## Other types of pattern
|
||
|
|
||
|
* `fixed()`
|
||
|
* `coll()`
|
||
|
* `boundary()`
|