r4ds/strings.Rmd

111 lines
3.4 KiB
Plaintext
Raw Normal View History

2015-10-21 22:31:15 +08:00
---
layout: default
title: String manipulation
output: bookdown::html_chapter
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# String manipulation
When working with text data, one of the most powerful tools at your disposal is regular expressions. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely.
In this chapter, you'll learn the basics of regular expressions using the stringr package.
The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
## String basics
In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no different in behaviour.
To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`).
```{r}
x <- c("\"", "\\")
x
writeLines(x)
```
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
```{r}
# (Will be fixed in R 3.3.0)
nchar(NA)
stringr::str_length(NA)
```
## Introduction to stringr
```{r}
library(stringr)
```
The stringr package contains functions for working with strings and patterns. We'll focus on three:
* `str_detect(string, pattern)`: does string match a pattern?
* `str_extract(string, pattern)`: extact matching pattern from string
* `str_replace(string, pattern, replacement)`: replace pattern with replacement
* `str_split(string, pattern)`.
## Extracting patterns
## Introduction to regular expressions
Goal is not to be exhaustive.
### Character classes and alternative
* `.`: any character
* `\d`: a digit
* `\s`: whitespace
* `x|y`: match x or y
* `[abc]`: match a, b, or c
* `[a-e]`: match any character between a and e
* `[!abc]`: match anything except a, b, or c
### Escaping
You may have noticed that since `.` is a special regular expression character, you'll need to escape `.`
### Repetition
* `?`: 0 or 1
* `+`: 1 or more
* `*`: 0 or more
* `{n}`: exactly n
* `{n,}`: n or more
* `{,m}`: at most m
* `{n,m}`: between n and m
(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)
### Anchors
* `^` match the start of the line
* `*` match the end of the line
* `\b` match boundary between words
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
## Detecting matches
### Groups
`str_match()`, `str_match_all()`
## Replacing patterns
## Other types of pattern
* `fixed()`
* `coll()`
* `boundary()`