Start on strings
This commit is contained in:
parent
da42f0d571
commit
88626be626
|
@ -22,7 +22,7 @@ install:
|
|||
|
||||
# Install R packages
|
||||
- ./travis-tool.sh r_binary_install knitr png
|
||||
- ./travis-tool.sh r_install ggplot2 dplyr tidyr pryr
|
||||
- ./travis-tool.sh r_install ggplot2 dplyr tidyr pryr stringr
|
||||
- ./travis-tool.sh github_package hadley/bookdown garrettgman/DSR hadley/readr
|
||||
|
||||
script: jekyll build
|
||||
|
|
|
@ -3,8 +3,8 @@
|
|||
<li><a href="visualize.html">Visualize</a></li>
|
||||
-->
|
||||
<li><a href="transform.html">Transform</a></li>
|
||||
<li><a href="strings.html">String manipulation</a></li>
|
||||
<!--
|
||||
<li><a href="strings.html">Regular expresssions</a></li>
|
||||
<li><a href="dates.html">Dates and times</a></li>
|
||||
-->
|
||||
<li><a href="tidy.html">Tidy</a></li>
|
||||
|
|
|
@ -0,0 +1,110 @@
|
|||
---
|
||||
layout: default
|
||||
title: String manipulation
|
||||
output: bookdown::html_chapter
|
||||
---
|
||||
|
||||
```{r setup, include=FALSE}
|
||||
knitr::opts_chunk$set(echo = TRUE)
|
||||
```
|
||||
|
||||
# String manipulation
|
||||
|
||||
When working with text data, one of the most powerful tools at your disposal is regular expressions. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely.
|
||||
|
||||
In this chapter, you'll learn the basics of regular expressions using the stringr package.
|
||||
|
||||
The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
|
||||
|
||||
## String basics
|
||||
|
||||
In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no different in behaviour.
|
||||
|
||||
To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`).
|
||||
|
||||
```{r}
|
||||
x <- c("\"", "\\")
|
||||
x
|
||||
writeLines(x)
|
||||
```
|
||||
|
||||
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
|
||||
|
||||
```{r}
|
||||
# (Will be fixed in R 3.3.0)
|
||||
nchar(NA)
|
||||
|
||||
stringr::str_length(NA)
|
||||
```
|
||||
|
||||
## Introduction to stringr
|
||||
|
||||
```{r}
|
||||
library(stringr)
|
||||
```
|
||||
|
||||
The stringr package contains functions for working with strings and patterns. We'll focus on three:
|
||||
|
||||
* `str_detect(string, pattern)`: does string match a pattern?
|
||||
* `str_extract(string, pattern)`: extact matching pattern from string
|
||||
* `str_replace(string, pattern, replacement)`: replace pattern with replacement
|
||||
* `str_split(string, pattern)`.
|
||||
|
||||
## Extracting patterns
|
||||
|
||||
## Introduction to regular expressions
|
||||
|
||||
Goal is not to be exhaustive.
|
||||
|
||||
### Character classes and alternative
|
||||
|
||||
* `.`: any character
|
||||
* `\d`: a digit
|
||||
* `\s`: whitespace
|
||||
|
||||
* `x|y`: match x or y
|
||||
|
||||
* `[abc]`: match a, b, or c
|
||||
* `[a-e]`: match any character between a and e
|
||||
* `[!abc]`: match anything except a, b, or c
|
||||
|
||||
### Escaping
|
||||
|
||||
You may have noticed that since `.` is a special regular expression character, you'll need to escape `.`
|
||||
|
||||
### Repetition
|
||||
|
||||
* `?`: 0 or 1
|
||||
* `+`: 1 or more
|
||||
* `*`: 0 or more
|
||||
|
||||
* `{n}`: exactly n
|
||||
* `{n,}`: n or more
|
||||
* `{,m}`: at most m
|
||||
* `{n,m}`: between n and m
|
||||
|
||||
(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)
|
||||
|
||||
### Anchors
|
||||
|
||||
* `^` match the start of the line
|
||||
* `*` match the end of the line
|
||||
* `\b` match boundary between words
|
||||
|
||||
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
|
||||
|
||||
|
||||
## Detecting matches
|
||||
|
||||
|
||||
### Groups
|
||||
|
||||
`str_match()`, `str_match_all()`
|
||||
|
||||
## Replacing patterns
|
||||
|
||||
## Other types of pattern
|
||||
|
||||
* `fixed()`
|
||||
* `coll()`
|
||||
* `boundary()`
|
Loading…
Reference in New Issue