Start on strings

2015-10-21 09:31:15 -05:00
parent da42f0d571
commit 88626be626
3 changed files with 112 additions and 2 deletions
--- a/.travis.yml
+++ b/.travis.yml
@@ -22,7 +22,7 @@ install:

  # Install R packages
  - ./travis-tool.sh r_binary_install knitr png
-  - ./travis-tool.sh r_install        ggplot2 dplyr tidyr pryr
+  - ./travis-tool.sh r_install        ggplot2 dplyr tidyr pryr stringr
  - ./travis-tool.sh github_package   hadley/bookdown garrettgman/DSR hadley/readr

 script: jekyll build
--- a/_includes/package-nav.html
+++ b/_includes/package-nav.html
@@ -3,8 +3,8 @@
 <li><a href="visualize.html">Visualize</a></li>
 -->
 <li><a href="transform.html">Transform</a></li>
+<li><a href="strings.html">String manipulation</a></li>
 <!--
-<li><a href="strings.html">Regular expresssions</a></li>
 <li><a href="dates.html">Dates and times</a></li>
 -->
 <li><a href="tidy.html">Tidy</a></li>
--- a/strings.Rmd
+++ b/strings.Rmd
@@ -0,0 +1,110 @@
+---
+layout: default
+title: String manipulation
+output: bookdown::html_chapter
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+# String manipulation
+
+When working with text data, one of the most powerful tools at your disposal is regular expressions. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely.
+
+In this chapter, you'll learn the basics of regular expressions using the stringr package. 
+
+The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
+
+## String basics
+
+In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no different in behaviour.
+
+To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`).
+
+```{r}
+x <- c("\"", "\\")
+x
+writeLines(x)
+```
+
+Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
+
+```{r}
+# (Will be fixed in R 3.3.0)
+nchar(NA)
+
+stringr::str_length(NA)
+```
+
+## Introduction to stringr
+
+```{r}
+library(stringr)
+```
+
+The stringr package contains functions for working with strings and patterns. We'll focus on three:
+
+* `str_detect(string, pattern)`: does string match a pattern?
+* `str_extract(string, pattern)`: extact matching pattern from string
+* `str_replace(string, pattern, replacement)`: replace pattern with replacement
+* `str_split(string, pattern)`.
+
+## Extracting patterns
+
+## Introduction to regular expressions
+
+Goal is not to be exhaustive.
+
+### Character classes and alternative
+
+* `.`: any character
+* `\d`: a digit
+* `\s`: whitespace
+
+* `x|y`: match x or y
+
+* `[abc]`: match a, b, or c
+* `[a-e]`: match any character between a and e
+* `[!abc]`: match anything except a, b, or c
+
+### Escaping
+
+You may have noticed that since `.` is a special regular expression character, you'll need to escape `.`
+
+### Repetition
+
+* `?`: 0 or 1
+* `+`: 1 or more
+* `*`: 0 or more
+
+* `{n}`: exactly n
+* `{n,}`: n or more
+* `{,m}`: at most m
+* `{n,m}`: between n and m
+
+(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)
+
+### Anchors
+
+* `^` match the start of the line
+* `*` match the end of the line
+* `\b` match boundary between words
+
+My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
+
+
+## Detecting matches
+
+
+### Groups
+
+`str_match()`, `str_match_all()`
+
+## Replacing patterns
+
+## Other types of pattern
+
+* `fixed()`
+* `coll()`
+* `boundary()`