More about strings

2015-10-22 13:17:00 -05:00
parent 88626be626
commit ec529ef1fa
1 changed files with 189 additions and 24 deletions
--- a/strings.Rmd
+++ b/strings.Rmd
@@ -6,6 +6,7 @@ output: bookdown::html_chapter

 ```{r setup, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE)
+library(stringr)
 ```

 # String manipulation
@@ -14,6 +15,10 @@ When working with text data, one of the most powerful tools at your disposal is

 In this chapter, you'll learn the basics of regular expressions using the stringr package. 

+```{r}
+library(stringr)
+```
+
 The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.

 ## String basics
@@ -28,56 +33,153 @@ x
 writeLines(x)
 ```

+### String length
+
 Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)

 ```{r}
 # (Will be fixed in R 3.3.0)
 nchar(NA)
-
-stringr::str_length(NA)
+str_length(NA)
 ```

-## Introduction to stringr
+### Combining strings
+
+To combine two or more strings, use `str_c()`:

 ```{r}
-library(stringr)
+str_c("x", "y")
+str_c("x", "y", "z")
 ```

-The stringr package contains functions for working with strings and patterns. We'll focus on three:
+Use the `sep` argument to control how they're separated:

-* `str_detect(string, pattern)`: does string match a pattern?
-* `str_extract(string, pattern)`: extact matching pattern from string
-* `str_replace(string, pattern, replacement)`: replace pattern with replacement
-* `str_split(string, pattern)`.
+```{r}
+str_c("x", "y", sep = ", ")
+```

-## Extracting patterns
+Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:

-## Introduction to regular expressions
+```{r}
+str_c("x", NA, "y")
+str_c("x", str_replace_na(NA), "y")
+```

-Goal is not to be exhaustive.
+`str_c()` is vectorised, and it automatically recycles the shortest vectors to the same length as the longest:

-### Character classes and alternative
+```{r}
+str_c("prefix-", c("a", "b", "c"), "-suffix")
+```

-* `.`: any character
-* `\d`: a digit
-* `\s`: whitespace
+To collapse vectors into a single string, use `collapse`:

-* `x|y`: match x or y
+```{r}
+str_c(c("x", "y", "z"), collapse = ", ")
+```

+When creating strings you might also find `str_pad()` and `str_dup()` useful:
+
+```{r}
+x <- c("apple", "banana", "pear")
+str_pad(x, 10)
+
+str_c("Na ", str_dup("na ", 4), "batman!") 
+```
+
+### Subsetting strings
+
+You can extract parts of a string using `str_sub()`:
+
+```{r}
+x <- c("apple", "banana", "pear")
+str_sub(x, 1, 3)
+# negative numbers count backwards from end
+str_sub(x, -3, -1)
+```
+
+You can also use `str_sub()` to modify strings:
+
+```{r}
+str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
+x
+```
+
+### Exercises
+
+1.  In your own words, describe the difference between `sep` and `collapse`.
+
+## Regular expressions
+
+The stringr package contains functions for working with strings and patterns. We'll focus on four main categories
+
+* What matches the pattern?
+* Does a string match a pattern? 
+* How can you replace a pattern with text?
+* How can you split a string into pieces?
+
+Key to all of these functions are regular expressions. Regular expressions are a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
+
+```{r}
+
+```
+
+Goal is not to be exhaustive, but to give you a solid foundation that allows you to solve a wide variety of problems. We'll point you to more resources where you can learn more about regular expresssions.
+
+### Matching anything and escaping
+
+Regular expression are not limited to matching fixed string. You can also use special characters that match patterns. For example, `.` allows you to match any character:
+
+```{r}
+str_subset(c("abc", "adc", "bef"), "a.c")
+```
+
+But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
+
+```{r}
+# To create the regular expression, we need \\
+dot <- "\\."
+
+# But the expression itself only contains one:
+cat(dot, "\n")
+
+# And this tells R to look for explicit .
+str_subset(c("abc", "a.c", "bef"), "a\\.c")
+```
+
+If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one!
+
+### Character classes and alternatives
+
+As well as `.` there are a number of other special patterns that match more than one character:
+
+* `\d`: any digit
+* `\s`: any whitespace (space, tab, newline)
 * `[abc]`: match a, b, or c
 * `[a-e]`: match any character between a and e
 * `[!abc]`: match anything except a, b, or c

-### Escaping
+Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.

-You may have noticed that since `.` is a special regular expression character, you'll need to escape `.`
+A similar idea is alternation: `x|y` matches either x or y. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
+
+```{r}
+str_detect(c("abc", "xyz"), "abc|xyz")
+```
+
+Like with mathematics, if precedence ever gets confusing, use parentheses to make it clear what you want:
+
+```{r}
+str_detect(c("grey", "gray"), "gr(e|a)y")
+str_detect(c("grey", "gray"), "gr(?:e|a)y")
+```
+
+Unfortunately parentheses have some other side-effects in regular expressions, which we'll learn about later. Technically, the parentheses you should use are `(?:)` which are called non-capturing parentheses. Most of the time this won't make any difference so it's easy to use `()`, but it sometimes helpful to be aware of `(?:)`.

 ### Repetition

 * `?`: 0 or 1
 * `+`: 1 or more
 * `*`: 0 or more
-
 * `{n}`: exactly n
 * `{n,}`: n or more
 * `{,m}`: at most m
@@ -85,17 +187,34 @@ You may have noticed that since `.` is a special regular expression character, y

 (By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)

+Note that the precedence of these operators are high, so you write: `colou?r`. That means you'll need to use parentheses for many uses: `bana(na)+` or `ba(na){2,}`.
+
 ### Anchors

 * `^` match the start of the line
 * `*` match the end of the line
-* `\b` match boundary between words

 My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).

+To force a regular expression to only match a complete string:
+
+```{r}
+str_detect(c("abcdef", "bcd"), "^bcd$")
+```
+
+You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
+
+### Exercises
+
+1.   Replace all `/` in a string with `\`.

 ## Detecting matches

+`str_detect()`, `str_subset()`, `str_count()`
+
+## Extracting matches
+
+`str_extract()`, `str_extract_all()`

 ### Groups

@@ -103,8 +222,54 @@ My favourite mneomic for rememember which is which (from [Evan Misshula](https:/

 ## Replacing patterns

+`str_replace()`, `str_replace_all()`
+
+## Splitting
+
+`str_split()`, `str_split_fixed()`.
+
 ## Other types of pattern

-* `fixed()`
-* `coll()`
-* `boundary()`
+When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`. Sometimes it's useful to call it explicitly so you can control the 
+
+* `fixed()`: matches exactly that sequence of characters (i.e. ignored
+  all special regular expression pattern).
+  
+* `coll()`: compare strings using standard **coll**ation rules. This is 
+  useful for doing case insensitive matching. Note that `coll()` takes a
+  `locale` parameter that controls which rules are used for comparing
+  characters. Unfortunately different parts of the world use different rules!
+
+```{r}
+# Turkish has two i's: with and without a dot, and it
+# has a different rule for capitalising them:
+str_to_upper(c("i", "ı"))
+str_to_upper(c("i", "ı"), locale = "tr")
+
+# That means you also need to be aware of the difference
+# when doing case insensitive matches:
+i <- c("I", "İ", "i", "ı")
+i
+
+str_subset(i, fixed("i", TRUE))
+str_subset(i, coll("i", TRUE))
+str_subset(i, coll("i", TRUE, locale = "tr"))
+```
+
+## Other uses of regular expressions
+
+There are a few other functions in base R that accept regular expressions:
+
+*   `apropos()` searchs all objects avaiable from the global environment. This
+    is useful if you can't quite remember the name of the function.
+   
+*   `ls()` is similar to `apropos()` but only works in the current 
+    environment. However, if you have so many objects in your environment
+    that you have to use a regular expression to filter them all, you 
+    need to think about what you're doing! (And probably use a list instead).
+
+*   `dir()` lists all the files in a directory. The `pattern` argument takes
+    a regular expression and only return file names that match the pattern.
+    For example, you can find all csv files with `dir(pattern = "\\.csv$")`.
+    (If you're more comfortable with "globs" like `*.csv`, you can convert
+    them to regular expressions with `glob2rx()`)