r4ds/strings.Rmd

---
layout: default
title: String manipulation
output: bookdown::html_chapter
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(stringr)

common <- rcorpora::corpora("words/common")$commonWords
fruit <- rcorpora::corpora("foods/fruits")$fruits
sentences <- readr::read_lines("harvard-sentences.txt")
```

# String manipulation

This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically unstructured or semi-structured data so you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.

This chapter will focus on the __stringr__ package. This package provides a consistent set of functions that all work the same way and are easier to learn than the base R equivalents. We'll also take a brief look at the __stringi__ package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.

## String basics

In R, strings are stored in a character vector. You can create strings with either single quotes or double quotes: there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`, in which case use `'`.

```{r}
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
```

To include a literal single or double quote in a string you can use `\` to "escape" it:

```{r}
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```

That means if you want to include a literal `\`, you'll need to double it up: `"\\"`.

Beware that the printed representation of the string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines()`:

```{r}
x <- c("\"", "\\")
x
writeLines(x)
```

There are a handful of other special characters. The most common used are `"\n"`, new line, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. You'll also sometimes strings like `"\u00b5"`, this is a way of writing non-English  characters that works on all platforms:

```{r}
x <- "\u00b5"
x
```

### String length

Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent and hard to remember. Their behaviour is particularly inconsistent when it comes to missing values. For examle, `nchar()`, which gives the length of a string, returns 2 for `NA` (instead of `NA`)

```{r}
# (Will be fixed in R 3.3.0)
nchar(NA)
```

Instead we'll use functions from stringr. These have more evocative names, and all start with `str_`:

```{r}
str_length(NA)
```

The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` trigger autocomplete, so you can easily see all of the stringr functions.

### Combining strings

To combine two or more strings, use `str_c()`:

```{r}
str_c("x", "y")
str_c("x", "y", "z")
```

Use the `sep` argument to control how they're separated:

```{r}
str_c("x", "y", sep = ", ")
```

Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:

```{r}
x <- c("abc", NA)
str_c("|-", x, "-|")
str_c("|-", str_replace_na(x), "-|")
```

As shown above, `str_c()` is vectorised, automatically recycling shorter vectors to the same length as the longest:

```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```

Objects of length 0 are silently dropped. This is particularly useful in conjunction with `if`:

```{r}
name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE

str_c("Good ", time_of_day, " ", name,
  if (birthday) " and HAPPY BIRTHDAY",
  "."
)
```

To collapse vectors into a single string, use `collapse`:

```{r}
str_c(c("x", "y", "z"), collapse = ", ")
```

### Subsetting strings

You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` argument which give the (inclusive) position of the substring:

```{r}
x <- c("apple", "banana", "pear")
str_sub(x, 1, 3)
# negative numbers count backwards from end
str_sub(x, -3, -1)
```

Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible:

```{r}
str_sub("a", 1, 5)
```

You can also use the assignment form of `str_sub()`, `` `str_sub<-()` ``, to modify strings:

```{r}
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
```

### Locales

Above I used`str_to_lower()` to change to lower case. You can also use `str_to_upper()` or `str_to_title()`. However, changing case is more complicated than it might at first seem because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:

```{r}
# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```

The locale is specified as ISO 639 language codes, which are two or three letter abbreviations. If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list. If you leave the locale blank, it will use the current locale.

Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the currect locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:

```{r}
x <- c("apple", "eggplant", "banana")
str_sort(x, locale = "en")  # English
str_sort(x, locale = "haw") # Hawaiian
```

### Exercises

1.  In your own words, describe the difference between the `sep` and `collapse`
    arguments to `str_c()`.

1.  In code that doesn't use stringr, you'll often see `paste()` and `paste0()`.
    What's the difference between the two functions? What stringr function are
    they equivalent to? How do the functions differ in their handling of 
    `NA`?
    
1.  Use `str_length()` and `str_sub()` to extract the middle character from 
    a character vector.

1.  What does `str_wrap()` do? When might you want to use it?

1.  What does `str_trim()` do? What's the opposite of `str_trim()`?

1.  Write a function that turns (e.g.) a vector `c("a", "b", "c")` into 
    the string `a, b, and c`. Think carefully about what it should do if
    given a vector of length 0, 1, or 2.

## Matching patterns with regular expressions

Regular expressions, regexps for short, are a very terse language that allow to describe patterns in strings. They take a little while to get your head around, but once you've got it you'll find them extremely useful. 

To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and shows you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.

### Basics matches

The simplest patterns match exact strings:

```{r}
x <- c("apple", "banana", "pear")
str_view(x, "an")
```

The next step up in complexity is `.`, which matches any character (except a new line):

```{r}
str_view(x, ".a.")
```

But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.

```{r}
# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
writeLines(dot)

# And this tells R to look for explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
```

If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` - you need four backslashes to match one!

```{r}
x <- "a\\b"
writeLines(x)

str_view(x, "\\\\")
```

In this book, I'll write a regular expression like `\.` and the string that represents the regular expression as `"\\."`.

#### Exercises

1.  Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.

1.  How would you match the sequence `"'\`?

1.  What patterns does will this regular expression match `"\..\..\..`? 
    How would you represent it as a string?

### Anchors

By default, regular expressions will match any part of a string. It's often useful to _anchor_ the regular expression so that it matches from the start or end of the string. You can use:

* `^` to match the start of the string.
* `*` to match the end of the string.

```{r}
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
```

To remember which is which, try this mneomic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).

To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:

```{r}
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")
```

You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.

#### Exercises

1.  How would you match the literal string `"$^$"`?

1.  Given this corpus of common words:
    
    ```{r}
    ```
    
    Create regular expressions that find all words that:
    
    1. Start with "y".
    1. End with "x"
    1. Are exactly three letters long. (Don't cheat by using `str_length()`!)
    1. Have seven letters or more.

    Since this list is long, you might want to use the `match` argument to
    `str_view()` to show only the matching or non-matching words.

### Character classes and alternatives

There are number of other special patterns that match more than one character:

* `.`: any character apart from a new line.
* `\d`: any digit.
* `\s`: any whitespace (space, tab, newline).
* `[abc]`: match a, b, or c.
* `[!abc]`: match anything except a, b, or c.

Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.

You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:

```{r}
str_view(c("abc", "xyz"), "abc|xyz")
```

Like with mathematical expression, if precedence ever gets confusing, use parentheses to make it clear what you want:

```{r}
str_view(c("grey", "gray"), "gr(e|a)y")
```

#### Exercises

1.  Create regular expressions that find all words that:

    1. Start with a vowel.

    1. That only contain constants. (Hint: thinking about matching 
       "not"-vowels.)

    1. End with `ed`, but not with `eed`.
    
    1. End with `ing` or `ise`.
    
1.  Write a regular expression that matches a word if it's probably written
    in British English, not American English.

1.  Create a regular expression that will match telephone numbers as commonly
    written in your country.

### Repetition

The next step up in power involves control how many times a pattern matches:

* `?`: 0 or 1
* `+`: 1 or more
* `*`: 0 or more
* `{n}`: exactly n
* `{n,}`: n or more
* `{,m}`: at most m
* `{n,m}`: between n and m

```{r}
```

By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them. This is an advanced feature of regular expressions, but it's useful to know that it exists:

```{r}
```

Note that the precedence of these operators are high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+` or `ba(na){2,}`.

#### Exercises

1.  Describe in words what these regular expressions match:
    (read carefully to see I'm using a regular expression or a string
    that defines a regular expression.)

    1. `^.*$`
    1. `"\\{.+\\}"`
    1. `\d{4}-\d{2}-\d{2}`
    1. `"\\\\{4}"`
    
1.  Create regular expressions to find all words that:

    1. Have three or more vowels in a row.
    1. Start with three consonants
    1. Have two or more vowel-consontant pairs in a row.

### Grouping and backreferences

You learned about parentheses earlier as a way to disambiguate complex expression. They do one other special thing: they also define numeric groups that you can refer to with _backreferences_, `\1`, `\2` etc.For example, the following regular expression finds all fruits that have a pair letters that's repeated.

```{r}
str_view(fruit, "(..)\\1", match = TRUE)
```

(You'll also see how they're useful in conjunction with `str_match()` in a few pages.)

Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. They are called non-capturing parentheses.

For example:

```{r}
str_detect(c("grey", "gray"), "gr(e|a)y")
str_detect(c("grey", "gray"), "gr(?:e|a)y")
```

### Exercises

1.  Describe, in words, what these expressions will match:

    1. `"(.)(.)\\2\\1"`
    1. `(..)\1`
    1. `"(.)(.)(.).*\\3\\2\\1"`

1.  Construct regular expressions to match words that:

    1. Start and end with the same character.

## Tools

Now that you've learned the basics of regular expression, it's time to learn how to apply to real problems. In this section you'll learn a wide array of stringr functions that let you:

* Determine which elements match a pattern.
* Find the positions of matches.
* Extract the content of matches.
* Replace matches with new values.
* How can you split a string into based on a match.

Because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. But since you're in a programming language, it's often easy to break the problem down into smaller pieces. If you find yourself getting stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down in to smaller pieces, solving each challenge before moving onto the next one.

### Detect matches

To determine if a character vector matches a pattern, use `str_detect()`. It returns a logical vector:

```{r}
x <- c("apple", "banana", "pear")
str_detect(x, "e")
```

Remember that logical vectors are effectively combined with `sum()` and `mean()`. This makes it easy to answer questions about a complete vector:

```{r}
# How many common words start with t?
sum(str_detect(common, "^t"))
```

When you have complicated logical conditions (e.g. match this or that but not these) combining multiple `str_detect()` calls with logical operators is often easy. A simple example is if you want to find all words that don't contain any vowels:

```{r}
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(common, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(common, "^[^aeiou]+$")
all.equal(no_vowels_1, no_vowels_2)
```

If you find your regular expression is getting hard to understand, trying breaking it up into smaller pieces, giving each piece a name, and then combining with logical operations.

`str_count()` is similar to `str_detect()` but it returns an integer count of the number of matches, instead of a true/false:

```{r}
str_count(x, "a")

# What's the average number of vowels per word?
mean(str_count(common, "[aeiou]"))
```

`str_subset()` is a wrapper for the common pattern `x[str_detect(x, pattern)]`.

### Exercises

1.  For each of the following challenges, try solving with both a single
    regular expression, and a combination of multiple `str_detect()` calls.
    
    1.  Find all words that start or end with `x`.
    1.  Find all words that start with a vowel and end with a consonant.

1.  What word has the highest number of vowels? What word has the highest
    proportion of vowels?

### Extract matches

To extract the actual text of a match, use `str_extract()`. For that to be useful, we need a somewhat more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences): these are sentences designed to tested VOIP systems, but we're going to use them as random data.

```{r}
length(sentences)
head(sentences)
```

Imagine we want to find all sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:

```{r}
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
```

Now we can select the sentences that contain a colour, and then extract the colour to figure out which one it is:

```{r}
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
```

A few sentences contain more than one colour and `str_extract()` only extracts the first:

```{r}
table(str_count(sentences, colour_match))
more <- sentences[str_count(sentences, colour_match) > 1]
more

str_extract(more, colour_match)
```

To get all matches, use `str_extract_all()`:

```{r}
str_view_all(more, colour_match)
str_extract_all(more, colour_match)
```

This returns a list, which is a little hard to work with, which is why it's not the default. You'll learn more about working with lists in Chapter XYZ. Note that matches are always non-overlapping: the second match starts after the first is complete.

Another options is to convert it to a character matrix with `simplify = TRUE`. Short matches are expanded with `""` to the length of the longest:

```{r}
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
```

#### Exercises

1.  From the Harvard sentences data, extract:

    1. The first word from each sentence.
    1. All words ending in `ing`.

1.  In the previous example, you might have noticed that our regular expression
    matched "fickered", which is not a colour. Modify the regex to prevent
    this problematic match.

### Grouped matches

We talked early about the use of parentheses. You can use them if you want to extract parts of a match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the":

```{r}
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
  str_subset(noun) %>%
  head(10)

str_extract(has_noun, noun)
```

(Defining a "word" in a regular expression is a little tricky. I've decided to go for a sequence of any characters except for a space.)

`str_extract()` gives us the complete match, but we'd like to be able to dig into the pieces. That's the job of `str_match()`. Instead of a character vector, it returns a matrix, with one column for each group, and one column for the complete match:

```{r}
str_match(has_noun, noun)
```

(You can see our heuristic for finding nouns isn't that good as it also picks up adjectives like smooth and parked.)

Like `str_extract()`, if you want all matches, you'll need to use `str_match_all()` and then work with the list that it returns.

#### Exercises


### Replacing matches

`str_replace()` allows you to transform 

```{r}
sentences %>% 
  head(5) %>% 
  str_replace("([^ ]+) ([^ ]+)", "\\2 \\1")
```

Like `str_extract()` and `str_match()`, `str_replace()` only affects the first match. To replace every match, use `str_replace_all()`. Compared to the other two `all()` functions, the output from `str_replace_all()` is simpler because it can stay as a character vector.

Multiple replacements

Backreferences.

Replacing with a function call (hopefully)

#### Exercises

1.   Replace all `/` in a string with `\`.


### Splitting

Another useful application is to split strings up into pieces. For example we could split sentences up into words

```{r}
sentences %>%
  head(5) %>% 
  str_split(" ")
```

Note that this function has to return a list: the number of pieces each element is split up into might be difference, so there's no way to put them in a vector. If you're working with a length-1 vector, the easiest thing is to just extra the first element of the list:

```{r}
"a|b|c|d" %>% 
  str_split("\\|") %>% 
  .[[1]]
```

You'll learn other techniques in the lists chapter.

If you want all strings to be split up into the same number of pieces, you can use `str_split_fixed()`. This outputs a matrix with one row for each string and one column for each piece:

```{r}
c("Name: Hadley", "County: NZ", "Age: 35") %>% 
  str_split_fixed(": ", 2)
```

<!-- Add comment to stringi issue that split should also preserve names -->

Instead of splitting up strings by patterns, you can also split up by a predefined set of boundaries with `boundary()`: by character, by line, by sentence and by word.

```{r}
x <- "This is a sentence.  This is another sentence."
str_view_all(x, boundary("word"))

str_split(x, " ")
str_split(x, boundary("word"))
```

### Find matches

`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.

## Other types of pattern

When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`:

```{r, eval = FALSE}
# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))
```

You can use the other arguments of `regex()` to control details of the match:

*   `ignore_case = TRUE` allows characters to match either their uppercase or 
    lowercase forms. This always uses the current locale.
    
    ```{r}
    bananas <- c("banana", "Banana", "BANANA")
    str_view(bananas, "banana")
    str_view(bananas, regex("banana", ignore_case = TRUE))
    ```
    
*   `multiline = TRUE` allows `^` and `$` to match the start and end of each
    line rather than the start and end of the complete string.
    
    ```{r}
    x <- "Line 1\nLine 2\nLine 3"
    str_view_all(x, "^Line")
    str_view_all(x, regex("^Line", multiline = TRUE))
    ```
    
*   `comments = TRUE` allows you to use comments and white space to make 
    complex regular expressions more understand. Space are ignored, as is 
    everything after `#`. To match a literal space, you'll need to escape it: 
    `"\\ "`.

*   `dotall = TRUE` allows `.` to match everything, including `\n`.

There are three other functions you can use instead of `regex()`:

*   `fixed()`: matches exactly that sequence of characters (i.e. ignored
    all special regular expression pattern). This allows you to avoid complex
    escaping and is faster than matching regular expressions:
  
    ```{r}
    microbenchmark::microbenchmark(
      fixed = str_detect(sentences, fixed("the")),
      regex = str_detect(sentences, "the")
    )
    ```
    
    The fixed match is almost 3x times faster than the regular expression match.
    But note the units: here it's only 200 µs faster. 
  
*   `coll()`: compare strings using standard **coll**ation rules. This is 
    useful for doing case insensitive matching. Note that `coll()` takes a
    `locale` parameter that controls which rules are used for comparing
    characters. Unfortunately different parts of the world use different rules!

    ```{r}
    # That means you also need to be aware of the difference
    # when doing case insensitive matches:
    i <- c("I", "İ", "i", "ı")
    i
    
    str_subset(i, coll("i", TRUE))
    str_subset(i, coll("i", TRUE, locale = "tr"))
    ```
    
    Both `fixed()` and `regex()` have `ignore_case` arguments, but they
    do not allow you to pick the locale: they always use the default locale.
    You can see what that is with the following code; more on stringi
    later.
    
    ```{r}
    stringi::stri_locale_info()
    ```

*   As you saw with `str_split()` you can use `boundary()` to match boundaries.
    You can also use it with the other functions, all though 
    
    ```{r}
    x <- "This is a sentence."
    str_view_all(x, boundary("word"))
    str_extract_all(x, boundary("word"))
    ```

## Other uses of regular expressions

There are a few other functions in base R that accept regular expressions:

*   `apropos()` searchs all objects avaiable from the global environment. This
    is useful if you can't quite remember the name of the function.
   
*   `ls()` is similar to `apropos()` but only works in the current 
    environment. However, if you have so many objects in your environment
    that you have to use a regular expression to filter them all, you 
    need to think about what you're doing! (And probably use a list instead).

*   `dir()` lists all the files in a directory. The `pattern` argument takes
    a regular expression and only return file names that match the pattern.
    For example, you can find all csv files with `dir(pattern = "\\.csv$")`.
    (If you're more comfortable with "globs" like `*.csv`, you can convert
    them to regular expressions with `glob2rx()`)

## Advanced topics


### The stringi package

stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `length(ls("package:stringi"))` functions to stringr's `length(ls("package:stringr"))`.

So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages are very similar because stringi was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs `stri_`.

### Encoding

Complicated and fraught with difficulty. Best approach is to convert to UTF-8 as soon as possible. All stringr and stringi functions do this. Readr always reads as UTF-8.

* UTF-8
* Latin1
* bytes: everything else

Generally, you should fix encoding problems during the data import phase.

Detect encoding operates statistically, by comparing frequency of byte fragments across languages and encodings. Fundamentally heuristic and works better with larger amounts of text (i.e. a whole file, not a single string from that file).

```{r}
x <- "\xc9migr\xe9 cause c\xe9l\xe8bre d\xe9j\xe0 vu."
x
str_conv(x, "ISO-8859-1")

as.data.frame(stringi::stri_enc_detect(x))
str_conv(x, "ISO-8859-2")
```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								---
 								layout: default
 								title: String manipulation
 								output: bookdown::html_chapter
 								---
 								```{r setup, include=FALSE}
 								knitr::opts_chunk$set(echo = TRUE)
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								library(stringr)
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
 								common <- rcorpora::corpora("words/common")$commonWords
 								fruit <- rcorpora::corpora("foods/fruits")$fruits
 								sentences <- readr::read_lines("harvard-sentences.txt")
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								```
 								# String manipulation
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically unstructured or semi-structured data so you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								This chapter will focus on the __stringr__ package. This package provides a consistent set of functions that all work the same way and are easier to learn than the base R equivalents. We'll also take a brief look at the __stringi__ package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								## String basics
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								In R, strings are stored in a character vector. You can create strings with either single quotes or double quotes: there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`, in which case use `'`.
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
 								```{r}
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								string1 <- "This is a string"
 								string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								To include a literal single or double quote in a string you can use `\` to "escape" it:
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								```{r}
 								double_quote <- "\"" # or '"'
 								single_quote <- '\'' # or "'"
 								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								That means if you want to include a literal `\`, you'll need to double it up: `"\\"`.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								Beware that the printed representation of the string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines()`:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								```{r}
 								x <- c("\"", "\\")
 								x
 								writeLines(x)
 								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								There are a handful of other special characters. The most common used are `"\n"`, new line, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. You'll also sometimes strings like `"\u00b5"`, this is a way of writing non-English  characters that works on all platforms:
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								```{r}
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								x <- "\u00b5"
 								x
 								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								### String length
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent and hard to remember. Their behaviour is particularly inconsistent when it comes to missing values. For examle, `nchar()`, which gives the length of a string, returns 2 for `NA` (instead of `NA`)
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								```{r}
 								# (Will be fixed in R 3.3.0)
 								nchar(NA)
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								```
 								Instead we'll use functions from stringr. These have more evocative names, and all start with `str_`:
 								```{r}
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								str_length(NA)
 								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` trigger autocomplete, so you can easily see all of the stringr functions.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								### Combining strings
 								To combine two or more strings, use `str_c()`:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```{r}
 								str_c("x", "y")
 								str_c("x", "y", "z")
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								Use the `sep` argument to control how they're separated:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								```{r}
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								str_c("x", "y", sep = ", ")
 								```
 								Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
 								```{r}
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								x <- c("abc", NA)
 								str_c("|-", x, "-|")
 								str_c("|-", str_replace_na(x), "-|")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								As shown above, `str_c()` is vectorised, automatically recycling shorter vectors to the same length as the longest:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
 								str_c("prefix-", c("a", "b", "c"), "-suffix")
 								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								Objects of length 0 are silently dropped. This is particularly useful in conjunction with `if`:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								name <- "Hadley"
 								time_of_day <- "morning"
 								birthday <- FALSE
 								str_c("Good ", time_of_day, " ", name,
 								  if (birthday) " and HAPPY BIRTHDAY",
 								  "."
 								)
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								To collapse vectors into a single string, use `collapse`:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								str_c(c("x", "y", "z"), collapse = ", ")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
 								### Subsetting strings
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` argument which give the (inclusive) position of the substring:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
 								x <- c("apple", "banana", "pear")
 								str_sub(x, 1, 3)
 								# negative numbers count backwards from end
 								str_sub(x, -3, -1)
 								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible:
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
 								```{r}
 								str_sub("a", 1, 5)
 								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								You can also use the assignment form of `str_sub()`, `` `str_sub<-()` ``, to modify strings:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
 								str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
 								x
 								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								### Locales
 								Above I used`str_to_lower()` to change to lower case. You can also use `str_to_upper()` or `str_to_title()`. However, changing case is more complicated than it might at first seem because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:
 								```{r}
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								# Turkish has two i's: with and without a dot, and it
 								# has a different rule for capitalising them:
 								str_to_upper(c("i", "ı"))
 								str_to_upper(c("i", "ı"), locale = "tr")
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								```
 								The locale is specified as ISO 639 language codes, which are two or three letter abbreviations. If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list. If you leave the locale blank, it will use the current locale.
 								Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the currect locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:
 								```{r}
 								x <- c("apple", "eggplant", "banana")
 								str_sort(x, locale = "en")  # English
 								str_sort(x, locale = "haw") # Hawaiian
 								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								### Exercises
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+.  In your own words, describe the difference between the `sep` and `collapse`
 								    arguments to `str_c()`.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+.  In code that doesn't use stringr, you'll often see `paste()` and `paste0()`.
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								    What's the difference between the two functions? What stringr function are
 								    they equivalent to? How do the functions differ in their handling of
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								    `NA`?
 .  Use `str_length()` and `str_sub()` to extract the middle character from
 								    a character vector.
 .  What does `str_wrap()` do? When might you want to use it?
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+.  What does `str_trim()` do? What's the opposite of `str_trim()`?
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+.  Write a function that turns (e.g.) a vector `c("a", "b", "c")` into
 								    the string `a, b, and c`. Think carefully about what it should do if
 								    given a vector of length 0, 1, or 2.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								## Matching patterns with regular expressions
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								Regular expressions, regexps for short, are a very terse language that allow to describe patterns in strings. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and shows you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								### Basics matches
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								The simplest patterns match exact strings:
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
 								```{r}
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								x <- c("apple", "banana", "pear")
 								str_view(x, "an")
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								```
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								The next step up in complexity is `.`, which matches any character (except a new line):
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```{r}
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								str_view(x, ".a.")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```{r}
 								# To create the regular expression, we need \\
 								dot <- "\\."
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								# But the expression itself only contains one:
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								writeLines(dot)
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								# And this tells R to look for explicit .
-												Use str_view htmlwidget

											
										
										
											2015-10-28 00:03:27 +08:00
+								str_view(c("abc", "a.c", "bef"), "a\\.c")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` - you need four backslashes to match one!
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								```{r}
 								x <- "a\\b"
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								writeLines(x)
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								str_view(x, "\\\\")
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								```
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
+								In this book, I'll write a regular expression like `\.` and the string that represents the regular expression as `"\\."`.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								#### Exercises
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+.  Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+.  How would you match the sequence `"'\`?
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+.  What patterns does will this regular expression match `"\..\..\..`?
 								    How would you represent it as a string?
-												More working on strings

											
										
										
											2015-10-29 00:03:11 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								### Anchors
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								By default, regular expressions will match any part of a string. It's often useful to _anchor_ the regular expression so that it matches from the start or end of the string. You can use:
 								* `^` to match the start of the string.
 								* `*` to match the end of the string.
 								```{r}
 								x <- c("apple", "banana", "pear")
 								str_view(x, "^a")
 								str_view(x, "a$")
 								```
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								To remember which is which, try this mneomic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
 								To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:
 								```{r}
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								x <- c("apple pie", "apple", "apple cake")
 								str_view(x, "apple")
 								str_view(x, "^apple$")
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								```
 								You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								#### Exercises
 .  How would you match the literal string `"$^$"`?
 .  Given this corpus of common words:
 								    ```{r}
 								    ```
 								    Create regular expressions that find all words that:
 . Start with "y".
 . End with "x"
 . Are exactly three letters long. (Don't cheat by using `str_length()`!)
 . Have seven letters or more.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								    Since this list is long, you might want to use the `match` argument to
 								    `str_view()` to show only the matching or non-matching words.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								### Character classes and alternatives
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								There are number of other special patterns that match more than one character:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								* `.`: any character apart from a new line.
 								* `\d`: any digit.
 								* `\s`: any whitespace (space, tab, newline).
 								* `[abc]`: match a, b, or c.
 								* `[!abc]`: match anything except a, b, or c.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
-												Use str_view htmlwidget

											
										
										
											2015-10-28 00:03:27 +08:00
+								str_view(c("abc", "xyz"), "abc|xyz")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								Like with mathematical expression, if precedence ever gets confusing, use parentheses to make it clear what you want:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								```{r}
-												Use str_view htmlwidget

											
										
										
											2015-10-28 00:03:27 +08:00
+								str_view(c("grey", "gray"), "gr(e|a)y")
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								#### Exercises
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+.  Create regular expressions that find all words that:
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+. Start with a vowel.
 . That only contain constants. (Hint: thinking about matching
 								       "not"-vowels.)
 . End with `ed`, but not with `eed`.
 . End with `ing` or `ise`.
 .  Write a regular expression that matches a word if it's probably written
 								    in British English, not American English.
 .  Create a regular expression that will match telephone numbers as commonly
 								    written in your country.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
 								### Repetition
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								The next step up in power involves control how many times a pattern matches:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								* `?`: 0 or 1
 								* `+`: 1 or more
 								* `*`: 0 or more
 								* `{n}`: exactly n
 								* `{n,}`: n or more
 								* `{,m}`: at most m
 								* `{n,m}`: between n and m
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								```{r}
 								```
 								By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them. This is an advanced feature of regular expressions, but it's useful to know that it exists:
 								```{r}
 								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								Note that the precedence of these operators are high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+` or `ba(na){2,}`.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								#### Exercises
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+.  Describe in words what these regular expressions match:
 								    (read carefully to see I'm using a regular expression or a string
 								    that defines a regular expression.)
 . `^.*$`
 . `"\\{.+\\}"`
 . `\d{4}-\d{2}-\d{2}`
 . `"\\\\{4}"`
 .  Create regular expressions to find all words that:
 . Have three or more vowels in a row.
 . Start with three consonants
 . Have two or more vowel-consontant pairs in a row.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								### Grouping and backreferences
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								You learned about parentheses earlier as a way to disambiguate complex expression. They do one other special thing: they also define numeric groups that you can refer to with _backreferences_, `\1`, `\2` etc.For example, the following regular expression finds all fruits that have a pair letters that's repeated.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```{r}
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								str_view(fruit, "(..)\\1", match = TRUE)
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
+								```
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								(You'll also see how they're useful in conjunction with `str_match()` in a few pages.)
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. They are called non-capturing parentheses.
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								For example:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								```{r}
 								str_detect(c("grey", "gray"), "gr(e|a)y")
 								str_detect(c("grey", "gray"), "gr(?:e|a)y")
 								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								### Exercises
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+.  Describe, in words, what these expressions will match:
 . `"(.)(.)\\2\\1"`
 . `(..)\1`
 . `"(.)(.)(.).*\\3\\2\\1"`
 .  Construct regular expressions to match words that:
 . Start and end with the same character.
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
+								## Tools
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								Now that you've learned the basics of regular expression, it's time to learn how to apply to real problems. In this section you'll learn a wide array of stringr functions that let you:
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								* Determine which elements match a pattern.
 								* Find the positions of matches.
 								* Extract the content of matches.
 								* Replace matches with new values.
 								* How can you split a string into based on a match.
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								Because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. But since you're in a programming language, it's often easy to break the problem down into smaller pieces. If you find yourself getting stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down in to smaller pieces, solving each challenge before moving onto the next one.
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								### Detect matches
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								To determine if a character vector matches a pattern, use `str_detect()`. It returns a logical vector:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								```{r}
 								x <- c("apple", "banana", "pear")
 								str_detect(x, "e")
 								```
 								Remember that logical vectors are effectively combined with `sum()` and `mean()`. This makes it easy to answer questions about a complete vector:
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								```{r}
 								# How many common words start with t?
 								sum(str_detect(common, "^t"))
 								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								When you have complicated logical conditions (e.g. match this or that but not these) combining multiple `str_detect()` calls with logical operators is often easy. A simple example is if you want to find all words that don't contain any vowels:
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								```{r}
 								# Find all words containing at least one vowel, and negate
 								no_vowels_1 <- !str_detect(common, "[aeiou]")
 								# Find all words consisting only of consonants (non-vowels)
 								no_vowels_2 <- str_detect(common, "^[^aeiou]+$")
 								all.equal(no_vowels_1, no_vowels_2)
 								```
 								If you find your regular expression is getting hard to understand, trying breaking it up into smaller pieces, giving each piece a name, and then combining with logical operations.
 								`str_count()` is similar to `str_detect()` but it returns an integer count of the number of matches, instead of a true/false:
 								```{r}
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								str_count(x, "a")
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								# What's the average number of vowels per word?
 								mean(str_count(common, "[aeiou]"))
 								```
 								`str_subset()` is a wrapper for the common pattern `x[str_detect(x, pattern)]`.
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								### Exercises
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+.  For each of the following challenges, try solving with both a single
 								    regular expression, and a combination of multiple `str_detect()` calls.
 .  Find all words that start or end with `x`.
 .  Find all words that start with a vowel and end with a consonant.
 .  What word has the highest number of vowels? What word has the highest
 								    proportion of vowels?
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
 								### Extract matches
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								To extract the actual text of a match, use `str_extract()`. For that to be useful, we need a somewhat more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences): these are sentences designed to tested VOIP systems, but we're going to use them as random data.
 								```{r}
 								length(sentences)
 								head(sentences)
 								```
 								Imagine we want to find all sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:
 								```{r}
 								colours <- c("red", "orange", "yellow", "green", "blue", "purple")
 								colour_match <- str_c(colours, collapse = "|")
 								colour_match
 								```
 								Now we can select the sentences that contain a colour, and then extract the colour to figure out which one it is:
 								```{r}
 								has_colour <- str_subset(sentences, colour_match)
 								matches <- str_extract(has_colour, colour_match)
 								head(matches)
 								```
 								A few sentences contain more than one colour and `str_extract()` only extracts the first:
 								```{r}
 								table(str_count(sentences, colour_match))
 								more <- sentences[str_count(sentences, colour_match) > 1]
 								more
 								str_extract(more, colour_match)
 								```
 								To get all matches, use `str_extract_all()`:
 								```{r}
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								str_view_all(more, colour_match)
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								str_extract_all(more, colour_match)
 								```
 								This returns a list, which is a little hard to work with, which is why it's not the default. You'll learn more about working with lists in Chapter XYZ. Note that matches are always non-overlapping: the second match starts after the first is complete.
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								Another options is to convert it to a character matrix with `simplify = TRUE`. Short matches are expanded with `""` to the length of the longest:
 								```{r}
 								x <- c("a", "a b", "a b c")
 								str_extract_all(x, "[a-z]", simplify = TRUE)
 								```
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								#### Exercises
 .  From the Harvard sentences data, extract:
 . The first word from each sentence.
 . All words ending in `ing`.
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+.  In the previous example, you might have noticed that our regular expression
 								    matched "fickered", which is not a colour. Modify the regex to prevent
 								    this problematic match.
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								### Grouped matches
 								We talked early about the use of parentheses. You can use them if you want to extract parts of a match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the":
 								```{r}
 								noun <- "(a|the) ([^ ]+)"
 								has_noun <- sentences %>%
 								  str_subset(noun) %>%
 								  head(10)
 								str_extract(has_noun, noun)
 								```
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								(Defining a "word" in a regular expression is a little tricky. I've decided to go for a sequence of any characters except for a space.)
 								`str_extract()` gives us the complete match, but we'd like to be able to dig into the pieces. That's the job of `str_match()`. Instead of a character vector, it returns a matrix, with one column for each group, and one column for the complete match:
 								```{r}
 								str_match(has_noun, noun)
 								```
 								(You can see our heuristic for finding nouns isn't that good as it also picks up adjectives like smooth and parked.)
 								Like `str_extract()`, if you want all matches, you'll need to use `str_match_all()` and then work with the list that it returns.
 								#### Exercises
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								### Replacing matches
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								`str_replace()` allows you to transform
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								```{r}
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								sentences %>%
 								  head(5) %>%
 								  str_replace("([^ ]+) ([^ ]+)", "\\2 \\1")
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								```
 								Like `str_extract()` and `str_match()`, `str_replace()` only affects the first match. To replace every match, use `str_replace_all()`. Compared to the other two `all()` functions, the output from `str_replace_all()` is simpler because it can stay as a character vector.
 								Multiple replacements
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												More work on strings

											
										
										
											2015-10-29 23:13:19 +08:00
+								Backreferences.
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								Replacing with a function call (hopefully)
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
+								#### Exercises
 .   Replace all `/` in a string with `\`.
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
+								### Splitting
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								Another useful application is to split strings up into pieces. For example we could split sentences up into words
 								```{r}
 								sentences %>%
 								  head(5) %>%
 								  str_split(" ")
 								```
 								Note that this function has to return a list: the number of pieces each element is split up into might be difference, so there's no way to put them in a vector. If you're working with a length-1 vector, the easiest thing is to just extra the first element of the list:
 								```{r}
 								"a|b|c|d" %>%
 								  str_split("\\|") %>%
 								  .[[1]]
 								```
 								You'll learn other techniques in the lists chapter.
 								If you want all strings to be split up into the same number of pieces, you can use `str_split_fixed()`. This outputs a matrix with one row for each string and one column for each piece:
 								```{r}
 								c("Name: Hadley", "County: NZ", "Age: 35") %>%
 								  str_split_fixed(": ", 2)
 								```
 								<!-- Add comment to stringi issue that split should also preserve names -->
-												Some string tweaking

											
										
										
											2015-10-26 22:52:24 +08:00
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								Instead of splitting up strings by patterns, you can also split up by a predefined set of boundaries with `boundary()`: by character, by line, by sentence and by word.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								```{r}
 								x <- "This is a sentence.  This is another sentence."
 								str_view_all(x, boundary("word"))
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								str_split(x, " ")
 								str_split(x, boundary("word"))
 								```
-												Slowly working through different stringr functions.

											
										
										
											2015-10-30 22:55:03 +08:00
 								### Find matches
 								`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
-												Start on strings

											
										
										
											2015-10-21 22:31:15 +08:00
+								## Other types of pattern
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								```{r, eval = FALSE}
 								# The regular call:
 								str_view(fruit, "nana")
 								# Is shorthand for
 								str_view(fruit, regex("nana"))
 								```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								You can use the other arguments of `regex()` to control details of the match:
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								*   `ignore_case = TRUE` allows characters to match either their uppercase or
 								    lowercase forms. This always uses the current locale.
 								    ```{r}
 								    bananas <- c("banana", "Banana", "BANANA")
 								    str_view(bananas, "banana")
 								    str_view(bananas, regex("banana", ignore_case = TRUE))
 								    ```
 								*   `multiline = TRUE` allows `^` and `$` to match the start and end of each
 								    line rather than the start and end of the complete string.
 								    ```{r}
 								    x <- "Line 1\nLine 2\nLine 3"
 								    str_view_all(x, "^Line")
 								    str_view_all(x, regex("^Line", multiline = TRUE))
 								    ```
 								*   `comments = TRUE` allows you to use comments and white space to make
 								    complex regular expressions more understand. Space are ignored, as is
 								    everything after `#`. To match a literal space, you'll need to escape it:
 								    `"\\ "`.
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
-												Keeping on writing about strings

											
										
										
											2015-11-02 10:59:18 +08:00
+								*   `dotall = TRUE` allows `.` to match everything, including `\n`.
 								There are three other functions you can use instead of `regex()`:
 								*   `fixed()`: matches exactly that sequence of characters (i.e. ignored
 								    all special regular expression pattern). This allows you to avoid complex
 								    escaping and is faster than matching regular expressions:
 								    ```{r}
 								    microbenchmark::microbenchmark(
 								      fixed = str_detect(sentences, fixed("the")),
 								      regex = str_detect(sentences, "the")
 								    )
 								    ```
 								    The fixed match is almost 3x times faster than the regular expression match.
 								    But note the units: here it's only 200 µs faster.
 								*   `coll()`: compare strings using standard **coll**ation rules. This is
 								    useful for doing case insensitive matching. Note that `coll()` takes a
 								    `locale` parameter that controls which rules are used for comparing
 								    characters. Unfortunately different parts of the world use different rules!
 								    ```{r}
 								    # That means you also need to be aware of the difference
 								    # when doing case insensitive matches:
 								    i <- c("I", "İ", "i", "ı")
 								    i
 								    str_subset(i, coll("i", TRUE))
 								    str_subset(i, coll("i", TRUE, locale = "tr"))
 								    ```
 								    Both `fixed()` and `regex()` have `ignore_case` arguments, but they
 								    do not allow you to pick the locale: they always use the default locale.
 								    You can see what that is with the following code; more on stringi
 								    later.
 								    ```{r}
 								    stringi::stri_locale_info()
 								    ```
 								*   As you saw with `str_split()` you can use `boundary()` to match boundaries.
 								    You can also use it with the other functions, all though
 								    ```{r}
 								    x <- "This is a sentence."
 								    str_view_all(x, boundary("word"))
 								    str_extract_all(x, boundary("word"))
 								    ```
-												More about strings

											
										
										
											2015-10-23 02:17:00 +08:00
 								## Other uses of regular expressions
 								There are a few other functions in base R that accept regular expressions:
 								*   `apropos()` searchs all objects avaiable from the global environment. This
 								    is useful if you can't quite remember the name of the function.
 								*   `ls()` is similar to `apropos()` but only works in the current
 								    environment. However, if you have so many objects in your environment
 								    that you have to use a regular expression to filter them all, you
 								    need to think about what you're doing! (And probably use a list instead).
 								*   `dir()` lists all the files in a directory. The `pattern` argument takes
 								    a regular expression and only return file names that match the pattern.
 								    For example, you can find all csv files with `dir(pattern = "\\.csv$")`.
 								    (If you're more comfortable with "globs" like `*.csv`, you can convert
 								    them to regular expressions with `glob2rx()`)
-												More on strings

											
										
										
											2015-10-27 22:33:41 +08:00
 								## Advanced topics
 								### The stringi package
 								stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `length(ls("package:stringi"))` functions to stringr's `length(ls("package:stringr"))`.
 								So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages are very similar because stringi was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs `stri_`.
 								### Encoding
 								Complicated and fraught with difficulty. Best approach is to convert to UTF-8 as soon as possible. All stringr and stringi functions do this. Readr always reads as UTF-8.
 								* UTF-8
 								* Latin1
 								* bytes: everything else
 								Generally, you should fix encoding problems during the data import phase.
 								Detect encoding operates statistically, by comparing frequency of byte fragments across languages and encodings. Fundamentally heuristic and works better with larger amounts of text (i.e. a whole file, not a single string from that file).
 								```{r}
 								x <- "\xc9migr\xe9 cause c\xe9l\xe8bre d\xe9j\xe0 vu."
 								x
 								str_conv(x, "ISO-8859-1")
 								as.data.frame(stringi::stri_enc_detect(x))
 								str_conv(x, "ISO-8859-2")
 								```