Writing about strings
This commit is contained in:
parent
c0daa382c1
commit
26ab1cc1eb
222
strings.Rmd
222
strings.Rmd
|
@ -8,35 +8,38 @@ status("restructuring")
|
||||||
|
|
||||||
This chapter introduces you to strings.
|
This chapter introduces you to strings.
|
||||||
You'll learn the basics of how strings work in R and how to create them "by hand".
|
You'll learn the basics of how strings work in R and how to create them "by hand".
|
||||||
Big topic so spread over three chapters: here we'll focus on the basic mechanics, in Chapter \@ref(regular-expressions) we'll dive into the details of regular expressions the sometimes cryptic language for describing patterns in strings, and we'll return to strings later in Chapter \@ref(programming-with-strings) when we think about them about from a programming perspective (rather than a data analysis perspective).
|
You'll also learn the basics of regular expressions, a powerful, but sometimes cryptic language for describing string patterns.
|
||||||
|
Regular expression are a big topic, so we'll come back to them again in Chapter \@ref(regular-expressions) to discuss more of the details.
|
||||||
We'll finish up with a discussion of some of the new challenges that arise when working with non-English strings.
|
We'll finish up with a discussion of some of the new challenges that arise when working with non-English strings.
|
||||||
|
|
||||||
While base R contains functions that allow us to perform pretty much all of the operations described in this chapter, here we're going to use the **stringr** package.
|
We'll come back to strings again in Chapter \@ref(programming-with-strings) where we'll think about them about more from a programming perspective than a data analysis perspective.
|
||||||
stringr has been carefully designed to be as consistent as possible so that knowledge gained about one function can be more easily transferred to the next.
|
|
||||||
stringr functions all start with the same `str_` prefix.
|
|
||||||
This is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr's functions:
|
|
||||||
|
|
||||||
```{r, echo = FALSE}
|
|
||||||
knitr::include_graphics("screenshots/stringr-autocomplete.png")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
This chapter will focus on the **stringr** package for string manipulation, which is part of the core tidyverse.
|
In this chapter, we'll use functions from the stringr package.
|
||||||
We'll also work with the babynames dataset.
|
The equivalent functionality is available in base R (through functions like `grepl()`, `gsub()`, and `regmatches()`) but we think you'll find stringr easier to use because it's been carefully designed to be as consistent as possible.
|
||||||
|
We'll also work with the babynames dataset since it provides some fun data to apply string manipulation to.
|
||||||
|
|
||||||
```{r setup, message = FALSE}
|
```{r setup, message = FALSE}
|
||||||
library(tidyverse)
|
library(tidyverse)
|
||||||
library(babynames)
|
library(babynames)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
You can easily tell when you're using a stringr function because all stringr functions start with `str_`.
|
||||||
|
This is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you jog your memory of which functions are available.
|
||||||
|
|
||||||
|
```{r, echo = FALSE}
|
||||||
|
knitr::include_graphics("screenshots/stringr-autocomplete.png")
|
||||||
|
```
|
||||||
|
|
||||||
## Creating a string
|
## Creating a string
|
||||||
|
|
||||||
To begin, let's discuss the mechanics of creating a string.
|
To begin, let's discuss the mechanics of creating a string[^strings-1].
|
||||||
We've created strings in passing earlier in the book, but didn't discuss the details.
|
We've created strings in passing earlier in the book, but didn't discuss the details.
|
||||||
First, there are two basic ways to create a string: using either single quotes (`'`) or double quotes (`"`).
|
First, there are two basic ways to create a string: using either single quotes (`'`) or double quotes (`"`).
|
||||||
Unlike other languages, there is no difference in behaviour.
|
Unlike other languages, there is no difference in behaviour, but the [tidyverse style guide](https://style.tidyverse.org/syntax.html#character-vectors) recommends using `"`, unless the string contains multiple `"`
|
||||||
I recommend always using `"`, unless you want to create a string that contains multiple `"`.
|
|
||||||
|
[^strings-1]: A string is a length-1 character vector.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
string1 <- "This is a string"
|
string1 <- "This is a string"
|
||||||
|
@ -50,7 +53,14 @@ If you forget to close a quote, you'll see `+`, the continuation character:
|
||||||
+
|
+
|
||||||
+ HELP I'M STUCK
|
+ HELP I'M STUCK
|
||||||
|
|
||||||
If this happen to you, press Escape and try again.
|
If this happen to you and you can't figure out which quote you need to close, press Escape to cancel, then try again.
|
||||||
|
|
||||||
|
You can combine multiple strings into a character vector by using `c()`:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
x <- c("first string", "second string", "third string")
|
||||||
|
x
|
||||||
|
```
|
||||||
|
|
||||||
### Escapes
|
### Escapes
|
||||||
|
|
||||||
|
@ -61,14 +71,16 @@ double_quote <- "\"" # or '"'
|
||||||
single_quote <- '\'' # or "'"
|
single_quote <- '\'' # or "'"
|
||||||
```
|
```
|
||||||
|
|
||||||
Which means if you want to include a literal backslash, you'll need to double it up: `"\\"`:
|
This means if you want to include a literal backslash in your string, you'll need to double it up: `"\\"`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
backslash <- "\\"
|
backslash <- "\\"
|
||||||
```
|
```
|
||||||
|
|
||||||
Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes.
|
Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes.
|
||||||
To see the raw contents of the string, use `str_view()`:
|
To see the raw contents of the string, use `str_view()` [^strings-2]:
|
||||||
|
|
||||||
|
[^strings-2]: You can also use the base R function `writeLines()`
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
x <- c(single_quote, double_quote, backslash)
|
x <- c(single_quote, double_quote, backslash)
|
||||||
|
@ -79,7 +91,7 @@ str_view(x)
|
||||||
### Raw strings
|
### Raw strings
|
||||||
|
|
||||||
Creating a string with multiple quotes or backslashes gets confusing quickly.
|
Creating a string with multiple quotes or backslashes gets confusing quickly.
|
||||||
For example, lets create a string that contains the contents of the chunk where I define the `double_quote` and `single_quote` variables:
|
To illustrate the problem, lets create a string that contains the contents of the chunk where I define the `double_quote` and `single_quote` variables:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
tricky <- "double_quote <- \"\\\"\" # or '\"'
|
tricky <- "double_quote <- \"\\\"\" # or '\"'
|
||||||
|
@ -87,9 +99,11 @@ single_quote <- '\\'' # or \"'\""
|
||||||
str_view(tricky)
|
str_view(tricky)
|
||||||
```
|
```
|
||||||
|
|
||||||
You can instead use a **raw string**[^strings-1] to reduce the amount of escaping:
|
That's a lot of backslashes!
|
||||||
|
|
||||||
[^strings-1]: Available in R 4.0.0 and above.
|
To eliminate the escaping you can instead use a **raw string**[^strings-3]:
|
||||||
|
|
||||||
|
[^strings-3]: Available in R 4.0.0 and above.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
tricky <- r"(double_quote <- "\"" # or '"'
|
tricky <- r"(double_quote <- "\"" # or '"'
|
||||||
|
@ -98,13 +112,12 @@ single_quote <- '\'' # or "'"
|
||||||
str_view(tricky)
|
str_view(tricky)
|
||||||
```
|
```
|
||||||
|
|
||||||
A raw string starts with `r"(` and finishes with `)"`.
|
A raw string usually starts with `r"(` and finishes with `)"`.
|
||||||
If your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. `` `r"--()--" ``, `` `r"---()---" ``,etc.
|
But if your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. `` `r"--()--" ``, `` `r"---()---" ``, etc. Raw strings are flexible enough to handle any text.
|
||||||
|
|
||||||
### Other special characters
|
### Other special characters
|
||||||
|
|
||||||
As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list in `?'"'`.
|
As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `\n`, newline, and `\t`, tab, but you can see the complete list in `?'"'`.
|
||||||
|
|
||||||
You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`.
|
You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`.
|
||||||
This is a way of writing non-English characters that works on all systems:
|
This is a way of writing non-English characters that works on all systems:
|
||||||
|
|
||||||
|
@ -114,72 +127,19 @@ x
|
||||||
str_view(x)
|
str_view(x)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Combining strings
|
|
||||||
|
|
||||||
Use `str_c()`[^strings-2] to join together multiple character vectors into a single vector:
|
|
||||||
|
|
||||||
[^strings-2]: `str_c()` is very similar to the base `paste0()`.
|
|
||||||
There are two main reasons I use it here: it obeys the usual rules for handling `NA`, and it uses the tidyverse recycling rules.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
str_c("x", "y")
|
|
||||||
str_c("x", "y", "z")
|
|
||||||
```
|
|
||||||
|
|
||||||
`str_c()` obeys the usual recycling rules:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
names <- c("Timothy", "Dewey", "Mable")
|
|
||||||
str_c("Hi ", names, "!")
|
|
||||||
```
|
|
||||||
|
|
||||||
And like most other functions in R, missing values are contagious.
|
|
||||||
You can use `coalesce()` to replace missing values with a value of your choosing:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
x <- c("abc", NA)
|
|
||||||
str_c("|-", x, "-|")
|
|
||||||
str_c("|-", coalesce(x, ""), "-|")
|
|
||||||
```
|
|
||||||
|
|
||||||
Since `str_c()` creates a vector, you'll usually use it with a `mutate()`:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
starwars %>%
|
|
||||||
mutate(greeting = str_c("Hi! I'm ", name, "."), .after = name)
|
|
||||||
```
|
|
||||||
|
|
||||||
Another powerful way of combining strings is with the glue package.
|
|
||||||
You can either use `glue::glue()` directly or call it via the `str_glue()` wrapper that stringr provides for you.
|
|
||||||
Glue works a little differently to the other methods: you give it a single string then within the string use `{}` to indicate where existing variables should be evaluated:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
str_glue("|-{x}-|")
|
|
||||||
```
|
|
||||||
|
|
||||||
Like `str_c()`, `str_glue()` pairs well with `mutate()`:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
starwars %>%
|
|
||||||
mutate(
|
|
||||||
intro = str_glue("Hi! My is {name} and I'm a {species} from {homeworld}"),
|
|
||||||
.keep = "none"
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
You can use any valid R code inside of `{}`, but it's a good idea to pull complex calculations out into their own variables so you can more easily check your work.
|
|
||||||
|
|
||||||
## Length and subsetting
|
## Length and subsetting
|
||||||
|
|
||||||
It's natural to think about the letters that make up an individual string.
|
It's natural to think about the letters that make up an individual string.
|
||||||
(But note that the idea of a "letter" isn't a natural fit to every language, we'll come back to that in Section \@ref(other-languages)).
|
(Not every language uses letters, which we'll talk about more in Section \@ref(other-languages)).
|
||||||
For example, `str_length()` tells you the length of a string in characters:
|
For example, `str_length()` tells you the length of a string in characters:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_length(c("a", "R for data science", NA))
|
str_length(c("a", "R for data science", NA))
|
||||||
```
|
```
|
||||||
|
|
||||||
You could use this with `count()` to find the distribution of lengths of US babynames, and then with `filter()` to look at the longest names:
|
You could use this with `count()` to find the distribution of lengths of US babynames, and then with `filter()` to look at the longest names[^strings-4]:
|
||||||
|
|
||||||
|
[^strings-4]: Looking at these entries, I'd say the babynames data removes spaces or hyphens from names and truncates after 15 letters.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
babynames %>%
|
babynames %>%
|
||||||
|
@ -257,7 +217,7 @@ Later on, we'll come back two related problems: the components have varying leng
|
||||||
|
|
||||||
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
|
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
|
||||||
|
|
||||||
## Long strings
|
### Long strings
|
||||||
|
|
||||||
Sometimes the reason you care about the length of a string is because you're trying to fit it into a label.
|
Sometimes the reason you care about the length of a string is because you're trying to fit it into a label.
|
||||||
stringr provides two useful tools for cases where your string is too long:
|
stringr provides two useful tools for cases where your string is too long:
|
||||||
|
@ -273,15 +233,87 @@ str_trunc(x, 30)
|
||||||
str_view(str_wrap(x, 30))
|
str_view(str_wrap(x, 30))
|
||||||
```
|
```
|
||||||
|
|
||||||
## String summaries
|
##
|
||||||
|
|
||||||
|
## Combining strings
|
||||||
|
|
||||||
|
There are two ways in which you might want to combine strings.
|
||||||
|
You might have a few character vectors which you want to combine together creating a new vector.
|
||||||
|
Or you might have a single vector that you want to collapse down into a single string.
|
||||||
|
|
||||||
|
### str_c()
|
||||||
|
|
||||||
|
Use `str_c()`[^strings-5] to join together multiple character vectors into a single vector:
|
||||||
|
|
||||||
|
[^strings-5]: `str_c()` is very similar to the base `paste0()`.
|
||||||
|
There are two main reasons I use it here: it obeys the usual rules for handling `NA`, and it uses the tidyverse recycling rules.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
str_c("x", "y")
|
||||||
|
str_c("x", "y", "z")
|
||||||
|
```
|
||||||
|
|
||||||
|
`str_c()` obeys the tidyverse recycling rules so any length-1 vectors (aka strings) will be recycled to the length of the longest vector[^strings-6]:
|
||||||
|
|
||||||
|
[^strings-6]: If the other vectors don't have the same length, `str_c()` will error.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
names <- c("Timothy", "Dewey", "Mable")
|
||||||
|
str_c("Hi ", names, "!")
|
||||||
|
```
|
||||||
|
|
||||||
|
Like most other functions in R, missing values are contagious, so any missing input will cause the output to be missing.
|
||||||
|
If you don't want this behaviour, use `coalesce()` to replace missing values with something else:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
x <- c("abc", NA)
|
||||||
|
str_c("|-", x, "-|")
|
||||||
|
str_c("|-", coalesce(x, ""), "-|")
|
||||||
|
```
|
||||||
|
|
||||||
|
Since `str_c()` creates a vector, you'll usually use it with `mutate()`:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
starwars %>%
|
||||||
|
mutate(greeting = str_c("Hi! I'm ", name, "."), .after = name)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Glue
|
||||||
|
|
||||||
|
Another powerful way of combining strings is with the glue package.
|
||||||
|
You can either use `glue::glue()` directly or call it via the `str_glue()` wrapper that stringr provides for you.
|
||||||
|
Glue works a little differently to the other methods: you give it a single string then within the string use `{}` to indicate where existing variables should be evaluated:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
x <- c("abc", NA)
|
||||||
|
str_glue("|-{x}-|")
|
||||||
|
```
|
||||||
|
|
||||||
|
Like `str_c()`, `str_glue()` pairs well with `mutate()`:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
starwars %>%
|
||||||
|
mutate(
|
||||||
|
intro = str_glue("Hi! My is {name} and I'm a {species} from {homeworld}"),
|
||||||
|
.keep = "none"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
You can use any valid R code inside of `{}`, but it's a good idea to pull complex calculations out into their own variables so you can more easily check your work.
|
||||||
|
|
||||||
|
Differences with `NA` handling.
|
||||||
|
|
||||||
|
### `str_flatten()`
|
||||||
|
|
||||||
`str_c()` combines multiple character vectors into a single character vector; the output is the same length as the input.
|
`str_c()` combines multiple character vectors into a single character vector; the output is the same length as the input.
|
||||||
An related function is `str_flatten()`: it takes a character vector and returns a single string:
|
An related function is `str_flatten()`:[^strings-7] it takes a character vector and returns a single string:
|
||||||
|
|
||||||
|
[^strings-7]: The base R equivalent is `paste()` with the `collapse` argument set.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_flatten(c("x", "y", "z"))
|
str_flatten(c("x", "y", "z"))
|
||||||
str_flatten(c("x", "y", "z"), ", ")
|
str_flatten(c("x", "y", "z"), ", ")
|
||||||
str_flatten(c("x", "y", "z"), ", ", ", and ")
|
str_flatten(c("x", "y", "z"), ", ", last = ", and ")
|
||||||
```
|
```
|
||||||
|
|
||||||
Just like `sum()` and `mean()` take a vector of numbers and return a single number, `str_flatten()` takes a character vector and returns a single string.
|
Just like `sum()` and `mean()` take a vector of numbers and return a single number, `str_flatten()` takes a character vector and returns a single string.
|
||||||
|
@ -302,6 +334,18 @@ df %>%
|
||||||
summarise(fruits = str_flatten(fruit, ", "))
|
summarise(fruits = str_flatten(fruit, ", "))
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Exercises
|
||||||
|
|
||||||
|
1. Compare the results of `paste0()` with `str_c()` for the following inputs:
|
||||||
|
|
||||||
|
```{r, eval = FALSE}
|
||||||
|
str_c("hi ", NA)
|
||||||
|
str_c("hi ", character())
|
||||||
|
str_c(letters[1:2], letters[1:3])
|
||||||
|
```
|
||||||
|
|
||||||
|
## Splitting apart strings
|
||||||
|
|
||||||
## Detect matches
|
## Detect matches
|
||||||
|
|
||||||
To determine if a character vector matches a pattern, use `str_detect()`.
|
To determine if a character vector matches a pattern, use `str_detect()`.
|
||||||
|
@ -474,7 +518,7 @@ tibble(sentence = sentences) %>%
|
||||||
2. Find all words that come after a "number" like "one", "two", "three" etc. Pull out both the number and the word.
|
2. Find all words that come after a "number" like "one", "two", "three" etc. Pull out both the number and the word.
|
||||||
3. Find all contractions. Separate out the pieces before and after the apostrophe.
|
3. Find all contractions. Separate out the pieces before and after the apostrophe.
|
||||||
|
|
||||||
## Strings -\> Columns
|
## Strings -> Columns
|
||||||
|
|
||||||
## Separate
|
## Separate
|
||||||
|
|
||||||
|
@ -509,7 +553,7 @@ table3 %>%
|
||||||
|
|
||||||
`separate_rows()`
|
`separate_rows()`
|
||||||
|
|
||||||
## Strings -\> Rows
|
## Strings -> Rows
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
starwars %>%
|
starwars %>%
|
||||||
|
@ -546,11 +590,11 @@ Maybe things you think are true, but aren't list?
|
||||||
You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is.
|
You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is.
|
||||||
And typically the problem is that the declaring encoding is wrong.
|
And typically the problem is that the declaring encoding is wrong.
|
||||||
|
|
||||||
The tidyverse follows best practices[^strings-3] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8.
|
The tidyverse follows best practices[^strings-8] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8.
|
||||||
It's still possible to have problems, but they'll typically arise during data import.
|
It's still possible to have problems, but they'll typically arise during data import.
|
||||||
Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`).
|
Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`).
|
||||||
|
|
||||||
[^strings-3]: <http://utf8everywhere.org>
|
[^strings-8]: <http://utf8everywhere.org>
|
||||||
|
|
||||||
### Length and subsetting
|
### Length and subsetting
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue