So far, you've used a bunch of strings without learning much about the details.
Now it's time to dive into them, learning what makes strings tick, and mastering some of the powerful string manipulation tool you have at your disposal.
We'll begin with the details of creating strings and character vectors.
You'll then dive into creating strings from data, then the opposite; extracting strings from data.
The chapter finishes up with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.
You can easily tell when you're using a stringr function because all stringr functions start with `str_`.
This is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you jog your memory of which functions are available.
Firstly, you can create a string using either single quotes (`'`) or double quotes (`"`).
There's no difference in behavior between the two so in the interests of consistency the [tidyverse style guide](https://style.tidyverse.org/syntax.html#character-vectors) recommends using `"`, unless the string contains multiple `"`.
Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string).
To see the raw contents of the string, use `str_view()`[^strings-1]:
To illustrate the problem, lets create a string that contains the contents of the code block where we define the `double_quote` and `single_quote` variables:
(This is sometimes called [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome).) To eliminate the escaping you can instead use a **raw string**[^strings-2]:
A raw string usually starts with `r"(` and finishes with `)"`.
But if your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. `` `r"--()--" ``, `` `r"---()---" ``, etc. Raw strings are flexible enough to handle any text.
As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `\n`, newline, and `\t`, tab. You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`. This is a way of writing non-English characters that works on all systems. You can see the complete list of other special characters in `?'"'`.
Note that `str_view()` uses a blue background for tabs to make them easier to spot.
One of the challenges of working with text is that there's a variety of ways that white space can end up in text, so this background helps you recognize that something strange is going on.
We'll show you how to do this with `str_c()` and `str_glue()` and how you can you use them with `mutate()`.
That naturally raises the question of what string functions you might use with `summarise()`, so we'll finish this section with a discussion of `str_flatten()` which is a summary function for strings.
If you are mixing many fixed and variable strings with `str_c()`, you'll notice that you type a lot of `"`s, making it hard to see the overall goal of the code. An alternative approach is provided by the [glue package](https://glue.tidyverse.org) via `str_glue()`[^strings-4]. You give it a single string that has a special feature: anything inside `{}` will be evaluated like it's outside of the quotes:
If you guess that you'll need to somehow escape it, you're on the right track.
The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like `\`, you double up the special characters:
The next two sections will give you the basic idea behind these separate functions, first separating into rows (which is a little simpler) and then separating in to columns.
We'll finish off my discussing the tools that the `wider` functions give you to diagnose problems.
It's rarer to see `separate_longer_position()` in the wild, but some older datasets do use very compact format where each character is used to record a value:
Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns.
They are slightly more complicated than their `longer` equivalents because you need to name the columns.
For example, in this following dataset `x` is made up of a code, an edition number, and a year, separated by `"."`.
To use `separate_wider_delim()` we supply the delimiter and the names in two arguments:
`separate_wider_delim()`[^strings-6] requires a fixed and known set of columns.
What happens if some of the rows don't have the expected number of pieces?
There are two possible problems, too few or too many pieces, so `separate_wider_delim()` provides two arguments to help: `too_few` and `too_many`. Let's first look at the `too_few` case with the following sample dataset:
[^strings-6]: The same principles apply to `separate_wider_position()` and `separate_wider_regex()`.
You'll notice that we get an error, but the error gives us some suggestions as to how you might proceed.
Let's start by debugging the problem:
```{r}
debug <- df |>
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_few = "debug"
)
debug
```
When you use the debug mode you get three extra columns add to the output: `x_ok`, `x_pieces`, and `x_remainder` (if you separate variable with a different name, you'll get a different prefix).
Here, `x_ok` lets you quickly find the inputs that failed:
```{r}
debug |> filter(!x_ok)
```
`x_pieces` tells us how many pieces were found, compared to the expected 3 (the length of `names`).
`x_remainder` isn't useful when there are too few pieces, but we'll see it again shortly.
Sometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating.
In that case, fix the problem upstream and make sure to remove `too_few = "debug"` to ensure that new problem become errors.
In other cases you may just want to fill in the missing pieces with `NA`s and move on.
That's the job of `too_few = "align_start"` and `too_few = "align_end"` which allow you to control where the `NA`s should go:
But now when we debug the result, you can see the purpose of `x_remainder`:
```{r}
debug <- df |>
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_many = "debug"
)
debug |> filter(!x_ok)
```
You have a slightly different set of options for handling too many pieces: you can either silently "drop" any additional pieces or "merge" them all into the final column:
This section discusses stringr functions that work with individual letters.
This is straightforward for English because it uses an alphabet with 26 letters, but things rapidly get complicated when you move beyond English.
Even languages that use the same alphabet but add additional accents (e.g. å, é, ï, ô, ū) are non-trivial because those letters might be represented as an individual character or by combing an unaccented letter (e.g. e) with a diacritic mark (e.g. ´).
And other languages "letters" look quite different: in Japanese each "letter" is a syllable, in Chinese each "letter" is a complex logogram, and in Arabic letters look radically different depending on their location in the word.
In this section, we'll assume that you're working with English text as we introduce to functions for finding the length of a string, extracting substrings, and handling long strings in plots and tables.
You could use this with `count()` to find the distribution of lengths of US babynames, and then with `filter()` to look at the longest names[^strings-7]:
You can extract parts of a string using `str_sub(string, start, end)`, where `start` and `end` are the letters where the substring should start and end.
- `str_wrap(x, 30)` wraps a string introducing new lines so that each line is at most 30 characters (it doesn't hyphenate, however, so any word longer than 30 characters will make a longer line)
x <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
2. Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
There are a handful of stringr functions whose behavior depends on your **locale**.
Locale is similar to language, but includes an optional region specifier to handle the fact that (e.g.) many countries speak Spanish, but with regional variations.
A locale is specified by lower-case language abbreviation, optionally followed by a `_` and a upper-case region identifier.
If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list, and you can see which are supported in stringr by looking at `stringi::stri_locale_list()`.
Base R string functions automatically use the locale set by your operating system.
This means that base R string functions usually use the rules associated with your native language, but such might work differently when you share it with someone who lives in different country.
In this chapter you've learned a wide of tools for working with strings, but you haven't learned one of the most important and powerful tools: regular expressions.
Regular expressions are very concise, but very expressive, language for describing patterns within strings, and are the topic of the next chapter.