Noodling on strings

2021-04-23 08:07:16 -05:00
parent 2505136477
commit 9091a1484d
3 changed files with 99 additions and 41 deletions
--- a/strings.Rmd
+++ b/strings.Rmd
@@ -2,12 +2,17 @@

 ## Introduction

-This chapter introduces you to string manipulation in R.
+This chapter introduces you to strings in R.
 You'll learn the basics of how strings work and how to create them by hand.
 Big topic so spread over three chapters.

 Base R contains many functions to work with strings but we'll generally avoid them here because they can be inconsistent, which makes them hard to remember.
 Instead, we'll use stringr which is designed to be as consistent as possible, and all of its functions start with `str_`.
+The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions:
+
+```{r, echo = FALSE}
+knitr::include_graphics("screenshots/stringr-autocomplete.png")
+```

 ### Prerequisites

@@ -15,6 +20,7 @@ This chapter will focus on the **stringr** package for string manipulation, whic

 ```{r setup, message = FALSE}
 library(tidyverse)
+library(babynames)
 ```

 ## Creating a string
@@ -86,7 +92,7 @@ If your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that

 ### Other special characters

-As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`.
+As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"` with `?'"'` or `?"'"`.

 You'll also sometimes see strings containing Unicode escapes like `"\u00b5"`.
 This is a way of writing non-English characters that works on all platforms:
@@ -105,12 +111,6 @@ str_c("x", "y")
 str_c("x", "y", "z")
 ```

-The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions:
-
-```{r, echo = FALSE}
-knitr::include_graphics("screenshots/stringr-autocomplete.png")
-```
-
 Use the `sep` argument to control how they're separated:

 ```{r}
@@ -126,24 +126,24 @@ str_c("|-", x, "-|")
 str_c("|-", coalesce(x, ""), "-|")
 ```

-`str_c()` is vectorised which means that it automatically recycles individual strings to the same length as the longest vector input:
-
-```{r}
-str_c("prefix-", c("a", "b", "c"), "-suffix")
-```
-
 `mutate()`

-## Flattening strings
-
-To collapse a vector of strings into a single string, use `collapse`:
+Another powerful way of combining strings is with the glue package.
+You can either use `glue::glue()` or call it via the `str_glue()` wrapper that string provides for you.
+Glue works a little differently to the other methods: you give it a single string using `{}` to indicate where you want to interpolate in existing variables:

 ```{r}
-str_flatten(c("x", "y", "z"), ", ")
+str_glue("|-{x}-|")
 ```

-This is a great tool for `summarise()`ing character data.
-Later we'll come back to the inverse of this, `separate_rows()`.
+Like `str_c()`, `str_glue()` pairs well with `mutate()`:
+
+```{r}
+starwars %>% mutate(
+  intro = str_glue("Hi my is {name} and I'm a {species} from {homeworld}"),
+  .keep = "none"
+)
+```

 ## Length and subsetting

@@ -153,6 +153,13 @@ For example, `str_length()` tells you the length of a string:
 str_length(c("a", "R for data science", NA))
 ```

+You could use this with `count()` to find the distribution of lengths of US babynames:
+
+```{r}
+babynames %>%
+  count(length = str_length(name))
+```
+
 You can extract parts of a string using `str_sub()`.
 As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring:

@@ -163,6 +170,16 @@ str_sub(x, 1, 3)
 str_sub(x, -3, -1)
 ```

+We could use this with `mutate()` to find the first and last letter of each name:
+
+```{r}
+babynames %>% 
+  mutate(
+    first = str_sub(name, 1, 1),
+    last = str_sub(name, -1, -1)
+  )
+```
+
 Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible:

 ```{r}
@@ -189,6 +206,19 @@ TODO: `separate()`
 4.  Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`.
    Think carefully about what it should do if given a vector of length 0, 1, or 2.

+## String summaries
+
+You can perform the opposite operation with `summarise()` and `str_flatten()`:
+
+To collapse a vector of strings into a single string, use `collapse`:
+
+```{r}
+str_flatten(c("x", "y", "z"), ", ")
+```
+
+This is a great tool for `summarise()`ing character data.
+Later we'll come back to the inverse of this, `separate_rows()`.
+
 ## Long strings

 `str_wrap()`
@@ -234,15 +264,14 @@ The results are identical, but I think the first approach is significantly easie
 If your regular expression gets overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining the pieces with logical operations.

 A common use of `str_detect()` is to select the elements that match a pattern.
-This makes it a natural pairing with `filter()`:
+This makes it a natural pairing with `filter()`.
+The following regexp finds all names with repeated pairs of letters (you'll learn how that regexp works in the next chapter)

 ```{r}
-df <- tibble(
-  word = words, 
-  i = seq_along(word)
-)
-df %>% 
-  filter(str_detect(word, "x$"))
+babynames %>% 
+  filter(n > 100) %>% 
+  count(name, wt = n) %>% 
+  filter(str_detect(name, "(..).*\\1"))
 ```

 A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string:
@@ -258,22 +287,13 @@ mean(str_count(words, "[aeiou]"))
 It's natural to use `str_count()` with `mutate()`:

 ```{r}
-df %>% 
+babynames %>% 
  mutate(
-    vowels = str_count(word, "[aeiou]"),
-    consonants = str_count(word, "[^aeiou]")
+    vowels = str_count(name, "[aeiou]"),
+    consonants = str_count(name, "[^aeiou]")
  )
 ```

-Note that matches never overlap.
-For example, in `"abababa"`, how many times will the pattern `"aba"` match?
-Regular expressions say two, not three:
-
-```{r}
-str_count("abababa", "aba")
-str_view_all("abababa", "aba")
-```
-
 ### Exercises

 1.  For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.
@@ -383,6 +403,8 @@ tibble(sentence = sentences) %>%
 2.  Find all contractions.
    Separate out the pieces before and after the apostrophe.

+## Strings -\> Columns
+
 ## Separate

 `separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
@@ -416,6 +438,15 @@ table3 %>%

 `separate_rows()`

+## Strings -\> Rows
+
+```{r}
+starwars %>% 
+  select(name, eye_color) %>% 
+  filter(str_detect(eye_color, ", ")) %>% 
+  separate_rows(eye_color)
+```
+
 ### Exercises

 1.  Split up a string like `"apples, pears, and bananas"` into individual components.
@@ -427,11 +458,22 @@ table3 %>%

 ## Other languages {#other-languages}

-### Length
+Encoding, and why not to trust `Encoding`.
+As a general rule, we recommend using UTF-8 everywhere, converting as a early as possible (i.e. by using the `encoding` argument to `readr::locale()`).
+
+### Length and subsetting

 This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages.
 Include some examples from <https://gankra.github.io/blah/text-hates-you/>.
-(Maybe better to include a non-English text section later?)
+
+This is a problem even with European problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components.
+
+```{r}
+x <- c("\u00e1", "a\u0301")
+x
+str_length(x)
+str_sub(x, 1, 1)
+```

 ### Locales