Reorganising bigger structure of strings

2021-12-05 10:52:47 -06:00 · 2021-12-05 10:52:47 -06:00 · 915ebf4463
parent 26ab1cc1eb
commit 915ebf4463
3 changed files with 194 additions and 286 deletions
--- a/prog-strings.Rmd
+++ b/prog-strings.Rmd
@ -10,6 +10,66 @@ library(tidyr)
 library(tibble)
 ```

+### Encoding
+
+You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is.
+And typically the problem is that the declaring encoding is wrong.
+
+The tidyverse follows best practices[^prog-strings-1] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8.
+It's still possible to have problems, but they'll typically arise during data import.
+Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`).
+
+[^prog-strings-1]: <http://utf8everywhere.org>
+
+### Length and subsetting
+
+This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages.
+
+Four most common are Latin, Chinese, Arabic, and Devangari, which represent three different systems of writing systems:
+
+-   Latin uses an alphabet, where each consonant and vowel gets its own letter.
+
+-   Chinese.
+    Logograms.
+    Half width vs full width.
+    English letters are roughly twice as high as they are wide.
+    Chinese characters are roughly square.
+
+-   Arabic is an abjad, only consonants are written and vowels are optionally as diacritics.
+    Additionally, it's written from right-to-left, so the first letter is the letter on the far right.
+
+-   Devangari is an abugida where each symbol represents a consonant-vowel pair, , vowel notation secondary.
+
+> For instance, 'ch' is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
+> --- <http://utf8everywhere.org>
+
+```{r}
+# But
+str_split("check", boundary("character", locale = "cs_CZ"))
+```
+
+This is a problem even with Latin alphabets because many languages use **diacritics**, glyphs added to the basic alphabet.
+This is a problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components.
+
+```{r}
+x <- c("á", "x́")
+str_length(x)
+# str_width(x)
+str_sub(x, 1, 1)
+
+# stri_width(c("全形", "ab"))
+# 0, 1, or 2
+# but this assumes no font substitution
+```
+
+```{r}
+cyrillic_a <- "А"
+latin_a <- "A"
+cyrillic_a == latin_a
+stringi::stri_escape_unicode(cyrillic_a)
+stringi::stri_escape_unicode(latin_a)
+```
+
 ### str_c

 `NULL`s are silently dropped.
@ -51,8 +111,6 @@ str_view_all(x, boundary("word"))
 str_extract_all(x, boundary("word"))
 ```

-### 
-
 ### Extract

 ```{r}
--- a/regexps.Rmd
+++ b/regexps.Rmd
@ -264,7 +264,7 @@ Collectively, these operators are called **quantifiers** because they quantify h
    b.  Have three or more vowels in a row.
    c.  Have two or more vowel-consonant pairs in a row.

-4.  Solve the beginner regexp crosswords at [\<https://regexcrossword.com/challenges/beginner\>](https://regexcrossword.com/challenges/beginner){.uri}.
+4.  Solve the beginner regexp crosswords at [\<https://regexcrossword.com/challenges/beginner>](https://regexcrossword.com/challenges/beginner){.uri}.

 ## Grouping and backreferences

@ -475,3 +475,9 @@ See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for mor
 Don't forget that you're in a programming language and you have other tools at your disposal.
 Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
 If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
+
+### Exercises
+
+1.  In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem.
+2.  Find all words that come after a "number" like "one", "two", "three" etc. Pull out both the number and the word.
+3.  Find all contractions. Separate out the pieces before and after the apostrophe.
--- a/strings.Rmd
+++ b/strings.Rmd
@ -6,12 +6,14 @@ status("restructuring")

 ## Introduction

-This chapter introduces you to strings.
-You'll learn the basics of how strings work in R and how to create them "by hand".
-You'll also learn the basics of regular expressions, a powerful, but sometimes cryptic language for describing string patterns.
-Regular expression are a big topic, so we'll come back to them again in Chapter \@ref(regular-expressions) to discuss more of the details.
-We'll finish up with a discussion of some of the new challenges that arise when working with non-English strings.
+So far, we've used a bunch of strings without really talking about how they work or the powerful tools you have to work with them.
+This chapter begins by diving into the details of creating strings, and from strings, character vectors.
+You'll then learn a grab bag of handy string functions before we dive into creating strings from data, then extracting data from strings.
+We'll then cover the basics of regular expressions, a powerful, but very concise and sometimes cryptic, language for describing patterns in string.
+The chapter concludes with a brief discussion of where your exceptions of English might steer you wrong when working with text from other languages.

+This chapter is paired with two other chapters.
+Regular expression are a big topic, so we'll come back to them again in Chapter \@ref(regular-expressions).
 We'll come back to strings again in Chapter \@ref(programming-with-strings) where we'll think about them about more from a programming perspective than a data analysis perspective.

 ### Prerequisites
@ -55,13 +57,6 @@ If you forget to close a quote, you'll see `+`, the continuation character:

 If this happen to you and you can't figure out which quote you need to close, press Escape to cancel, then try again.

-You can combine multiple strings into a character vector by using `c()`:
-
-```{r}
-x <- c("first string", "second string", "third string")
-x
-```
-
 ### Escapes

 To include a literal single or double quote in a string you can use `\` to "escape" it:
@ -127,7 +122,25 @@ x
 str_view(x)
 ```

-## Length and subsetting
+Now that you've learned the basics of creating strings by "hand", we'll go into the details of creating strings from other strings, starting with combining strings.
+
+### Vectors
+
+You can combine multiple strings into a character vector by using `c()`:
+
+```{r}
+x <- c("first string", "second string", "third string")
+x
+```
+
+You can create a length zero character vector with `character()`.
+This is not usually very useful, but can help you understand the general principle of functions by giving them an unusual input.
+
+### Exercises
+
+## Handy functions
+
+### Length

 It's natural to think about the letters that make up an individual string.
 (Not every language uses letters, which we'll talk about more in Section \@ref(other-languages)).
@ -150,6 +163,8 @@ babynames %>%
  count(name, wt = n, sort = TRUE)
 ```

+### Subsetting
+
 You can extract parts of a string using `str_sub(string, start, end)`.
 The `start` and `end` arguments are inclusive, so the length of the returned string will be `end - start + 1`:

@ -180,42 +195,7 @@ babynames %>%
  )
 ```

-Sometimes you'll get a column that's made up of individual fixed length strings that have been joined together:
-
-```{r}
-df <- tribble(
-  ~ sex_year_age,
-  "M200115",
-  "F201503",
-)
-```
-
-You can extract the columns using `str_sub()`:
-
-```{r}
-df %>% mutate(
-  sex = str_sub(sex_year_age, 1, 1),
-  year = str_sub(sex_year_age, 2, 5),
-  age = str_sub(sex_year_age, 6, 7),
-)
-```
-
-Or use the `separate()` helper function:
-
-```{r}
-df %>% 
-  separate(sex_year_age, c("sex", "year", "age"), c(1, 5))
-```
-
-Note that you give `separate()` three columns but only two positions --- that's because you're telling `separate()` where to break up the string.
-
-TODO: draw diagram to emphasise that it's the space between the characters.
-
-Later on, we'll come back two related problems: the components have varying length and are a separated by a character, or they have an varying number of components and you want to split up into rows, rather than columns.
-
-### Exercises
-
-1.  Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
+Later, we'll come back to the problem of extracting data from strings.

 ### Long strings

@ -233,7 +213,9 @@ str_trunc(x, 30)
 str_view(str_wrap(x, 30))
 ```

-## 
+### Exercises
+
+1.  Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?

 ## Combining strings

@ -278,6 +260,16 @@ starwars %>%
  mutate(greeting = str_c("Hi! I'm ", name, "."), .after = name)
 ```

+### `str_dup()`
+
+`str_c(a, a, a)` is like `a + a + a`, what's the equivalent of `3 * a`?
+That's `str_dup()`:
+
+```{r}
+str_dup(letters[1:3], 3)
+str_dup("a", 1:3)
+```
+
 ### Glue

 Another powerful way of combining strings is with the glue package.
@ -301,12 +293,13 @@ starwars %>%

 You can use any valid R code inside of `{}`, but it's a good idea to pull complex calculations out into their own variables so you can more easily check your work.

-Differences with `NA` handling.
+Differences with `NA` handling?

 ### `str_flatten()`

-`str_c()` combines multiple character vectors into a single character vector; the output is the same length as the input.
-An related function is `str_flatten()`:[^strings-7] it takes a character vector and returns a single string:
+So far I've shown you vectorised functions that work will with `mutate()`: the output of these functions is the same length as the input.
+There's one last important function that's a summary function: the output is always length 1, regardless of the length of the input.
+That's `str_flatten()`:[^strings-7] it takes a character vector and always returns a single string:

 [^strings-7]: The base R equivalent is `paste()` with the `collapse` argument set.

@ -336,7 +329,7 @@ df %>%

 ### Exercises

-1.  Compare the results of `paste0()` with `str_c()` for the following inputs:
+1.  Compare and contrast the results of `paste0()` with `str_c()` for the following inputs:

    ```{r, eval = FALSE}
    str_c("hi ", NA)
@ -344,9 +337,18 @@ df %>%
    str_c(letters[1:2], letters[1:3])
    ```

+2.  What does `str_flatten()` return if you give it a length 0 character vector?
+
 ## Splitting apart strings

-## Detect matches
+Common for multiple variables worth of data to be stored in a single string.
+In this section you'll learn how to use various functions tidyr to extract them.
+
+Waiting on: <https://github.com/tidyverse/tidyups/pull/15>
+
+## Working with patterns
+
+### Detect matches

 To determine if a character vector matches a pattern, use `str_detect()`.
 It returns a logical vector the same length as the input:
@ -377,6 +379,8 @@ babynames %>%

 (Note that this gives us the proportion of names that contain an x; if you wanted the proportion of babies given a name containing an x, you'd need to perform a weighted mean).

+### Count matches
+
 A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string:

 ```{r}
@ -394,14 +398,23 @@ babynames %>%
  )
 ```

-### Exercises
+You also wonder if any names include special characters like periods:

-1.  What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
+```{r}
+babynames %>% 
+  distinct(name) %>% 
+  head() %>% 
+  mutate(
+    periods = str_count(name, "."),
+  )
+```

-## Introduction to regular expressions
+That's weird!

-Before we can continue on we need to discuss the second argument to `str_detect()` --- the pattern that you want to match.
-Above, I used a simple string, but the pattern actually a much richer tool called a **regular expression**.
+### Introduction to regular expressions
+
+To understand what's going on, we need to discuss what the second argument to `str_detect()` really is.
+It looks like a simple string, but it's pattern actually a much richer tool called a **regular expression**.
 A regular expression uses special characters to match string patterns.
 For example, `.` will match any character, so `"a."` will match any string that contains an a followed by another character:

@ -426,17 +439,6 @@ There are three useful **quantifiers** that can be applied to other pattern: `?`

 -   `ab*` matches an "a", followed by any number of bs

-You can use `()` to control precedence:
-
-   `(ab)?` optionally matches "ab"
-
-   `(ab)+` matches one or more "ab" repeats
-
-```{r}
-str_view(c("aba", "ababab", "abbbbbb"), "ab+")
-str_view(c("aba", "ababab", "abbbbbb"), "(ab)+")
-```
-
 There are various alternatives to `.` that match a restricted set of characters.
 One useful operator is the **character class:** `[abcd]` match "a", "b", "c", or "d"; `[^abcd]` matches anything **except** "a", "b", "c", or "d".

@ -457,15 +459,7 @@ str_view_all("x  X  xy", regex(".Y", ignore_case = TRUE))

 We'll come back to case later, because it's not trivial for many languages.

-### Exercises
-
-1.  For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.
-
-    a.  Find all words that start or end with `x`.
-    b.  Find all words that start with a vowel and end with a consonant.
-    c.  Are there any words that contain at least one of each different vowel?
-
-## Replacing matches
+### Replacing matches

 `str_replace_all()` allow you to replace matches with new strings.
 The simplest use is to replace a pattern with a fixed string:
@ -490,226 +484,76 @@ Use in `mutate()`
 Using pipe inside mutate.
 Recommendation to make a function, and think about testing it --- don't need formal tests, but useful to build up a set of positive and negative test cases as you.

-#### Exercises
+### Exercises

-1.  Replace all forward slashes in a string with backslashes.
+1.  What word has the highest number of vowels?
+    What word has the highest proportion of vowels?
+    (Hint: what is the denominator?)

-2.  Implement a simple version of `str_to_lower()` using `str_replace_all()`.
+2.  For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.

-3.  Switch the first and last letters in `words`.
+    a.  Find all words that start or end with `x`.
+    b.  Find all words that start with a vowel and end with a consonant.
+    c.  Are there any words that contain at least one of each different vowel?
+
+3.  Replace all forward slashes in a string with backslashes.
+
+4.  Implement a simple version of `str_to_lower()` using `str_replace_all()`.
+
+5.  Switch the first and last letters in `words`.
    Which of those strings are still `words`?

-## Extract full matches
+## Locale dependent operations {#other-languages}

-If your data is in a tibble, it's often easier to use `tidyr::extract()`.
-It works like `str_match()` but requires you to name the matches, which are then placed in new columns:
+So far all of our examples have been using English.
+The details of the many ways other languages are different to English are too diverse to detail here, but I wanted to give a quick outline of the functions who's behaviour differs based on your **locale**, the set of settings that vary from country to country.

-```{r}
-tibble(sentence = sentences) %>% 
-  tidyr::extract(
-    sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
-    remove = FALSE
-  )
-```
-
-### Exercises
-
-1.  In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem.
-2.  Find all words that come after a "number" like "one", "two", "three" etc. Pull out both the number and the word.
-3.  Find all contractions. Separate out the pieces before and after the apostrophe.
-
-## Strings -> Columns
-
-## Separate
-
-`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
-Take `table3`:
-
-```{r}
-table3
-```
-
-The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
-`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
-
-```{r}
-table3 %>%
-  separate(rate, into = c("cases", "population"))
-```
-
-```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
-knitr::include_graphics("images/tidy-17.png")
-```
-
-By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
-For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
-If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
-For example, we could rewrite the code above as:
-
-```{r eval = FALSE}
-table3 %>%
-  separate(rate, into = c("cases", "population"), sep = "/")
-```
-
-`separate_rows()`
-
-## Strings -> Rows
-
-```{r}
-starwars %>% 
-  select(name, eye_color) %>% 
-  filter(str_detect(eye_color, ", ")) %>% 
-  separate_rows(eye_color)
-```
-
-### Exercises
-
-1.  Split up a string like `"apples, pears, and bananas"` into individual components.
-
-2.  Why is it better to split up by `boundary("word")` than `" "`?
-
-3.  What does splitting with an empty string (`""`) do?
-    Experiment, and then read the documentation.
-
-## Other writing systems {#other-languages}
-
-Unicode is a system for representing the many writing systems used around the world.
-Fundamental unit is a **code point**.
-This usually represents something like a letter or symbol, but might also be formatting like a diacritic mark or a (e.g.) the skin tone of an emoji.
-Character vs grapheme cluster.
-
-Include some examples from <https://gankra.github.io/blah/text-hates-you/>.
-
-All stringr functions default to the English locale.
-This ensures that your code works the same way on every system, avoiding subtle bugs.
-
-Maybe things you think are true, but aren't list?
-
-### Encoding
-
-You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is.
-And typically the problem is that the declaring encoding is wrong.
-
-The tidyverse follows best practices[^strings-8] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8.
-It's still possible to have problems, but they'll typically arise during data import.
-Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`).
-
-[^strings-8]: <http://utf8everywhere.org>
-
-### Length and subsetting
-
-This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages.
-
-Four most common are Latin, Chinese, Arabic, and Devangari, which represent three different systems of writing systems:
-
-   Latin uses an alphabet, where each consonant and vowel gets its own letter.
-
-   Chinese.
-    Logograms.
-    Half width vs full width.
-    English letters are roughly twice as high as they are wide.
-    Chinese characters are roughly square.
-
-   Arabic is an abjad, only consonants are written and vowels are optionally as diacritics.
-    Additionally, it's written from right-to-left, so the first letter is the letter on the far right.
-
-   Devangari is an abugida where each symbol represents a consonant-vowel pair, , vowel notation secondary.
-
-> For instance, 'ch' is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
-> --- <http://utf8everywhere.org>
-
-```{r}
-# But
-str_split("check", boundary("character", locale = "cs_CZ"))
-```
-
-This is a problem even with Latin alphabets because many languages use **diacritics**, glyphs added to the basic alphabet.
-This is a problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components.
-
-```{r}
-x <- c("á", "x́")
-str_length(x)
-# str_width(x)
-str_sub(x, 1, 1)
-
-# stri_width(c("全形", "ab"))
-# 0, 1, or 2
-# but this assumes no font substitution
-```
-
-```{r}
-cyrillic_a <- "А"
-latin_a <- "A"
-cyrillic_a == latin_a
-stringi::stri_escape_unicode(cyrillic_a)
-stringi::stri_escape_unicode(latin_a)
-```
-
-### Collation rules
-
-`coll()`: compare strings using standard **coll**ation rules.
-This is useful for doing case insensitive matching.
-Note that `coll()` takes a `locale` parameter that controls which rules are used for comparing characters.
-Unfortunately different parts of the world use different rules!B
-oth `fixed()` and `regex()` have `ignore_case` arguments, but they do not allow you to pick the locale: they always use the default locale.
-You can see what that is with the following code; more on stringi later.
-
-```{r}
-a1 <- "\u00e1"
-a2 <- "a\u0301"
-c(a1, a2)
-a1 == a2
-
-str_detect(a1, fixed(a2))
-str_detect(a1, coll(a2))
-```
-
-The downside of `coll()` is speed; because the rules for recognising which characters are the same are complicated, `coll()` is relatively slow compared to `regex()` and `fixed()`.
-
-### Upper and lower case
-
-Relatively few writing systems have upper and lower case: Latin, Greek, and Cyrillic, plus a handful of lessor known languages.
-
-Above I used `str_to_lower()` to change the text to lower case.
-You can also use `str_to_upper()` or `str_to_title()`.
-However, changing case is more complicated than it might at first appear because different languages have different rules for changing case.
-You can pick which set of rules to use by specifying a locale:
-
-```{r}
-# Turkish has two i's: with and without a dot, and it
-# has a different rule for capitalising them:
-str_to_upper(c("i", "ı"))
-str_to_upper(c("i", "ı"), locale = "tr")
-```
+-   Words are broken up by spaces.
+-   Words are composed of individual spaces.
+-   All letters in a word are written down.

 The locale is specified as a ISO 639 language code, which is a two or three letter abbreviation.
 If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list, and you can see which are supported with `stringi::stri_locale_list()`.
-If you leave the locale blank, it will use English.

-The locale also affects case-insensitive matching, which `coll(ignore_case =  TRUE)` which you can control with `coll()`:
+Base R string functions automatically use your locale current locale, but stringr functions all default to the English locale.
+This ensures that your code works the same way on every system, avoiding subtle bugs.
+To choose a different locale you'll need to specify the `locale` argument; seeing that a function has a locale argument tells you that its behaviour will differ from locale to locale.

-```{r}
-i <- c("Iİiı")
+Here are a few places where locale matter:S

-str_view_all(i, coll("i", ignore_case = TRUE))
-str_view_all(i, coll("i", ignore_case = TRUE, locale = "tr"))
-```
+-   Upper and lower case: only relatively few languages have upper and lower case (Latin, Greek, and Cyrillic, plus a handful of lessor known languages). The rules are not te same in every language that uses these alphabets. For example, Turkish has two i's: with and without a dot, and it has a different rule for capitalising them:

-You can also do case insensitive matching this `fixed(ignore_case = TRUE)`, but this uses a simple approximation which will not work in all cases.
+    ```{r}
+    str_to_upper(c("i", "ı"))
+    str_to_upper(c("i", "ı"), locale = "tr")
+    ```

-### Sorting
+-   This also affects case insensitive matching with `coll(ignore_case =  TRUE)` which you can control with `coll()`:

-Unicode collation algorithm: <https://unicode.org/reports/tr10/>
+    ```{r}
+    i <- c("Iİiı")

-Another important operation that's affected by the locale is sorting.
-The base R `order()` and `sort()` functions sort strings using the current locale.
-If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument.
+    str_view_all(i, coll("i", ignore_case = TRUE))
+    str_view_all(i, coll("i", ignore_case = TRUE, locale = "tr"))
+    ```

-Can also control the "strength", which determines how accents are sorted.
+-   Many characters with diacritics can be recorded in multiple ways: these will print identically but won't match with `fixed()`.

-```{r}
-str_sort(c("a", "ch", "c", "h"))
-str_sort(c("a", "ch", "c", "h"), locale = "cs_CZ")
-```
+    ```{r}
+    a1 <- "\u00e1"
+    a2 <- "a\u0301"
+    c(a1, a2)
+    a1 == a2

-TODO: add connection to `arrange()`
+    str_view(a1, fixed(a2))
+    str_view(a1, coll(a2))
+    ```
+
+-   Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the current locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument. Here's an example: in Czech, "ch" is a digraph that appears after `h` in the alphabet.
+
+    ```{r}
+    str_sort(c("a", "ch", "c", "h"))
+    str_sort(c("a", "ch", "c", "h"), locale = "cs")
+    ```
+
+    TODO after dplyr 1.1.0: discuss `arrange()`