Delete unused chapters
This commit is contained in:
parent
e5495cd63a
commit
5da1c82fd8
|
@ -1,26 +0,0 @@
|
||||||
# List columns {#sec-list-columns}
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
#| results: "asis"
|
|
||||||
#| echo: false
|
|
||||||
source("_common.R")
|
|
||||||
status("drafting")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Introduction
|
|
||||||
|
|
||||||
<!--# TO DO: Write introduction. -->
|
|
||||||
|
|
||||||
### Prerequisites
|
|
||||||
|
|
||||||
In this chapter we'll continue using tidyr, which also provides a bunch of tools to rectangle your datasets.
|
|
||||||
tidyr is a member of the core tidyverse.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
#| label: setup
|
|
||||||
#| message: false
|
|
||||||
|
|
||||||
library(tidyverse)
|
|
||||||
```
|
|
||||||
|
|
||||||
<!--# TO DO: Write chapter around unnest, hoist, etc. -->
|
|
301
prog-strings.qmd
301
prog-strings.qmd
|
@ -1,301 +0,0 @@
|
||||||
# Programming with strings {#sec-programming-with-strings}
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
#| results: "asis"
|
|
||||||
#| echo: false
|
|
||||||
source("_common.R")
|
|
||||||
status("drafting")
|
|
||||||
```
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
library(stringr)
|
|
||||||
library(tidyr)
|
|
||||||
library(tibble)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Encoding
|
|
||||||
|
|
||||||
You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is.
|
|
||||||
And typically the problem is that the declaring encoding is wrong.
|
|
||||||
|
|
||||||
The tidyverse follows best practices[^prog-strings-1] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8.
|
|
||||||
It's still possible to have problems, but they'll typically arise during data import.
|
|
||||||
Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`).
|
|
||||||
|
|
||||||
[^prog-strings-1]: <http://utf8everywhere.org>
|
|
||||||
|
|
||||||
### Length and subsetting
|
|
||||||
|
|
||||||
This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages.
|
|
||||||
|
|
||||||
Four most common are Latin, Chinese, Arabic, and Devangari, which represent three different systems of writing systems:
|
|
||||||
|
|
||||||
- Latin uses an alphabet, where each consonant and vowel gets its own letter.
|
|
||||||
|
|
||||||
- Chinese.
|
|
||||||
Logograms.
|
|
||||||
Half width vs. full width.
|
|
||||||
English letters are roughly twice as high as they are wide.
|
|
||||||
Chinese characters are roughly square.
|
|
||||||
|
|
||||||
- Arabic is an abjad, only consonants are written and vowels are optionally as diacritics.
|
|
||||||
Additionally, it's written from right-to-left, so the first letter is the letter on the far right.
|
|
||||||
|
|
||||||
- Devangari is an abugida where each symbol represents a consonant-vowel pair, , vowel notation secondary.
|
|
||||||
|
|
||||||
> For instance, 'ch' is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
|
|
||||||
> --- <http://utf8everywhere.org>
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
# But
|
|
||||||
str_split("check", boundary("character", locale = "cs_CZ"))
|
|
||||||
```
|
|
||||||
|
|
||||||
This is a problem even with Latin alphabets because many languages use **diacritics**, glyphs added to the basic alphabet.
|
|
||||||
This is a problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
x <- c("á", "x́")
|
|
||||||
str_length(x)
|
|
||||||
# str_width(x)
|
|
||||||
str_sub(x, 1, 1)
|
|
||||||
|
|
||||||
# stri_width(c("全形", "ab"))
|
|
||||||
# 0, 1, or 2
|
|
||||||
# but this assumes no font substitution
|
|
||||||
```
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
cyrillic_a <- "А"
|
|
||||||
latin_a <- "A"
|
|
||||||
cyrillic_a == latin_a
|
|
||||||
stringi::stri_escape_unicode(cyrillic_a)
|
|
||||||
stringi::stri_escape_unicode(latin_a)
|
|
||||||
```
|
|
||||||
|
|
||||||
### str_c
|
|
||||||
|
|
||||||
`NULL`s are silently dropped.
|
|
||||||
This is particularly useful in conjunction with `if`:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
name <- "Hadley"
|
|
||||||
time_of_day <- "morning"
|
|
||||||
birthday <- FALSE
|
|
||||||
|
|
||||||
str_c(
|
|
||||||
"Good ", time_of_day, " ", name,
|
|
||||||
if (birthday) " and HAPPY BIRTHDAY",
|
|
||||||
"."
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### `str_dup()`
|
|
||||||
|
|
||||||
Closely related to `str_c()` is `str_dup()`.
|
|
||||||
`str_c(a, a, a)` is like `a + a + a`, what's the equivalent of `3 * a`?
|
|
||||||
That's `str_dup()`:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
str_dup(letters[1:3], 3)
|
|
||||||
str_dup("a", 1:3)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Performance
|
|
||||||
|
|
||||||
`fixed()`: matches exactly the specified sequence of bytes.
|
|
||||||
It ignores all special regular expressions and operates at a very low level.
|
|
||||||
This allows you to avoid complex escaping and can be much faster than regular expressions.
|
|
||||||
The following microbenchmark shows that it's about 3x faster for a simple example.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
microbenchmark::microbenchmark(
|
|
||||||
fixed = str_detect(sentences, fixed("the")),
|
|
||||||
regex = str_detect(sentences, "the"),
|
|
||||||
times = 20
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
As you saw with `str_split()` you can use `boundary()` to match boundaries.
|
|
||||||
You can also use it with the other functions:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
x <- "This is a sentence."
|
|
||||||
str_view_all(x, boundary("word"))
|
|
||||||
str_extract_all(x, boundary("word"))
|
|
||||||
```
|
|
||||||
|
|
||||||
### Extract
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
colors <- c("red", "orange", "yellow", "green", "blue", "purple")
|
|
||||||
color_match <- str_c(colors, collapse = "|")
|
|
||||||
color_match
|
|
||||||
|
|
||||||
more <- sentences[str_count(sentences, color_match) > 1]
|
|
||||||
str_extract_all(more, color_match)
|
|
||||||
```
|
|
||||||
|
|
||||||
If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
|
|
||||||
str_extract_all(more, color_match, simplify = TRUE)
|
|
||||||
|
|
||||||
x <- c("a", "a b", "a b c")
|
|
||||||
str_extract_all(x, "[a-z]", simplify = TRUE)
|
|
||||||
```
|
|
||||||
|
|
||||||
We don't talk about matrices here, but they are useful elsewhere.
|
|
||||||
|
|
||||||
### Exercises
|
|
||||||
|
|
||||||
1. From the Harvard sentences data, extract:
|
|
||||||
|
|
||||||
1. The first word from each sentence.
|
|
||||||
2. All words ending in `ing`.
|
|
||||||
3. All plurals.
|
|
||||||
|
|
||||||
## Grouped matches
|
|
||||||
|
|
||||||
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching.
|
|
||||||
You can also use parentheses to extract parts of a complex match.
|
|
||||||
For example, imagine we want to extract nouns from the sentences.
|
|
||||||
As a heuristic, we'll look for any word that comes after "a" or "the".
|
|
||||||
Defining a "word" in a regular expression is a little tricky, so here we use a simple approximation: a sequence of at least one character that isn't a space.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
noun <- "(a|the) ([^ ]+)"
|
|
||||||
|
|
||||||
has_noun <- sentences |>
|
|
||||||
str_subset(noun) |>
|
|
||||||
head(10)
|
|
||||||
has_noun |>
|
|
||||||
str_extract(noun)
|
|
||||||
```
|
|
||||||
|
|
||||||
`str_extract()` gives us the complete match; `str_match()` gives each individual component.
|
|
||||||
Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
has_noun |>
|
|
||||||
str_match(noun)
|
|
||||||
```
|
|
||||||
|
|
||||||
(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
|
|
||||||
|
|
||||||
## Splitting
|
|
||||||
|
|
||||||
Use `str_split()` to split a string up into pieces.
|
|
||||||
For example, we could split sentences into words:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
sentences |>
|
|
||||||
head(5) |>
|
|
||||||
str_split(" ")
|
|
||||||
```
|
|
||||||
|
|
||||||
Because each component might contain a different number of pieces, this returns a list.
|
|
||||||
If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
str_split("a|b|c|d", "\\|")[[1]]
|
|
||||||
```
|
|
||||||
|
|
||||||
Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
sentences |>
|
|
||||||
head(5) |>
|
|
||||||
str_split(" ", simplify = TRUE)
|
|
||||||
```
|
|
||||||
|
|
||||||
You can also request a maximum number of pieces:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
|
|
||||||
fields |> str_split(": ", n = 2, simplify = TRUE)
|
|
||||||
```
|
|
||||||
|
|
||||||
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
x <- "This is a sentence. This is another sentence."
|
|
||||||
str_view_all(x, boundary("word"))
|
|
||||||
|
|
||||||
str_split(x, " ")[[1]]
|
|
||||||
str_split(x, boundary("word"))[[1]]
|
|
||||||
```
|
|
||||||
|
|
||||||
Show how `separate_rows()` is a special case of `str_split()` + `summarize()`.
|
|
||||||
|
|
||||||
## Replace with function
|
|
||||||
|
|
||||||
## Locations
|
|
||||||
|
|
||||||
`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match.
|
|
||||||
These are particularly useful when none of the other functions does exactly what you want.
|
|
||||||
You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
|
|
||||||
|
|
||||||
## stringi
|
|
||||||
|
|
||||||
stringr is built on top of the **stringi** package.
|
|
||||||
stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions.
|
|
||||||
stringi, on the other hand, is designed to be comprehensive.
|
|
||||||
It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
|
|
||||||
|
|
||||||
If you find yourself struggling to do something in stringr, it's worth taking a look at stringi.
|
|
||||||
The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way.
|
|
||||||
The main difference is the prefix: `str_` vs. `stri_`.
|
|
||||||
|
|
||||||
### Exercises
|
|
||||||
|
|
||||||
1. Find the stringi functions that:
|
|
||||||
|
|
||||||
a. Count the number of words.
|
|
||||||
b. Find duplicated strings.
|
|
||||||
c. Generate random text.
|
|
||||||
|
|
||||||
2. How do you control the language that `stri_sort()` uses for sorting?
|
|
||||||
|
|
||||||
### Exercises
|
|
||||||
|
|
||||||
1. What do the `extra` and `fill` arguments do in `separate()`?
|
|
||||||
Experiment with the various options for the following two toy datasets.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
#| eval: false
|
|
||||||
|
|
||||||
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) |>
|
|
||||||
separate(x, c("one", "two", "three"))
|
|
||||||
|
|
||||||
tibble(x = c("a,b,c", "d,e", "f,g,i")) |>
|
|
||||||
separate(x, c("one", "two", "three"))
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Both `unite()` and `separate()` have a `remove` argument.
|
|
||||||
What does it do?
|
|
||||||
Why would you set it to `FALSE`?
|
|
||||||
|
|
||||||
3. Compare and contrast `separate()` and `extract()`.
|
|
||||||
Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
|
|
||||||
|
|
||||||
4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
|
|
||||||
How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
#| eval: false
|
|
||||||
|
|
||||||
events <- tribble(
|
|
||||||
~month, ~day,
|
|
||||||
1 , 20,
|
|
||||||
1 , 21,
|
|
||||||
1 , 22
|
|
||||||
)
|
|
||||||
|
|
||||||
events |>
|
|
||||||
unite("date", month:day, sep = "-", remove = FALSE)
|
|
||||||
```
|
|
||||||
|
|
||||||
5. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`.
|
|
||||||
Think carefully about what it should do if given a vector of length 0, 1, or 2.
|
|
Loading…
Reference in New Issue