Reorganising bigger structure of strings
This commit is contained in:
parent
26ab1cc1eb
commit
915ebf4463
|
@ -10,6 +10,66 @@ library(tidyr)
|
|||
library(tibble)
|
||||
```
|
||||
|
||||
### Encoding
|
||||
|
||||
You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is.
|
||||
And typically the problem is that the declaring encoding is wrong.
|
||||
|
||||
The tidyverse follows best practices[^prog-strings-1] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8.
|
||||
It's still possible to have problems, but they'll typically arise during data import.
|
||||
Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`).
|
||||
|
||||
[^prog-strings-1]: <http://utf8everywhere.org>
|
||||
|
||||
### Length and subsetting
|
||||
|
||||
This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages.
|
||||
|
||||
Four most common are Latin, Chinese, Arabic, and Devangari, which represent three different systems of writing systems:
|
||||
|
||||
- Latin uses an alphabet, where each consonant and vowel gets its own letter.
|
||||
|
||||
- Chinese.
|
||||
Logograms.
|
||||
Half width vs full width.
|
||||
English letters are roughly twice as high as they are wide.
|
||||
Chinese characters are roughly square.
|
||||
|
||||
- Arabic is an abjad, only consonants are written and vowels are optionally as diacritics.
|
||||
Additionally, it's written from right-to-left, so the first letter is the letter on the far right.
|
||||
|
||||
- Devangari is an abugida where each symbol represents a consonant-vowel pair, , vowel notation secondary.
|
||||
|
||||
> For instance, 'ch' is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
|
||||
> --- <http://utf8everywhere.org>
|
||||
|
||||
```{r}
|
||||
# But
|
||||
str_split("check", boundary("character", locale = "cs_CZ"))
|
||||
```
|
||||
|
||||
This is a problem even with Latin alphabets because many languages use **diacritics**, glyphs added to the basic alphabet.
|
||||
This is a problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components.
|
||||
|
||||
```{r}
|
||||
x <- c("á", "x́")
|
||||
str_length(x)
|
||||
# str_width(x)
|
||||
str_sub(x, 1, 1)
|
||||
|
||||
# stri_width(c("全形", "ab"))
|
||||
# 0, 1, or 2
|
||||
# but this assumes no font substitution
|
||||
```
|
||||
|
||||
```{r}
|
||||
cyrillic_a <- "А"
|
||||
latin_a <- "A"
|
||||
cyrillic_a == latin_a
|
||||
stringi::stri_escape_unicode(cyrillic_a)
|
||||
stringi::stri_escape_unicode(latin_a)
|
||||
```
|
||||
|
||||
### str_c
|
||||
|
||||
`NULL`s are silently dropped.
|
||||
|
@ -51,8 +111,6 @@ str_view_all(x, boundary("word"))
|
|||
str_extract_all(x, boundary("word"))
|
||||
```
|
||||
|
||||
###
|
||||
|
||||
### Extract
|
||||
|
||||
```{r}
|
||||
|
|
|
@ -264,7 +264,7 @@ Collectively, these operators are called **quantifiers** because they quantify h
|
|||
b. Have three or more vowels in a row.
|
||||
c. Have two or more vowel-consonant pairs in a row.
|
||||
|
||||
4. Solve the beginner regexp crosswords at [\<https://regexcrossword.com/challenges/beginner\>](https://regexcrossword.com/challenges/beginner){.uri}.
|
||||
4. Solve the beginner regexp crosswords at [\<https://regexcrossword.com/challenges/beginner>](https://regexcrossword.com/challenges/beginner){.uri}.
|
||||
|
||||
## Grouping and backreferences
|
||||
|
||||
|
@ -475,3 +475,9 @@ See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for mor
|
|||
Don't forget that you're in a programming language and you have other tools at your disposal.
|
||||
Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
|
||||
If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem.
|
||||
2. Find all words that come after a "number" like "one", "two", "three" etc. Pull out both the number and the word.
|
||||
3. Find all contractions. Separate out the pieces before and after the apostrophe.
|
||||
|
|
410
strings.Rmd
410
strings.Rmd
|
@ -6,12 +6,14 @@ status("restructuring")
|
|||
|
||||
## Introduction
|
||||
|
||||
This chapter introduces you to strings.
|
||||
You'll learn the basics of how strings work in R and how to create them "by hand".
|
||||
You'll also learn the basics of regular expressions, a powerful, but sometimes cryptic language for describing string patterns.
|
||||
Regular expression are a big topic, so we'll come back to them again in Chapter \@ref(regular-expressions) to discuss more of the details.
|
||||
We'll finish up with a discussion of some of the new challenges that arise when working with non-English strings.
|
||||
So far, we've used a bunch of strings without really talking about how they work or the powerful tools you have to work with them.
|
||||
This chapter begins by diving into the details of creating strings, and from strings, character vectors.
|
||||
You'll then learn a grab bag of handy string functions before we dive into creating strings from data, then extracting data from strings.
|
||||
We'll then cover the basics of regular expressions, a powerful, but very concise and sometimes cryptic, language for describing patterns in string.
|
||||
The chapter concludes with a brief discussion of where your exceptions of English might steer you wrong when working with text from other languages.
|
||||
|
||||
This chapter is paired with two other chapters.
|
||||
Regular expression are a big topic, so we'll come back to them again in Chapter \@ref(regular-expressions).
|
||||
We'll come back to strings again in Chapter \@ref(programming-with-strings) where we'll think about them about more from a programming perspective than a data analysis perspective.
|
||||
|
||||
### Prerequisites
|
||||
|
@ -55,13 +57,6 @@ If you forget to close a quote, you'll see `+`, the continuation character:
|
|||
|
||||
If this happen to you and you can't figure out which quote you need to close, press Escape to cancel, then try again.
|
||||
|
||||
You can combine multiple strings into a character vector by using `c()`:
|
||||
|
||||
```{r}
|
||||
x <- c("first string", "second string", "third string")
|
||||
x
|
||||
```
|
||||
|
||||
### Escapes
|
||||
|
||||
To include a literal single or double quote in a string you can use `\` to "escape" it:
|
||||
|
@ -127,7 +122,25 @@ x
|
|||
str_view(x)
|
||||
```
|
||||
|
||||
## Length and subsetting
|
||||
Now that you've learned the basics of creating strings by "hand", we'll go into the details of creating strings from other strings, starting with combining strings.
|
||||
|
||||
### Vectors
|
||||
|
||||
You can combine multiple strings into a character vector by using `c()`:
|
||||
|
||||
```{r}
|
||||
x <- c("first string", "second string", "third string")
|
||||
x
|
||||
```
|
||||
|
||||
You can create a length zero character vector with `character()`.
|
||||
This is not usually very useful, but can help you understand the general principle of functions by giving them an unusual input.
|
||||
|
||||
### Exercises
|
||||
|
||||
## Handy functions
|
||||
|
||||
### Length
|
||||
|
||||
It's natural to think about the letters that make up an individual string.
|
||||
(Not every language uses letters, which we'll talk about more in Section \@ref(other-languages)).
|
||||
|
@ -150,6 +163,8 @@ babynames %>%
|
|||
count(name, wt = n, sort = TRUE)
|
||||
```
|
||||
|
||||
### Subsetting
|
||||
|
||||
You can extract parts of a string using `str_sub(string, start, end)`.
|
||||
The `start` and `end` arguments are inclusive, so the length of the returned string will be `end - start + 1`:
|
||||
|
||||
|
@ -180,42 +195,7 @@ babynames %>%
|
|||
)
|
||||
```
|
||||
|
||||
Sometimes you'll get a column that's made up of individual fixed length strings that have been joined together:
|
||||
|
||||
```{r}
|
||||
df <- tribble(
|
||||
~ sex_year_age,
|
||||
"M200115",
|
||||
"F201503",
|
||||
)
|
||||
```
|
||||
|
||||
You can extract the columns using `str_sub()`:
|
||||
|
||||
```{r}
|
||||
df %>% mutate(
|
||||
sex = str_sub(sex_year_age, 1, 1),
|
||||
year = str_sub(sex_year_age, 2, 5),
|
||||
age = str_sub(sex_year_age, 6, 7),
|
||||
)
|
||||
```
|
||||
|
||||
Or use the `separate()` helper function:
|
||||
|
||||
```{r}
|
||||
df %>%
|
||||
separate(sex_year_age, c("sex", "year", "age"), c(1, 5))
|
||||
```
|
||||
|
||||
Note that you give `separate()` three columns but only two positions --- that's because you're telling `separate()` where to break up the string.
|
||||
|
||||
TODO: draw diagram to emphasise that it's the space between the characters.
|
||||
|
||||
Later on, we'll come back two related problems: the components have varying length and are a separated by a character, or they have an varying number of components and you want to split up into rows, rather than columns.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
|
||||
Later, we'll come back to the problem of extracting data from strings.
|
||||
|
||||
### Long strings
|
||||
|
||||
|
@ -233,7 +213,9 @@ str_trunc(x, 30)
|
|||
str_view(str_wrap(x, 30))
|
||||
```
|
||||
|
||||
##
|
||||
### Exercises
|
||||
|
||||
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
|
||||
|
||||
## Combining strings
|
||||
|
||||
|
@ -278,6 +260,16 @@ starwars %>%
|
|||
mutate(greeting = str_c("Hi! I'm ", name, "."), .after = name)
|
||||
```
|
||||
|
||||
### `str_dup()`
|
||||
|
||||
`str_c(a, a, a)` is like `a + a + a`, what's the equivalent of `3 * a`?
|
||||
That's `str_dup()`:
|
||||
|
||||
```{r}
|
||||
str_dup(letters[1:3], 3)
|
||||
str_dup("a", 1:3)
|
||||
```
|
||||
|
||||
### Glue
|
||||
|
||||
Another powerful way of combining strings is with the glue package.
|
||||
|
@ -301,12 +293,13 @@ starwars %>%
|
|||
|
||||
You can use any valid R code inside of `{}`, but it's a good idea to pull complex calculations out into their own variables so you can more easily check your work.
|
||||
|
||||
Differences with `NA` handling.
|
||||
Differences with `NA` handling?
|
||||
|
||||
### `str_flatten()`
|
||||
|
||||
`str_c()` combines multiple character vectors into a single character vector; the output is the same length as the input.
|
||||
An related function is `str_flatten()`:[^strings-7] it takes a character vector and returns a single string:
|
||||
So far I've shown you vectorised functions that work will with `mutate()`: the output of these functions is the same length as the input.
|
||||
There's one last important function that's a summary function: the output is always length 1, regardless of the length of the input.
|
||||
That's `str_flatten()`:[^strings-7] it takes a character vector and always returns a single string:
|
||||
|
||||
[^strings-7]: The base R equivalent is `paste()` with the `collapse` argument set.
|
||||
|
||||
|
@ -336,7 +329,7 @@ df %>%
|
|||
|
||||
### Exercises
|
||||
|
||||
1. Compare the results of `paste0()` with `str_c()` for the following inputs:
|
||||
1. Compare and contrast the results of `paste0()` with `str_c()` for the following inputs:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
str_c("hi ", NA)
|
||||
|
@ -344,9 +337,18 @@ df %>%
|
|||
str_c(letters[1:2], letters[1:3])
|
||||
```
|
||||
|
||||
2. What does `str_flatten()` return if you give it a length 0 character vector?
|
||||
|
||||
## Splitting apart strings
|
||||
|
||||
## Detect matches
|
||||
Common for multiple variables worth of data to be stored in a single string.
|
||||
In this section you'll learn how to use various functions tidyr to extract them.
|
||||
|
||||
Waiting on: <https://github.com/tidyverse/tidyups/pull/15>
|
||||
|
||||
## Working with patterns
|
||||
|
||||
### Detect matches
|
||||
|
||||
To determine if a character vector matches a pattern, use `str_detect()`.
|
||||
It returns a logical vector the same length as the input:
|
||||
|
@ -377,6 +379,8 @@ babynames %>%
|
|||
|
||||
(Note that this gives us the proportion of names that contain an x; if you wanted the proportion of babies given a name containing an x, you'd need to perform a weighted mean).
|
||||
|
||||
### Count matches
|
||||
|
||||
A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string:
|
||||
|
||||
```{r}
|
||||
|
@ -394,14 +398,23 @@ babynames %>%
|
|||
)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
You also wonder if any names include special characters like periods:
|
||||
|
||||
1. What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
|
||||
```{r}
|
||||
babynames %>%
|
||||
distinct(name) %>%
|
||||
head() %>%
|
||||
mutate(
|
||||
periods = str_count(name, "."),
|
||||
)
|
||||
```
|
||||
|
||||
## Introduction to regular expressions
|
||||
That's weird!
|
||||
|
||||
Before we can continue on we need to discuss the second argument to `str_detect()` --- the pattern that you want to match.
|
||||
Above, I used a simple string, but the pattern actually a much richer tool called a **regular expression**.
|
||||
### Introduction to regular expressions
|
||||
|
||||
To understand what's going on, we need to discuss what the second argument to `str_detect()` really is.
|
||||
It looks like a simple string, but it's pattern actually a much richer tool called a **regular expression**.
|
||||
A regular expression uses special characters to match string patterns.
|
||||
For example, `.` will match any character, so `"a."` will match any string that contains an a followed by another character:
|
||||
|
||||
|
@ -426,17 +439,6 @@ There are three useful **quantifiers** that can be applied to other pattern: `?`
|
|||
|
||||
- `ab*` matches an "a", followed by any number of bs
|
||||
|
||||
You can use `()` to control precedence:
|
||||
|
||||
- `(ab)?` optionally matches "ab"
|
||||
|
||||
- `(ab)+` matches one or more "ab" repeats
|
||||
|
||||
```{r}
|
||||
str_view(c("aba", "ababab", "abbbbbb"), "ab+")
|
||||
str_view(c("aba", "ababab", "abbbbbb"), "(ab)+")
|
||||
```
|
||||
|
||||
There are various alternatives to `.` that match a restricted set of characters.
|
||||
One useful operator is the **character class:** `[abcd]` match "a", "b", "c", or "d"; `[^abcd]` matches anything **except** "a", "b", "c", or "d".
|
||||
|
||||
|
@ -457,15 +459,7 @@ str_view_all("x X xy", regex(".Y", ignore_case = TRUE))
|
|||
|
||||
We'll come back to case later, because it's not trivial for many languages.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.
|
||||
|
||||
a. Find all words that start or end with `x`.
|
||||
b. Find all words that start with a vowel and end with a consonant.
|
||||
c. Are there any words that contain at least one of each different vowel?
|
||||
|
||||
## Replacing matches
|
||||
### Replacing matches
|
||||
|
||||
`str_replace_all()` allow you to replace matches with new strings.
|
||||
The simplest use is to replace a pattern with a fixed string:
|
||||
|
@ -490,226 +484,76 @@ Use in `mutate()`
|
|||
Using pipe inside mutate.
|
||||
Recommendation to make a function, and think about testing it --- don't need formal tests, but useful to build up a set of positive and negative test cases as you.
|
||||
|
||||
#### Exercises
|
||||
### Exercises
|
||||
|
||||
1. Replace all forward slashes in a string with backslashes.
|
||||
1. What word has the highest number of vowels?
|
||||
What word has the highest proportion of vowels?
|
||||
(Hint: what is the denominator?)
|
||||
|
||||
2. Implement a simple version of `str_to_lower()` using `str_replace_all()`.
|
||||
2. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.
|
||||
|
||||
3. Switch the first and last letters in `words`.
|
||||
a. Find all words that start or end with `x`.
|
||||
b. Find all words that start with a vowel and end with a consonant.
|
||||
c. Are there any words that contain at least one of each different vowel?
|
||||
|
||||
3. Replace all forward slashes in a string with backslashes.
|
||||
|
||||
4. Implement a simple version of `str_to_lower()` using `str_replace_all()`.
|
||||
|
||||
5. Switch the first and last letters in `words`.
|
||||
Which of those strings are still `words`?
|
||||
|
||||
## Extract full matches
|
||||
## Locale dependent operations {#other-languages}
|
||||
|
||||
If your data is in a tibble, it's often easier to use `tidyr::extract()`.
|
||||
It works like `str_match()` but requires you to name the matches, which are then placed in new columns:
|
||||
So far all of our examples have been using English.
|
||||
The details of the many ways other languages are different to English are too diverse to detail here, but I wanted to give a quick outline of the functions who's behaviour differs based on your **locale**, the set of settings that vary from country to country.
|
||||
|
||||
```{r}
|
||||
tibble(sentence = sentences) %>%
|
||||
tidyr::extract(
|
||||
sentence, c("article", "noun"), "(a|the) ([^ ]+)",
|
||||
remove = FALSE
|
||||
)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem.
|
||||
2. Find all words that come after a "number" like "one", "two", "three" etc. Pull out both the number and the word.
|
||||
3. Find all contractions. Separate out the pieces before and after the apostrophe.
|
||||
|
||||
## Strings -> Columns
|
||||
|
||||
## Separate
|
||||
|
||||
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
|
||||
Take `table3`:
|
||||
|
||||
```{r}
|
||||
table3
|
||||
```
|
||||
|
||||
The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
|
||||
`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
|
||||
|
||||
```{r}
|
||||
table3 %>%
|
||||
separate(rate, into = c("cases", "population"))
|
||||
```
|
||||
|
||||
```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `rate` into `cases` and `population` to make `table3` tidy", fig.alt = "Two panels, one with a data frame with three columns (country, year, and rate) and the other with a data frame with four columns (country, year, cases, and population). Arrows show how the rate variable is separated into two variables: cases and population."}
|
||||
knitr::include_graphics("images/tidy-17.png")
|
||||
```
|
||||
|
||||
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
|
||||
For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
|
||||
If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
|
||||
For example, we could rewrite the code above as:
|
||||
|
||||
```{r eval = FALSE}
|
||||
table3 %>%
|
||||
separate(rate, into = c("cases", "population"), sep = "/")
|
||||
```
|
||||
|
||||
`separate_rows()`
|
||||
|
||||
## Strings -> Rows
|
||||
|
||||
```{r}
|
||||
starwars %>%
|
||||
select(name, eye_color) %>%
|
||||
filter(str_detect(eye_color, ", ")) %>%
|
||||
separate_rows(eye_color)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Split up a string like `"apples, pears, and bananas"` into individual components.
|
||||
|
||||
2. Why is it better to split up by `boundary("word")` than `" "`?
|
||||
|
||||
3. What does splitting with an empty string (`""`) do?
|
||||
Experiment, and then read the documentation.
|
||||
|
||||
## Other writing systems {#other-languages}
|
||||
|
||||
Unicode is a system for representing the many writing systems used around the world.
|
||||
Fundamental unit is a **code point**.
|
||||
This usually represents something like a letter or symbol, but might also be formatting like a diacritic mark or a (e.g.) the skin tone of an emoji.
|
||||
Character vs grapheme cluster.
|
||||
|
||||
Include some examples from <https://gankra.github.io/blah/text-hates-you/>.
|
||||
|
||||
All stringr functions default to the English locale.
|
||||
This ensures that your code works the same way on every system, avoiding subtle bugs.
|
||||
|
||||
Maybe things you think are true, but aren't list?
|
||||
|
||||
### Encoding
|
||||
|
||||
You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is.
|
||||
And typically the problem is that the declaring encoding is wrong.
|
||||
|
||||
The tidyverse follows best practices[^strings-8] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8.
|
||||
It's still possible to have problems, but they'll typically arise during data import.
|
||||
Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`).
|
||||
|
||||
[^strings-8]: <http://utf8everywhere.org>
|
||||
|
||||
### Length and subsetting
|
||||
|
||||
This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages.
|
||||
|
||||
Four most common are Latin, Chinese, Arabic, and Devangari, which represent three different systems of writing systems:
|
||||
|
||||
- Latin uses an alphabet, where each consonant and vowel gets its own letter.
|
||||
|
||||
- Chinese.
|
||||
Logograms.
|
||||
Half width vs full width.
|
||||
English letters are roughly twice as high as they are wide.
|
||||
Chinese characters are roughly square.
|
||||
|
||||
- Arabic is an abjad, only consonants are written and vowels are optionally as diacritics.
|
||||
Additionally, it's written from right-to-left, so the first letter is the letter on the far right.
|
||||
|
||||
- Devangari is an abugida where each symbol represents a consonant-vowel pair, , vowel notation secondary.
|
||||
|
||||
> For instance, 'ch' is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
|
||||
> --- <http://utf8everywhere.org>
|
||||
|
||||
```{r}
|
||||
# But
|
||||
str_split("check", boundary("character", locale = "cs_CZ"))
|
||||
```
|
||||
|
||||
This is a problem even with Latin alphabets because many languages use **diacritics**, glyphs added to the basic alphabet.
|
||||
This is a problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components.
|
||||
|
||||
```{r}
|
||||
x <- c("á", "x́")
|
||||
str_length(x)
|
||||
# str_width(x)
|
||||
str_sub(x, 1, 1)
|
||||
|
||||
# stri_width(c("全形", "ab"))
|
||||
# 0, 1, or 2
|
||||
# but this assumes no font substitution
|
||||
```
|
||||
|
||||
```{r}
|
||||
cyrillic_a <- "А"
|
||||
latin_a <- "A"
|
||||
cyrillic_a == latin_a
|
||||
stringi::stri_escape_unicode(cyrillic_a)
|
||||
stringi::stri_escape_unicode(latin_a)
|
||||
```
|
||||
|
||||
### Collation rules
|
||||
|
||||
`coll()`: compare strings using standard **coll**ation rules.
|
||||
This is useful for doing case insensitive matching.
|
||||
Note that `coll()` takes a `locale` parameter that controls which rules are used for comparing characters.
|
||||
Unfortunately different parts of the world use different rules!B
|
||||
oth `fixed()` and `regex()` have `ignore_case` arguments, but they do not allow you to pick the locale: they always use the default locale.
|
||||
You can see what that is with the following code; more on stringi later.
|
||||
|
||||
```{r}
|
||||
a1 <- "\u00e1"
|
||||
a2 <- "a\u0301"
|
||||
c(a1, a2)
|
||||
a1 == a2
|
||||
|
||||
str_detect(a1, fixed(a2))
|
||||
str_detect(a1, coll(a2))
|
||||
```
|
||||
|
||||
The downside of `coll()` is speed; because the rules for recognising which characters are the same are complicated, `coll()` is relatively slow compared to `regex()` and `fixed()`.
|
||||
|
||||
### Upper and lower case
|
||||
|
||||
Relatively few writing systems have upper and lower case: Latin, Greek, and Cyrillic, plus a handful of lessor known languages.
|
||||
|
||||
Above I used `str_to_lower()` to change the text to lower case.
|
||||
You can also use `str_to_upper()` or `str_to_title()`.
|
||||
However, changing case is more complicated than it might at first appear because different languages have different rules for changing case.
|
||||
You can pick which set of rules to use by specifying a locale:
|
||||
|
||||
```{r}
|
||||
# Turkish has two i's: with and without a dot, and it
|
||||
# has a different rule for capitalising them:
|
||||
str_to_upper(c("i", "ı"))
|
||||
str_to_upper(c("i", "ı"), locale = "tr")
|
||||
```
|
||||
- Words are broken up by spaces.
|
||||
- Words are composed of individual spaces.
|
||||
- All letters in a word are written down.
|
||||
|
||||
The locale is specified as a ISO 639 language code, which is a two or three letter abbreviation.
|
||||
If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list, and you can see which are supported with `stringi::stri_locale_list()`.
|
||||
If you leave the locale blank, it will use English.
|
||||
|
||||
The locale also affects case-insensitive matching, which `coll(ignore_case = TRUE)` which you can control with `coll()`:
|
||||
Base R string functions automatically use your locale current locale, but stringr functions all default to the English locale.
|
||||
This ensures that your code works the same way on every system, avoiding subtle bugs.
|
||||
To choose a different locale you'll need to specify the `locale` argument; seeing that a function has a locale argument tells you that its behaviour will differ from locale to locale.
|
||||
|
||||
```{r}
|
||||
i <- c("Iİiı")
|
||||
Here are a few places where locale matter:S
|
||||
|
||||
str_view_all(i, coll("i", ignore_case = TRUE))
|
||||
str_view_all(i, coll("i", ignore_case = TRUE, locale = "tr"))
|
||||
```
|
||||
- Upper and lower case: only relatively few languages have upper and lower case (Latin, Greek, and Cyrillic, plus a handful of lessor known languages). The rules are not te same in every language that uses these alphabets. For example, Turkish has two i's: with and without a dot, and it has a different rule for capitalising them:
|
||||
|
||||
You can also do case insensitive matching this `fixed(ignore_case = TRUE)`, but this uses a simple approximation which will not work in all cases.
|
||||
```{r}
|
||||
str_to_upper(c("i", "ı"))
|
||||
str_to_upper(c("i", "ı"), locale = "tr")
|
||||
```
|
||||
|
||||
### Sorting
|
||||
- This also affects case insensitive matching with `coll(ignore_case = TRUE)` which you can control with `coll()`:
|
||||
|
||||
Unicode collation algorithm: <https://unicode.org/reports/tr10/>
|
||||
```{r}
|
||||
i <- c("Iİiı")
|
||||
|
||||
Another important operation that's affected by the locale is sorting.
|
||||
The base R `order()` and `sort()` functions sort strings using the current locale.
|
||||
If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument.
|
||||
str_view_all(i, coll("i", ignore_case = TRUE))
|
||||
str_view_all(i, coll("i", ignore_case = TRUE, locale = "tr"))
|
||||
```
|
||||
|
||||
Can also control the "strength", which determines how accents are sorted.
|
||||
- Many characters with diacritics can be recorded in multiple ways: these will print identically but won't match with `fixed()`.
|
||||
|
||||
```{r}
|
||||
str_sort(c("a", "ch", "c", "h"))
|
||||
str_sort(c("a", "ch", "c", "h"), locale = "cs_CZ")
|
||||
```
|
||||
```{r}
|
||||
a1 <- "\u00e1"
|
||||
a2 <- "a\u0301"
|
||||
c(a1, a2)
|
||||
a1 == a2
|
||||
|
||||
TODO: add connection to `arrange()`
|
||||
str_view(a1, fixed(a2))
|
||||
str_view(a1, coll(a2))
|
||||
```
|
||||
|
||||
- Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the current locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument. Here's an example: in Czech, "ch" is a digraph that appears after `h` in the alphabet.
|
||||
|
||||
```{r}
|
||||
str_sort(c("a", "ch", "c", "h"))
|
||||
str_sort(c("a", "ch", "c", "h"), locale = "cs")
|
||||
```
|
||||
|
||||
TODO after dplyr 1.1.0: discuss `arrange()`
|
||||
|
|
Loading…
Reference in New Issue