Some string tweaking
This commit is contained in:
parent
ec529ef1fa
commit
979289c50b
90
strings.Rmd
90
strings.Rmd
|
@ -21,9 +21,17 @@ library(stringr)
|
|||
|
||||
The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
|
||||
|
||||
For example, if you have
|
||||
|
||||
```{r}
|
||||
|
||||
```
|
||||
|
||||
The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
|
||||
|
||||
## String basics
|
||||
|
||||
In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no different in behaviour.
|
||||
In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`, in which case use `'`.
|
||||
|
||||
To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`).
|
||||
|
||||
|
@ -33,6 +41,17 @@ x
|
|||
writeLines(x)
|
||||
```
|
||||
|
||||
There are a handful of other special characters. The most common used are `"\n"`, new line, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`.
|
||||
|
||||
You'll also sometimes strings like `"\u00b5"`, this is a way of writing special characters that works on all platforms:
|
||||
|
||||
```R
|
||||
x <- "\u00b5"
|
||||
x
|
||||
```
|
||||
|
||||
Remember that the representation of a string is different from the string itself.
|
||||
|
||||
### String length
|
||||
|
||||
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
|
||||
|
@ -61,11 +80,12 @@ str_c("x", "y", sep = ", ")
|
|||
Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
|
||||
|
||||
```{r}
|
||||
str_c("x", NA, "y")
|
||||
str_c("x", str_replace_na(NA), "y")
|
||||
x <- c("abc", NA)
|
||||
str_c("|-", x, "-|")
|
||||
str_c("|-", str_replace_na(x), "-|")
|
||||
```
|
||||
|
||||
`str_c()` is vectorised, and it automatically recycles the shortest vectors to the same length as the longest:
|
||||
As shown above, `str_c()` is vectorised, automatically recycling the shortest vectors to the same length as the longest:
|
||||
|
||||
```{r}
|
||||
str_c("prefix-", c("a", "b", "c"), "-suffix")
|
||||
|
@ -108,25 +128,10 @@ x
|
|||
|
||||
1. In your own words, describe the difference between `sep` and `collapse`.
|
||||
|
||||
## Regular expressions
|
||||
|
||||
The stringr package contains functions for working with strings and patterns. We'll focus on four main categories
|
||||
|
||||
* What matches the pattern?
|
||||
* Does a string match a pattern?
|
||||
* How can you replace a pattern with text?
|
||||
* How can you split a string into pieces?
|
||||
## Regular expressions basics
|
||||
|
||||
Key to all of these functions are regular expressions. Regular expressions are a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
|
||||
|
||||
```{r}
|
||||
|
||||
```
|
||||
|
||||
Goal is not to be exhaustive, but to give you a solid foundation that allows you to solve a wide variety of problems. We'll point you to more resources where you can learn more about regular expresssions.
|
||||
|
||||
### Matching anything and escaping
|
||||
|
||||
Regular expression are not limited to matching fixed string. You can also use special characters that match patterns. For example, `.` allows you to match any character:
|
||||
|
||||
```{r}
|
||||
|
@ -148,6 +153,14 @@ str_subset(c("abc", "a.c", "bef"), "a\\.c")
|
|||
|
||||
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one!
|
||||
|
||||
```{r}
|
||||
x <- "a\\b"
|
||||
cat(x, "\n")
|
||||
|
||||
y <- str_replace(x, "\\\\", "-slash-")
|
||||
cat(y, "\n")
|
||||
```
|
||||
|
||||
### Character classes and alternatives
|
||||
|
||||
As well as `.` there are a number of other special patterns that match more than one character:
|
||||
|
@ -166,7 +179,7 @@ A similar idea is alternation: `x|y` matches either x or y. Note that the preced
|
|||
str_detect(c("abc", "xyz"), "abc|xyz")
|
||||
```
|
||||
|
||||
Like with mathematics, if precedence ever gets confusing, use parentheses to make it clear what you want:
|
||||
Like with mathematical expression, if precedence ever gets confusing, use parentheses to make it clear what you want:
|
||||
|
||||
```{r}
|
||||
str_detect(c("grey", "gray"), "gr(e|a)y")
|
||||
|
@ -191,42 +204,57 @@ Note that the precedence of these operators are high, so you write: `colou?r`. T
|
|||
|
||||
### Anchors
|
||||
|
||||
* `^` match the start of the line
|
||||
* `*` match the end of the line
|
||||
Regular expressions can also match things that are not characters. The most important non-character matches are:
|
||||
|
||||
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
|
||||
* `^`: the start of the line.
|
||||
* `*`: the end of the line.
|
||||
|
||||
To force a regular expression to only match a complete string:
|
||||
To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:
|
||||
|
||||
```{r}
|
||||
str_detect(c("abcdef", "bcd"), "^bcd$")
|
||||
```
|
||||
|
||||
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
|
||||
|
||||
You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Replace all `/` in a string with `\`.
|
||||
|
||||
## Detecting matches
|
||||
## Regular expression operations
|
||||
|
||||
The stringr package contains functions for working with strings and patterns. We'll focus on four main categories:
|
||||
|
||||
* What matches the pattern?
|
||||
* Does a string match a pattern?
|
||||
* How can you replace a pattern with text?
|
||||
* How can you split a string into pieces?
|
||||
|
||||
### Detecting matches
|
||||
|
||||
`str_detect()`, `str_subset()`, `str_count()`
|
||||
|
||||
## Extracting matches
|
||||
### Extracting matches
|
||||
|
||||
`str_extract()`, `str_extract_all()`
|
||||
|
||||
### Groups
|
||||
### Extracting grouped matches
|
||||
|
||||
`str_match()`, `str_match_all()`
|
||||
|
||||
## Replacing patterns
|
||||
### Replacing patterns
|
||||
|
||||
`str_replace()`, `str_replace_all()`
|
||||
|
||||
## Splitting
|
||||
### Splitting
|
||||
|
||||
`str_split()`, `str_split_fixed()`.
|
||||
`str_split()`, `str_split_fixed()`.
|
||||
|
||||
### Finding locations
|
||||
|
||||
`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
|
||||
|
||||
## Other types of pattern
|
||||
|
||||
|
|
Loading…
Reference in New Issue