Update regexps (#1237)
This commit is contained in:
parent
d8688e8029
commit
01b8566680
|
@ -545,7 +545,7 @@ dyears(1)
|
|||
Durations always record the time span in seconds.
|
||||
Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week.
|
||||
Larger time units are more problematic.
|
||||
A year is uses the "average" number of days in a year, i.e. 365.25.
|
||||
A year uses the "average" number of days in a year, i.e. 365.25.
|
||||
There's no way to convert a month to a duration, because there's just too much variation.
|
||||
|
||||
You can add and multiply durations:
|
||||
|
@ -565,15 +565,15 @@ last_year <- today() - dyears(1)
|
|||
However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:
|
||||
|
||||
```{r}
|
||||
one_pm <- ymd_hms("2026-03-12 13:00:00", tz = "America/New_York")
|
||||
one_am <- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York")
|
||||
|
||||
one_pm
|
||||
one_pm + ddays(1)
|
||||
one_am
|
||||
one_am + ddays(1)
|
||||
```
|
||||
|
||||
Why is one day after 1pm March 12, 2pm March 13?
|
||||
Why is one day after 1am March 8, 2am March 9?
|
||||
If you look carefully at the date you might also notice that the time zones have changed.
|
||||
March 12 only has 23 hours because it's when DST starts, so if we add a full days worth of seconds we end up with a different time.
|
||||
March 8 only has 23 hours because it's when DST starts, so if we add a full days worth of seconds we end up with a different time.
|
||||
|
||||
### Periods
|
||||
|
||||
|
@ -582,8 +582,8 @@ Periods are time spans but don't have a fixed length in seconds, instead they wo
|
|||
That allows them to work in a more intuitive way:
|
||||
|
||||
```{r}
|
||||
one_pm
|
||||
one_pm + days(1)
|
||||
one_am
|
||||
one_am + days(1)
|
||||
```
|
||||
|
||||
Like durations, periods can be created with a number of friendly constructor functions.
|
||||
|
@ -610,8 +610,8 @@ ymd("2024-01-01") + dyears(1)
|
|||
ymd("2024-01-01") + years(1)
|
||||
|
||||
# Daylight Savings Time
|
||||
one_pm + ddays(1)
|
||||
one_pm + days(1)
|
||||
one_am + ddays(1)
|
||||
one_am + days(1)
|
||||
```
|
||||
|
||||
Let's use periods to fix an oddity related to our flight dates.
|
||||
|
|
|
@ -179,7 +179,7 @@ This brings us to another important way of revealing implicitly missing observat
|
|||
You'll learn more about joins in @sec-joins, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.
|
||||
|
||||
`dplyr::anti_join(x, y)` is a particularly useful tool here because it selects only the rows in `x` that don't have a match in `y`.
|
||||
For example, we can use two `anti_join()`s reveal to reveal that we're missing information for four airports and 722 planes mentioned in `flights`:
|
||||
For example, we can use two `anti_join()`s to reveal that we're missing information for four airports and 722 planes mentioned in `flights`:
|
||||
|
||||
```{r}
|
||||
library(nycflights13)
|
||||
|
|
22
regexps.qmd
22
regexps.qmd
|
@ -252,8 +252,8 @@ These functions are naturally paired with `mutate()` when doing data cleaning, a
|
|||
### Extract variables {#sec-extract-variables}
|
||||
|
||||
The last function we'll discuss uses regular expressions to extract data out of one column into one or more new columns: `separate_wider_regex()`.
|
||||
It's a peer of the `separate_wider_location()` and `separate_wider_delim()` functions that you learned about in @sec-string-columns.
|
||||
These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.
|
||||
It's a peer of the `separate_wider_position()` and `separate_wider_delim()` functions that you learned about in @sec-string-columns.
|
||||
These functions live in tidyr because they operate on (columns of) data frames, rather than individual vectors.
|
||||
|
||||
Let's create a simple dataset to show how it works.
|
||||
Here we have some data derived from `babynames` where we have the name, gender, and age of a bunch of people in a rather weird format[^regexps-5]:
|
||||
|
@ -377,9 +377,9 @@ str_view(fruit, "^a")
|
|||
str_view(fruit, "a$")
|
||||
```
|
||||
|
||||
It's tempting to think that `$` should matches the start of a string, because that's how we write dollar amounts, but it's not what regular expressions want.
|
||||
It's tempting to think that `$` should match the start of a string, because that's how we write dollar amounts, but it's not what regular expressions want.
|
||||
|
||||
To force a regular expression to only the full string, anchor it with both `^` and `$`:
|
||||
To force a regular expression to match only the full string, anchor it with both `^` and `$`:
|
||||
|
||||
```{r}
|
||||
str_view(fruit, "apple")
|
||||
|
@ -387,7 +387,7 @@ str_view(fruit, "^apple$")
|
|||
```
|
||||
|
||||
You can also match the boundary between words (i.e. the start or end of a word) with `\b`.
|
||||
This can be particularly when using RStudio's find and replace tool.
|
||||
This can be particularly useful when using RStudio's find and replace tool.
|
||||
For example, if to find all uses of `sum()`, you can search for `\bsum\b` to avoid matching `summarize`, `summary`, `rowsum` and so on:
|
||||
|
||||
```{r}
|
||||
|
@ -496,7 +496,7 @@ But unlike algebra you're unlikely to remember the precedence rules for regexes,
|
|||
|
||||
### Grouping and capturing
|
||||
|
||||
As well overriding operator precedence, parentheses have another important effect: they create **capturing groups** that allow you to use sub-components of the match.
|
||||
As well as overriding operator precedence, parentheses have another important effect: they create **capturing groups** that allow you to use sub-components of the match.
|
||||
|
||||
The first way to use a capturing group is to refer back to it within a match with **back reference**: `\1` refers to the match contained in the first parenthesis, `\2` in the second parenthesis, and so on.
|
||||
For example, the following pattern finds all fruits that have a repeated pair of letters:
|
||||
|
@ -594,7 +594,7 @@ This allows you control the so called regex flags and match various types of fix
|
|||
|
||||
### Regex flags {#sec-flags}
|
||||
|
||||
There are a number of settings that can use to control the details of the regexp.
|
||||
There are a number of settings that can be used to control the details of the regexp.
|
||||
These settings are often called **flags** in other programming languages.
|
||||
In stringr, you can use these by wrapping the pattern in a call to `regex()`.
|
||||
The most useful flag is probably `ignore_case = TRUE` because it allows characters to match either their uppercase or lowercase forms:
|
||||
|
@ -680,7 +680,7 @@ str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
|
|||
To put these ideas into practice we'll solve a few semi-authentic problems next.
|
||||
We'll discuss three general techniques:
|
||||
|
||||
1. checking you work by creating simple positive and negative controls
|
||||
1. checking your work by creating simple positive and negative controls
|
||||
2. combining regular expressions with Boolean algebra
|
||||
3. creating complex patterns using string manipulation
|
||||
|
||||
|
@ -830,7 +830,7 @@ str_view(sentences, pattern)
|
|||
```
|
||||
|
||||
In this example, `cols` only contains numbers and letters so you don't need to worry about metacharacters.
|
||||
But in general, whenever you create create patterns from existing strings it's wise to run them through `str_escape()` to ensure they match literally.
|
||||
But in general, whenever you create patterns from existing strings it's wise to run them through `str_escape()` to ensure they match literally.
|
||||
|
||||
### Exercises
|
||||
|
||||
|
@ -862,10 +862,10 @@ There are three other particularly useful places where you might want to use a r
|
|||
- `matches(pattern)` will select all variables whose name matches the supplied pattern.
|
||||
It's a "tidyselect" function that you can use anywhere in any tidyverse function that selects variables (e.g. `select()`, `rename_with()` and `across()`).
|
||||
|
||||
- `pivot_longer()'s` `names_pattern` argument takes a vector of regular expressions, just like `separate_with_regex()`.
|
||||
- `pivot_longer()'s` `names_pattern` argument takes a vector of regular expressions, just like `separate_wider_regex()`.
|
||||
It's useful when extracting data out of variable names with a complex structure
|
||||
|
||||
- The `delim` argument in `separate_delim_longer()` and `separate_delim_wider()` usually matches a fixed string, but you can use `regex()` to make it match a pattern.
|
||||
- The `delim` argument in `separate_longer_delim()` and `separate_wider_delim()` usually matches a fixed string, but you can use `regex()` to make it match a pattern.
|
||||
This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. `regex(", ?")`.
|
||||
|
||||
### Base R
|
||||
|
|
Loading…
Reference in New Issue