Update regexps (#1237)

This commit is contained in:
Stephen Balogun 2023-01-23 15:35:09 +01:00 committed by GitHub
parent d8688e8029
commit 01b8566680
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 22 additions and 22 deletions

View File

@ -545,7 +545,7 @@ dyears(1)
Durations always record the time span in seconds.
Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week.
Larger time units are more problematic.
A year is uses the "average" number of days in a year, i.e. 365.25.
A year uses the "average" number of days in a year, i.e. 365.25.
There's no way to convert a month to a duration, because there's just too much variation.
You can add and multiply durations:
@ -565,15 +565,15 @@ last_year <- today() - dyears(1)
However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:
```{r}
one_pm <- ymd_hms("2026-03-12 13:00:00", tz = "America/New_York")
one_am <- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York")
one_pm
one_pm + ddays(1)
one_am
one_am + ddays(1)
```
Why is one day after 1pm March 12, 2pm March 13?
Why is one day after 1am March 8, 2am March 9?
If you look carefully at the date you might also notice that the time zones have changed.
March 12 only has 23 hours because it's when DST starts, so if we add a full days worth of seconds we end up with a different time.
March 8 only has 23 hours because it's when DST starts, so if we add a full days worth of seconds we end up with a different time.
### Periods
@ -582,8 +582,8 @@ Periods are time spans but don't have a fixed length in seconds, instead they wo
That allows them to work in a more intuitive way:
```{r}
one_pm
one_pm + days(1)
one_am
one_am + days(1)
```
Like durations, periods can be created with a number of friendly constructor functions.
@ -610,8 +610,8 @@ ymd("2024-01-01") + dyears(1)
ymd("2024-01-01") + years(1)
# Daylight Savings Time
one_pm + ddays(1)
one_pm + days(1)
one_am + ddays(1)
one_am + days(1)
```
Let's use periods to fix an oddity related to our flight dates.

View File

@ -179,7 +179,7 @@ This brings us to another important way of revealing implicitly missing observat
You'll learn more about joins in @sec-joins, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.
`dplyr::anti_join(x, y)` is a particularly useful tool here because it selects only the rows in `x` that don't have a match in `y`.
For example, we can use two `anti_join()`s reveal to reveal that we're missing information for four airports and 722 planes mentioned in `flights`:
For example, we can use two `anti_join()`s to reveal that we're missing information for four airports and 722 planes mentioned in `flights`:
```{r}
library(nycflights13)

View File

@ -252,8 +252,8 @@ These functions are naturally paired with `mutate()` when doing data cleaning, a
### Extract variables {#sec-extract-variables}
The last function we'll discuss uses regular expressions to extract data out of one column into one or more new columns: `separate_wider_regex()`.
It's a peer of the `separate_wider_location()` and `separate_wider_delim()` functions that you learned about in @sec-string-columns.
These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.
It's a peer of the `separate_wider_position()` and `separate_wider_delim()` functions that you learned about in @sec-string-columns.
These functions live in tidyr because they operate on (columns of) data frames, rather than individual vectors.
Let's create a simple dataset to show how it works.
Here we have some data derived from `babynames` where we have the name, gender, and age of a bunch of people in a rather weird format[^regexps-5]:
@ -377,9 +377,9 @@ str_view(fruit, "^a")
str_view(fruit, "a$")
```
It's tempting to think that `$` should matches the start of a string, because that's how we write dollar amounts, but it's not what regular expressions want.
It's tempting to think that `$` should match the start of a string, because that's how we write dollar amounts, but it's not what regular expressions want.
To force a regular expression to only the full string, anchor it with both `^` and `$`:
To force a regular expression to match only the full string, anchor it with both `^` and `$`:
```{r}
str_view(fruit, "apple")
@ -387,7 +387,7 @@ str_view(fruit, "^apple$")
```
You can also match the boundary between words (i.e. the start or end of a word) with `\b`.
This can be particularly when using RStudio's find and replace tool.
This can be particularly useful when using RStudio's find and replace tool.
For example, if to find all uses of `sum()`, you can search for `\bsum\b` to avoid matching `summarize`, `summary`, `rowsum` and so on:
```{r}
@ -496,7 +496,7 @@ But unlike algebra you're unlikely to remember the precedence rules for regexes,
### Grouping and capturing
As well overriding operator precedence, parentheses have another important effect: they create **capturing groups** that allow you to use sub-components of the match.
As well as overriding operator precedence, parentheses have another important effect: they create **capturing groups** that allow you to use sub-components of the match.
The first way to use a capturing group is to refer back to it within a match with **back reference**: `\1` refers to the match contained in the first parenthesis, `\2` in the second parenthesis, and so on.
For example, the following pattern finds all fruits that have a repeated pair of letters:
@ -594,7 +594,7 @@ This allows you control the so called regex flags and match various types of fix
### Regex flags {#sec-flags}
There are a number of settings that can use to control the details of the regexp.
There are a number of settings that can be used to control the details of the regexp.
These settings are often called **flags** in other programming languages.
In stringr, you can use these by wrapping the pattern in a call to `regex()`.
The most useful flag is probably `ignore_case = TRUE` because it allows characters to match either their uppercase or lowercase forms:
@ -680,7 +680,7 @@ str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
To put these ideas into practice we'll solve a few semi-authentic problems next.
We'll discuss three general techniques:
1. checking you work by creating simple positive and negative controls
1. checking your work by creating simple positive and negative controls
2. combining regular expressions with Boolean algebra
3. creating complex patterns using string manipulation
@ -830,7 +830,7 @@ str_view(sentences, pattern)
```
In this example, `cols` only contains numbers and letters so you don't need to worry about metacharacters.
But in general, whenever you create create patterns from existing strings it's wise to run them through `str_escape()` to ensure they match literally.
But in general, whenever you create patterns from existing strings it's wise to run them through `str_escape()` to ensure they match literally.
### Exercises
@ -862,10 +862,10 @@ There are three other particularly useful places where you might want to use a r
- `matches(pattern)` will select all variables whose name matches the supplied pattern.
It's a "tidyselect" function that you can use anywhere in any tidyverse function that selects variables (e.g. `select()`, `rename_with()` and `across()`).
- `pivot_longer()'s` `names_pattern` argument takes a vector of regular expressions, just like `separate_with_regex()`.
- `pivot_longer()'s` `names_pattern` argument takes a vector of regular expressions, just like `separate_wider_regex()`.
It's useful when extracting data out of variable names with a complex structure
- The `delim` argument in `separate_delim_longer()` and `separate_delim_wider()` usually matches a fixed string, but you can use `regex()` to make it match a pattern.
- The `delim` argument in `separate_longer_delim()` and `separate_wider_delim()` usually matches a fixed string, but you can use `regex()` to make it match a pattern.
This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. `regex(", ?")`.
### Base R