Update regexps (#1237)
This commit is contained in:
		| @@ -545,7 +545,7 @@ dyears(1) | ||||
| Durations always record the time span in seconds. | ||||
| Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. | ||||
| Larger time units are more problematic. | ||||
| A year is uses the "average" number of days in a year, i.e. 365.25. | ||||
| A year uses the "average" number of days in a year, i.e. 365.25. | ||||
| There's no way to convert a month to a duration, because there's just too much variation. | ||||
|  | ||||
| You can add and multiply durations: | ||||
| @@ -565,15 +565,15 @@ last_year <- today() - dyears(1) | ||||
| However, because durations represent an exact number of seconds, sometimes you might get an unexpected result: | ||||
|  | ||||
| ```{r} | ||||
| one_pm <- ymd_hms("2026-03-12 13:00:00", tz = "America/New_York") | ||||
| one_am <- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York") | ||||
|  | ||||
| one_pm | ||||
| one_pm + ddays(1) | ||||
| one_am | ||||
| one_am + ddays(1) | ||||
| ``` | ||||
|  | ||||
| Why is one day after 1pm March 12, 2pm March 13? | ||||
| Why is one day after 1am March 8, 2am March 9? | ||||
| If you look carefully at the date you might also notice that the time zones have changed. | ||||
| March 12 only has 23 hours because it's when DST starts, so if we add a full days worth of seconds we end up with a different time. | ||||
| March 8 only has 23 hours because it's when DST starts, so if we add a full days worth of seconds we end up with a different time. | ||||
|  | ||||
| ### Periods | ||||
|  | ||||
| @@ -582,8 +582,8 @@ Periods are time spans but don't have a fixed length in seconds, instead they wo | ||||
| That allows them to work in a more intuitive way: | ||||
|  | ||||
| ```{r} | ||||
| one_pm | ||||
| one_pm + days(1) | ||||
| one_am | ||||
| one_am + days(1) | ||||
| ``` | ||||
|  | ||||
| Like durations, periods can be created with a number of friendly constructor functions. | ||||
| @@ -610,8 +610,8 @@ ymd("2024-01-01") + dyears(1) | ||||
| ymd("2024-01-01") + years(1) | ||||
|  | ||||
| # Daylight Savings Time | ||||
| one_pm + ddays(1) | ||||
| one_pm + days(1) | ||||
| one_am + ddays(1) | ||||
| one_am + days(1) | ||||
| ``` | ||||
|  | ||||
| Let's use periods to fix an oddity related to our flight dates. | ||||
|   | ||||
| @@ -179,7 +179,7 @@ This brings us to another important way of revealing implicitly missing observat | ||||
| You'll learn more about joins in @sec-joins, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another. | ||||
|  | ||||
| `dplyr::anti_join(x, y)` is a particularly useful tool here because it selects only the rows in `x` that don't have a match in `y`. | ||||
| For example, we can use two `anti_join()`s reveal to reveal that we're missing information for four airports and 722 planes mentioned in `flights`: | ||||
| For example, we can use two `anti_join()`s to reveal that we're missing information for four airports and 722 planes mentioned in `flights`: | ||||
|  | ||||
| ```{r} | ||||
| library(nycflights13) | ||||
|   | ||||
							
								
								
									
										22
									
								
								regexps.qmd
									
									
									
									
									
								
							
							
						
						
									
										22
									
								
								regexps.qmd
									
									
									
									
									
								
							| @@ -252,8 +252,8 @@ These functions are naturally paired with `mutate()` when doing data cleaning, a | ||||
| ### Extract variables {#sec-extract-variables} | ||||
|  | ||||
| The last function we'll discuss uses regular expressions to extract data out of one column into one or more new columns: `separate_wider_regex()`. | ||||
| It's a peer of the `separate_wider_location()` and `separate_wider_delim()` functions that you learned about in @sec-string-columns. | ||||
| These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors. | ||||
| It's a peer of the `separate_wider_position()` and `separate_wider_delim()` functions that you learned about in @sec-string-columns. | ||||
| These functions live in tidyr because they operate on (columns of) data frames, rather than individual vectors. | ||||
|  | ||||
| Let's create a simple dataset to show how it works. | ||||
| Here we have some data derived from `babynames` where we have the name, gender, and age of a bunch of people in a rather weird format[^regexps-5]: | ||||
| @@ -377,9 +377,9 @@ str_view(fruit, "^a") | ||||
| str_view(fruit, "a$") | ||||
| ``` | ||||
|  | ||||
| It's tempting to think that `$` should matches the start of a string, because that's how we write dollar amounts, but it's not what regular expressions want. | ||||
| It's tempting to think that `$` should match the start of a string, because that's how we write dollar amounts, but it's not what regular expressions want. | ||||
|  | ||||
| To force a regular expression to only the full string, anchor it with both `^` and `$`: | ||||
| To force a regular expression to match only the full string, anchor it with both `^` and `$`: | ||||
|  | ||||
| ```{r} | ||||
| str_view(fruit, "apple") | ||||
| @@ -387,7 +387,7 @@ str_view(fruit, "^apple$") | ||||
| ``` | ||||
|  | ||||
| You can also match the boundary between words (i.e. the start or end of a word) with `\b`. | ||||
| This can be particularly when using RStudio's find and replace tool. | ||||
| This can be particularly useful when using RStudio's find and replace tool. | ||||
| For example, if to find all uses of `sum()`, you can search for `\bsum\b` to avoid matching `summarize`, `summary`, `rowsum` and so on: | ||||
|  | ||||
| ```{r} | ||||
| @@ -496,7 +496,7 @@ But unlike algebra you're unlikely to remember the precedence rules for regexes, | ||||
|  | ||||
| ### Grouping and capturing | ||||
|  | ||||
| As well overriding operator precedence, parentheses have another important effect: they create **capturing groups** that allow you to use sub-components of the match. | ||||
| As well as overriding operator precedence, parentheses have another important effect: they create **capturing groups** that allow you to use sub-components of the match. | ||||
|  | ||||
| The first way to use a capturing group is to refer back to it within a match with **back reference**: `\1` refers to the match contained in the first parenthesis, `\2` in the second parenthesis, and so on. | ||||
| For example, the following pattern finds all fruits that have a repeated pair of letters: | ||||
| @@ -594,7 +594,7 @@ This allows you control the so called regex flags and match various types of fix | ||||
|  | ||||
| ### Regex flags {#sec-flags} | ||||
|  | ||||
| There are a number of settings that can use to control the details of the regexp. | ||||
| There are a number of settings that can be used to control the details of the regexp. | ||||
| These settings are often called **flags** in other programming languages. | ||||
| In stringr, you can use these by wrapping the pattern in a call to `regex()`. | ||||
| The most useful flag is probably `ignore_case = TRUE` because it allows characters to match either their uppercase or lowercase forms: | ||||
| @@ -680,7 +680,7 @@ str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr")) | ||||
| To put these ideas into practice we'll solve a few semi-authentic problems next. | ||||
| We'll discuss three general techniques: | ||||
|  | ||||
| 1.  checking you work by creating simple positive and negative controls | ||||
| 1.  checking your work by creating simple positive and negative controls | ||||
| 2.  combining regular expressions with Boolean algebra | ||||
| 3.  creating complex patterns using string manipulation | ||||
|  | ||||
| @@ -830,7 +830,7 @@ str_view(sentences, pattern) | ||||
| ``` | ||||
|  | ||||
| In this example, `cols` only contains numbers and letters so you don't need to worry about metacharacters. | ||||
| But in general, whenever you create create patterns from existing strings it's wise to run them through `str_escape()` to ensure they match literally. | ||||
| But in general, whenever you create patterns from existing strings it's wise to run them through `str_escape()` to ensure they match literally. | ||||
|  | ||||
| ### Exercises | ||||
|  | ||||
| @@ -862,10 +862,10 @@ There are three other particularly useful places where you might want to use a r | ||||
| -   `matches(pattern)` will select all variables whose name matches the supplied pattern. | ||||
|     It's a "tidyselect" function that you can use anywhere in any tidyverse function that selects variables (e.g. `select()`, `rename_with()` and `across()`). | ||||
|  | ||||
| -   `pivot_longer()'s` `names_pattern` argument takes a vector of regular expressions, just like `separate_with_regex()`. | ||||
| -   `pivot_longer()'s` `names_pattern` argument takes a vector of regular expressions, just like `separate_wider_regex()`. | ||||
|     It's useful when extracting data out of variable names with a complex structure | ||||
|  | ||||
| -   The `delim` argument in `separate_delim_longer()` and `separate_delim_wider()` usually matches a fixed string, but you can use `regex()` to make it match a pattern. | ||||
| -   The `delim` argument in `separate_longer_delim()` and `separate_wider_delim()` usually matches a fixed string, but you can use `regex()` to make it match a pattern. | ||||
|     This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. `regex(", ?")`. | ||||
|  | ||||
| ### Base R | ||||
|   | ||||
		Reference in New Issue
	
	Block a user