More minor page count tweaks & fixes
And re-convert with latest htmlbook
This commit is contained in:
@@ -1,23 +1,14 @@
|
||||
<section data-type="chapter" id="chp-regexps">
|
||||
<h1><span id="sec-regular-expressions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Regular expressions</span></span></h1>
|
||||
<section id="introduction" data-type="sect1">
|
||||
<section id="regexps-introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>In <a href="#chp-strings" data-type="xref">#chp-strings</a>, you learned a whole bunch of useful functions for working with strings. This chapter will focus on functions that use <strong>regular expressions</strong>, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”<span data-type="footnote">You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).</span> or “regexp”.</p>
|
||||
<p>The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. We’ll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, we’ll talk about some of the other types of patterns that stringr functions can work with and the various “flags” that allow you to tweak the operation of regular expressions. We’ll finish with a survey of other places in the tidyverse and base R where you might use regexes.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<section id="regexps-prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
<div data-type="important"><div class="callout-body d-flex">
|
||||
<div class="callout-icon-container">
|
||||
<i class="callout-icon"/>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
<p>This chapter relies on features only found in tidyr 1.3.0, which is still in development. If you want to live on the edge, you can get the dev version with <code>devtools::install_github("tidyverse/tidyr")</code>.</p></div>
|
||||
|
||||
<p>In this chapter, we’ll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
@@ -46,11 +37,7 @@ Pattern basics</h1>
|
||||
#> [11] │ boysen<berry>
|
||||
#> [19] │ cloud<berry>
|
||||
#> [21] │ cran<berry>
|
||||
#> [29] │ elder<berry>
|
||||
#> [32] │ goji <berry>
|
||||
#> [33] │ goose<berry>
|
||||
#> [38] │ huckle<berry>
|
||||
#> ... and 4 more
|
||||
#> ... and 8 more
|
||||
|
||||
str_view(fruit, "BERRY")</pre>
|
||||
</div>
|
||||
@@ -70,8 +57,7 @@ str_view(fruit, "BERRY")</pre>
|
||||
#> [51] │ nect<arine>
|
||||
#> [62] │ pine<apple>
|
||||
#> [64] │ pomegr<anate>
|
||||
#> [70] │ r<aspbe>rry
|
||||
#> [73] │ sal<al be>rry</pre>
|
||||
#> ... and 2 more</pre>
|
||||
</div>
|
||||
<p><strong>Quantifiers</strong> control how many times a pattern can match:</p>
|
||||
<ul><li>
|
||||
@@ -123,11 +109,7 @@ str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]")
|
||||
#> [34] │ alth<ough>
|
||||
#> [37] │ am<ount>
|
||||
#> [46] │ app<oint>
|
||||
#> [47] │ appr<oach>
|
||||
#> [52] │ ar<ound>
|
||||
#> [61] │ <auth>ority
|
||||
#> [79] │ be<auty>
|
||||
#> ... and 62 more</pre>
|
||||
#> ... and 66 more</pre>
|
||||
</div>
|
||||
<p>(We’ll learn more elegant ways to express these ideas in <a href="#sec-quantifiers" data-type="xref">#sec-quantifiers</a>.)</p>
|
||||
<p>You can use <strong>alternation</strong>, <code>|</code> to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “pear”, or “banana”, or a repeated vowel.</p>
|
||||
@@ -144,11 +126,6 @@ str_view(fruit, "aa|ee|ii|oo|uu")
|
||||
#> [66] │ purple mangost<ee>n</pre>
|
||||
</div>
|
||||
<p>Regular expressions are very compact and use a lot of punctuation characters, so they can seem overwhelming and hard to read at first. Don’t worry; you’ll get better with practice, and simple patterns will soon become second nature. Let’s kick off that process by practicing with some useful stringr functions.</p>
|
||||
|
||||
<section id="exercises" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="sec-stringr-regex-funs" data-type="sect1">
|
||||
@@ -286,7 +263,7 @@ str_remove_all(x, "[aeiou]")
|
||||
<section id="sec-extract-variables" data-type="sect2">
|
||||
<h2>
|
||||
Extract variables</h2>
|
||||
<p>The last function we’ll discuss uses regular expressions to extract data out of one column into one or more new columns: <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. It’s a peer of the <code>separate_wider_location()</code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> functions that you learned about in <a href="#sec-string-columns" data-type="xref">#sec-string-columns</a>. These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.</p>
|
||||
<p>The last function we’ll discuss uses regular expressions to extract data out of one column into one or more new columns: <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. It’s a peer of the <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> functions that you learned about in <a href="#sec-string-columns" data-type="xref">#sec-string-columns</a>. These functions live in tidyr because they operate on (columns of) data frames, rather than individual vectors.</p>
|
||||
<p>Let’s create a simple dataset to show how it works. Here we have some data derived from <code>babynames</code> where we have the name, gender, and age of a bunch of people in a rather weird format<span data-type="footnote">We wish we could reassure you that you’d never see something this weird in real life, but unfortunately over the course of your career you’re likely to see much weirder!</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tribble(
|
||||
@@ -325,7 +302,7 @@ Extract variables</h2>
|
||||
<p>If the match fails, you can use <code>too_short = "debug"</code> to figure out what went wrong, just like <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code>.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises-1" data-type="sect2">
|
||||
<section id="regexps-exercises" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)</p></li>
|
||||
@@ -398,8 +375,8 @@ str_view(fruit, "a$")
|
||||
#> [56] │ papay<a>
|
||||
#> [74] │ satsum<a></pre>
|
||||
</div>
|
||||
<p>It’s tempting to think that <code>$</code> should matches the start of a string, because that’s how we write dollar amounts, but it’s not what regular expressions want.</p>
|
||||
<p>To force a regular expression to only the full string, anchor it with both <code>^</code> and <code>$</code>:</p>
|
||||
<p>It’s tempting to think that <code>$</code> should match the start of a string, because that’s how we write dollar amounts, but it’s not what regular expressions want.</p>
|
||||
<p>To force a regular expression to match only the full string, anchor it with both <code>^</code> and <code>$</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "apple")
|
||||
#> [1] │ <apple>
|
||||
@@ -407,7 +384,7 @@ str_view(fruit, "a$")
|
||||
str_view(fruit, "^apple$")
|
||||
#> [1] │ <apple></pre>
|
||||
</div>
|
||||
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly when using RStudio’s find and replace tool. For example, if to find all uses of <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarize</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
|
||||
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly useful when using RStudio’s find and replace tool. For example, if to find all uses of <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarize</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
|
||||
str_view(x, "sum")
|
||||
@@ -523,7 +500,7 @@ Operator precedence and parentheses</h2>
|
||||
<section id="grouping-and-capturing" data-type="sect2">
|
||||
<h2>
|
||||
Grouping and capturing</h2>
|
||||
<p>As well overriding operator precedence, parentheses have another important effect: they create <strong>capturing groups</strong> that allow you to use sub-components of the match.</p>
|
||||
<p>As well as overriding operator precedence, parentheses have another important effect: they create <strong>capturing groups</strong> that allow you to use sub-components of the match.</p>
|
||||
<p>The first way to use a capturing group is to refer back to it within a match with <strong>back reference</strong>: <code>\1</code> refers to the match contained in the first parenthesis, <code>\2</code> in the second parenthesis, and so on. For example, the following pattern finds all fruits that have a repeated pair of letters:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "(..)\\1")
|
||||
@@ -548,17 +525,13 @@ Grouping and capturing</h2>
|
||||
<pre data-type="programlisting" data-code-language="r">sentences |>
|
||||
str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |>
|
||||
str_view()
|
||||
#> [1] │ The canoe birch slid on the smooth planks.
|
||||
#> [2] │ Glue sheet the to the dark blue background.
|
||||
#> [3] │ It's to easy tell the depth of a well.
|
||||
#> [4] │ These a days chicken leg is a rare dish.
|
||||
#> [5] │ Rice often is served in round bowls.
|
||||
#> [6] │ The of juice lemons makes fine punch.
|
||||
#> [7] │ The was box thrown beside the parked truck.
|
||||
#> [8] │ The were hogs fed chopped corn and garbage.
|
||||
#> [9] │ Four of hours steady work faced us.
|
||||
#> [10] │ A size large in stockings is hard to sell.
|
||||
#> ... and 710 more</pre>
|
||||
#> [1] │ The canoe birch slid on the smooth planks.
|
||||
#> [2] │ Glue sheet the to the dark blue background.
|
||||
#> [3] │ It's to easy tell the depth of a well.
|
||||
#> [4] │ These a days chicken leg is a rare dish.
|
||||
#> [5] │ Rice often is served in round bowls.
|
||||
#> [6] │ The of juice lemons makes fine punch.
|
||||
#> ... and 714 more</pre>
|
||||
</div>
|
||||
<p>If you want extract the matches for each group you can use <code><a href="https://stringr.tidyverse.org/reference/str_match.html">str_match()</a></code>. But <code><a href="https://stringr.tidyverse.org/reference/str_match.html">str_match()</a></code> returns a matrix, so it’s not particularly easy to work with<span data-type="footnote">Mostly because we never discuss matrices in this book!</span>:</p>
|
||||
<div class="cell">
|
||||
@@ -605,7 +578,7 @@ str_match(x, "gr(?:e|a)y")
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="exercises-2" data-type="sect2">
|
||||
<section id="regexps-exercises-1" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>How would you match the literal string <code>"'\</code>? How about <code>"$^$"</code>?</p></li>
|
||||
@@ -645,7 +618,7 @@ Pattern control</h1>
|
||||
<section id="sec-flags" data-type="sect2">
|
||||
<h2>
|
||||
Regex flags</h2>
|
||||
<p>There are a number of settings that can use to control the details of the regexp. These settings are often called <strong>flags</strong> in other programming languages. In stringr, you can use these by wrapping the pattern in a call to <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code>. The most useful flag is probably <code>ignore_case = TRUE</code> because it allows characters to match either their uppercase or lowercase forms:</p>
|
||||
<p>There are a number of settings that can be used to control the details of the regexp. These settings are often called <strong>flags</strong> in other programming languages. In stringr, you can use these by wrapping the pattern in a call to <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code>. The most useful flag is probably <code>ignore_case = TRUE</code> because it allows characters to match either their uppercase or lowercase forms:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">bananas <- c("banana", "Banana", "BANANA")
|
||||
str_view(bananas, "banana")
|
||||
@@ -737,7 +710,7 @@ str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
|
||||
<h1>
|
||||
Practice</h1>
|
||||
<p>To put these ideas into practice we’ll solve a few semi-authentic problems next. We’ll discuss three general techniques:</p>
|
||||
<ol type="1"><li>checking you work by creating simple positive and negative controls</li>
|
||||
<ol type="1"><li>checking your work by creating simple positive and negative controls</li>
|
||||
<li>combining regular expressions with Boolean algebra</li>
|
||||
<li>creating complex patterns using string manipulation</li>
|
||||
</ol>
|
||||
@@ -753,11 +726,7 @@ Check your work</h2>
|
||||
#> [7] │ <The> box was thrown beside the parked truck.
|
||||
#> [8] │ <The> hogs were fed chopped corn and garbage.
|
||||
#> [11] │ <The> boy was there when the sun rose.
|
||||
#> [13] │ <The> source of the huge river is the clear spring.
|
||||
#> [18] │ <The> soft cushion broke the man's fall.
|
||||
#> [19] │ <The> salt breeze came across from the sea.
|
||||
#> [20] │ <The> girl at the booth sold fifty bonds.
|
||||
#> ... and 267 more</pre>
|
||||
#> ... and 271 more</pre>
|
||||
</div>
|
||||
<p>Because that pattern also matches sentences starting with words like <code>They</code> or <code>These</code>. We need to make sure that the “e” is the last letter in the word, which we can do by adding adding a word boundary:</p>
|
||||
<div class="cell">
|
||||
@@ -768,26 +737,18 @@ Check your work</h2>
|
||||
#> [8] │ <The> hogs were fed chopped corn and garbage.
|
||||
#> [11] │ <The> boy was there when the sun rose.
|
||||
#> [13] │ <The> source of the huge river is the clear spring.
|
||||
#> [18] │ <The> soft cushion broke the man's fall.
|
||||
#> [19] │ <The> salt breeze came across from the sea.
|
||||
#> [20] │ <The> girl at the booth sold fifty bonds.
|
||||
#> [21] │ <The> small pup gnawed a hole in the sock.
|
||||
#> ... and 246 more</pre>
|
||||
#> ... and 250 more</pre>
|
||||
</div>
|
||||
<p>What about finding all sentences that begin with a pronoun?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(sentences, "^She|He|It|They\\b")
|
||||
#> [3] │ <It>'s easy to tell the depth of a well.
|
||||
#> [15] │ <He>lp the woman get back to her feet.
|
||||
#> [27] │ <He>r purse was full of useless trash.
|
||||
#> [29] │ <It> snowed, rained, and hailed the same morning.
|
||||
#> [63] │ <He> ran half way to the hardware store.
|
||||
#> [90] │ <He> lay prone and hardly moved a limb.
|
||||
#> [116] │ <He> ordered peach pie with ice cream.
|
||||
#> [118] │ <He>mp is a weed found in parts of the tropics.
|
||||
#> [127] │ <It> caught its hind paw in a rusty trap.
|
||||
#> [132] │ <He> said the same phrase thirty times.
|
||||
#> ... and 53 more</pre>
|
||||
#> [3] │ <It>'s easy to tell the depth of a well.
|
||||
#> [15] │ <He>lp the woman get back to her feet.
|
||||
#> [27] │ <He>r purse was full of useless trash.
|
||||
#> [29] │ <It> snowed, rained, and hailed the same morning.
|
||||
#> [63] │ <He> ran half way to the hardware store.
|
||||
#> [90] │ <He> lay prone and hardly moved a limb.
|
||||
#> ... and 57 more</pre>
|
||||
</div>
|
||||
<p>A quick inspection of the results shows that we’re getting some spurious matches. That’s because we’ve forgotten to use parentheses:</p>
|
||||
<div class="cell">
|
||||
@@ -798,11 +759,7 @@ Check your work</h2>
|
||||
#> [90] │ <He> lay prone and hardly moved a limb.
|
||||
#> [116] │ <He> ordered peach pie with ice cream.
|
||||
#> [127] │ <It> caught its hind paw in a rusty trap.
|
||||
#> [132] │ <He> said the same phrase thirty times.
|
||||
#> [153] │ <He> broke a new shoelace that day.
|
||||
#> [159] │ <She> sewed the torn coat quite neatly.
|
||||
#> [168] │ <He> knew the skill of the great young actress.
|
||||
#> ... and 47 more</pre>
|
||||
#> ... and 51 more</pre>
|
||||
</div>
|
||||
<p>You might wonder how you might spot such a mistake if it didn’t occur in the first few matches. A good technique is to create a few positive and negative matches and use them to test that your pattern works as expected:</p>
|
||||
<div class="cell">
|
||||
@@ -850,11 +807,7 @@ Boolean operations</h2>
|
||||
#> [62] │ <availab>le
|
||||
#> [66] │ <ba>by
|
||||
#> [67] │ <ba>ck
|
||||
#> [68] │ <ba>d
|
||||
#> [69] │ <ba>g
|
||||
#> [70] │ <bala>nce
|
||||
#> [71] │ <ba>ll
|
||||
#> ... and 20 more</pre>
|
||||
#> ... and 24 more</pre>
|
||||
</div>
|
||||
<p>It’s simpler to combine the results of two calls to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>:</p>
|
||||
<div class="cell">
|
||||
@@ -897,11 +850,7 @@ Creating a pattern with code</h2>
|
||||
#> [148] │ The spot on the blotter was made by <green> ink.
|
||||
#> [160] │ The sofa cushion is <red> and of light weight.
|
||||
#> [174] │ The sky that morning was clear and bright <blue>.
|
||||
#> [204] │ A <blue> crane is a tall wading bird.
|
||||
#> [217] │ It is hard to erase <blue> or <red> ink.
|
||||
#> [224] │ The lamp shone with a steady <green> flame.
|
||||
#> [247] │ The box is held by a bright <red> snapper.
|
||||
#> ... and 16 more</pre>
|
||||
#> ... and 20 more</pre>
|
||||
</div>
|
||||
<p>But as the number of colors grows, it would quickly get tedious to construct this pattern by hand. Wouldn’t it be nice if we could store the colors in a vector?</p>
|
||||
<div class="cell">
|
||||
@@ -915,34 +864,26 @@ Creating a pattern with code</h2>
|
||||
<p>We could make this pattern more comprehensive if we had a good list of colors. One place we could start from is the list of built-in colors that R can use for plots:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(colors())
|
||||
#> [1] │ white
|
||||
#> [2] │ aliceblue
|
||||
#> [3] │ antiquewhite
|
||||
#> [4] │ antiquewhite1
|
||||
#> [5] │ antiquewhite2
|
||||
#> [6] │ antiquewhite3
|
||||
#> [7] │ antiquewhite4
|
||||
#> [8] │ aquamarine
|
||||
#> [9] │ aquamarine1
|
||||
#> [10] │ aquamarine2
|
||||
#> ... and 647 more</pre>
|
||||
#> [1] │ white
|
||||
#> [2] │ aliceblue
|
||||
#> [3] │ antiquewhite
|
||||
#> [4] │ antiquewhite1
|
||||
#> [5] │ antiquewhite2
|
||||
#> [6] │ antiquewhite3
|
||||
#> ... and 651 more</pre>
|
||||
</div>
|
||||
<p>But lets first eliminate the numbered variants:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">cols <- colors()
|
||||
cols <- cols[!str_detect(cols, "\\d")]
|
||||
str_view(cols)
|
||||
#> [1] │ white
|
||||
#> [2] │ aliceblue
|
||||
#> [3] │ antiquewhite
|
||||
#> [4] │ aquamarine
|
||||
#> [5] │ azure
|
||||
#> [6] │ beige
|
||||
#> [7] │ bisque
|
||||
#> [8] │ black
|
||||
#> [9] │ blanchedalmond
|
||||
#> [10] │ blue
|
||||
#> ... and 133 more</pre>
|
||||
#> [1] │ white
|
||||
#> [2] │ aliceblue
|
||||
#> [3] │ antiquewhite
|
||||
#> [4] │ aquamarine
|
||||
#> [5] │ azure
|
||||
#> [6] │ beige
|
||||
#> ... and 137 more</pre>
|
||||
</div>
|
||||
<p>Then we can turn this into one giant pattern. We won’t show the pattern here because it’s huge, but you can see it working:</p>
|
||||
<div class="cell">
|
||||
@@ -954,16 +895,12 @@ str_view(sentences, pattern)
|
||||
#> [66] │ Cars and busses stalled in <snow> drifts.
|
||||
#> [92] │ A wisp of cloud hung in the <blue> air.
|
||||
#> [112] │ Leaves turn <brown> and <yellow> in the fall.
|
||||
#> [148] │ The spot on the blotter was made by <green> ink.
|
||||
#> [149] │ Mud was spattered on the front of his <white> shirt.
|
||||
#> [160] │ The sofa cushion is <red> and of light weight.
|
||||
#> [167] │ The office paint was a dull, sad <tan>.
|
||||
#> ... and 53 more</pre>
|
||||
#> ... and 57 more</pre>
|
||||
</div>
|
||||
<p>In this example, <code>cols</code> only contains numbers and letters so you don’t need to worry about metacharacters. But in general, whenever you create create patterns from existing strings it’s wise to run them through <code><a href="https://stringr.tidyverse.org/reference/str_escape.html">str_escape()</a></code> to ensure they match literally.</p>
|
||||
<p>In this example, <code>cols</code> only contains numbers and letters so you don’t need to worry about metacharacters. But in general, whenever you create patterns from existing strings it’s wise to run them through <code><a href="https://stringr.tidyverse.org/reference/str_escape.html">str_escape()</a></code> to ensure they match literally.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises-3" data-type="sect2">
|
||||
<section id="regexps-exercises-2" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>
|
||||
@@ -988,8 +925,8 @@ Regular expressions in other places</h1>
|
||||
tidyverse</h2>
|
||||
<p>There are three other particularly useful places where you might want to use a regular expressions</p>
|
||||
<ul><li><p><code>matches(pattern)</code> will select all variables whose name matches the supplied pattern. It’s a “tidyselect” function that you can use anywhere in any tidyverse function that selects variables (e.g. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename_with()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>).</p></li>
|
||||
<li><p><code>pivot_longer()'s</code> <code>names_pattern</code> argument takes a vector of regular expressions, just like <code>separate_with_regex()</code>. It’s useful when extracting data out of variable names with a complex structure</p></li>
|
||||
<li><p>The <code>delim</code> argument in <code>separate_delim_longer()</code> and <code>separate_delim_wider()</code> usually matches a fixed string, but you can use <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code> to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. <code>regex(", ?")</code>.</p></li>
|
||||
<li><p><code>pivot_longer()'s</code> <code>names_pattern</code> argument takes a vector of regular expressions, just like <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. It’s useful when extracting data out of variable names with a complex structure</p></li>
|
||||
<li><p>The <code>delim</code> argument in <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_delim()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> usually matches a fixed string, but you can use <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code> to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. <code>regex(", ?")</code>.</p></li>
|
||||
</ul></section>
|
||||
|
||||
<section id="base-r" data-type="sect2">
|
||||
@@ -1011,7 +948,7 @@ Base R</h2>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="summary" data-type="sect1">
|
||||
<section id="regexps-summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>With every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there. They’re definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.</p>
|
||||
|
||||
Reference in New Issue
Block a user