Don't transform non-crossref links

This commit is contained in:
Hadley Wickham
2022-11-18 10:30:32 -06:00
parent 4caea5281b
commit 78a1c12fe7
32 changed files with 693 additions and 693 deletions

View File

@@ -44,7 +44,7 @@ library(babynames)</pre>
<section id="sec-reg-basics" data-type="sect1">
<h1>
Pattern basics</h1>
<p>Well use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code> to learn how regex patterns work. We used <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code> in the last chapter to better understand a string vs its printed representation, and now well use it with its second argument, a regular expression. When this is supplied, <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code> will show only the elements of the string vector that match, surrounding each match with <code>&lt;&gt;</code>, and, where possible, highlighting the match in blue.</p>
<p>Well use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> to learn how regex patterns work. We used <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> in the last chapter to better understand a string vs its printed representation, and now well use it with its second argument, a regular expression. When this is supplied, <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> will show only the elements of the string vector that match, surrounding each match with <code>&lt;&gt;</code>, and, where possible, highlighting the match in blue.</p>
<p>The simplest patterns consist of letters and numbers which match those characters exactly:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_view(fruit, "berry")
@@ -167,12 +167,12 @@ Key functions</h1>
<section id="detect-matches" data-type="sect2">
<h2>
Detect matches</h2>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code> returns a logical vector that is <code>TRUE</code> if the pattern matched an element of the character vector and <code>FALSE</code> otherwise:</p>
<p><code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> returns a logical vector that is <code>TRUE</code> if the pattern matched an element of the character vector and <code>FALSE</code> otherwise:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_detect(c("a", "b", "c"), "[aeiou]")
#&gt; [1] TRUE FALSE FALSE</pre>
</div>
<p>Since <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code> returns a logical vector of the same length as the initial vector, it pairs well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>. For example, this code finds all the most popular names containing a lower-case “x”:</p>
<p>Since <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> returns a logical vector of the same length as the initial vector, it pairs well with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>. For example, this code finds all the most popular names containing a lower-case “x”:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">babynames |&gt;
filter(str_detect(name, "x")) |&gt;
@@ -188,7 +188,7 @@ Detect matches</h2>
#&gt; 6 Alexa 123032
#&gt; # … with 968 more rows</pre>
</div>
<p>We can also use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code> with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> by pairing it with <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code> or <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code>: <code>sum(str_detect(x, pattern))</code> tells you the number of observations that match and <code>mean(str_detect(x, pattern))</code> tells you the proportion that match. For example, the following snippet computes and visualizes the proportion of baby names<span data-type="footnote">This gives us the proportion of <strong>names</strong> that contain an “x”; if you wanted the proportion of babies with a name containing an x, youd need to perform a weighted mean.</span> that contain “x”, broken down by year. It looks like theyve radically increased in popularity lately!</p>
<p>We can also use <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> by pairing it with <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> or <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>: <code>sum(str_detect(x, pattern))</code> tells you the number of observations that match and <code>mean(str_detect(x, pattern))</code> tells you the proportion that match. For example, the following snippet computes and visualizes the proportion of baby names<span data-type="footnote">This gives us the proportion of <strong>names</strong> that contain an “x”; if you wanted the proportion of babies with a name containing an x, youd need to perform a weighted mean.</span> that contain “x”, broken down by year. It looks like theyve radically increased in popularity lately!</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">babynames |&gt;
group_by(year) |&gt;
@@ -202,7 +202,7 @@ Detect matches</h2>
</figure>
</div>
</div>
<p>There are two functions that are closely related to <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code>, namely <code><a href="#chp-https://stringr.tidyverse.org/reference/str_subset" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_subset</a></code> which returns just the strings that contain a match and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_which" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_which</a></code> which returns the indexes of strings that have a match:</p>
<p>There are two functions that are closely related to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>, namely <code><a href="https://stringr.tidyverse.org/reference/str_subset.html">str_subset()</a></code> which returns just the strings that contain a match and <code><a href="https://stringr.tidyverse.org/reference/str_which.html">str_which()</a></code> which returns the indexes of strings that have a match:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_subset(c("a", "b", "c"), "[aeiou]")
#&gt; [1] "a"
@@ -214,7 +214,7 @@ str_which(c("a", "b", "c"), "[aeiou]")
<section id="count-matches" data-type="sect2">
<h2>
Count matches</h2>
<p>The next step up in complexity from <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code> is <code><a href="#chp-https://stringr.tidyverse.org/reference/str_count" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_count</a></code>: rather than a simple true or false, it tells you how many matches there are in each string.</p>
<p>The next step up in complexity from <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> is <code><a href="https://stringr.tidyverse.org/reference/str_count.html">str_count()</a></code>: rather than a simple true or false, it tells you how many matches there are in each string.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("apple", "banana", "pear")
str_count(x, "p")
@@ -227,7 +227,7 @@ str_count(x, "p")
str_view("abababa", "aba")
#&gt; [1] │ &lt;aba&gt;b&lt;aba&gt;</pre>
</div>
<p>Its natural to use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_count" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_count</a></code> with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>. The following example uses <code><a href="#chp-https://stringr.tidyverse.org/reference/str_count" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_count</a></code> with character classes to count the number of vowels and consonants in each name.</p>
<p>Its natural to use <code><a href="https://stringr.tidyverse.org/reference/str_count.html">str_count()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. The following example uses <code><a href="https://stringr.tidyverse.org/reference/str_count.html">str_count()</a></code> with character classes to count the number of vowels and consonants in each name.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">babynames |&gt;
count(name) |&gt;
@@ -249,7 +249,7 @@ str_view("abababa", "aba")
<p>If you look closely, youll notice that theres something off with our calculations: “Aaban” contains three “a”s, but our summary reports only two vowels. Thats because regular expressions are case sensitive. There are three ways we could fix this:</p>
<ul><li>Add the upper case vowels to the character class: <code>str_count(name, "[aeiouAEIOU]")</code>.</li>
<li>Tell the regular expression to ignore case: <code>str_count(regex(name, ignore_case = TRUE), "[aeiou]")</code>. Well talk about more in <a href="#sec-flags" data-type="xref">#sec-flags</a>.</li>
<li>Use <code><a href="#chp-https://stringr.tidyverse.org/reference/case" data-type="xref">#chp-https://stringr.tidyverse.org/reference/case</a></code> to convert the names to lower case: <code>str_count(str_to_lower(name), "[aeiou]")</code>. You learned about this function in <a href="#sec-other-languages" data-type="xref">#sec-other-languages</a>.</li>
<li>Use <code><a href="https://stringr.tidyverse.org/reference/case.html">str_to_lower()</a></code> to convert the names to lower case: <code>str_count(str_to_lower(name), "[aeiou]")</code>. You learned about this function in <a href="#sec-other-languages" data-type="xref">#sec-other-languages</a>.</li>
</ul><p>This variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.</p>
<p>In this case, since were applying two functions to the name, I think its easier to transform it first:</p>
<div class="cell">
@@ -276,25 +276,25 @@ str_view("abababa", "aba")
<section id="replace-values" data-type="sect2">
<h2>
Replace values</h2>
<p>As well as detecting and counting matches, we can also modify them with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_replace" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_replace</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_replace" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_replace</a></code>. <code><a href="#chp-https://stringr.tidyverse.org/reference/str_replace" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_replace</a></code> replaces the first match, and as the name suggests, <code><a href="#chp-https://stringr.tidyverse.org/reference/str_replace" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_replace</a></code> replaces all matches.</p>
<p>As well as detecting and counting matches, we can also modify them with <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace_all()</a></code>. <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace()</a></code> replaces the first match, and as the name suggests, <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace_all()</a></code> replaces all matches.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-")
#&gt; [1] "-ppl-" "p--r" "b-n-n-"</pre>
</div>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_remove" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_remove</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_remove" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_remove</a></code> are handy shortcuts for <code>str_replace(x, pattern, "")</code>.</p>
<p><code><a href="https://stringr.tidyverse.org/reference/str_remove.html">str_remove()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_remove.html">str_remove_all()</a></code> are handy shortcuts for <code>str_replace(x, pattern, "")</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("apple", "pear", "banana")
str_remove_all(x, "[aeiou]")
#&gt; [1] "ppl" "pr" "bnn"</pre>
</div>
<p>These functions are naturally paired with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> when doing data cleaning, and youll often apply them repeatedly to peel off layers of inconsistent formatting.</p>
<p>These functions are naturally paired with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> when doing data cleaning, and youll often apply them repeatedly to peel off layers of inconsistent formatting.</p>
</section>
<section id="extract-variables" data-type="sect2">
<h2>
Extract variables</h2>
<p>The last function well discuss uses regular expressions to extract data out of one column into one or more new columns: <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code>. Its a peer of the <code>separate_wider_location()</code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> functions that you learned about in <a href="#sec-string-columns" data-type="xref">#sec-string-columns</a>. These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.</p>
<p>The last function well discuss uses regular expressions to extract data out of one column into one or more new columns: <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. Its a peer of the <code>separate_wider_location()</code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> functions that you learned about in <a href="#sec-string-columns" data-type="xref">#sec-string-columns</a>. These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.</p>
<p>Lets create a simple dataset to show how it works. Here we have some data derived from <code>babynames</code> where we have the name, gender, and age of a bunch of people in a rather weird format<span data-type="footnote">We wish we could reassure you that youd never see something this weird in real life, but unfortunately over the course of your career youre likely to see much weirder!</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
@@ -308,7 +308,7 @@ Extract variables</h2>
"&lt;Patricia&gt;-F_84",
)</pre>
</div>
<p>To extract this data using <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name:</p>
<p>To extract this data using <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
separate_wider_regex(
@@ -330,7 +330,7 @@ Extract variables</h2>
#&gt; 6 Justin M 41
#&gt; # … with 1 more row</pre>
</div>
<p>If the match fails, you can use <code>too_short = "debug"</code> to figure out what went wrong, just like <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code>.</p>
<p>If the match fails, you can use <code>too_short = "debug"</code> to figure out what went wrong, just like <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code>.</p>
</section>
<section id="exercises-1" data-type="sect2">
@@ -338,7 +338,7 @@ Extract variables</h2>
Exercises</h2>
<ol type="1"><li><p>What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)</p></li>
<li><p>Replace all forward slashes in a string with backslashes.</p></li>
<li><p>Implement a simple version of <code><a href="#chp-https://stringr.tidyverse.org/reference/case" data-type="xref">#chp-https://stringr.tidyverse.org/reference/case</a></code> using <code><a href="#chp-https://stringr.tidyverse.org/reference/str_replace" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_replace</a></code>.</p></li>
<li><p>Implement a simple version of <code><a href="https://stringr.tidyverse.org/reference/case.html">str_to_lower()</a></code> using <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace_all()</a></code>.</p></li>
<li><p>Create a regular expression that will match telephone numbers as commonly written in your country.</p></li>
</ol></section>
</section>
@@ -415,7 +415,7 @@ str_view(fruit, "a$")
str_view(fruit, "^apple$")
#&gt; [1] │ &lt;apple&gt;</pre>
</div>
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly when using RStudios find and replace tool. For example, if to find all uses of <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarise</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly when using RStudios find and replace tool. For example, if to find all uses of <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarise</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum")
@@ -495,7 +495,7 @@ str_view(x, "\\S+")
<section id="sec-quantifiers" data-type="sect2">
<h2>
Quantifiers</h2>
<p><strong>Quantifiers</strong> control how many times a pattern matches. In <a href="#sec-reg-basics" data-type="xref">#sec-reg-basics</a> you learned about <code>?</code> (0 or 1 matches), <code>+</code> (1 or more matches), and <code>*</code> (0 or more matches). For example, <code>colou?r</code> will match American or British spelling, <code>\d+</code> will match one or more digits, and <code>\s?</code> will optionally match a single item of whitespace. You can also specify the number of matches precisely with <code><a href="#chp-https://rdrr.io/r/base/Paren" data-type="xref">#chp-https://rdrr.io/r/base/Paren</a></code>:</p>
<p><strong>Quantifiers</strong> control how many times a pattern matches. In <a href="#sec-reg-basics" data-type="xref">#sec-reg-basics</a> you learned about <code>?</code> (0 or 1 matches), <code>+</code> (1 or more matches), and <code>*</code> (0 or more matches). For example, <code>colou?r</code> will match American or British spelling, <code>\d+</code> will match one or more digits, and <code>\s?</code> will optionally match a single item of whitespace. You can also specify the number of matches precisely with <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>:</p>
<ul><li>
<code>{n}</code> matches exactly n times.</li>
<li>
@@ -551,7 +551,7 @@ Grouping and capturing</h2>
#&gt; [699] │ &lt;require&gt;
#&gt; [739] │ &lt;sense&gt;</pre>
</div>
<p>You can also use back references in <code><a href="#chp-https://stringr.tidyverse.org/reference/str_replace" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_replace</a></code>. For example, this code switches the order of the second and third words in <code>sentences</code>:</p>
<p>You can also use back references in <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace()</a></code>. For example, this code switches the order of the second and third words in <code>sentences</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sentences |&gt;
str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |&gt;
@@ -568,7 +568,7 @@ Grouping and capturing</h2>
#&gt; [10] │ A size large in stockings is hard to sell.
#&gt; ... and 710 more</pre>
</div>
<p>If you want extract the matches for each group you can use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_match" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_match</a></code>. But <code><a href="#chp-https://stringr.tidyverse.org/reference/str_match" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_match</a></code> returns a matrix, so its not particularly easy to work with<span data-type="footnote">Mostly because we never discuss matrices in this book!</span>:</p>
<p>If you want extract the matches for each group you can use <code><a href="https://stringr.tidyverse.org/reference/str_match.html">str_match()</a></code>. But <code><a href="https://stringr.tidyverse.org/reference/str_match.html">str_match()</a></code> returns a matrix, so its not particularly easy to work with<span data-type="footnote">Mostly because we never discuss matrices in this book!</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sentences |&gt;
str_match("the (\\w+) (\\w+)") |&gt;
@@ -598,7 +598,7 @@ Grouping and capturing</h2>
#&gt; 6 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; # … with 714 more rows</pre>
</div>
<p>But then youve basically recreated your own version of <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code>. Indeed, behind the scenes, <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> converts your vector of patterns to a single regex that uses grouping to capture the named components.</p>
<p>But then youve basically recreated your own version of <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. Indeed, behind the scenes, <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> converts your vector of patterns to a single regex that uses grouping to capture the named components.</p>
<p>Occasionally, youll want to use parentheses without creating matching groups. You can create a non-capturing group with <code>(?:)</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("a gray cat", "a grey dog")
@@ -619,11 +619,11 @@ Exercises</h2>
<ol type="1"><li><p>How would you match the literal string <code>"'\</code>? How about <code>"$^$"</code>?</p></li>
<li><p>Explain why each of these patterns dont match a <code>\</code>: <code>"\"</code>, <code>"\\"</code>, <code>"\\\"</code>.</p></li>
<li>
<p>Given the corpus of common words in <code><a href="#chp-https://stringr.tidyverse.org/reference/stringr-data" data-type="xref">#chp-https://stringr.tidyverse.org/reference/stringr-data</a></code>, create regular expressions that find all words that:</p>
<p>Given the corpus of common words in <code><a href="https://stringr.tidyverse.org/reference/stringr-data.html">stringr::words</a></code>, create regular expressions that find all words that:</p>
<ol type="a"><li>Start with “y”.</li>
<li>Dont start with “y”.</li>
<li>End with “x”.</li>
<li>Are exactly three letters long. (Dont cheat by using <code><a href="#chp-https://stringr.tidyverse.org/reference/str_length" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_length</a></code>!)</li>
<li>Are exactly three letters long. (Dont cheat by using <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code>!)</li>
<li>Have seven letters or more.</li>
<li>Contain a vowel-consonant pair.</li>
<li>Contain at least two vowel-consonant pairs in a row.</li>
@@ -653,7 +653,7 @@ Pattern control</h1>
<section id="sec-flags" data-type="sect2">
<h2>
Regex flags</h2>
<p>There are a number of settings that can use to control the details of the regexp. These settings are often called <strong>flags</strong> in other programming languages. In stringr, you can use these by wrapping the pattern in a call to <code><a href="#chp-https://stringr.tidyverse.org/reference/modifiers" data-type="xref">#chp-https://stringr.tidyverse.org/reference/modifiers</a></code>. The most useful flag is probably <code>ignore_case = TRUE</code> because it allows characters to match either their uppercase or lowercase forms:</p>
<p>There are a number of settings that can use to control the details of the regexp. These settings are often called <strong>flags</strong> in other programming languages. In stringr, you can use these by wrapping the pattern in a call to <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code>. The most useful flag is probably <code>ignore_case = TRUE</code> because it allows characters to match either their uppercase or lowercase forms:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">bananas &lt;- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
@@ -719,19 +719,19 @@ str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
<section id="fixed-matches" data-type="sect2">
<h2>
Fixed matches</h2>
<p>You can opt-out of the regular expression rules by using <code><a href="#chp-https://stringr.tidyverse.org/reference/modifiers" data-type="xref">#chp-https://stringr.tidyverse.org/reference/modifiers</a></code>:</p>
<p>You can opt-out of the regular expression rules by using <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">fixed()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_view(c("", "a", "."), fixed("."))
#&gt; [3] │ &lt;.&gt;</pre>
</div>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/modifiers" data-type="xref">#chp-https://stringr.tidyverse.org/reference/modifiers</a></code> also gives you the ability to ignore case:</p>
<p><code><a href="https://stringr.tidyverse.org/reference/modifiers.html">fixed()</a></code> also gives you the ability to ignore case:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_view("x X", "X")
#&gt; [1] │ x &lt;X&gt;
str_view("x X", fixed("X", ignore_case = TRUE))
#&gt; [1] │ &lt;x&gt; &lt;X&gt;</pre>
</div>
<p>If youre working with non-English text, you will probably want <code><a href="#chp-https://stringr.tidyverse.org/reference/modifiers" data-type="xref">#chp-https://stringr.tidyverse.org/reference/modifiers</a></code> instead of <code><a href="#chp-https://stringr.tidyverse.org/reference/modifiers" data-type="xref">#chp-https://stringr.tidyverse.org/reference/modifiers</a></code>, as it implements the full rules for capitalization as used by the <code>locale</code> you specify. See <a href="#sec-other-languages" data-type="xref">#sec-other-languages</a> for more details on locales.</p>
<p>If youre working with non-English text, you will probably want <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">coll()</a></code> instead of <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">fixed()</a></code>, as it implements the full rules for capitalization as used by the <code>locale</code> you specify. See <a href="#sec-other-languages" data-type="xref">#sec-other-languages</a> for more details on locales.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
#&gt; [1] │ i &lt;İ&gt; ı I
@@ -864,7 +864,7 @@ Boolean operations</h2>
#&gt; [71] │ &lt;ba&gt;ll
#&gt; ... and 20 more</pre>
</div>
<p>Its simpler to combine the results of two calls to <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code>:</p>
<p>Its simpler to combine the results of two calls to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">words[str_detect(words, "a") &amp; str_detect(words, "b")]
#&gt; [1] "able" "about" "absolute" "available" "baby" "back"
@@ -879,7 +879,7 @@ Boolean operations</h2>
# ...
words[str_detect(words, "u.*o.*i.*e.*a")]</pre>
</div>
<p>Its much simpler to combine five calls to <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code>:</p>
<p>Its much simpler to combine five calls to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">words[
str_detect(words, "a") &amp;
@@ -915,7 +915,7 @@ Creating a pattern with code</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rgb &lt;- c("red", "green", "blue")</pre>
</div>
<p>Well, we can! Wed just need to create the pattern from the vector using <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_flatten" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_flatten</a></code>:</p>
<p>Well, we can! Wed just need to create the pattern from the vector using <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
#&gt; [1] "\\b(red|green|blue)\\b"</pre>
@@ -968,21 +968,21 @@ str_view(sentences, pattern)
#&gt; [167] │ The office paint was a dull, sad &lt;tan&gt;.
#&gt; ... and 53 more</pre>
</div>
<p>In this example, <code>cols</code> only contains numbers and letters so you dont need to worry about metacharacters. But in general, whenever you create create patterns from existing strings its wise to run them through <code><a href="#chp-https://stringr.tidyverse.org/reference/str_escape" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_escape</a></code> to ensure they match literally.</p>
<p>In this example, <code>cols</code> only contains numbers and letters so you dont need to worry about metacharacters. But in general, whenever you create create patterns from existing strings its wise to run them through <code><a href="https://stringr.tidyverse.org/reference/str_escape.html">str_escape()</a></code> to ensure they match literally.</p>
</section>
<section id="exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code> calls.</p>
<p>For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> calls.</p>
<ol type="a"><li>Find all <code>words</code> that start or end with <code>x</code>.</li>
<li>Find all <code>words</code> that start with a vowel and end with a consonant.</li>
<li>Are there any <code>words</code> that contain at least one of each different vowel?</li>
</ol></li>
<li><p>Construct patterns to find evidence for and against the rule “i before e except after c”?</p></li>
<li><p><code><a href="#chp-https://rdrr.io/r/grDevices/colors" data-type="xref">#chp-https://rdrr.io/r/grDevices/colors</a></code> contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then removed the colors that are modified).</p></li>
<li><p>Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the <code><a href="#chp-https://rdrr.io/r/utils/data" data-type="xref">#chp-https://rdrr.io/r/utils/data</a></code> function: <code>data(package = "datasets")$results[, "Item"]</code>. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so youll need to strip those off.</p></li>
<li><p><code><a href="https://rdrr.io/r/grDevices/colors.html">colors()</a></code> contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then removed the colors that are modified).</p></li>
<li><p>Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the <code><a href="https://rdrr.io/r/utils/data.html">data()</a></code> function: <code>data(package = "datasets")$results[, "Item"]</code>. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so youll need to strip those off.</p></li>
</ol></section>
</section>
@@ -995,9 +995,9 @@ Regular expressions in other places</h1>
<h2>
tidyverse</h2>
<p>There are three other particularly useful places where you might want to use a regular expressions</p>
<ul><li><p><code>matches(pattern)</code> will select all variables whose name matches the supplied pattern. Its a “tidyselect” function that you can use anywhere in any tidyverse function that selects variables (e.g. <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>).</p></li>
<ul><li><p><code>matches(pattern)</code> will select all variables whose name matches the supplied pattern. Its a “tidyselect” function that you can use anywhere in any tidyverse function that selects variables (e.g. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename_with()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>).</p></li>
<li><p><code>pivot_longer()'s</code> <code>names_pattern</code> argument takes a vector of regular expressions, just like <code>separate_with_regex()</code>. Its useful when extracting data out of variable names with a complex structure</p></li>
<li><p>The <code>delim</code> argument in <code>separate_delim_longer()</code> and <code>separate_delim_wider()</code> usually matches a fixed string, but you can use <code><a href="#chp-https://stringr.tidyverse.org/reference/modifiers" data-type="xref">#chp-https://stringr.tidyverse.org/reference/modifiers</a></code> to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. <code>regex(", ?")</code>.</p></li>
<li><p>The <code>delim</code> argument in <code>separate_delim_longer()</code> and <code>separate_delim_wider()</code> usually matches a fixed string, but you can use <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code> to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. <code>regex(", ?")</code>.</p></li>
</ul></section>
<section id="base-r" data-type="sect2">
@@ -1014,7 +1014,7 @@ Base R</h2>
<pre data-type="programlisting" data-code-language="downlit">head(list.files(pattern = "\\.Rmd$"))
#&gt; character(0)</pre>
</div>
<p>Its worth noting that the pattern language used by base R is very slightly different to that used by stringr. Thats because stringr is built on top of the <a href="#chp-https://stringi.gagolewski" data-type="xref">#chp-https://stringi.gagolewski</a>, which is in turn built on top of the <a href="#chp-https://unicode-org.github.io/icu/userguide/strings/regexp" data-type="xref">#chp-https://unicode-org.github.io/icu/userguide/strings/regexp</a>, whereas base R functions use either the <a href="#chp-https://github.com/laurikari/tre" data-type="xref">#chp-https://github.com/laurikari/tre</a> or the <a href="#chp-https://www.pcre" data-type="xref">#chp-https://www.pcre</a>, depending on whether or not youve set <code>perl = TRUE</code>. Fortunately, the basics of regular expressions are so well established that youll encounter few variations when working with the patterns youll learn in this book. You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the <code>(?…)</code> syntax.</p>
<p>Its worth noting that the pattern language used by base R is very slightly different to that used by stringr. Thats because stringr is built on top of the <a href="https://stringi.gagolewski.com">stringi package</a>, which is in turn built on top of the <a href="https://unicode-org.github.io/icu/userguide/strings/regexp.html">ICU engine</a>, whereas base R functions use either the <a href="https://github.com/laurikari/tre">TRE engine</a> or the <a href="https://www.pcre.org">PCRE engine</a>, depending on whether or not youve set <code>perl = TRUE</code>. Fortunately, the basics of regular expressions are so well established that youll encounter few variations when working with the patterns youll learn in this book. You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the <code>(?…)</code> syntax.</p>
</section>
</section>
@@ -1023,7 +1023,7 @@ Base R</h2>
Summary</h1>
<p>With every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there. Theyre definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.</p>
<p>In this chapter, youve started your journey to become a regular expression master by learning the most useful stringr functions and the most important components of the regular expression language. And there are plenty of resources to learn more.</p>
<p>A good place to start is <code><a href="#chp-https://stringr.tidyverse.org/articles/regular-expressions" data-type="xref">#chp-https://stringr.tidyverse.org/articles/regular-expressions</a></code>: it documents the full set of syntax supported by stringr. Another useful reference is <a href="https://www.regular-expressions.info/tutorial.html">https://www.regular-expressions.info/</a>. Its not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.</p>
<p>A good place to start is <code><a href="https://stringr.tidyverse.org/articles/regular-expressions.html">vignette("regular-expressions", package = "stringr")</a></code>: it documents the full set of syntax supported by stringr. Another useful reference is <a href="https://www.regular-expressions.info/tutorial.html">https://www.regular-expressions.info/</a>. Its not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.</p>
<p>Its also good to know that stringr is implemented on top of the stringi package by Marek Gagolewsk. If youre struggling to find a function that does what you need in stringr, dont be afraid to look in stringi. Youll find stringi very easy to pick up because it follows many of the the same conventions as stringr.</p>
<p>In the next chapter, well talk about a data structure closely related to strings: factors. Factors are used to represent categorical data in R, i.e. data with a fixed and known set of possible values identified by a vector of strings.</p>