Re-render book for O'Reilly
This commit is contained in:
@@ -3,8 +3,8 @@
|
||||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>In <a href="#chp-strings" data-type="xref">#chp-strings</a>, you learned a whole bunch of useful functions for working with strings. In this chapter we’ll focusing on functions that use <strong>regular expressions</strong>, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”<span data-type="footnote">You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).</span> or “regexp”.</p>
|
||||
<p>The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. We’ll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, we’ll talk about some of the other types of patterns that stringr functions can work with, and the various “flags” that allow you to tweak the operation of regular expressions. We’ll finish up with a survey of other places in the tidyverse and base R where you might use regexes.</p>
|
||||
<p>In <a href="#chp-strings" data-type="xref">#chp-strings</a>, you learned a whole bunch of useful functions for working with strings. This chapter will focus on functions that use <strong>regular expressions</strong>, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”<span data-type="footnote">You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).</span> or “regexp”.</p>
|
||||
<p>The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. We’ll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, we’ll talk about some of the other types of patterns that stringr functions can work with and the various “flags” that allow you to tweak the operation of regular expressions. We’ll finish with a survey of other places in the tidyverse and base R where you might use regexes.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
@@ -16,14 +16,14 @@ Prerequisites</h2>
|
||||
|
||||
</div>
|
||||
|
||||
<p>This chapter relies on features only found in stringr 1.5.0 and tidyr 1.3.0 which are still in development. If you want to live life on the edge, you can get the dev versions with <code>devtools::install_github(c("tidyverse/stringr", "tidyverse/tidyr"))</code>.</p></div>
|
||||
<p>This chapter relies on features only found in tidyr 1.3.0, which is still in development. If you want to live on the edge, you can get the dev version with <code>devtools::install_github("tidyverse/tidyr")</code>.</p></div>
|
||||
|
||||
<p>In this chapter, we’ll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
library(babynames)</pre>
|
||||
</div>
|
||||
<p>Through this chapter we’ll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:</p>
|
||||
<p>Through this chapter, we’ll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:</p>
|
||||
<ul><li>
|
||||
<code>fruit</code> contains the names of 80 fruits.</li>
|
||||
<li>
|
||||
@@ -36,7 +36,7 @@ library(babynames)</pre>
|
||||
<section id="sec-reg-basics" data-type="sect1">
|
||||
<h1>
|
||||
Pattern basics</h1>
|
||||
<p>We’ll use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> to learn how regex patterns work. We used <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> in the last chapter to better understand a string vs its printed representation, and now we’ll use it with its second argument, a regular expression. When this is supplied, <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> will show only the elements of the string vector that match, surrounding each match with <code><></code>, and, where possible, highlighting the match in blue.</p>
|
||||
<p>We’ll use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> to learn how regex patterns work. We used <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> in the last chapter to better understand a string vs. its printed representation, and now we’ll use it with its second argument, a regular expression. When this is supplied, <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> will show only the elements of the string vector that match, surrounding each match with <code><></code>, and, where possible, highlighting the match in blue.</p>
|
||||
<p>The simplest patterns consist of letters and numbers which match those characters exactly:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "berry")
|
||||
@@ -75,11 +75,11 @@ str_view(fruit, "BERRY")</pre>
|
||||
</div>
|
||||
<p><strong>Quantifiers</strong> control how many times a pattern can match:</p>
|
||||
<ul><li>
|
||||
<code>?</code> makes a pattern optional (i.e. it matches 0 or 1 times)</li>
|
||||
<code>?</code> makes a pattern optional (i.e., it matches 0 or 1 times)</li>
|
||||
<li>
|
||||
<code>+</code> lets a pattern repeat (i.e. it matches at least once)</li>
|
||||
<code>+</code> lets a pattern repeat (i.e., it matches at least once)</li>
|
||||
<li>
|
||||
<code>*</code> lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).</li>
|
||||
<code>*</code> lets a pattern be optional or repeat (i.e., it matches any number of times, including 0).</li>
|
||||
</ul><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># ab? matches an "a", optionally followed by a "b".
|
||||
str_view(c("a", "ab", "abb"), "ab?")
|
||||
@@ -98,7 +98,7 @@ str_view(c("a", "ab", "abb"), "ab*")
|
||||
#> [2] │ <ab>
|
||||
#> [3] │ <abb></pre>
|
||||
</div>
|
||||
<p><strong>Character classes</strong> are defined by <code>[]</code> and let you match a set set of characters, e.g. <code>[abcd]</code> matches “a”, “b”, “c”, or “d”. You can also invert the match by starting with <code>^</code>: <code>[^abcd]</code> matches anything <strong>except</strong> “a”, “b”, “c”, or “d”. We can use this idea to find the words with three vowels or four consonants in a row:</p>
|
||||
<p><strong>Character classes</strong> are defined by <code>[]</code> and let you match a set of characters, e.g. <code>[abcd]</code> matches “a”, “b”, “c”, or “d”. You can also invert the match by starting with <code>^</code>: <code>[^abcd]</code> matches anything <strong>except</strong> “a”, “b”, “c”, or “d”. We can use this idea to find the words with three vowels or four consonants in a row:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(words, "[aeiou][aeiou][aeiou]")
|
||||
#> [79] │ b<eau>ty
|
||||
@@ -114,7 +114,7 @@ str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]")
|
||||
#> [830] │ su<pply>
|
||||
#> [836] │ <syst>em</pre>
|
||||
</div>
|
||||
<p>You can combine character classes and quantifiers. For example, the following regexp looks for two vowel followed by two or more consonants:</p>
|
||||
<p>You can combine character classes and quantifiers. For example, the following regexp looks for two vowels followed by two or more consonants:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(words, "[aeiou][aeiou][^aeiou][^aeiou]+")
|
||||
#> [6] │ acc<ount>
|
||||
@@ -129,7 +129,7 @@ str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]")
|
||||
#> [79] │ be<auty>
|
||||
#> ... and 62 more</pre>
|
||||
</div>
|
||||
<p>(We’ll learn some more elegant ways to express these ideas in <a href="#sec-quantifiers" data-type="xref">#sec-quantifiers</a>.)</p>
|
||||
<p>(We’ll learn more elegant ways to express these ideas in <a href="#sec-quantifiers" data-type="xref">#sec-quantifiers</a>.)</p>
|
||||
<p>You can use <strong>alternation</strong>, <code>|</code> to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “pear”, or “banana”, or a repeated vowel.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "apple|pear|banana")
|
||||
@@ -154,12 +154,12 @@ Exercises</h2>
|
||||
<section id="sec-stringr-regex-funs" data-type="sect1">
|
||||
<h1>
|
||||
Key functions</h1>
|
||||
<p>Now that you’ve got the basics of regular expressions under your belt, let’s use them with some stringr and tidyr functions. In the following section, you’ll learn about how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.</p>
|
||||
<p>Now that you’ve got the basics of regular expressions under your belt, let’s use them with some stringr and tidyr functions. In the following section, you’ll learn how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.</p>
|
||||
|
||||
<section id="detect-matches" data-type="sect2">
|
||||
<h2>
|
||||
Detect matches</h2>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> returns a logical vector that is <code>TRUE</code> if the pattern matched an element of the character vector and <code>FALSE</code> otherwise:</p>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> returns a logical vector that is <code>TRUE</code> if the pattern matches an element of the character vector and <code>FALSE</code> otherwise:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_detect(c("a", "b", "c"), "[aeiou]")
|
||||
#> [1] TRUE FALSE FALSE</pre>
|
||||
@@ -184,12 +184,12 @@ Detect matches</h2>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">babynames |>
|
||||
group_by(year) |>
|
||||
summarise(prop_x = mean(str_detect(name, "x"))) |>
|
||||
ggplot(aes(year, prop_x)) +
|
||||
summarize(prop_x = mean(str_detect(name, "x"))) |>
|
||||
ggplot(aes(x = year, y = prop_x)) +
|
||||
geom_line()</pre>
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-x-names"><p><img src="regexps_files/figure-html/fig-x-names-1.png" alt="A timeseries showing the proportion of baby names that contain the letter x. The proportion declines gradually from 8 per 1000 in 1880 to 4 per 1000 in 1980, then increases rapidly to 16 per 1000 in 2019." width="576"/></p>
|
||||
<figure id="fig-x-names"><p><img src="regexps_files/figure-html/fig-x-names-1.png" alt="A time series showing the proportion of baby names that contain the letter x. The proportion declines gradually from 8 per 1000 in 1880 to 4 per 1000 in 1980, then increases rapidly to 16 per 1000 in 2019." width="576"/></p>
|
||||
<figcaption>A time series showing the proportion of baby names that contain a lower case “x”.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
@@ -241,7 +241,7 @@ str_view("abababa", "aba")
|
||||
<p>If you look closely, you’ll notice that there’s something off with our calculations: “Aaban” contains three “a”s, but our summary reports only two vowels. That’s because regular expressions are case sensitive. There are three ways we could fix this:</p>
|
||||
<ul><li>Add the upper case vowels to the character class: <code>str_count(name, "[aeiouAEIOU]")</code>.</li>
|
||||
<li>Tell the regular expression to ignore case: <code>str_count(regex(name, ignore_case = TRUE), "[aeiou]")</code>. We’ll talk about more in <a href="#sec-flags" data-type="xref">#sec-flags</a>.</li>
|
||||
<li>Use <code><a href="https://stringr.tidyverse.org/reference/case.html">str_to_lower()</a></code> to convert the names to lower case: <code>str_count(str_to_lower(name), "[aeiou]")</code>. You learned about this function in <a href="#sec-other-languages" data-type="xref">#sec-other-languages</a>.</li>
|
||||
<li>Use <code><a href="https://stringr.tidyverse.org/reference/case.html">str_to_lower()</a></code> to convert the names to lower case: <code>str_count(str_to_lower(name), "[aeiou]")</code>.</li>
|
||||
</ul><p>This variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.</p>
|
||||
<p>In this case, since we’re applying two functions to the name, I think it’s easier to transform it first:</p>
|
||||
<div class="cell">
|
||||
@@ -283,7 +283,7 @@ str_remove_all(x, "[aeiou]")
|
||||
<p>These functions are naturally paired with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> when doing data cleaning, and you’ll often apply them repeatedly to peel off layers of inconsistent formatting.</p>
|
||||
</section>
|
||||
|
||||
<section id="extract-variables" data-type="sect2">
|
||||
<section id="sec-extract-variables" data-type="sect2">
|
||||
<h2>
|
||||
Extract variables</h2>
|
||||
<p>The last function we’ll discuss uses regular expressions to extract data out of one column into one or more new columns: <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. It’s a peer of the <code>separate_wider_location()</code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> functions that you learned about in <a href="#sec-string-columns" data-type="xref">#sec-string-columns</a>. These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.</p>
|
||||
@@ -407,12 +407,12 @@ str_view(fruit, "a$")
|
||||
str_view(fruit, "^apple$")
|
||||
#> [1] │ <apple></pre>
|
||||
</div>
|
||||
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly when using RStudio’s find and replace tool. For example, if to find all uses of <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarise</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
|
||||
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly when using RStudio’s find and replace tool. For example, if to find all uses of <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarize</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
|
||||
str_view(x, "sum")
|
||||
#> [1] │ <sum>mary(x)
|
||||
#> [2] │ <sum>marise(df)
|
||||
#> [2] │ <sum>marize(df)
|
||||
#> [3] │ row<sum>(x)
|
||||
#> [4] │ <sum>(x)
|
||||
str_view(x, "\\bsum\\b")
|
||||
@@ -621,7 +621,7 @@ Exercises</h2>
|
||||
<li>Contain at least two vowel-consonant pairs in a row.</li>
|
||||
<li>Only consist of repeated vowel-consonant pairs.</li>
|
||||
</ol></li>
|
||||
<li><p>Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut. Try and make the shortest possible regex!</p></li>
|
||||
<li><p>Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarize, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut. Try and make the shortest possible regex!</p></li>
|
||||
<li><p>Switch the first and last letters in <code>words</code>. Which of those strings are still <code>words</code>?</p></li>
|
||||
<li>
|
||||
<p>Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)</p>
|
||||
|
||||
Reference in New Issue
Block a user