More minor page count tweaks & fixes

And re-convert with latest htmlbook
This commit is contained in:
Hadley Wickham
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions

View File

@@ -1,24 +1,15 @@
<section data-type="chapter" id="chp-strings">
<h1><span id="sec-strings" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Strings</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="strings-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>So far, youve used a bunch of strings without learning much about the details. Now its time to dive into them, learn what makes strings tick, and master some of the powerful string manipulation tools you have at your disposal.</p>
<p>Well begin with the details of creating strings and character vectors. Youll then dive into creating strings from data, then the opposite; extracting strings from data. Well then discuss tools that work with individual letters. The chapter finishes with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.</p>
<p>Well keep working with strings in the next chapter, where youll learn more about the power of regular expressions.</p>
<section id="prerequisites" data-type="sect2">
<section id="strings-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<div data-type="important"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>This chapter relies on features only found in tidyr 1.3.0, which is still in development. If you want to live on the edge, you can get the dev versions with <code>devtools::install_github("tidyverse/tidyr")</code>.</p></div>
<p>In this chapter, well use functions from the stringr package, which is part of the core tidyverse. Well also use the babynames data since it provides some fun strings to manipulate.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
@@ -113,7 +104,7 @@ str_view(x)
<p>Note that <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> uses a blue background for tabs to make them easier to spot. One of the challenges of working with text is that theres a variety of ways that white space can end up in the text, so this background helps you recognize that something strange is going on.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="strings-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@@ -138,7 +129,7 @@ Creating many strings from data</h1>
<section id="str_c" data-type="sect2">
<h2>
<code>str_c()</code>
str_c()
</h2>
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> takes any number of vectors as arguments and returns a character vector:</p>
<div class="cell">
@@ -151,16 +142,14 @@ str_c("Hello ", c("John", "Susan"))
</div>
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> is very similar to the base <code><a href="https://rdrr.io/r/base/paste.html">paste0()</a></code>, but is designed to be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> by obeying the usual tidyverse rules for recycling and propagating missing values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">set.seed(1410)
df &lt;- tibble(name = c(wakefield::name(3), NA))
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(name = c("Flora", "David", "Terra"))
df |&gt; mutate(greeting = str_c("Hi ", name, "!"))
#&gt; # A tibble: 4 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Ilena Hi Ilena!
#&gt; 2 Sacramento Hi Sacramento!
#&gt; 3 Graylon Hi Graylon!
#&gt; 4 &lt;NA&gt; &lt;NA&gt;</pre>
#&gt; # A tibble: 3 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Flora Hi Flora!
#&gt; 2 David Hi David!
#&gt; 3 Terra Hi Terra!</pre>
</div>
<p>If you want missing values to display in another way, use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> to replace them. Depending on what you want, you might use it either inside or outside of <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>:</p>
<div class="cell">
@@ -169,48 +158,45 @@ df |&gt; mutate(greeting = str_c("Hi ", name, "!"))
greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
)
#&gt; # A tibble: 4 × 3
#&gt; name greeting1 greeting2
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Ilena Hi Ilena! Hi Ilena!
#&gt; 2 Sacramento Hi Sacramento! Hi Sacramento!
#&gt; 3 Graylon Hi Graylon! Hi Graylon!
#&gt; 4 &lt;NA&gt; Hi you! Hi!</pre>
#&gt; # A tibble: 3 × 3
#&gt; name greeting1 greeting2
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Flora Hi Flora! Hi Flora!
#&gt; 2 David Hi David! Hi David!
#&gt; 3 Terra Hi Terra! Hi Terra!</pre>
</div>
</section>
<section id="sec-glue" data-type="sect2">
<h2>
<code>str_glue()</code>
str_glue()
</h2>
<p>If you are mixing many fixed and variable strings with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>, youll notice that you type a lot of <code>"</code>s, making it hard to see the overall goal of the code. An alternative approach is provided by the <a href="https://glue.tidyverse.org">glue package</a> via <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code><span data-type="footnote">If youre not using stringr, you can also access it directly with <code><a href="https://glue.tidyverse.org/reference/glue.html">glue::glue()</a></code>.</span>. You give it a single string that has a special feature: anything inside <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code> will be evaluated like its outside of the quotes:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; mutate(greeting = str_glue("Hi {name}!"))
#&gt; # A tibble: 4 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;glue&gt;
#&gt; 1 Ilena Hi Ilena!
#&gt; 2 Sacramento Hi Sacramento!
#&gt; 3 Graylon Hi Graylon!
#&gt; 4 &lt;NA&gt; Hi NA!</pre>
#&gt; # A tibble: 3 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;glue&gt;
#&gt; 1 Flora Hi Flora!
#&gt; 2 David Hi David!
#&gt; 3 Terra Hi Terra!</pre>
</div>
<p>As you can see, <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> currently converts missing values to the string <code>"NA"</code> unfortunately making it inconsistent with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>.</p>
<p>You also might wonder what happens if you need to include a regular <code>{</code> or <code>}</code> in your string. Youre on the right track if you guess youll need to escape it somehow. The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like <code>\</code>, you double up the special characters:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; mutate(greeting = str_glue("{{Hi {name}!}}"))
#&gt; # A tibble: 4 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;glue&gt;
#&gt; 1 Ilena {Hi Ilena!}
#&gt; 2 Sacramento {Hi Sacramento!}
#&gt; 3 Graylon {Hi Graylon!}
#&gt; 4 &lt;NA&gt; {Hi NA!}</pre>
#&gt; # A tibble: 3 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;glue&gt;
#&gt; 1 Flora {Hi Flora!}
#&gt; 2 David {Hi David!}
#&gt; 3 Terra {Hi Terra!}</pre>
</div>
</section>
<section id="str_flatten" data-type="sect2">
<h2>
<code>str_flatten()</code>
str_flatten()
</h2>
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code>glue()</code> work well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because their output is the same length as their inputs. What if you want a function that works well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, i.e., something that always returns a single string? Thats the job of <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code><span data-type="footnote">The base R equivalent is <code><a href="https://rdrr.io/r/base/paste.html">paste()</a></code> used with the <code>collapse</code> argument.</span>: it takes a character vector and combines each element of the vector into a single string:</p>
<div class="cell">
@@ -244,7 +230,7 @@ df |&gt;
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="strings-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@@ -598,7 +584,12 @@ Long strings</h2>
<li><p><code>str_wrap(x, 30)</code> wraps a string introducing new lines so that each line is at most 30 characters (it doesnt hyphenate, however, so any word longer than 30 characters will make a longer line)</p></li>
</ul><p>The following code shows these functions in action with a made-up string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
<pre data-type="programlisting" data-code-language="r">x &lt;- paste0(
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod ",
"tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim ",
"veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea",
"commodo consequat."
)
str_view(str_trunc(x, 30))
#&gt; [1] │ Lorem ipsum dolor sit amet,...
@@ -610,12 +601,12 @@ str_view(str_wrap(x, 30))
#&gt; │ magna aliqua. Ut enim ad
#&gt; │ minim veniam, quis nostrud
#&gt; │ exercitation ullamco laboris
#&gt; │ nisi ut aliquip ex ea commodo
#&gt; │ nisi ut aliquip ex eacommodo
#&gt; │ consequat.</pre>
</div>
</section>
<section id="exercises-2" data-type="sect2">
<section id="strings-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Use <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> to extract the middle letter from each baby name. What will you do if the string has an even number of characters?</li>
@@ -734,7 +725,7 @@ str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
</section>
</section>
<section id="summary" data-type="sect1">
<section id="strings-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Now its time to learn one of the most important and powerful tools for working with strings: regular expressions. Regular expressions are a very concise but very expressive language for describing patterns within strings and are the topic of the next chapter.</p>