Don't transform non-crossref links
This commit is contained in:
@@ -44,7 +44,7 @@ library(babynames)</pre>
|
||||
<section id="creating-a-string" data-type="sect1">
|
||||
<h1>
|
||||
Creating a string</h1>
|
||||
<p>We’ve created strings in passing earlier in the book, but didn’t discuss the details. Firstly, you can create a string using either single quotes (<code>'</code>) or double quotes (<code>"</code>). There’s no difference in behavior between the two so in the interests of consistency the <a href="#character-vectors" data-type="xref">#character-vectors</a> recommends using <code>"</code>, unless the string contains multiple <code>"</code>.</p>
|
||||
<p>We’ve created strings in passing earlier in the book, but didn’t discuss the details. Firstly, you can create a string using either single quotes (<code>'</code>) or double quotes (<code>"</code>). There’s no difference in behavior between the two so in the interests of consistency the <a href="https://style.tidyverse.org/syntax.html#character-vectors">tidyverse style guide</a> recommends using <code>"</code>, unless the string contains multiple <code>"</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">string1 <- "This is a string"
|
||||
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'</pre>
|
||||
@@ -68,7 +68,7 @@ single_quote <- '\'' # or "'"</pre>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">backslash <- "\\"</pre>
|
||||
</div>
|
||||
<p>Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string). To see the raw contents of the string, use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code><span data-type="footnote">Or use the base R function <code><a href="#chp-https://rdrr.io/r/base/writeLines" data-type="xref">#chp-https://rdrr.io/r/base/writeLines</a></code>.</span>:</p>
|
||||
<p>Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string). To see the raw contents of the string, use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code><span data-type="footnote">Or use the base R function <code><a href="https://rdrr.io/r/base/writeLines.html">writeLines()</a></code>.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(single_quote, double_quote, backslash)
|
||||
x
|
||||
@@ -92,7 +92,7 @@ str_view(tricky)
|
||||
#> [1] │ double_quote <- "\"" # or '"'
|
||||
#> │ single_quote <- '\'' # or "'"</pre>
|
||||
</div>
|
||||
<p>That’s a lot of backslashes! (This is sometimes called <a href="#chp-https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome" data-type="xref">#chp-https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome</a>.) To eliminate the escaping you can instead use a <strong>raw string</strong><span data-type="footnote">Available in R 4.0.0 and above.</span>:</p>
|
||||
<p>That’s a lot of backslashes! (This is sometimes called <a href="https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome">leaning toothpick syndrome</a>.) To eliminate the escaping you can instead use a <strong>raw string</strong><span data-type="footnote">Available in R 4.0.0 and above.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">tricky <- r"(double_quote <- "\"" # or '"'
|
||||
single_quote <- '\'' # or "'")"
|
||||
@@ -106,7 +106,7 @@ str_view(tricky)
|
||||
<section id="other-special-characters" data-type="sect2">
|
||||
<h2>
|
||||
Other special characters</h2>
|
||||
<p>As well as <code>\"</code>, <code>\'</code>, and <code>\\</code> there are a handful of other special characters that may come in handy. The most common are <code>\n</code>, newline, and <code>\t</code>, tab. You’ll also sometimes see strings containing Unicode escapes that start with <code>\u</code> or <code>\U</code>. This is a way of writing non-English characters that works on all systems. You can see the complete list of other special characters in <code><a href="#chp-https://rdrr.io/r/base/Quotes" data-type="xref">#chp-https://rdrr.io/r/base/Quotes</a></code>.</p>
|
||||
<p>As well as <code>\"</code>, <code>\'</code>, and <code>\\</code> there are a handful of other special characters that may come in handy. The most common are <code>\n</code>, newline, and <code>\t</code>, tab. You’ll also sometimes see strings containing Unicode escapes that start with <code>\u</code> or <code>\U</code>. This is a way of writing non-English characters that works on all systems. You can see the complete list of other special characters in <code><a href="https://rdrr.io/r/base/Quotes.html">?'"'</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
|
||||
x
|
||||
@@ -118,7 +118,7 @@ str_view(x)
|
||||
#> [3] │ µ
|
||||
#> [4] │ 😄</pre>
|
||||
</div>
|
||||
<p>Note that <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code> uses a blue background for tabs to make them easier to spot. One of the challenges of working with text is that there’s a variety of ways that white space can end up in text, so this background helps you recognize that something strange is going on.</p>
|
||||
<p>Note that <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> uses a blue background for tabs to make them easier to spot. One of the challenges of working with text is that there’s a variety of ways that white space can end up in text, so this background helps you recognize that something strange is going on.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises" data-type="sect2">
|
||||
@@ -131,7 +131,7 @@ Exercises</h2>
|
||||
<li><p><code>\\\\\\</code></p></li>
|
||||
</ol></li>
|
||||
<li>
|
||||
<p>Create the string in your R session and print it. What happens to the special “\u00a0”? How does <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code> display it? Can you do a little googling to figure out what this special character is?</p>
|
||||
<p>Create the string in your R session and print it. What happens to the special “\u00a0”? How does <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> display it? Can you do a little googling to figure out what this special character is?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- "This\u00a0is\u00a0tricky"</pre>
|
||||
</div>
|
||||
@@ -142,13 +142,13 @@ Exercises</h2>
|
||||
<section id="creating-many-strings-from-data" data-type="sect1">
|
||||
<h1>
|
||||
Creating many strings from data</h1>
|
||||
<p>Now that you’ve learned the basics of creating a string or two by “hand”, we’ll go into the details of creating strings from other strings. This will help you solve the common problem where you have some text that you wrote that you want to combine with strings from a data frame. For example, to create a greeting you might combine “Hello” with a <code>name</code> variable. We’ll show you how to do this with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code> and how you can you use them with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>. That naturally raises the question of what string functions you might use with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, so we’ll finish this section with a discussion of <code><a href="#chp-https://stringr.tidyverse.org/reference/str_flatten" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_flatten</a></code> which is a summary function for strings.</p>
|
||||
<p>Now that you’ve learned the basics of creating a string or two by “hand”, we’ll go into the details of creating strings from other strings. This will help you solve the common problem where you have some text that you wrote that you want to combine with strings from a data frame. For example, to create a greeting you might combine “Hello” with a <code>name</code> variable. We’ll show you how to do this with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> and how you can you use them with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. That naturally raises the question of what string functions you might use with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>, so we’ll finish this section with a discussion of <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code> which is a summary function for strings.</p>
|
||||
|
||||
<section id="str_c" data-type="sect2">
|
||||
<h2>
|
||||
<code>str_c()</code>
|
||||
</h2>
|
||||
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code><span data-type="footnote"><code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> is very similar to the base <code><a href="#chp-https://rdrr.io/r/base/paste" data-type="xref">#chp-https://rdrr.io/r/base/paste</a></code>. There are two main reasons we recommend it: it propagates <code>NA</code>s (rather than converting them to <code>"NA"</code>) and it uses the tidyverse recycling rules.</span> takes any number of vectors as arguments and returns a character vector:</p>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code><span data-type="footnote"><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> is very similar to the base <code><a href="https://rdrr.io/r/base/paste.html">paste0()</a></code>. There are two main reasons we recommend it: it propagates <code>NA</code>s (rather than converting them to <code>"NA"</code>) and it uses the tidyverse recycling rules.</span> takes any number of vectors as arguments and returns a character vector:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_c("x", "y")
|
||||
#> [1] "xy"
|
||||
@@ -157,7 +157,7 @@ str_c("x", "y", "z")
|
||||
str_c("Hello ", c("John", "Susan"))
|
||||
#> [1] "Hello John" "Hello Susan"</pre>
|
||||
</div>
|
||||
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> is designed to be used with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> so it obeys the usual rules for recycling and missing values:</p>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> is designed to be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> so it obeys the usual rules for recycling and missing values:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">set.seed(1410)
|
||||
df <- tibble(name = c(wakefield::name(3), NA))
|
||||
@@ -170,7 +170,7 @@ df |> mutate(greeting = str_c("Hi ", name, "!"))
|
||||
#> 3 Graylon Hi Graylon!
|
||||
#> 4 <NA> <NA></pre>
|
||||
</div>
|
||||
<p>If you want missing values to display in some other way, use <code><a href="#chp-https://dplyr.tidyverse.org/reference/coalesce" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/coalesce</a></code>. Depending on what you want, you might use it either inside or outside of <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code>:</p>
|
||||
<p>If you want missing values to display in some other way, use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code>. Depending on what you want, you might use it either inside or outside of <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
mutate(
|
||||
@@ -191,7 +191,7 @@ df |> mutate(greeting = str_c("Hi ", name, "!"))
|
||||
<h2>
|
||||
<code>str_glue()</code>
|
||||
</h2>
|
||||
<p>If you are mixing many fixed and variable strings with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code>, you’ll notice that you type a lot of <code>"</code>s, making it hard to see the overall goal of the code. An alternative approach is provided by the <a href="#chp-https://glue.tidyverse" data-type="xref">#chp-https://glue.tidyverse</a> via <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code><span data-type="footnote">If you’re not using stringr, you can also access it directly with <code><a href="#chp-https://glue.tidyverse.org/reference/glue" data-type="xref">#chp-https://glue.tidyverse.org/reference/glue</a></code>.</span>. You give it a single string that has a special feature: anything inside <code><a href="#chp-https://rdrr.io/r/base/Paren" data-type="xref">#chp-https://rdrr.io/r/base/Paren</a></code> will be evaluated like it’s outside of the quotes:</p>
|
||||
<p>If you are mixing many fixed and variable strings with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>, you’ll notice that you type a lot of <code>"</code>s, making it hard to see the overall goal of the code. An alternative approach is provided by the <a href="https://glue.tidyverse.org">glue package</a> via <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code><span data-type="footnote">If you’re not using stringr, you can also access it directly with <code><a href="https://glue.tidyverse.org/reference/glue.html">glue::glue()</a></code>.</span>. You give it a single string that has a special feature: anything inside <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code> will be evaluated like it’s outside of the quotes:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |> mutate(greeting = str_glue("Hi {name}!"))
|
||||
#> # A tibble: 4 × 2
|
||||
@@ -202,7 +202,7 @@ df |> mutate(greeting = str_c("Hi ", name, "!"))
|
||||
#> 3 Graylon Hi Graylon!
|
||||
#> 4 <NA> Hi NA!</pre>
|
||||
</div>
|
||||
<p>As you can see, <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code> currently converts missing values to the string <code>"NA"</code> unfortunately making it inconsistent with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code>.</p>
|
||||
<p>As you can see, <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> currently converts missing values to the string <code>"NA"</code> unfortunately making it inconsistent with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>.</p>
|
||||
<p>You also might wonder what happens if you need to include a regular <code>{</code> or <code>}</code> in your string. If you guess that you’ll need to somehow escape it, you’re on the right track. The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like <code>\</code>, you double up the special characters:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
|
||||
@@ -220,7 +220,7 @@ df |> mutate(greeting = str_c("Hi ", name, "!"))
|
||||
<h2>
|
||||
<code>str_flatten()</code>
|
||||
</h2>
|
||||
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> and <code>glue()</code> work well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> because their output is the same length as their inputs. What if you want a function that works well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, i.e. something that always returns a single string? That’s the job of <code><a href="#chp-https://stringr.tidyverse.org/reference/str_flatten" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_flatten</a></code><span data-type="footnote">The base R equivalent is <code><a href="#chp-https://rdrr.io/r/base/paste" data-type="xref">#chp-https://rdrr.io/r/base/paste</a></code> used with the <code>collapse</code> argument.</span>: it takes a character vector and combines each element of the vector into a single string:</p>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code>glue()</code> work well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because their output is the same length as their inputs. What if you want a function that works well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>, i.e. something that always returns a single string? That’s the job of <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code><span data-type="footnote">The base R equivalent is <code><a href="https://rdrr.io/r/base/paste.html">paste()</a></code> used with the <code>collapse</code> argument.</span>: it takes a character vector and combines each element of the vector into a single string:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_flatten(c("x", "y", "z"))
|
||||
#> [1] "xyz"
|
||||
@@ -229,7 +229,7 @@ str_flatten(c("x", "y", "z"), ", ")
|
||||
str_flatten(c("x", "y", "z"), ", ", last = ", and ")
|
||||
#> [1] "x, y, and z"</pre>
|
||||
</div>
|
||||
<p>This makes it work well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>:</p>
|
||||
<p>This makes it work well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tribble(
|
||||
~ name, ~ fruit,
|
||||
@@ -256,14 +256,14 @@ df |>
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>
|
||||
<p>Compare and contrast the results of <code><a href="#chp-https://rdrr.io/r/base/paste" data-type="xref">#chp-https://rdrr.io/r/base/paste</a></code> with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> for the following inputs:</p>
|
||||
<p>Compare and contrast the results of <code><a href="https://rdrr.io/r/base/paste.html">paste0()</a></code> with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> for the following inputs:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_c("hi ", NA)
|
||||
str_c(letters[1:2], letters[1:3])</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Convert the following expressions from <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> to <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code> or vice versa:</p>
|
||||
<p>Convert the following expressions from <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> or vice versa:</p>
|
||||
<ol type="a"><li><p><code>str_c("The price of ", food, " is ", price)</code></p></li>
|
||||
<li><p><code>str_glue("I'm {age} years old and live in {country}")</code></p></li>
|
||||
<li><p><code>str_c("\\section{", title, "}")</code></p></li>
|
||||
@@ -290,7 +290,7 @@ Extracting data from strings</h1>
|
||||
<section id="separating-into-rows" data-type="sect2">
|
||||
<h2>
|
||||
Separating into rows</h2>
|
||||
<p>Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_longer_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_longer_delim</a></code> to split based on a delimiter:</p>
|
||||
<p>Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_delim()</a></code> to split based on a delimiter:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df1 <- tibble(x = c("a,b,c", "d,e", "f"))
|
||||
df1 |>
|
||||
@@ -305,7 +305,7 @@ df1 |>
|
||||
#> 5 e
|
||||
#> 6 f</pre>
|
||||
</div>
|
||||
<p>It’s rarer to see <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_longer_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_longer_delim</a></code> in the wild, but some older datasets do use very compact format where each character is used to record a value:</p>
|
||||
<p>It’s rarer to see <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_position()</a></code> in the wild, but some older datasets do use very compact format where each character is used to record a value:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df2 <- tibble(x = c("1211", "131", "21"))
|
||||
df2 |>
|
||||
@@ -326,7 +326,7 @@ df2 |>
|
||||
<section id="sec-string-columns" data-type="sect2">
|
||||
<h2>
|
||||
Separating into columns</h2>
|
||||
<p>Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their <code>longer</code> equivalents because you need to name the columns. For example, in this following dataset <code>x</code> is made up of a code, an edition number, and a year, separated by <code>"."</code>. To use <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> we supply the delimiter and the names in two arguments:</p>
|
||||
<p>Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their <code>longer</code> equivalents because you need to name the columns. For example, in this following dataset <code>x</code> is made up of a code, an edition number, and a year, separated by <code>"."</code>. To use <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> we supply the delimiter and the names in two arguments:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
|
||||
df3 |>
|
||||
@@ -357,7 +357,7 @@ df3 |>
|
||||
#> 2 b10 2011
|
||||
#> 3 e15 2015</pre>
|
||||
</div>
|
||||
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> works a little differently, because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies. You can omit values from the output by not naming them:</p>
|
||||
<p><code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> works a little differently, because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies. You can omit values from the output by not naming them:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df4 <- tibble(x = c("202215TX", "202122LA", "202325CA"))
|
||||
df4 |>
|
||||
@@ -377,7 +377,7 @@ df4 |>
|
||||
<section id="diagnosing-widening-problems" data-type="sect2">
|
||||
<h2>
|
||||
Diagnosing widening problems</h2>
|
||||
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code><span data-type="footnote">The same principles apply to <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code>.</span> requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces? There are two possible problems, too few or too many pieces, so <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> provides two arguments to help: <code>too_few</code> and <code>too_many</code>. Let’s first look at the <code>too_few</code> case with the following sample dataset:</p>
|
||||
<p><code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code><span data-type="footnote">The same principles apply to <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>.</span> requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces? There are two possible problems, too few or too many pieces, so <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> provides two arguments to help: <code>too_few</code> and <code>too_many</code>. Let’s first look at the <code>too_few</code> case with the following sample dataset:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
|
||||
|
||||
@@ -523,12 +523,12 @@ Letters</h1>
|
||||
<section id="length" data-type="sect2">
|
||||
<h2>
|
||||
Length</h2>
|
||||
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_length" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_length</a></code> tells you the number of letters in the string:</p>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> tells you the number of letters in the string:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_length(c("a", "R for data science", NA))
|
||||
#> [1] 1 18 NA</pre>
|
||||
</div>
|
||||
<p>You could use this with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> to find the distribution of lengths of US babynames, and then with <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> to look at the longest names<span data-type="footnote">Looking at these entries, we’d guess that the babynames data drops spaces or hyphens and truncates after 15 letters.</span>:</p>
|
||||
<p>You could use this with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to find the distribution of lengths of US babynames, and then with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to look at the longest names<span data-type="footnote">Looking at these entries, we’d guess that the babynames data drops spaces or hyphens and truncates after 15 letters.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">babynames |>
|
||||
count(length = str_length(name), wt = n)
|
||||
@@ -573,12 +573,12 @@ str_sub(x, 1, 3)
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_sub(x, -3, -1)
|
||||
#> [1] "ple" "ana" "ear"</pre>
|
||||
</div>
|
||||
<p>Note that <code><a href="#chp-https://stringr.tidyverse.org/reference/str_sub" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_sub</a></code> won’t fail if the string is too short: it will just return as much as possible:</p>
|
||||
<p>Note that <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> won’t fail if the string is too short: it will just return as much as possible:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_sub("a", 1, 5)
|
||||
#> [1] "a"</pre>
|
||||
</div>
|
||||
<p>We could use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_sub" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_sub</a></code> with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> to find the first and last letter of each name:</p>
|
||||
<p>We could use <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to find the first and last letter of each name:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">babynames |>
|
||||
mutate(
|
||||
@@ -626,7 +626,7 @@ str_view(str_wrap(x, 30))
|
||||
<section id="exercises-2" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>Use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_length" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_length</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_sub" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_sub</a></code> to extract the middle letter from each baby name. What will you do if the string has an even number of characters?</li>
|
||||
<ol type="1"><li>Use <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> to extract the middle letter from each baby name. What will you do if the string has an even number of characters?</li>
|
||||
<li>Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?</li>
|
||||
</ol></section>
|
||||
</section>
|
||||
@@ -639,7 +639,7 @@ Non-English text</h1>
|
||||
<section id="encoding" data-type="sect2">
|
||||
<h2>
|
||||
Encoding</h2>
|
||||
<p>When working with non-English text the first challenge is often the <strong>encoding</strong>. To understand what’s going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using <code><a href="#chp-https://rdrr.io/r/base/rawConversion" data-type="xref">#chp-https://rdrr.io/r/base/rawConversion</a></code>:</p>
|
||||
<p>When working with non-English text the first challenge is often the <strong>encoding</strong>. To understand what’s going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using <code><a href="https://rdrr.io/r/base/rawConversion.html">charToRaw()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">charToRaw("Hadley")
|
||||
#> [1] 48 61 64 6c 65 79</pre>
|
||||
@@ -676,7 +676,7 @@ read_csv(x2, locale = locale(encoding = "Shift-JIS"))
|
||||
#> <chr>
|
||||
#> 1 こんにちは</pre>
|
||||
</div>
|
||||
<p>How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides <code><a href="#chp-https://readr.tidyverse.org/reference/encoding" data-type="xref">#chp-https://readr.tidyverse.org/reference/encoding</a></code> to help you figure it out. It’s not foolproof, and it works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.</p>
|
||||
<p>How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides <code><a href="https://readr.tidyverse.org/reference/encoding.html">guess_encoding()</a></code> to help you figure it out. It’s not foolproof, and it works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">guess_encoding(x1)
|
||||
#> # A tibble: 1 × 2
|
||||
@@ -695,7 +695,7 @@ guess_encoding(x2)
|
||||
<section id="letter-variations" data-type="sect2">
|
||||
<h2>
|
||||
Letter variations</h2>
|
||||
<p>If you’re working with individual letters (e.g. with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_length" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_length</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_sub" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_sub</a></code>) there’s an important challenge if you’re working with an language that has accents because letters might be represented as an individual character or by combing an unaccented letter (e.g. ü) with a diacritic mark (e.g. ¨). For example, this code shows two ways of representing ü that look identical:</p>
|
||||
<p>If you’re working with individual letters (e.g. with <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code>) there’s an important challenge if you’re working with an language that has accents because letters might be represented as an individual character or by combing an unaccented letter (e.g. ü) with a diacritic mark (e.g. ¨). For example, this code shows two ways of representing ü that look identical:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">u <- c("\u00fc", "u\u0308")
|
||||
str_view(u)
|
||||
@@ -709,7 +709,7 @@ str_view(u)
|
||||
str_sub(u, 1, 1)
|
||||
#> [1] "ü" "u"</pre>
|
||||
</div>
|
||||
<p>Finally note that these strings look differently when you compare them with <code>==</code>, for which is stringr provides the handy <code><a href="#chp-https://stringr.tidyverse.org/reference/str_equal" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_equal</a></code> function:</p>
|
||||
<p>Finally note that these strings look differently when you compare them with <code>==</code>, for which is stringr provides the handy <code><a href="https://stringr.tidyverse.org/reference/str_equal.html">str_equal()</a></code> function:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">u[[1]] == u[[2]]
|
||||
#> [1] FALSE
|
||||
@@ -722,7 +722,7 @@ str_equal(u[[1]], u[[2]])
|
||||
<section id="locale-dependent-function" data-type="sect2">
|
||||
<h2>
|
||||
Locale-dependent function</h2>
|
||||
<p>Finally, there are a handful of stringr functions whose behavior depends on your <strong>locale</strong>. A locale is similar to a language, but includes an optional region specifier to handle regional variations within a language. A locale is specified by lower-case language abbreviation, optionally followed by a <code>_</code> and a upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. If you don’t already know the code for your language, <a href="#chp-https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes" data-type="xref">#chp-https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes</a> has a good list, and you can see which are supported in stringr by looking at <code><a href="#chp-https://rdrr.io/pkg/stringi/man/stri_locale_list" data-type="xref">#chp-https://rdrr.io/pkg/stringi/man/stri_locale_list</a></code>.</p>
|
||||
<p>Finally, there are a handful of stringr functions whose behavior depends on your <strong>locale</strong>. A locale is similar to a language, but includes an optional region specifier to handle regional variations within a language. A locale is specified by lower-case language abbreviation, optionally followed by a <code>_</code> and a upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. If you don’t already know the code for your language, <a href="https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes">Wikipedia</a> has a good list, and you can see which are supported in stringr by looking at <code><a href="https://rdrr.io/pkg/stringi/man/stri_locale_list.html">stringi::stri_locale_list()</a></code>.</p>
|
||||
<p>Base R string functions automatically use the locale set by your operating system. This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in different country. To avoid this problem, stringr defaults to using English rules, by using the “en” locale, and requires you to specify the <code>locale</code> argument to override it. Fortunately there are two sets of functions where the locale really matters: changing case and sorting.</p>
|
||||
<p><strong>T</strong>he rules for changing case are not the same in every language. For example, Turkish has two i’s: with and without a dot, and it capitalizes them in a different way to English:</p>
|
||||
<div class="cell">
|
||||
@@ -738,7 +738,7 @@ str_to_upper(c("i", "ı"), locale = "tr")
|
||||
str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
|
||||
#> [1] "a" "c" "h" "ch" "z"</pre>
|
||||
</div>
|
||||
<p>This also comes up when sorting strings with <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> which is why it also has a <code>locale</code> argument.</p>
|
||||
<p>This also comes up when sorting strings with <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">dplyr::arrange()</a></code> which is why it also has a <code>locale</code> argument.</p>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
|
||||
Reference in New Issue
Block a user