Re-render book for O'Reilly

This commit is contained in:
Hadley Wickham
2023-01-12 17:22:57 -06:00
parent 28671ed8bd
commit 360d65ae47
113 changed files with 4957 additions and 2997 deletions

View File

@@ -3,12 +3,12 @@
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In this chapter, youll learn tools for iteration, repeatedly performing the same action on different objects. Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free. For example, if you want to double a numeric vector <code>x</code> in R, you can just write <code>2 * x</code>. In most other languages, youd need to explicitly double each element of x using some sort of for loop.</p>
<p>In this chapter, youll learn tools for iteration, repeatedly performing the same action on different objects. Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free. For example, if you want to double a numeric vector <code>x</code> in R, you can just write <code>2 * x</code>. In most other languages, youd need to explicitly double each element of <code>x</code> using some sort of for loop.</p>
<p>This book has already given you a small but powerful number of tools that perform the same action for multiple “things”:</p>
<ul><li>
<code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code> draws a plot for each subset.</li>
<li>
<code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> plus <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> computes a summary statistics for each subset.</li>
<code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> plus <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> computes a summary statistics for each subset.</li>
<li>
<code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> create new rows and columns for each element of a list-column.</li>
</ul><p>Now its time to learn some more general tools, often called <strong>functional programming</strong> tools because they are built around functions that take other functions as inputs. Learning functional programming can easily veer into the abstract, but in this chapter well keep things concrete by focusing on three common tasks: modifying multiple columns, reading multiple files, and saving multiple objects.</p>
@@ -25,7 +25,7 @@ Prerequisites</h2>
<p>This chapter relies on features only found in purrr 1.0.0 and dplyr 1.1.0, which are still in development. If you want to live life on the edge you can get the dev version with <code>devtools::install_github(c("tidyverse/purrr", "tidyverse/dplyr"))</code>.</p></div>
<p>In this chapter, well focus on tools provided by dplyr and purrr, both core members of the tidyverse. Youve seen dplyr before, but <a href="http://purrr.tidyverse.org/">purrr</a> is new. Were going to use just a couple of purrr functions from in this chapter, but its a great package to explore as you improve your programming skills.</p>
<p>In this chapter, well focus on tools provided by dplyr and purrr, both core members of the tidyverse. Youve seen dplyr before, but <a href="http://purrr.tidyverse.org/">purrr</a> is new. Were just going to use a couple of purrr functions in this chapter, but its a great package to explore as you improve your programming skills.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
@@ -46,7 +46,7 @@ Modifying multiple columns</h1>
</div>
<p>You could do it with copy-and-paste:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; summarise(
<pre data-type="programlisting" data-code-language="r">df |&gt; summarize(
n = n(),
a = median(a),
b = median(b),
@@ -58,9 +58,9 @@ Modifying multiple columns</h1>
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 10 -0.246 -0.287 -0.0567 0.144</pre>
</div>
<p>That breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of columns. Instead you can use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>:</p>
<p>That breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of columns. Instead, you can use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; summarise(
<pre data-type="programlisting" data-code-language="r">df |&gt; summarize(
n = n(),
across(a:d, median),
)
@@ -76,7 +76,7 @@ Modifying multiple columns</h1>
Selecting columns with<code>.cols</code>
</h2>
<p>The first argument to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, <code>.cols</code>, selects the columns to transform. This uses the same specifications as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <a href="#sec-select" data-type="xref">#sec-select</a>, so you can use functions like <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">starts_with()</a></code> and <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">ends_with()</a></code> to select columns based on their name.</p>
<p>There are two additional selection techniques that are particularly useful for <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> and <code>where()</code>. <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> is straightforward: it selects every (non-grouping) column:</p>
<p>There are two additional selection techniques that are particularly useful for <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> and <code><a href="https://tidyselect.r-lib.org/reference/where.html">where()</a></code>. <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> is straightforward: it selects every (non-grouping) column:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
grp = sample(2, 10, replace = TRUE),
@@ -88,15 +88,15 @@ Selecting columns with<code>.cols</code>
df |&gt;
group_by(grp) |&gt;
summarise(across(everything(), median))
summarize(across(everything(), median))
#&gt; # A tibble: 2 × 5
#&gt; grp a b c d
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 -0.0935 -0.0163 0.363 0.364
#&gt; 2 2 0.312 -0.0576 0.208 0.565</pre>
</div>
<p>Note grouping columns (<code>grp</code> here) are not included in <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, because theyre automatically preserved by <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>.</p>
<p><code>where()</code> allows you to select columns based on their type:</p>
<p>Note grouping columns (<code>grp</code> here) are not included in <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, because theyre automatically preserved by <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>.</p>
<p><code><a href="https://tidyselect.r-lib.org/reference/where.html">where()</a></code> allows you to select columns based on their type:</p>
<ul><li>
<code>where(is.numeric)</code> selects all numeric columns.</li>
<li>
@@ -116,33 +116,35 @@ df |&gt;
)
df_types |&gt;
summarise(across(where(is.numeric), mean))
summarize(across(where(is.numeric), mean))
#&gt; # A tibble: 1 × 2
#&gt; x1 x2
#&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 0.370
df_types |&gt;
summarise(across(where(is.character), str_flatten))
summarize(across(where(is.character), str_flatten))
#&gt; # A tibble: 1 × 2
#&gt; y1 y2
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 kjh bananaappleegg</pre>
</div>
<p>Just like other selectors, you can combine these with Boolean algebra. For example, <code>!where(is.numeric)</code> selects all non-numeric columns and <code>starts_with("a") &amp; where(is.logical)</code> selects all logical columns whose name starts with “a”.</p>
<p>Just like other selectors, you can combine these with Boolean algebra. For example, <code>!where(is.numeric)</code> selects all non-numeric columns, and <code>starts_with("a") &amp; where(is.logical)</code> selects all logical columns whose name starts with “a”.</p>
</section>
<section id="calling-a-single-function" data-type="sect2">
<h2>
Calling a single function</h2>
<p>The second argument to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> defines how each column will be transformed. In simple cases, as above, this will be a single existing function. This is a pretty special feature of R: were passing one function (<code>median</code>, <code>mean</code>, <code>str_flatten</code>, …) to another function (<code>across</code>). This is one of the features that makes R a function programming language.</p>
<p>The second argument to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> defines how each column will be transformed. In simple cases, as above, this will be a single existing function. This is a pretty special feature of R: were passing one function (<code>median</code>, <code>mean</code>, <code>str_flatten</code>, …) to another function (<code>across</code>). This is one of the features that makes R a functional programming language.</p>
<p>Its important to note that were passing this function to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, so <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> can call it, not calling it ourselves. That means the function name should never be followed by <code>()</code>. If you forget, youll get an error:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(grp) |&gt;
summarise(across(everything(), median()))
#&gt; Error in vapply(.x, .f, .mold, ..., USE.NAMES = FALSE): values must be length 1,
#&gt; but FUN(X[[1]]) result is length 0</pre>
summarize(across(everything(), median()))
#&gt; Error in `summarize()`:
#&gt; In argument: `across(everything(), median())`.
#&gt; Caused by error in `is.factor()`:
#&gt; ! argument "x" is missing, with no default</pre>
</div>
<p>This error arises because youre calling the function with no input, e.g.:</p>
<div class="cell">
@@ -154,7 +156,7 @@ Calling a single function</h2>
<section id="calling-multiple-functions" data-type="sect2">
<h2>
Calling multiple functions</h2>
<p>In more complex cases, you might want to supply additional arguments or perform multiple transformations. Lets motivate this problem with a simple example: what happens if we have some missing values in our data? <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> propagates those missing values, giving us a suboptimal output:</p>
<p>In more complex cases, you might want to supply additional arguments or perform multiple transformations. Lets motivate this problem with a simple example: what happens if we have some missing values in our data? <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> propagates those missing values, giving us a suboptimal output:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">rnorm_na &lt;- function(n, n_na, mean = 0, sd = 1) {
sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
@@ -167,7 +169,7 @@ df_miss &lt;- tibble(
d = rnorm(5)
)
df_miss |&gt;
summarise(
summarize(
across(a:d, median),
n = n()
)
@@ -179,7 +181,7 @@ df_miss |&gt;
<p>It would be nice if we could pass along <code>na.rm = TRUE</code> to <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> to remove these missing values. To do so, instead of calling <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> directly, we need to create a new function that calls <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> with the desired arguments:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df_miss |&gt;
summarise(
summarize(
across(a:d, function(x) median(x, na.rm = TRUE)),
n = n()
)
@@ -191,7 +193,7 @@ df_miss |&gt;
<p>This is a little verbose, so R comes with a handy shortcut: for this sort of throw away, or <strong>anonymous</strong><span data-type="footnote">Anonymous, because we never explicitly gave it a name with <code>&lt;-</code>. Another term programmers use for this is “lambda function”.</span>, function you can replace <code>function</code> with <code>\</code><span data-type="footnote">In older code you might see syntax that looks like <code>~ .x + 1</code>. This is another way to write anonymous functions but it only works inside tidyverse functions and always uses the variable name <code>.x</code>. We now recommend the base syntax, <code>\(x) x + 1</code>.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df_miss |&gt;
summarise(
summarize(
across(a:d, \(x) median(x, na.rm = TRUE)),
n = n()
)</pre>
@@ -199,7 +201,7 @@ df_miss |&gt;
<p>In either case, <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> effectively expands to the following code:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df_miss |&gt;
summarise(
summarize(
a = median(a, na.rm = TRUE),
b = median(b, na.rm = TRUE),
c = median(c, na.rm = TRUE),
@@ -207,10 +209,10 @@ df_miss |&gt;
n = n()
)</pre>
</div>
<p>When we remove the missing values from the <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, it would be nice to know just how many values we were removing. We can find that out by supplying two functions to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: one to compute the median and the other to count the missing values. You supply multiple functions by using a named list to <code>.fns</code>:</p>
<p>When we remove the missing values from the <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, it would be nice to know just how many values were removed. We can find that out by supplying two functions to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: one to compute the median and the other to count the missing values. You supply multiple functions by using a named list to <code>.fns</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df_miss |&gt;
summarise(
summarize(
across(a:d, list(
median = \(x) median(x, na.rm = TRUE),
n_miss = \(x) sum(is.na(x))
@@ -218,10 +220,10 @@ df_miss |&gt;
n = n()
)
#&gt; # A tibble: 1 × 9
#&gt; a_median a_n_miss b_median b_n_miss c_median c_n_miss d_med…¹ d_n_m…² n
#&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5
#&gt; # … with abbreviated variable names ¹d_median, ²d_n_miss</pre>
#&gt; a_median a_n_miss b_median b_n_miss c_median c_n_miss d_median d_n_miss
#&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 0.429 1 -0.721 1 -0.796 2 0.704 0
#&gt; # … with 1 more variable: n &lt;int&gt;</pre>
</div>
<p>If you look carefully, you might intuit that the columns are named using using a glue specification (<a href="#sec-glue" data-type="xref">#sec-glue</a>) like <code>{.col}_{.fn}</code> where <code>.col</code> is the name of the original column and <code>.fn</code> is the name of the function. Thats not a coincidence! As youll learn in the next section, you can use <code>.names</code> argument to supply your own glue spec.</p>
</section>
@@ -232,7 +234,7 @@ Column names</h2>
<p>The result of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is named according to the specification provided in the <code>.names</code> argument. We could specify our own if we wanted the name of the function to come first<span data-type="footnote">You cant currently change the order of the columns, but you could reorder them after the fact using <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> or similar.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df_miss |&gt;
summarise(
summarize(
across(
a:d,
list(
@@ -244,12 +246,12 @@ Column names</h2>
n = n(),
)
#&gt; # A tibble: 1 × 9
#&gt; median_a n_miss_a median_b n_miss_b median_c n_miss_c media…¹ n_mis…² n
#&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5
#&gt; # … with abbreviated variable names ¹median_d, ²n_miss_d</pre>
#&gt; median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d
#&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 0.429 1 -0.721 1 -0.796 2 0.704 0
#&gt; # … with 1 more variable: n &lt;int&gt;</pre>
</div>
<p>The <code>.names</code> argument is particularly important when you use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. By default the output of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is given the same names as the inputs. This means that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> inside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> will replace existing columns. For example, here we use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> to replace <code>NA</code>s with <code>0</code>:</p>
<p>The <code>.names</code> argument is particularly important when you use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. By default, the output of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is given the same names as the inputs. This means that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> inside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> will replace existing columns. For example, here we use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> to replace <code>NA</code>s with <code>0</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df_miss |&gt;
mutate(
@@ -284,7 +286,7 @@ Column names</h2>
<section id="filtering" data-type="sect2">
<h2>
Filtering</h2>
<p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is a great match for <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> but its more awkward to use with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, because you usually combine multiple conditions with either <code>|</code> or <code>&amp;</code>. Its clear that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> can help to create multiple logical columns, but then what? So dplyr provides two variants of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> called <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_any()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_all()</a></code>:</p>
<p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is a great match for <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> but its more awkward to use with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, because you usually combine multiple conditions with either <code>|</code> or <code>&amp;</code>. Its clear that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> can help to create multiple logical columns, but then what? So dplyr provides two variants of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> called <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_any()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_all()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df_miss |&gt; filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
#&gt; # A tibble: 3 × 4
@@ -318,12 +320,6 @@ df_miss |&gt; filter(if_all(a:d, is.na))
<p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is particularly useful to program with because it allows you to operate on multiple columns. For example, <a href="https://twitter.com/_wurli/status/1571836746899283969">Jacob Scott</a> uses this little helper which wraps a bunch of lubridate function to expand all date columns into year, month, and day columns:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(lubridate)
#&gt; Loading required package: timechange
#&gt;
#&gt; Attaching package: 'lubridate'
#&gt; The following objects are masked from 'package:base':
#&gt;
#&gt; date, intersect, setdiff, union
expand_dates &lt;- function(df) {
df |&gt;
@@ -347,16 +343,16 @@ df_date |&gt;
</div>
<p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> also makes it easy to supply multiple columns in a single argument because the first argument uses tidy-select; you just need to remember to embrace that argument, as we discussed in <a href="#sec-embracing" data-type="xref">#sec-embracing</a>. For example, this function will compute the means of numeric columns by default. But by supplying the second argument you can choose to summarize just selected columns:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">summarise_means &lt;- function(df, summary_vars = where(is.numeric)) {
<pre data-type="programlisting" data-code-language="r">summarize_means &lt;- function(df, summary_vars = where(is.numeric)) {
df |&gt;
summarise(
summarize(
across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
n = n()
)
}
diamonds |&gt;
group_by(clarity) |&gt;
summarise_means()
summarize_means()
#&gt; # A tibble: 8 × 9
#&gt; clarity carat depth table price x y z n
#&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
@@ -370,7 +366,7 @@ diamonds |&gt;
diamonds |&gt;
group_by(clarity) |&gt;
summarise_means(c(carat, x:z))
summarize_means(c(carat, x:z))
#&gt; # A tibble: 8 × 6
#&gt; clarity carat x y z n
#&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
@@ -391,7 +387,7 @@ Vs<code>pivot_longer()</code>
<p>Before we go on, its worth pointing out an interesting connection between <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> (<a href="#sec-pivoting" data-type="xref">#sec-pivoting</a>). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. For example, take this multi-function summary:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
summarise(across(a:d, list(median = median, mean = mean)))
summarize(across(a:d, list(median = median, mean = mean)))
#&gt; # A tibble: 1 × 8
#&gt; a_median a_mean b_median b_mean c_median c_mean d_median d_mean
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
@@ -402,7 +398,7 @@ Vs<code>pivot_longer()</code>
<pre data-type="programlisting" data-code-language="r">long &lt;- df |&gt;
pivot_longer(a:d) |&gt;
group_by(name) |&gt;
summarise(
summarize(
median = median(value),
mean = mean(value)
)
@@ -464,7 +460,7 @@ df_long
df_long |&gt;
group_by(group) |&gt;
summarise(mean = weighted.mean(val, wts))
summarize(mean = weighted.mean(val, wts))
#&gt; # A tibble: 4 × 2
#&gt; group mean
#&gt; &lt;chr&gt; &lt;dbl&gt;
@@ -486,12 +482,12 @@ Exercises</h2>
<li><p>It is possible to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> inside <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> where its equivalent to <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_all()</a></code>. Can you explain why?</p></li>
<li><p>Adjust <code>expand_dates()</code> to automatically remove the date columns after theyve been expanded. Do you need to embrace any arguments?</p></li>
<li>
<p>Explain what each step of the pipeline in this function does. What special feature of <code>where()</code> are we taking advantage of?</p>
<p>Explain what each step of the pipeline in this function does. What special feature of <code><a href="https://tidyselect.r-lib.org/reference/where.html">where()</a></code> are we taking advantage of?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">show_missing &lt;- function(df, group_vars, summary_vars = everything()) {
df |&gt;
group_by(pick({{ group_vars }})) |&gt;
summarise(
summarize(
across({{ summary_vars }}, \(x) sum(is.na(x))),
.groups = "drop"
) |&gt;
@@ -522,7 +518,7 @@ data2022 &lt;- readxl::read_excel("data/y2022.xlsx")</pre>
<section id="listing-files-in-a-directory" data-type="sect2">
<h2>
Listing files in a directory</h2>
<p>As the name suggests, <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code> lists the files in a directory. TO CONSIDER: why not use it via the more obvious name <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code>? Youll almost always use three arguments:</p>
<p>As the name suggests, <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code> lists the files in a directory. Youll almost always use three arguments:</p>
<ul><li><p>The first argument, <code>path</code>, is the directory to look in.</p></li>
<li><p><code>pattern</code> is a regular expression used to filter the file names. The most common pattern is something like <code>[.]xlsx$</code> or <code>[.]csv$</code> to find all files with a specified extension.</p></li>
<li><p><code>full.names</code> determines whether or not the directory name should be included in the output. You almost always want this to be <code>TRUE</code>.</p></li>
@@ -608,7 +604,7 @@ files[[1]]
#&gt; 6 Australia Oceania 69.1 8691212 10040.
#&gt; # … with 136 more rows</pre>
</div>
<p>(This is another data structure that doesnt display particularly compactly with <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> so you might want to load into RStudio and inspect it with <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>).</p>
<p>(This is another data structure that doesnt display particularly compactly with <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> so you might want to load it into RStudio and inspect it with <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>).</p>
<p>Now we can use <code><a href="https://purrr.tidyverse.org/reference/list_c.html">purrr::list_rbind()</a></code> to combine that list of data frames into a single data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">list_rbind(files)
@@ -623,13 +619,13 @@ files[[1]]
#&gt; 6 Australia Oceania 69.1 8691212 10040.
#&gt; # … with 1,698 more rows</pre>
</div>
<p>Or we could do both steps at once in pipeline:</p>
<p>Or we could do both steps at once in a pipeline:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">paths |&gt;
map(readxl::read_excel) |&gt;
list_rbind()</pre>
</div>
<p>What if we want to pass in extra arguments to <code>read_excel()</code>? We use the same technique that we used with <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>. For example, its often useful to peak at the first few row of the data with <code>n_max = 1</code>:</p>
<p>What if we want to pass in extra arguments to <code>read_excel()</code>? We use the same technique that we used with <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>. For example, its often useful to peak at the first few rows of the data with <code>n_max = 1</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">paths |&gt;
map(\(path) readxl::read_excel(path, n_max = 1)) |&gt;
@@ -651,7 +647,7 @@ files[[1]]
<section id="sec-data-in-the-path" data-type="sect2">
<h2>
Data in the path</h2>
<p>Sometimes the name of the file is itself data. In this example, the file name contains the year, which is not otherwise recorded in the individual files. To get that column into the final data frame, we need to do two things.</p>
<p>Sometimes the name of the file is data itself. In this example, the file name contains the year, which is not otherwise recorded in the individual files. To get that column into the final data frame, we need to do two things:</p>
<p>First, we name the vector of paths. The easiest way to do this is with the <code><a href="https://rlang.r-lib.org/reference/set_names.html">set_names()</a></code> function, which can take a function. Here we use <code><a href="https://rdrr.io/r/base/basename.html">basename()</a></code> to extract just the file name from the full path:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">paths |&gt; set_names(basename)
@@ -752,15 +748,15 @@ Save your work</h2>
write_csv(gapminder, "gapminder.csv")</pre>
</div>
<p>Now when you come back to this problem in the future, you can read in a single csv file.</p>
<p>If youre working in a project, wed suggest calling the file that does this sort of data prep work something like <code>0-cleanup.R.</code> The <code>0</code> in the file name suggests that this should be run before anything else.</p>
<p>If youre working in a project, wed suggest calling the file that does this sort of data prep work something like <code>0-cleanup.R</code>. The <code>0</code> in the file name suggests that this should be run before anything else.</p>
<p>If your input data files change over time, you might consider learning a tool like <a href="https://docs.ropensci.org/targets/">targets</a> to set up your data cleaning code to automatically re-run whenever one of the input files is modified.</p>
</section>
<section id="many-simple-iterations" data-type="sect2">
<h2>
Many simple iterations</h2>
<p>Here weve just loaded the data directly from disk, and were lucky enough to get a tidy dataset. In most cases, youll need to do some additional tidying, and you have two basic basic options: you can do one round of iteration with a complex function, or do a multiple rounds of iteration with simple functions. In our experience most folks reach first for one complex iteration, but youre often better by doing multiple simple iterations.</p>
<p>For example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine. One way to approach the problem is write a function that takes a file and does all those steps then call <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> once:</p>
<p>Here weve just loaded the data directly from disk, and were lucky enough to get a tidy dataset. In most cases, youll need to do some additional tidying, and you have two basic options: you can do one round of iteration with a complex function, or do multiple rounds of iteration with simple functions. In our experience most folks reach first for one complex iteration, but youre often better by doing multiple simple iterations.</p>
<p>For example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine. One way to approach the problem is to write a function that takes a file and does all those steps then call <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> once:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">process_file &lt;- function(path) {
df &lt;- read_csv(path)
@@ -784,7 +780,7 @@ paths |&gt;
map(\(df) df |&gt; pivot_longer(jan:dec, names_to = "month")) |&gt;
list_rbind()</pre>
</div>
<p>We recommend this approach because it stops you getting fixated on getting the first file right because moving on to the rest. By considering all of the data when doing tidying and cleaning, youre more likely to think holistically and end up with a higher quality result.</p>
<p>We recommend this approach because it stops you getting fixated on getting the first file right before moving on to the rest. By considering all of the data when doing tidying and cleaning, youre more likely to think holistically and end up with a higher quality result.</p>
<p>In this particular example, theres another optimization you could make, by binding all the data frames together earlier. Then you can rely on regular dplyr behavior:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">paths |&gt;
@@ -799,12 +795,12 @@ paths |&gt;
<section id="heterogeneous-data" data-type="sect2">
<h2>
Heterogeneous data</h2>
<p>Unfortunately sometimes its not possible to go from <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> straight to <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> because the data frames are so heterogeneous that <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> either fails or yields a data frame thats not very useful. In that case, its still useful to start by loading all of the files:</p>
<p>Unfortunately, sometimes its not possible to go from <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> straight to <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> because the data frames are so heterogeneous that <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> either fails or yields a data frame thats not very useful. In that case, its still useful to start by loading all of the files:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">files &lt;- paths |&gt;
map(readxl::read_excel) </pre>
</div>
<p>Then a very useful strategy is to capture the structure of the data frames to data so that you can explore it using your data science skills. One way to do so is with this handy <code>df_types</code> function that returns a tibble with one row for each column:</p>
<p>Then a very useful strategy is to capture the structure of the data frames so that you can explore it using your data science skills. One way to do so is with this handy <code>df_types</code> function that returns a tibble with one row for each column:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df_types &lt;- function(df) {
tibble(
@@ -837,7 +833,7 @@ df_types(nycflights13::flights)
#&gt; 6 dep_delay double 8255
#&gt; # … with 13 more rows</pre>
</div>
<p>You can then apply this function all of the files, and maybe do some pivoting to make it easy to see where there are differences. For example, this makes it easy to verify that the gapminder spreadsheets that weve been working with are all quite homogeneous:</p>
<p>You can then apply this function to all of the files, and maybe do some pivoting to make it easier to see where the differences are. For example, this makes it easy to verify that the gapminder spreadsheets that weve been working with are all quite homogeneous:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">files |&gt;
map(df_types) |&gt;
@@ -855,7 +851,7 @@ df_types(nycflights13::flights)
#&gt; 6 1977.xlsx character character double double double
#&gt; # … with 6 more rows</pre>
</div>
<p>If the files have heterogeneous formats you might need to do more processing before you can successfully merge them. Unfortunately were now going to leave you to figure that out on your own, but you might want to read about <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_if()</a></code> and <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_at()</a></code>. <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_if()</a></code> allows you to selectively modify elements of a list based on their values; <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_at()</a></code> allows you to selectively modify elements based on their names.</p>
<p>If the files have heterogeneous formats, you might need to do more processing before you can successfully merge them. Unfortunately, were now going to leave you to figure that out on your own, but you might want to read about <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_if()</a></code> and <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_at()</a></code>. <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_if()</a></code> allows you to selectively modify elements of a list based on their values; <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_at()</a></code> allows you to selectively modify elements based on their names.</p>
</section>
<section id="handling-failures" data-type="sect2">
@@ -870,7 +866,7 @@ Handling failures</h2>
data &lt;- files |&gt; list_rbind()</pre>
</div>
<p>This works particularly well here because <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code>, like many tidyverse functions, automatically ignores <code>NULL</code>s.</p>
<p>Now you have all the data that can be read easily, and its time to tackle the hard part of figuring out why some files failed load and what do to about it. Start by getting the paths that failed:</p>
<p>Now you have all the data that can be read easily, and its time to tackle the hard part of figuring out why some files failed to load and what do to about it. Start by getting the paths that failed:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">failed &lt;- map_vec(files, is.null)
paths[failed]
@@ -885,13 +881,13 @@ paths[failed]
Saving multiple outputs</h1>
<p>In the last section, you learned about <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>, which is useful for reading multiple files into a single object. In this section, well now explore sort of the opposite problem: how can you take one or more R objects and save it to one or more files? Well explore this challenge using three examples:</p>
<ul><li>Saving multiple data frames into one database.</li>
<li>Saving multiple data frames into multiple csv files.</li>
<li>Saving multiple data frames into multiple <code>.csv</code> files.</li>
<li>Saving multiple plots to multiple <code>.png</code> files.</li>
</ul>
<section id="sec-save-database" data-type="sect2">
<h2>
Writing to a database</h2>
<p>Sometimes when working with many files at once, its not possible to fit all your data into memory at once, and you cant do <code>map(files, read_csv)</code>. One approach to deal with this problem is to load your into a database so you can access just the bits you need with dbplyr.</p>
<p>Sometimes when working with many files at once, its not possible to fit all your data into memory at once, and you cant do <code>map(files, read_csv)</code>. One approach to deal with this problem is to load your data into a database so you can access just the bits you need with dbplyr.</p>
<p>If youre lucky, the database package youre using will provide a handy function that takes a vector of paths and loads them all into the database. This is the case with duckdbs <code>duckdb_read_csv()</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">con &lt;- DBI::dbConnect(duckdb::duckdb())
@@ -914,7 +910,7 @@ template
#&gt; 6 Australia Oceania 69.1 8691212 10040. 1952
#&gt; # … with 136 more rows</pre>
</div>
<p>Now we can connect to the database, and use <code><a href="https://dbi.r-dbi.org/reference/dbCreateTable.html">DBI::dbCreateTable()</a></code> to turn our template into database table:</p>
<p>Now we can connect to the database, and use <code><a href="https://dbi.r-dbi.org/reference/dbCreateTable.html">DBI::dbCreateTable()</a></code> to turn our template into a database table:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">con &lt;- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)</pre>
@@ -923,7 +919,7 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">con |&gt; tbl("gapminder")
#&gt; # Source: table&lt;gapminder&gt; [0 x 6]
#&gt; # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; # … with 6 variables: country &lt;chr&gt;, continent &lt;chr&gt;, lifeExp &lt;dbl&gt;,
#&gt; # pop &lt;dbl&gt;, gdpPercap &lt;dbl&gt;, year &lt;dbl&gt;</pre>
</div>
@@ -950,15 +946,15 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
tbl("gapminder") |&gt;
count(year)
#&gt; # Source: SQL [?? x 2]
#&gt; # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; year n
#&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1952 142
#&gt; 2 1987 142
#&gt; 3 1957 142
#&gt; 4 1992 142
#&gt; 5 1962 142
#&gt; 6 1997 142
#&gt; 2 1957 142
#&gt; 3 1962 142
#&gt; 4 1967 142
#&gt; 5 1972 142
#&gt; 6 1977 142
#&gt; # … with more rows</pre>
</div>
</section>
@@ -997,7 +993,7 @@ by_clarity
#&gt; 6 1.04 Premium G 62.2 58 2801 6.46 6.41 4
#&gt; # … with 735 more rows</pre>
</div>
<p>While were here, lets create a column that gives the name of output file, using <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>:</p>
<p>While were here, lets create a column that gives the name of output file, using <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">by_clarity &lt;- by_clarity |&gt;
mutate(path = str_glue("diamonds-{clarity}.csv"))
@@ -1034,7 +1030,7 @@ Saving plots</h2>
<p>We can take the same basic approach to create many plots. Lets first make a function that draws the plot we want:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">carat_histogram &lt;- function(df) {
ggplot(df, aes(carat)) + geom_histogram(binwidth = 0.1)
ggplot(df, aes(x = carat)) + geom_histogram(binwidth = 0.1)
}
carat_histogram(by_clarity$data[[1]])</pre>
@@ -1078,8 +1074,8 @@ ggsave(by_clarity$path[[8]], by_clarity$plot[[8]], width = 6, height = 6)</pre>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter youve seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems. Once youve mastered the techniques in this chapter, we highly recommend learning more by reading the <a href="https://adv-r.hadley.nz/functionals.html">Functionals chapter</a> of <em>Advanced R</em> and consulting the <a href="https://purrr.tidyverse.org">purrr website</a>.</p>
<p>If you know much about iteration in other languages you might be surprised that we didnt discuss the <code>for</code> loop. Thats because Rs orientation towards data analysis changes how we iterate: in most cases you can rely on an existing idiom to do something to each columns or each group. And when you cant, you can often use a functional programming tool like <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> that does something to each element of a list. However, you will see <code>for</code> loops in wild-caught code, so youll learn about them in the next chapter where well discuss some important base R tools.</p>
<p>In this chapter, youve seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems. Once youve mastered the techniques in this chapter, we highly recommend learning more by reading the <a href="https://adv-r.hadley.nz/functionals.html">Functionals chapter</a> of <em>Advanced R</em> and consulting the <a href="https://purrr.tidyverse.org">purrr website</a>.</p>
<p>If you know much about iteration in other languages, you might be surprised that we didnt discuss the <code>for</code> loop. Thats because Rs orientation towards data analysis changes how we iterate: in most cases you can rely on an existing idiom to do something to each columns or each group. And when you cant, you can often use a functional programming tool like <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> that does something to each element of a list. However, you will see <code>for</code> loops in wild-caught code, so youll learn about them in the next chapter where well discuss some important base R tools.</p>
</section>