Re-render book for O'Reilly

This commit is contained in:
Hadley Wickham
2023-01-12 17:22:57 -06:00
parent 28671ed8bd
commit 360d65ae47
113 changed files with 4957 additions and 2997 deletions

View File

@@ -60,7 +60,7 @@ df |> mutate(
<section id="writing-a-function" data-type="sect2">
<h2>
Writing a function</h2>
<p>To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> its a little easier to see the pattern because each repetition is now one line:</p>
<p>To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, its a little easier to see the pattern because each repetition is now one line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
@@ -73,8 +73,8 @@ Writing a function</h2>
</div>
<p>To turn this into a function you need three things:</p>
<ol type="1"><li><p>A <strong>name</strong>. Here well use <code>rescale01</code> because this function rescales a vector to lie between 0 and 1.</p></li>
<li><p>The <strong>arguments</strong>. The arguments are things that vary across calls and our analysis above tells us that have just one. Well call it <code>x</code> because this is the conventional name for a numeric vector.</p></li>
<li><p>The <strong>body</strong>. The body is the code that repeated across all the calls.</p></li>
<li><p>The <strong>arguments</strong>. The arguments are things that vary across calls and our analysis above tells us that we have just one. Well call it <code>x</code> because this is the conventional name for a numeric vector.</p></li>
<li><p>The <strong>body</strong>. The body is the code thats repeated across all the calls.</p></li>
</ol><p>Then you create a function by following the template:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">name &lt;- function(arguments) {
@@ -117,7 +117,7 @@ rescale01(c(1, 2, 3, NA, 5))
<section id="improving-our-function" data-type="sect2">
<h2>
Improving our function</h2>
<p>You might notice <code>rescale01()</code> function does some unnecessary work — instead of computing <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> twice and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> once we could instead compute both the minimum and maximum in one step with <code><a href="https://rdrr.io/r/base/range.html">range()</a></code>:</p>
<p>You might notice that the <code>rescale01()</code> function does some unnecessary work — instead of computing <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> twice and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> once we could instead compute both the minimum and maximum in one step with <code><a href="https://rdrr.io/r/base/range.html">range()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">rescale01 &lt;- function(x) {
rng &lt;- range(x, na.rm = TRUE)
@@ -136,6 +136,7 @@ rescale01(x)
rng &lt;- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
#&gt; [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
#&gt; [8] 0.7777778 0.8888889 1.0000000 Inf</pre>
@@ -146,14 +147,14 @@ rescale01(x)
<section id="mutate-functions" data-type="sect2">
<h2>
Mutate functions</h2>
<p>Now youve got the basic idea of functions, lets take a look a whole bunch of examples. Well start by looking at “mutate” functions, functions that work well like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> because they return an output the same length as the input.</p>
<p>Lets start with a simple variation of <code>rescale01()</code>. Maybe you want compute the Z-score, rescaling a vector to have to a mean of zero and a standard deviation of one:</p>
<p>Now youve got the basic idea of functions, lets take a look at a whole bunch of examples. Well start by looking at “mutate” functions, i.e. functions that work well inside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> because they return an output of the same length as the input.</p>
<p>Lets start with a simple variation of <code>rescale01()</code>. Maybe you want to compute the Z-score, rescaling a vector to have a mean of zero and a standard deviation of one:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">z_score &lt;- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}</pre>
</div>
<p>Or maybe you want to wrap up a straightforward <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> in order to give it a useful name. For example, this <code>clamp()</code> function ensures all values of a vector lie in between a minimum or a maximum:</p>
<p>Or maybe you want to wrap up a straightforward <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> and give it a useful name. For example, this <code>clamp()</code> function ensures all values of a vector lie in between a minimum or a maximum:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">clamp &lt;- function(x, min, max) {
case_when(
@@ -162,6 +163,7 @@ Mutate functions</h2>
.default = x
)
}
clamp(1:10, min = 3, max = 7)
#&gt; [1] 3 3 3 4 5 6 7 7 7 7</pre>
</div>
@@ -174,15 +176,17 @@ clamp(1:10, min = 3, max = 7)
.default = x
)
}
na_outside(1:10, min = 3, max = 7)
#&gt; [1] NA NA 3 4 5 6 7 NA NA NA</pre>
</div>
<p>Of course functions dont just need to work with numeric variables. You might want to extract out some repeated string manipulation. Maybe you need to make the first character upper case:</p>
<p>Of course functions dont just need to work with numeric variables. You might want to do some repeated string manipulation. Maybe you need to make the first character upper case:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">first_upper &lt;- function(x) {
str_sub(x, 1, 1) &lt;- str_to_upper(str_sub(x, 1, 1))
x
}
first_upper("hello")
#&gt; [1] "Hello"</pre>
</div>
@@ -198,12 +202,13 @@ clean_number &lt;- function(x) {
as.numeric(x)
if_else(is_pct, num / 100, num)
}
clean_number("$12,300")
#&gt; [1] 12300
clean_number("45%")
#&gt; [1] 0.45</pre>
</div>
<p>Sometimes your functions will be highly specialized for one data analysis. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with <code>NA</code>:</p>
<p>Sometimes your functions will be highly specialized for one data analysis step. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">fix_na &lt;- function(x) {
if_else(x %in% c(997, 998, 999), NA, x)
@@ -237,14 +242,16 @@ Summary functions</h2>
<pre data-type="programlisting" data-code-language="r">commas &lt;- function(x) {
str_flatten(x, collapse = ", ", last = " and ")
}
commas(c("cat", "dog", "pigeon"))
#&gt; [1] "cat, dog and pigeon"</pre>
</div>
<p>Or you might wrap up a simple computation, like for the coefficient of variation, which divides standard deviation by the mean:</p>
<p>Or you might wrap up a simple computation, like for the coefficient of variation, which divides the standard deviation by the mean:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cv &lt;- function(x, na.rm = FALSE) {
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}
cv(runif(100, min = 0, max = 50))
#&gt; [1] 0.5196276
cv(runif(100, min = 0, max = 500))
@@ -318,42 +325,62 @@ Data frame functions</h1>
<section id="indirection-and-tidy-evaluation" data-type="sect2">
<h2>
Indirection and tidy evaluation</h2>
<p>When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Lets illustrate the problem with a very simple function: <code>pull_unique()</code>. The goal of this function is to <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> the unique (distinct) values of a variable:</p>
<p>When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Lets illustrate the problem with a very simple function: <code>grouped_mean()</code>. The goal of this function is compute the mean of <code>mean_var</code> grouped by <code>group_var</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">pull_unique &lt;- function(df, var) {
<pre data-type="programlisting" data-code-language="r">grouped_mean &lt;- function(df, group_var, mean_var) {
df |&gt;
distinct(var) |&gt;
pull(var)
group_by(group_var) |&gt;
summarize(mean(mean_var))
}</pre>
</div>
<p>If we try and use it, we get an error:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt; pull_unique(clarity)
#&gt; Error in `distinct()` at ]8;line = 38:col = 2;file:///Users/hadleywickham/Documents/dplyr/dplyr/R/pull.Rdplyr/R/pull.R:38:2]8;;:
#&gt; ! Must use existing variables.
#&gt;`var` not found in `.data`.</pre>
<pre data-type="programlisting" data-code-language="r">diamonds |&gt; grouped_mean(cut, carat)
#&gt; Error in `group_by()`:
#&gt; ! Must group by variables found in `.data`.
#&gt;Column `group_var` is not found.</pre>
</div>
<p>To make the problem a bit more clear we can use a made up data frame:</p>
<p>To make the problem a bit more clear, we can use a made up data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(var = "var", x = "x", y = "y")
df |&gt; pull_unique(x)
#&gt; [1] "var"
df |&gt; pull_unique(y)
#&gt; [1] "var"</pre>
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
mean_var = 1,
group_var = "g",
group = 1,
x = 10,
y = 100
)
df |&gt; grouped_mean(group, x)
#&gt; # A tibble: 1 × 2
#&gt; group_var `mean(mean_var)`
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 g 1
df |&gt; grouped_mean(group, y)
#&gt; # A tibble: 1 × 2
#&gt; group_var `mean(mean_var)`
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 g 1</pre>
</div>
<p>Regardless of how we call <code>pull_unique()</code> it always does <code>df |&gt; distinct(var) |&gt; pull(var)</code>, instead of <code>df |&gt; distinct(x) |&gt; pull(x)</code> or <code>df |&gt; distinct(y) |&gt; pull(y)</code>. This is a problem of indirection, and it arises because dplyr uses <strong>tidy evaluation</strong> to allow you to refer to the names of variables inside your data frame without any special treatment.</p>
<p>Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; its obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> not to treat <code>var</code> as the name of a variable, but instead look inside <code>var</code> for the variable we actually want to use.</p>
<p>Regardless of how we call <code>grouped_mean()</code> it always does <code>df |&gt; group_by(group_var) |&gt; summarize(mean(mean_var))</code>, instead of <code>df |&gt; group_by(group) |&gt; summarize(mean(x))</code> or <code>df |&gt; group_by(group) |&gt; summarize(mean(y))</code>. This is a problem of indirection, and it arises because dplyr uses <strong>tidy evaluation</strong> to allow you to refer to the names of variables inside your data frame without any special treatment.</p>
<p>Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; its obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell <code>group_mean()</code> and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> not to treat <code>group_var</code> and <code>mean_var</code> as the name of the variables, but instead look inside them for the variable we actually want to use.</p>
<p>Tidy evaluation includes a solution to this problem called <strong>embracing</strong> 🤗. Embracing a variable means to wrap it in braces so (e.g.) <code>var</code> becomes <code>{{ var }}</code>. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember whats happening is to think of <code>{{ }}</code> as looking down a tunnel — <code>{{ var }}</code> will make a dplyr function look inside of <code>var</code> rather than looking for a variable called <code>var</code>.</p>
<p>So to make <code>pull_unique()</code> work we need to replace <code>var</code> with <code>{{ var }}</code>:</p>
<p>So to make grouped_mean<code>()</code> work, we need to surround <code>group_var</code> and <code>mean_var()</code> with <code>{{ }}</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">pull_unique &lt;- function(df, var) {
<pre data-type="programlisting" data-code-language="r">grouped_mean &lt;- function(df, group_var, mean_var) {
df |&gt;
distinct({{ var }}) |&gt;
pull({{ var }})
group_by({{ group_var }}) |&gt;
summarize(mean({{ mean_var }}))
}
diamonds |&gt; pull_unique(clarity)
#&gt; [1] SI2 SI1 VS1 VS2 VVS2 VVS1 I1 IF
#&gt; Levels: I1 &lt; SI2 &lt; SI1 &lt; VS2 &lt; VS1 &lt; VVS2 &lt; VVS1 &lt; IF</pre>
diamonds |&gt; grouped_mean(cut, carat)
#&gt; # A tibble: 5 × 2
#&gt; cut `mean(carat)`
#&gt; &lt;ord&gt; &lt;dbl&gt;
#&gt; 1 Fair 1.05
#&gt; 2 Good 0.849
#&gt; 3 Very Good 0.806
#&gt; 4 Premium 0.892
#&gt; 5 Ideal 0.703</pre>
</div>
<p>Success!</p>
</section>
@@ -361,11 +388,11 @@ diamonds |&gt; pull_unique(clarity)
<section id="sec-embracing" data-type="sect2">
<h2>
When to embrace?</h2>
<p>So the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:</p>
<ul><li><p><strong>Data-masking</strong>: this is used in functions like <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> that compute with variables.</p></li>
<li><p><strong>Tidy-selection</strong>: this is used for for functions like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> that select variables.</p></li>
<p>So the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately, this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:</p>
<ul><li><p><strong>Data-masking</strong>: this is used in functions like <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> that compute with variables.</p></li>
<li><p><strong>Tidy-selection</strong>: this is used for functions like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> that select variables.</p></li>
</ul><p>Your intuition about which arguments use tidy evaluation should be good for many common functions — just think about whether you can compute (e.g. <code>x + 1</code>) or select (e.g. <code>a:x</code>).</p>
<p>In the following sections well explore the sorts of handy functions you might write once you understand embracing.</p>
<p>In the following sections, well explore the sorts of handy functions you might write once you understand embracing.</p>
</section>
<section id="common-use-cases" data-type="sect2">
@@ -374,7 +401,7 @@ Common use cases</h2>
<p>If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">summary6 &lt;- function(data, var) {
data |&gt; summarise(
data |&gt; summarize(
min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE),
@@ -384,14 +411,15 @@ Common use cases</h2>
.groups = "drop"
)
}
diamonds |&gt; summary6(carat)
#&gt; # A tibble: 1 × 6
#&gt; min mean median max n n_miss
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0.2 0.798 0.7 5.01 53940 0</pre>
</div>
<p>(Whenever you wrap <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> in a helper, we think its good practice to set <code>.groups = "drop"</code> to both avoid the message and leave the data in an ungrouped state.)</p>
<p>The nice thing about this function is because it wraps <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> you can used it on grouped data:</p>
<p>(Whenever you wrap <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> in a helper, we think its good practice to set <code>.groups = "drop"</code> to both avoid the message and leave the data in an ungrouped state.)</p>
<p>The nice thing about this function is, because it wraps <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, you can use it on grouped data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
group_by(cut) |&gt;
@@ -405,7 +433,7 @@ diamonds |&gt; summary6(carat)
#&gt; 4 Premium 0.2 0.892 0.86 4.01 13791 0
#&gt; 5 Ideal 0.2 0.703 0.54 3.5 21551 0</pre>
</div>
<p>Because the arguments to summarize are data-masking that also means that the <code>var</code> argument to <code>summary6()</code> is data-masking. That means you can also summarize computed variables:</p>
<p>Furthermore, since the arguments to summarize are data-masking also means that the <code>var</code> argument to <code>summary6()</code> is data-masking. That means you can also summarize computed variables:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
group_by(cut) |&gt;
@@ -419,8 +447,8 @@ diamonds |&gt; summary6(carat)
#&gt; 4 Premium -0.699 -0.125 -0.0655 0.603 13791 0
#&gt; 5 Ideal -0.699 -0.225 -0.268 0.544 21551 0</pre>
</div>
<p>To summarize multiple variables youll need to wait until <a href="#sec-across" data-type="xref">#sec-across</a>, where youll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>.</p>
<p>Another popular <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> helper function is a version of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> that also computes proportions:</p>
<p>To summarize multiple variables, youll need to wait until <a href="#sec-across" data-type="xref">#sec-across</a>, where youll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>.</p>
<p>Another popular <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> helper function is a version of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> that also computes proportions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/Diabb6/status/1571635146658402309
count_prop &lt;- function(df, var, sort = FALSE) {
@@ -428,6 +456,7 @@ count_prop &lt;- function(df, var, sort = FALSE) {
count({{ var }}, sort = sort) |&gt;
mutate(prop = n / sum(n))
}
diamonds |&gt; count_prop(clarity)
#&gt; # A tibble: 8 × 3
#&gt; clarity n prop
@@ -447,26 +476,36 @@ diamonds |&gt; count_prop(clarity)
df |&gt;
filter({{ condition }}) |&gt;
distinct({{ var }}) |&gt;
arrange({{ var }}) |&gt;
pull({{ var }})
arrange({{ var }})
}
# Find all the destinations in December
flights |&gt; unique_where(month == 12, dest)
#&gt; [1] "ABQ" "ALB" "ATL" "AUS" "AVL" "BDL" "BGR" "BHM" "BNA" "BOS" "BQN" "BTV"
#&gt; [13] "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CHS" "CLE" "CLT" "CMH" "CVG" "DAY"
#&gt; [25] "DCA" "DEN" "DFW" "DSM" "DTW" "EGE" "EYW" "FLL" "GRR" "GSO" "GSP" "HDN"
#&gt; [37] "HNL" "HOU" "IAD" "IAH" "ILM" "IND" "JAC" "JAX" "LAS" "LAX" "LGB" "MCI"
#&gt; [49] "MCO" "MDW" "MEM" "MHT" "MIA" "MKE" "MSN" "MSP" "MSY" "MTJ" "OAK" "OKC"
#&gt; [61] "OMA" "ORD" "ORF" "PBI" "PDX" "PHL" "PHX" "PIT" "PSE" "PSP" "PVD" "PWM"
#&gt; [73] "RDU" "RIC" "ROC" "RSW" "SAN" "SAT" "SAV" "SBN" "SDF" "SEA" "SFO" "SJC"
#&gt; [85] "SJU" "SLC" "SMF" "SNA" "SRQ" "STL" "STT" "SYR" "TPA" "TUL" "TYS" "XNA"
#&gt; # A tibble: 96 × 1
#&gt; dest
#&gt; &lt;chr&gt;
#&gt; 1 ABQ
#&gt; 2 ALB
#&gt; 3 ATL
#&gt; 4 AUS
#&gt; 5 AVL
#&gt; 6 BDL
#&gt; # … with 90 more rows
# Which months did plane N14228 fly in?
flights |&gt; unique_where(tailnum == "N14228", month)
#&gt; [1] 1 2 3 4 5 6 7 8 9 10 12</pre>
#&gt; # A tibble: 11 × 1
#&gt; month
#&gt; &lt;int&gt;
#&gt; 1 1
#&gt; 2 2
#&gt; 3 3
#&gt; 4 4
#&gt; 5 5
#&gt; 6 6
#&gt; # … with 5 more rows</pre>
</div>
<p>Here we embrace <code>condition</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code>var</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code>.</p>
<p>Weve made all these examples take a data frame as the first argument, but if youre working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> since they form the compound primary key that allows you to identify a row.</p>
<p>Here we embrace <code>condition</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code>var</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>.</p>
<p>Weve made all these examples to take a data frame as the first argument, but if youre working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> since they form the compound primary key that allows you to identify a row.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights_sub &lt;- function(rows, cols) {
flights |&gt;
@@ -476,43 +515,45 @@ flights |&gt; unique_where(tailnum == "N14228", month)
flights_sub(dest == "IAH", contains("time"))
#&gt; # A tibble: 7,198 × 8
#&gt; time_hour carrier flight dep_time sched…¹ arr_t…² sched…³ air_t…⁴
#&gt; &lt;dttm&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 2013-01-01 05:00:00 UA 1545 517 515 830 819 227
#&gt; 2 2013-01-01 05:00:00 UA 1714 533 529 850 830 227
#&gt; 3 2013-01-01 06:00:00 UA 496 623 627 933 932 229
#&gt; 4 2013-01-01 07:00:00 UA 473 728 732 1041 1038 238
#&gt; 5 2013-01-01 07:00:00 UA 1479 739 739 1104 1038 249
#&gt; 6 2013-01-01 09:00:00 UA 1220 908 908 1228 1219 233
#&gt; # … with 7,192 more rows, and abbreviated variable names ¹​sched_dep_time,
#&gt; # ²arr_time, ³sched_arr_time, ⁴air_time</pre>
#&gt; time_hour carrier flight dep_time sched_dep_time arr_time
#&gt; &lt;dttm&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013-01-01 05:00:00 UA 1545 517 515 830
#&gt; 2 2013-01-01 05:00:00 UA 1714 533 529 850
#&gt; 3 2013-01-01 06:00:00 UA 496 623 627 933
#&gt; 4 2013-01-01 07:00:00 UA 473 728 732 1041
#&gt; 5 2013-01-01 07:00:00 UA 1479 739 739 1104
#&gt; 6 2013-01-01 09:00:00 UA 1220 908 908 1228
#&gt; # … with 7,192 more rows, and 2 more variables: sched_arr_time &lt;int&gt;,
#&gt; # air_time &lt;dbl&gt;</pre>
</div>
</section>
<section id="data-masking-vs-tidy-selection" data-type="sect2">
<section id="data-masking-vs.-tidy-selection" data-type="sect2">
<h2>
Data-masking vs tidy-selection</h2>
Data-masking vs. tidy-selection</h2>
<p>Sometimes you want to select variables inside a function that uses data-masking. For example, imagine you want to write a <code>count_missing()</code> that counts the number of missing observations in rows. You might try writing something like:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">count_missing &lt;- function(df, group_vars, x_var) {
df |&gt;
group_by({{ group_vars }}) |&gt;
summarise(n_miss = sum(is.na({{ x_var }})))
summarize(n_miss = sum(is.na({{ x_var }})))
}
flights |&gt;
count_missing(c(year, month, day), dep_time)
#&gt; Error in `group_by()` at ]8;line = 127:col = 2;file:///Users/hadleywickham/Documents/dplyr/dplyr/R/summarise.Rdplyr/R/summarise.R:127:2]8;;:
#&gt; In argument: `..1 = c(year, month, day)`.
#&gt; Error in `group_by()`:
#&gt; In argument: `c(year, month, day)`.
#&gt; Caused by error:
#&gt; ! `..1` must be size 336776 or 1, not 1010328.</pre>
#&gt; ! `c(year, month, day)` must be size 336776 or 1, not 1010328.</pre>
</div>
<p>This doesnt work because <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> uses data-masking, not tidy-selection. We can work around that problem by using the handy <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> which allows you to use use tidy-selection inside data-masking functions:</p>
<p>This doesnt work because <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> uses data-masking, not tidy-selection. We can work around that problem by using the handy <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> function, which allows you to use tidy-selection inside data-masking functions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">count_missing &lt;- function(df, group_vars, x_var) {
df |&gt;
group_by(pick({{ group_vars }})) |&gt;
summarise(n_miss = sum(is.na({{ x_var }})))
summarize(n_miss = sum(is.na({{ x_var }})))
}
flights |&gt;
count_missing(c(year, month, day), dep_time)
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using
@@ -542,6 +583,7 @@ count_wide &lt;- function(data, rows, cols) {
values_fill = 0
)
}
diamonds |&gt; count_wide(clarity, cut)
#&gt; # A tibble: 8 × 6
#&gt; clarity Fair Good `Very Good` Premium Ideal
@@ -572,9 +614,9 @@ diamonds |&gt; count_wide(c(clarity, color), cut)
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Using the datasets from nyclights13, write functions that:</p>
<p>Using the datasets from nycflights13, write a function that:</p>
<ol type="1"><li>
<p>Find all flights that were cancelled (i.e. <code>is.na(arr_time)</code>) or delayed by more than an hour.</p>
<p>Finds all flights that were cancelled (i.e. <code>is.na(arr_time)</code>) or delayed by more than an hour.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt; filter_severe()</pre>
</div>
@@ -582,7 +624,7 @@ Exercises</h2>
<li>
<p>Counts the number of cancelled flights and the number of flights delayed by more than an hour.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt; group_by(dest) |&gt; summarise_severe()</pre>
<pre data-type="programlisting" data-code-language="r">flights |&gt; group_by(dest) |&gt; summarize_severe()</pre>
</div>
</li>
<li>
@@ -592,19 +634,19 @@ Exercises</h2>
</div>
</li>
<li>
<p>Summarizes the weather to compute the minum, mean, and maximum, of a user supplied variable:</p>
<p>Summarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">weather |&gt; summarise_weather(temp)</pre>
<pre data-type="programlisting" data-code-language="r">weather |&gt; summarize_weather(temp)</pre>
</div>
</li>
<li>
<p>Converts the user supplied variable that uses clock time (e.g. <code>dep_time</code>, <code>arr_time</code>, etc) into a decimal time (i.e. hours + minutes / 60).</p>
<p>Converts the user supplied variable that uses clock time (e.g. <code>dep_time</code>, <code>arr_time</code>, etc.) into a decimal time (i.e. hours + (minutes / 60)).</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">weather |&gt; standardise_time(sched_dep_time)</pre>
</div>
</li>
</ol></li>
<li><p>For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-select: <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename_with()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_min()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_sample()</a></code>.</p></li>
<li><p>For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename_with()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_min()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_sample()</a></code>.</p></li>
<li>
<p>Generalize the following function so that you can supply any number of variables to count.</p>
<div class="cell">
@@ -621,21 +663,21 @@ Exercises</h2>
<section id="plot-functions" data-type="sect1">
<h1>
Plot functions</h1>
<p>Instead of returning a data frame, you might want to return a plot. Fortunately you can use the same techniques with ggplot2, because <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function. For example, imagine that youre making a lot of histograms:</p>
<p>Instead of returning a data frame, you might want to return a plot. Fortunately, you can use the same techniques with ggplot2, because <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function. For example, imagine that youre making a lot of histograms:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
ggplot(aes(carat)) +
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.1)
diamonds |&gt;
ggplot(aes(carat)) +
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.05)</pre>
</div>
<p>Wouldnt it be nice if you could wrap this up into a histogram function? This is easy as once you know that <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function so that you need to embrace:</p>
<p>Wouldnt it be nice if you could wrap this up into a histogram function? This is easy as pie once you know that <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function and you need to embrace:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">histogram &lt;- function(df, var, binwidth = NULL) {
df |&gt;
ggplot(aes({{ var }})) +
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth)
}
@@ -644,7 +686,7 @@ diamonds |&gt; histogram(carat, 0.1)</pre>
<p><img src="functions_files/figure-html/unnamed-chunk-46-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>Note that <code>histogram()</code> returns a ggplot2 plot, so that you can still add on additional components if you want. Just remember to switch from <code>|&gt;</code> to <code>+</code>:</p>
<p>Note that <code>histogram()</code> returns a ggplot2 plot, meaning you can still add on additional components if you want. Just remember to switch from <code>|&gt;</code> to <code>+</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
histogram(carat, 0.1) +
@@ -660,10 +702,9 @@ More variables</h2>
<p>Its straightforward to add more variables to the mix. For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/tyler_js_smith/status/1574377116988104704
linearity_check &lt;- function(df, x, y) {
df |&gt;
ggplot(aes({{ x }}, {{ y }})) +
ggplot(aes(x = {{ x }}, y = {{ y }})) +
geom_point() +
geom_smooth(method = "loess", color = "red", se = FALSE) +
geom_smooth(method = "lm", color = "blue", se = FALSE)
@@ -683,13 +724,14 @@ starwars |&gt;
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/ppaxisa/status/1574398423175921665
hex_plot &lt;- function(df, x, y, z, bins = 20, fun = "mean") {
df |&gt;
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
ggplot(aes(x = {{ x }}, y = {{ y }}, z = {{ z }})) +
stat_summary_hex(
aes(colour = after_scale(fill)), # make border same colour as fill
aes(color = after_scale(fill)), # make border same color as fill
bins = bins,
fun = fun,
)
}
diamonds |&gt; hex_plot(carat, price, depth)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-49-1.png" class="img-fluid" width="576"/></p>
@@ -708,17 +750,19 @@ Combining with dplyr</h2>
ggplot(aes(y = {{ var }})) +
geom_bar()
}
diamonds |&gt; sorted_bars(cut)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-50-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>Or you could maybe you want to make it easy to draw a bar plot just for a subset of the data:</p>
<p>We have to use a new operator here, <code>:=</code>, because we are generating the variable name based on user-supplied data. Variable names go on the left hand side of <code>=</code>, but Rs syntax doesnt allow anything to the left of <code>=</code> except for a single literal name. To work around this problem, we use the special operator <code>:=</code> which tidy evaluation treats in exactly the same way as <code>=</code>.</p>
<p>Or maybe you want to make it easy to draw a bar plot just for a subset of the data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">conditional_bars &lt;- function(df, condition, var) {
df |&gt;
filter({{ condition }}) |&gt;
ggplot(aes({{ var }})) +
ggplot(aes(x = {{ var }})) +
geom_bar()
}
@@ -727,17 +771,16 @@ diamonds |&gt; conditional_bars(cut == "Good", clarity)</pre>
<p><img src="functions_files/figure-html/unnamed-chunk-51-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>You can also get creative and display data summaries in other way. For example, this code uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.</p>
<p>You can also get creative and display data summaries in other ways. For example, this code uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
fancy_ts &lt;- function(df, val, group) {
labs &lt;- df |&gt;
group_by({{group}}) |&gt;
summarize(breaks = max({{val}}))
group_by({{ group }}) |&gt;
summarize(breaks = max({{ val }}))
df |&gt;
ggplot(aes(date, {{val}}, group = {{group}}, color = {{group}})) +
ggplot(aes(x = date, y = {{ val }}, group = {{ group }}, color = {{ group }})) +
geom_path() +
scale_y_continuous(
breaks = labs$breaks,
@@ -753,6 +796,7 @@ df &lt;- tibble(
dist4 = sort(rnorm(50, 15, 1)),
date = seq.Date(as.Date("2022-01-01"), as.Date("2022-04-10"), by = "2 days")
)
df &lt;- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
fancy_ts(df, value, dist_name)</pre>
@@ -766,26 +810,26 @@ fancy_ts(df, value, dist_name)</pre>
<section id="faceting" data-type="sect2">
<h2>
Faceting</h2>
<p>Unfortunately programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work. so you have to learn a new syntax. When programming with facets, instead of writing <code>~ x</code>, you need to write <code>vars(x)</code> and instead of <code>~ x + y</code> you need to write <code>vars(x, y)</code>. The only advantage of this syntax is that <code><a href="https://ggplot2.tidyverse.org/reference/vars.html">vars()</a></code> uses tidy evaluation so you can embrace within it:</p>
<p>Unfortunately, programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work. So you have to learn a new syntax. When programming with facets, instead of writing <code>~ x</code>, you need to write <code>vars(x)</code> and instead of <code>~ x + y</code> you need to write <code>vars(x, y)</code>. The only advantage of this syntax is that <code><a href="https://ggplot2.tidyverse.org/reference/vars.html">vars()</a></code> uses tidy evaluation so you can embrace within it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/sharoz/status/1574376332821204999
foo &lt;- function(x) {
ggplot(mtcars, aes(mpg, disp)) +
ggplot(mtcars, aes(x = mpg, y = disp)) +
geom_point() +
facet_wrap(vars({{ x }}))
}
foo(cyl)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-53-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>As with data frame functions, it can be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable. For example, the following function makes it particularly easy to interactively explore the conditional distribution <code>bill_length_mm</code> from palmerpenguins dataset.</p>
<p>As with data frame functions, it can be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable. For example, the following function makes it particularly easy to interactively explore the conditional distribution of <code>carat</code> from the diamonds dataset.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/yutannihilat_en/status/1574387230025875457
density &lt;- function(colour, facets, binwidth = 0.1) {
density &lt;- function(color, facets, binwidth = 0.1) {
diamonds |&gt;
ggplot(aes(carat, after_stat(density), colour = {{ colour }})) +
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}
@@ -812,18 +856,18 @@ Labeling</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">histogram &lt;- function(df, var, binwidth = NULL) {
df |&gt;
ggplot(aes({{ var }})) +
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth)
}</pre>
</div>
<p>Wouldnt it be nice if we could label the output with the variable and the bin width that was used? To do so, were going to have to go under the covers of tidy evaluation and use a function from package we havent talked about before: rlang. rlang is a low-level package thats used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).</p>
<p>To solve the labeling problem we can use <code><a href="https://rlang.r-lib.org/reference/englue.html">rlang::englue()</a></code>. This works similarly to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>, so any value wrapped in <code><a href="https://rdrr.io/r/base/Paren.html">{ }</a></code> will be inserted into the string. But it also understands <code>{{ }}</code>, which automatically insert the appropriate variable name:</p>
<p>Wouldnt it be nice if we could label the output with the variable and the bin width that was used? To do so, were going to have to go under the covers of tidy evaluation and use a function from the package we havent talked about yet: rlang. rlang is a low-level package thats used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).</p>
<p>To solve the labeling problem we can use <code><a href="https://rlang.r-lib.org/reference/englue.html">rlang::englue()</a></code>. This works similarly to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>, so any value wrapped in <code><a href="https://rdrr.io/r/base/Paren.html">{ }</a></code> will be inserted into the string. But it also understands <code>{{ }}</code>, which automatically inserts the appropriate variable name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">histogram &lt;- function(df, var, binwidth) {
label &lt;- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
df |&gt;
ggplot(aes({{ var }})) +
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth) +
labs(title = label)
}
@@ -833,17 +877,16 @@ diamonds |&gt; histogram(carat, 0.1)</pre>
<p><img src="functions_files/figure-html/unnamed-chunk-56-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>You can use the same approach any other place that you might supply a string in a ggplot2 plot.</p>
<p>You can use the same approach in any other place where you want to supply a string in a ggplot2 plot.</p>
</section>
<section id="exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Build up a rich plotting function by incrementally implementing each of the steps below.
<p>Build up a rich plotting function by incrementally implementing each of the steps below:</p>
<ol type="1"><li><p>Draw a scatterplot given dataset and <code>x</code> and <code>y</code> variables.</p></li>
<li><p>Add a line of best fit (i.e. a linear model with no standard errors).</p></li>
<li><p>Add a title.</p></li>
</ol></li>
</ol></section>
</section>
@@ -866,21 +909,20 @@ collapse_years()</pre>
<p>R also doesnt care about how you use white space in your functions but future readers will. Continue to follow the rules from <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>. Additionally, <code>function()</code> should always be followed by squiggly brackets (<code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># missing extra two spaces
pull_unique &lt;- function(df, var) {
df |&gt;
distinct({{ var }}) |&gt;
pull({{ var }})
density &lt;- function(color, facets, binwidth = 0.1) {
diamonds |&gt;
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}
# Pipe indented incorrectly
pull_unique &lt;- function(df, var) {
df |&gt;
distinct({{ var }}) |&gt;
pull({{ var }})
}
# Missing {} and all one line
pull_unique &lt;- function(df, var) df |&gt; distinct({{ var }}) |&gt; pull({{ var }})</pre>
density &lt;- function(color, facets, binwidth = 0.1) {
diamonds |&gt;
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}</pre>
</div>
<p>As you can see we recommend putting extra spaces inside of <code>{{ }}</code>. This makes it very obvious that something unusual is happening.</p>
@@ -893,20 +935,21 @@ Exercises</h2>
<pre data-type="programlisting" data-code-language="r">f1 &lt;- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f3 &lt;- function(x, y) {
rep(y, length.out = length(x))
}</pre>
</div>
</li>
<li><p>Take a function that youve written recently and spend 5 minutes brainstorming a better name for it and its arguments.</p></li>
<li><p>Make a case for why <code>norm_r()</code>, <code>norm_d()</code> etc would be better than <code><a href="https://rdrr.io/r/stats/Normal.html">rnorm()</a></code>, <code><a href="https://rdrr.io/r/stats/Normal.html">dnorm()</a></code>. Make a case for the opposite.</p></li>
<li><p>Make a case for why <code>norm_r()</code>, <code>norm_d()</code> etc. would be better than <code><a href="https://rdrr.io/r/stats/Normal.html">rnorm()</a></code>, <code><a href="https://rdrr.io/r/stats/Normal.html">dnorm()</a></code>. Make a case for the opposite.</p></li>
</ol></section>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot. Along the way your saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.</p>
<p>In this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot. Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.</p>
<p>We have only shown you the bare minimum to get started with functions and theres much more to learn. A few places to learn more are:</p>
<ul><li>To learn more about programming with tidy evaluation, see useful recipes in <a href="https://dplyr.tidyverse.org/articles/programming.html">programming with dplyr</a> and <a href="https://tidyr.tidyverse.org/articles/programming.html">programming with tidyr</a> and learn more about the theory in <a href="https://rlang.r-lib.org/reference/topic-data-mask.html">What is data-masking and why do I need {{?</a>.</li>
<li>To learn more about reducing duplication in your ggplot2 code, read the <a href="https://ggplot2-book.org/programming.html" class="uri">Programming with ggplot2</a> chapter of the ggplot2 book.</li>