Fix code language

This commit is contained in:
Hadley Wickham
2022-11-18 11:26:25 -06:00
parent 69b4597f3b
commit 868a35ca71
29 changed files with 912 additions and 907 deletions

View File

@@ -18,7 +18,7 @@ Introduction</h1>
Prerequisites</h2>
<p>Well wrap up a variety of functions from around the tidyverse. Well also use nycflights13 as a source of familiar data to use our functions with.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(nycflights13)</pre>
</div>
</section>
@@ -29,7 +29,7 @@ library(nycflights13)</pre>
Vector functions</h1>
<p>Well begin with vector functions: functions that take one or more vectors and return a vector result. For example, take a look at this code. What does it do?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
a = rnorm(5),
b = rnorm(5),
c = rnorm(5),
@@ -62,7 +62,7 @@ df |&gt; mutate(
Writing a function</h2>
<p>To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> its a little easier to see the pattern because each repetition is now one line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
<pre data-type="programlisting" data-code-language="r">(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)) </pre>
@@ -77,26 +77,26 @@ Writing a function</h2>
<li><p>The <strong>body</strong>. The body is the code that repeated across all the calls.</p></li>
</ol><p>Then you create a function by following the template:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">name &lt;- function(arguments) {
<pre data-type="programlisting" data-code-language="r">name &lt;- function(arguments) {
body
}</pre>
</div>
<p>For this case that leads to:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rescale01 &lt;- function(x) {
<pre data-type="programlisting" data-code-language="r">rescale01 &lt;- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}</pre>
</div>
<p>At this point you might test with a few simple inputs to make sure youve captured the logic correctly:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rescale01(c(-10, 0, 10))
<pre data-type="programlisting" data-code-language="r">rescale01(c(-10, 0, 10))
#&gt; [1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
#&gt; [1] 0.00 0.25 0.50 NA 1.00</pre>
</div>
<p>Then you can rewrite the call to <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> as:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; mutate(
<pre data-type="programlisting" data-code-language="r">df |&gt; mutate(
a = rescale01(a),
b = rescale01(b),
c = rescale01(c),
@@ -119,20 +119,20 @@ rescale01(c(1, 2, 3, NA, 5))
Improving our function</h2>
<p>You might notice <code>rescale01()</code> function does some unnecessary work — instead of computing <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> twice and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> once we could instead compute both the minimum and maximum in one step with <code><a href="https://rdrr.io/r/base/range.html">range()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rescale01 &lt;- function(x) {
<pre data-type="programlisting" data-code-language="r">rescale01 &lt;- function(x) {
rng &lt;- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}</pre>
</div>
<p>Or you might try this function on a vector that includes an infinite value:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1:10, Inf)
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1:10, Inf)
rescale01(x)
#&gt; [1] 0 0 0 0 0 0 0 0 0 0 NaN</pre>
</div>
<p>That result is not particularly useful so we could ask <code><a href="https://rdrr.io/r/base/range.html">range()</a></code> to ignore infinite values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rescale01 &lt;- function(x) {
<pre data-type="programlisting" data-code-language="r">rescale01 &lt;- function(x) {
rng &lt;- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
@@ -149,13 +149,13 @@ Mutate functions</h2>
<p>Now youve got the basic idea of functions, lets take a look a whole bunch of examples. Well start by looking at “mutate” functions, functions that work well like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> because they return an output the same length as the input.</p>
<p>Lets start with a simple variation of <code>rescale01()</code>. Maybe you want compute the Z-score, rescaling a vector to have to a mean of zero and a standard deviation of one:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">z_score &lt;- function(x) {
<pre data-type="programlisting" data-code-language="r">z_score &lt;- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}</pre>
</div>
<p>Or maybe you want to wrap up a straightforward <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> in order to give it a useful name. For example, this <code>clamp()</code> function ensures all values of a vector lie in between a minimum or a maximum:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">clamp &lt;- function(x, min, max) {
<pre data-type="programlisting" data-code-language="r">clamp &lt;- function(x, min, max) {
case_when(
x &lt; min ~ min,
x &gt; max ~ max,
@@ -167,7 +167,7 @@ clamp(1:10, min = 3, max = 7)
</div>
<p>Or maybe youd rather mark those values as <code>NA</code>s:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">na_outside &lt;- function(x, min, max) {
<pre data-type="programlisting" data-code-language="r">na_outside &lt;- function(x, min, max) {
case_when(
x &lt; min ~ NA,
x &gt; max ~ NA,
@@ -179,7 +179,7 @@ na_outside(1:10, min = 3, max = 7)
</div>
<p>Of course functions dont just need to work with numeric variables. You might want to extract out some repeated string manipulation. Maybe you need to make the first character upper case:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">first_upper &lt;- function(x) {
<pre data-type="programlisting" data-code-language="r">first_upper &lt;- function(x) {
str_sub(x, 1, 1) &lt;- str_to_upper(str_sub(x, 1, 1))
x
}
@@ -188,7 +188,7 @@ first_upper("hello")
</div>
<p>Or maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/NVlabormarket/status/1571939851922198530
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/NVlabormarket/status/1571939851922198530
clean_number &lt;- function(x) {
is_pct &lt;- str_detect(x, "%")
num &lt;- x |&gt;
@@ -205,13 +205,13 @@ clean_number("45%")
</div>
<p>Sometimes your functions will be highly specialized for one data analysis. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">fix_na &lt;- function(x) {
<pre data-type="programlisting" data-code-language="r">fix_na &lt;- function(x) {
if_else(x %in% c(997, 998, 999), NA, x)
}</pre>
</div>
<p>Weve focused on examples that take a single vector because we think theyre the most common. But theres no reason that your function cant take multiple vector inputs. For example, you might want to compute the distance between two locations on the globe using the haversine formula. This requires four vectors:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
haversine &lt;- function(long1, lat1, long2, lat2, round = 3) {
# convert to radians
long1 &lt;- long1 * pi / 180
@@ -234,7 +234,7 @@ haversine &lt;- function(long1, lat1, long2, lat2, round = 3) {
Summary functions</h2>
<p>Another important family of vector functions is summary functions, functions that return a single value for use in <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. Sometimes this can just be a matter of setting a default argument or two:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">commas &lt;- function(x) {
<pre data-type="programlisting" data-code-language="r">commas &lt;- function(x) {
str_flatten(x, collapse = ", ", last = " and ")
}
commas(c("cat", "dog", "pigeon"))
@@ -242,7 +242,7 @@ commas(c("cat", "dog", "pigeon"))
</div>
<p>Or you might wrap up a simple computation, like for the coefficient of variation, which divides standard deviation by the mean:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cv &lt;- function(x, na.rm = FALSE) {
<pre data-type="programlisting" data-code-language="r">cv &lt;- function(x, na.rm = FALSE) {
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}
cv(runif(100, min = 0, max = 50))
@@ -252,14 +252,14 @@ cv(runif(100, min = 0, max = 500))
</div>
<p>Or maybe you just want to make a common pattern easier to remember by giving it a memorable name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/gbganalyst/status/1571619641390252033
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/gbganalyst/status/1571619641390252033
n_missing &lt;- function(x) {
sum(is.na(x))
} </pre>
</div>
<p>You can also write functions with multiple vector inputs. For example, maybe you want to compute the mean absolute prediction error to help you compare model predictions with actual values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/neilgcurrie/status/1571607727255834625
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/neilgcurrie/status/1571607727255834625
mape &lt;- function(actual, predicted) {
sum(abs((actual - predicted) / actual)) / length(actual)
}</pre>
@@ -278,7 +278,7 @@ Exercises</h2>
<ol type="1"><li>
<p>Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">mean(is.na(x))
<pre data-type="programlisting" data-code-language="r">mean(is.na(x))
mean(is.na(y))
mean(is.na(z))
@@ -302,7 +302,7 @@ round(z / sum(z, na.rm = TRUE) * 100, 1)</pre>
<li>
<p>Read the documentation to figure out what the following functions do. Why are they useful even though they are so short?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">is_directory &lt;- function(x) file.info(x)$isdir
<pre data-type="programlisting" data-code-language="r">is_directory &lt;- function(x) file.info(x)$isdir
is_readable &lt;- function(x) file.access(x, 4) == 0</pre>
</div>
</li>
@@ -320,7 +320,7 @@ Data frame functions</h1>
Indirection and tidy evaluation</h2>
<p>When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Lets illustrate the problem with a very simple function: <code>pull_unique()</code>. The goal of this function is to <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> the unique (distinct) values of a variable:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">pull_unique &lt;- function(df, var) {
<pre data-type="programlisting" data-code-language="r">pull_unique &lt;- function(df, var) {
df |&gt;
distinct(var) |&gt;
pull(var)
@@ -328,14 +328,14 @@ Indirection and tidy evaluation</h2>
</div>
<p>If we try and use it, we get an error:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt; pull_unique(clarity)
<pre data-type="programlisting" data-code-language="r">diamonds |&gt; pull_unique(clarity)
#&gt; Error in `distinct()` at ]8;line = 38:col = 2;file:///Users/hadleywickham/Documents/dplyr/dplyr/R/pull.Rdplyr/R/pull.R:38:2]8;;:
#&gt; ! Must use existing variables.
#&gt; ✖ `var` not found in `.data`.</pre>
</div>
<p>To make the problem a bit more clear we can use a made up data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(var = "var", x = "x", y = "y")
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(var = "var", x = "x", y = "y")
df |&gt; pull_unique(x)
#&gt; [1] "var"
df |&gt; pull_unique(y)
@@ -346,7 +346,7 @@ df |&gt; pull_unique(y)
<p>Tidy evaluation includes a solution to this problem called <strong>embracing</strong> 🤗. Embracing a variable means to wrap it in braces so (e.g.) <code>var</code> becomes <code>{{ var }}</code>. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember whats happening is to think of <code>{{ }}</code> as looking down a tunnel — <code>{{ var }}</code> will make a dplyr function look inside of <code>var</code> rather than looking for a variable called <code>var</code>.</p>
<p>So to make <code>pull_unique()</code> work we need to replace <code>var</code> with <code>{{ var }}</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">pull_unique &lt;- function(df, var) {
<pre data-type="programlisting" data-code-language="r">pull_unique &lt;- function(df, var) {
df |&gt;
distinct({{ var }}) |&gt;
pull({{ var }})
@@ -373,7 +373,7 @@ When to embrace?</h2>
Common use cases</h2>
<p>If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">summary6 &lt;- function(data, var) {
<pre data-type="programlisting" data-code-language="r">summary6 &lt;- function(data, var) {
data |&gt; summarise(
min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE),
@@ -393,7 +393,7 @@ diamonds |&gt; summary6(carat)
<p>(Whenever you wrap <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> in a helper, we think its good practice to set <code>.groups = "drop"</code> to both avoid the message and leave the data in an ungrouped state.)</p>
<p>The nice thing about this function is because it wraps <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> you can used it on grouped data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
group_by(cut) |&gt;
summary6(carat)
#&gt; # A tibble: 5 × 7
@@ -407,7 +407,7 @@ diamonds |&gt; summary6(carat)
</div>
<p>Because the arguments to summarize are data-masking that also means that the <code>var</code> argument to <code>summary6()</code> is data-masking. That means you can also summarize computed variables:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
group_by(cut) |&gt;
summary6(log10(carat))
#&gt; # A tibble: 5 × 7
@@ -422,7 +422,7 @@ diamonds |&gt; summary6(carat)
<p>To summarize multiple variables youll need to wait until <a href="#sec-across" data-type="xref">#sec-across</a>, where youll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>.</p>
<p>Another popular <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> helper function is a version of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> that also computes proportions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/Diabb6/status/1571635146658402309
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/Diabb6/status/1571635146658402309
count_prop &lt;- function(df, var, sort = FALSE) {
df |&gt;
count({{ var }}, sort = sort) |&gt;
@@ -443,7 +443,7 @@ diamonds |&gt; count_prop(clarity)
<p>This function has three arguments: <code>df</code>, <code>var</code>, and <code>sort</code>, and only <code>var</code> needs to be embraced because its passed to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> which uses data-masking for all variables in <code></code>.</p>
<p>Or maybe you want to find the sorted unique values of a variable for a subset of the data. Rather than supplying a variable and a value to do the filtering, well allow the user to supply a condition:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">unique_where &lt;- function(df, condition, var) {
<pre data-type="programlisting" data-code-language="r">unique_where &lt;- function(df, condition, var) {
df |&gt;
filter({{ condition }}) |&gt;
distinct({{ var }}) |&gt;
@@ -468,7 +468,7 @@ flights |&gt; unique_where(tailnum == "N14228", month)
<p>Here we embrace <code>condition</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code>var</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code>.</p>
<p>Weve made all these examples take a data frame as the first argument, but if youre working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> since they form the compound primary key that allows you to identify a row.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_sub &lt;- function(rows, cols) {
<pre data-type="programlisting" data-code-language="r">flights_sub &lt;- function(rows, cols) {
flights |&gt;
filter({{ rows }}) |&gt;
select(time_hour, carrier, flight, {{ cols }})
@@ -494,7 +494,7 @@ flights_sub(dest == "IAH", contains("time"))
Data-masking vs tidy-selection</h2>
<p>Sometimes you want to select variables inside a function that uses data-masking. For example, imagine you want to write a <code>count_missing()</code> that counts the number of missing observations in rows. You might try writing something like:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">count_missing &lt;- function(df, group_vars, x_var) {
<pre data-type="programlisting" data-code-language="r">count_missing &lt;- function(df, group_vars, x_var) {
df |&gt;
group_by({{ group_vars }}) |&gt;
summarise(n_miss = sum(is.na({{ x_var }})))
@@ -508,7 +508,7 @@ flights |&gt;
</div>
<p>This doesnt work because <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> uses data-masking, not tidy-selection. We can work around that problem by using the handy <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> which allows you to use use tidy-selection inside data-masking functions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">count_missing &lt;- function(df, group_vars, x_var) {
<pre data-type="programlisting" data-code-language="r">count_missing &lt;- function(df, group_vars, x_var) {
df |&gt;
group_by(pick({{ group_vars }})) |&gt;
summarise(n_miss = sum(is.na({{ x_var }})))
@@ -531,7 +531,7 @@ flights |&gt;
</div>
<p>Another convenient use of <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> is to make a 2d table of counts. Here we count using all the variables in the <code>rows</code> and <code>columns</code>, then use <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> to rearrange the counts into a grid:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/pollicipes/status/1571606508944719876
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/pollicipes/status/1571606508944719876
count_wide &lt;- function(data, rows, cols) {
data |&gt;
count(pick(c({{ rows }}, {{ cols }}))) |&gt;
@@ -576,31 +576,31 @@ Exercises</h2>
<ol type="1"><li>
<p>Find all flights that were cancelled (i.e. <code>is.na(arr_time)</code>) or delayed by more than an hour.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; filter_severe()</pre>
<pre data-type="programlisting" data-code-language="r">flights |&gt; filter_severe()</pre>
</div>
</li>
<li>
<p>Counts the number of cancelled flights and the number of flights delayed by more than an hour.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; group_by(dest) |&gt; summarise_severe()</pre>
<pre data-type="programlisting" data-code-language="r">flights |&gt; group_by(dest) |&gt; summarise_severe()</pre>
</div>
</li>
<li>
<p>Finds all flights that were cancelled or delayed by more than a user supplied number of hours:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; filter_severe(hours = 2)</pre>
<pre data-type="programlisting" data-code-language="r">flights |&gt; filter_severe(hours = 2)</pre>
</div>
</li>
<li>
<p>Summarizes the weather to compute the minum, mean, and maximum, of a user supplied variable:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">weather |&gt; summarise_weather(temp)</pre>
<pre data-type="programlisting" data-code-language="r">weather |&gt; summarise_weather(temp)</pre>
</div>
</li>
<li>
<p>Converts the user supplied variable that uses clock time (e.g. <code>dep_time</code>, <code>arr_time</code>, etc) into a decimal time (i.e. hours + minutes / 60).</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">weather |&gt; standardise_time(sched_dep_time)</pre>
<pre data-type="programlisting" data-code-language="r">weather |&gt; standardise_time(sched_dep_time)</pre>
</div>
</li>
</ol></li>
@@ -608,7 +608,7 @@ Exercises</h2>
<li>
<p>Generalize the following function so that you can supply any number of variables to count.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">count_prop &lt;- function(df, var, sort = FALSE) {
<pre data-type="programlisting" data-code-language="r">count_prop &lt;- function(df, var, sort = FALSE) {
df |&gt;
count({{ var }}, sort = sort) |&gt;
mutate(prop = n / sum(n))
@@ -623,7 +623,7 @@ Exercises</h2>
Plot functions</h1>
<p>Instead of returning a data frame, you might want to return a plot. Fortunately you can use the same techniques with ggplot2, because <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function. For example, imagine that youre making a lot of histograms:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
ggplot(aes(carat)) +
geom_histogram(binwidth = 0.1)
@@ -633,7 +633,7 @@ diamonds |&gt;
</div>
<p>Wouldnt it be nice if you could wrap this up into a histogram function? This is easy as once you know that <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function so that you need to embrace:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">histogram &lt;- function(df, var, binwidth = NULL) {
<pre data-type="programlisting" data-code-language="r">histogram &lt;- function(df, var, binwidth = NULL) {
df |&gt;
ggplot(aes({{ var }})) +
geom_histogram(binwidth = binwidth)
@@ -646,7 +646,7 @@ diamonds |&gt; histogram(carat, 0.1)</pre>
</div>
<p>Note that <code>histogram()</code> returns a ggplot2 plot, so that you can still add on additional components if you want. Just remember to switch from <code>|&gt;</code> to <code>+</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
histogram(carat, 0.1) +
labs(x = "Size (in carats)", y = "Number of diamonds")</pre>
<div class="cell-output-display">
@@ -659,7 +659,7 @@ diamonds |&gt; histogram(carat, 0.1)</pre>
More variables</h2>
<p>Its straightforward to add more variables to the mix. For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/tyler_js_smith/status/1574377116988104704
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/tyler_js_smith/status/1574377116988104704
linearity_check &lt;- function(df, x, y) {
df |&gt;
@@ -680,7 +680,7 @@ starwars |&gt;
</div>
<p>Or maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/ppaxisa/status/1574398423175921665
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/ppaxisa/status/1574398423175921665
hex_plot &lt;- function(df, x, y, z, bins = 20, fun = "mean") {
df |&gt;
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
@@ -702,7 +702,7 @@ diamonds |&gt; hex_plot(carat, price, depth)</pre>
Combining with dplyr</h2>
<p>Some of the most useful helpers combine a dash of dplyr with ggplot2. For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_infreq()</a></code>. Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sorted_bars &lt;- function(df, var) {
<pre data-type="programlisting" data-code-language="r">sorted_bars &lt;- function(df, var) {
df |&gt;
mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |&gt;
ggplot(aes(y = {{ var }})) +
@@ -715,7 +715,7 @@ diamonds |&gt; sorted_bars(cut)</pre>
</div>
<p>Or you could maybe you want to make it easy to draw a bar plot just for a subset of the data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">conditional_bars &lt;- function(df, condition, var) {
<pre data-type="programlisting" data-code-language="r">conditional_bars &lt;- function(df, condition, var) {
df |&gt;
filter({{ condition }}) |&gt;
ggplot(aes({{ var }})) +
@@ -729,7 +729,7 @@ diamonds |&gt; conditional_bars(cut == "Good", clarity)</pre>
</div>
<p>You can also get creative and display data summaries in other way. For example, this code uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
<pre data-type="programlisting" data-code-language="r"># https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
fancy_ts &lt;- function(df, val, group) {
labs &lt;- df |&gt;
@@ -768,7 +768,7 @@ fancy_ts(df, value, dist_name)</pre>
Faceting</h2>
<p>Unfortunately programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work. so you have to learn a new syntax. When programming with facets, instead of writing <code>~ x</code>, you need to write <code>vars(x)</code> and instead of <code>~ x + y</code> you need to write <code>vars(x, y)</code>. The only advantage of this syntax is that <code><a href="https://ggplot2.tidyverse.org/reference/vars.html">vars()</a></code> uses tidy evaluation so you can embrace within it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/sharoz/status/1574376332821204999
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/sharoz/status/1574376332821204999
foo &lt;- function(x) {
ggplot(mtcars, aes(mpg, disp)) +
@@ -782,7 +782,7 @@ foo(cyl)</pre>
</div>
<p>As with data frame functions, it can be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable. For example, the following function makes it particularly easy to interactively explore the conditional distribution <code>bill_length_mm</code> from palmerpenguins dataset.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/yutannihilat_en/status/1574387230025875457
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/yutannihilat_en/status/1574387230025875457
density &lt;- function(colour, facets, binwidth = 0.1) {
diamonds |&gt;
ggplot(aes(carat, after_stat(density), colour = {{ colour }})) +
@@ -810,7 +810,7 @@ density(cut, clarity)</pre>
Labeling</h2>
<p>Remember the histogram function we showed you earlier?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">histogram &lt;- function(df, var, binwidth = NULL) {
<pre data-type="programlisting" data-code-language="r">histogram &lt;- function(df, var, binwidth = NULL) {
df |&gt;
ggplot(aes({{ var }})) +
geom_histogram(binwidth = binwidth)
@@ -819,7 +819,7 @@ Labeling</h2>
<p>Wouldnt it be nice if we could label the output with the variable and the bin width that was used? To do so, were going to have to go under the covers of tidy evaluation and use a function from package we havent talked about before: rlang. rlang is a low-level package thats used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).</p>
<p>To solve the labeling problem we can use <code><a href="https://rlang.r-lib.org/reference/englue.html">rlang::englue()</a></code>. This works similarly to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>, so any value wrapped in <code><a href="https://rdrr.io/r/base/Paren.html">{ }</a></code> will be inserted into the string. But it also understands <code>{{ }}</code>, which automatically insert the appropriate variable name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">histogram &lt;- function(df, var, binwidth) {
<pre data-type="programlisting" data-code-language="r">histogram &lt;- function(df, var, binwidth) {
label &lt;- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
df |&gt;
@@ -853,7 +853,7 @@ Style</h1>
<p>R doesnt care what your function or arguments are called but the names make a big difference for humans. Ideally, the name of your function will be short, but clearly evoke what the function does. Thats hard! But its better to be clear than short, as RStudios autocomplete makes it easy to type long names.</p>
<p>Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> is better than <code>compute_mean()</code>), or accessing some property of an object (i.e. <code><a href="https://rdrr.io/r/stats/coef.html">coef()</a></code> is better than <code>get_coefficients()</code>). Use your best judgement and dont be afraid to rename a function if you figure out a better name later.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Too short
<pre data-type="programlisting" data-code-language="r"># Too short
f()
# Not a verb, or descriptive
@@ -865,7 +865,7 @@ collapse_years()</pre>
</div>
<p>R also doesnt care about how you use white space in your functions but future readers will. Continue to follow the rules from <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>. Additionally, <code>function()</code> should always be followed by squiggly brackets (<code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># missing extra two spaces
<pre data-type="programlisting" data-code-language="r"># missing extra two spaces
pull_unique &lt;- function(df, var) {
df |&gt;
distinct({{ var }}) |&gt;
@@ -890,7 +890,7 @@ Exercises</h2>
<ol type="1"><li>
<p>Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">f1 &lt;- function(string, prefix) {
<pre data-type="programlisting" data-code-language="r">f1 &lt;- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f3 &lt;- function(x, y) {