r4ds/functions.qmd

918 lines
31 KiB
Plaintext
Raw Normal View History

# Functions {#sec-functions}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
2022-09-19 05:18:45 +08:00
status("drafting")
```
2015-10-21 21:04:37 +08:00
## Introduction
2016-07-19 21:01:50 +08:00
One of the best ways to improve your reach as a data scientist is to write functions.
2022-09-22 05:22:34 +08:00
Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
You should consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
Writing a function has three big advantages over using copy-and-paste:
2016-03-03 22:25:43 +08:00
1. You can give a function an evocative name that makes your code easier to understand.
2016-03-03 22:25:43 +08:00
2. As requirements change, you only need to update code in one place, instead of many.
2016-08-10 04:49:26 +08:00
3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
2016-08-10 04:49:26 +08:00
Writing good functions is a lifetime journey.
2022-08-10 00:43:12 +08:00
Even after using R for many years we still learn new techniques and better ways of approaching old problems.
The goal of this chapter is to get you started on your journey with functions with three useful types of functions:
2016-03-03 22:25:43 +08:00
2022-09-20 06:46:17 +08:00
- Vector functions take one or more vectors as input and return a vector as output.
- Data frame functions take a data frame as input and return a data frame as output.
- Plot functions that take a data frame as input and return a plot as output.
2022-09-19 05:18:45 +08:00
2022-09-22 05:22:34 +08:00
The chapter concludes with some advice on function style.
2016-02-11 21:58:53 +08:00
2022-09-29 22:20:45 +08:00
Many of the examples in this chapter were inspired by real data analysis code supplied by folks on twitter.
2022-09-29 23:50:47 +08:00
We've often simplified the code from the original so you might want to look at the original tweets which we list in the comments.
If you want just to see a huge variety of functions, check out the motivating tweets: https://twitter.com/hadleywickham/status/1574373127349575680, https://twitter.com/hadleywickham/status/1571603361350164486 A big thanks to everyone who contributed!
WI won't fully explain all of the functions that we use here, so you might need to do some reading of the documentation.
2022-09-29 22:20:45 +08:00
2016-07-19 21:01:50 +08:00
### Prerequisites
2022-09-21 06:03:26 +08:00
We'll wrap up a variety of functions from around the tidyverse.
2022-09-29 22:20:45 +08:00
We'll also use nycflights13 as a source of relatively familiar data to apply our functions to.
2022-09-21 06:03:26 +08:00
2022-09-09 00:32:10 +08:00
```{r}
2022-09-21 06:03:26 +08:00
#| message: false
2022-09-09 00:32:10 +08:00
library(tidyverse)
2022-09-29 22:20:45 +08:00
library(nycflights13)
```
This chapter also relies on a function that hasn't yet been implemented for dplyr but will be by the time the book is out:
```{r}
pick <- function(cols) {
across({{ cols }})
}
2022-09-09 00:32:10 +08:00
```
2016-07-19 21:01:50 +08:00
2022-09-19 05:18:45 +08:00
## Vector functions
2016-03-01 22:29:58 +08:00
2022-09-20 06:46:17 +08:00
We'll begin with vector functions: functions that take one or more vectors and return a vector result.
For example, take a look at this code.
What does it do?
2015-10-21 21:04:37 +08:00
```{r}
2022-09-22 05:22:34 +08:00
df <- tibble(
a = rnorm(5),
b = rnorm(5),
c = rnorm(5),
d = rnorm(5),
2015-10-21 21:04:37 +08:00
)
2022-09-09 00:32:10 +08:00
df |> mutate(
a = (a - min(a, na.rm = TRUE)) /
(max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
b = (b - min(b, na.rm = TRUE)) /
(max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
c = (c - min(c, na.rm = TRUE)) /
(max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
d = (d - min(d, na.rm = TRUE)) /
2022-09-22 05:22:34 +08:00
(max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
2022-09-09 00:32:10 +08:00
)
2015-10-19 21:41:33 +08:00
```
You might be able to puzzle out that this rescales each column to have a range from 0 to 1.
But did you spot the mistake?
2022-09-20 06:46:17 +08:00
When Hadley wrote this code he made an error when copying-and-pasting and forgot to change an `a` to a `b`.
Preventing this type of mistake of is one very good reason to learn how to write functions.
2015-10-19 21:41:33 +08:00
2022-09-22 05:22:34 +08:00
### Writing a function
To write a function you need to first analyse your repeated to figure what parts of the repeated code is constant and what parts vary.
If we take the code above and pull it outside of `mutate()` it's a little easier to see the pattern because each repetition is now one line:
2015-10-21 21:04:37 +08:00
```{r}
#| eval: false
2022-09-20 06:46:17 +08:00
(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))
2015-10-21 21:04:37 +08:00
```
2022-09-29 23:50:47 +08:00
To make this a bit clearer we can replace the bit that varies with `█`:
2022-09-22 05:22:34 +08:00
```{r}
#| eval: false
(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
```
2022-09-29 23:50:47 +08:00
There's only one thing that varies which implies we're going to need a function with one argument.
2022-09-22 05:22:34 +08:00
To turn this into an actual function you need three things:
2015-10-21 21:04:37 +08:00
2022-09-29 22:20:45 +08:00
1. A **name**.
Here we might use `rescale01` because this function rescales a vector to lie between 0 and 1.
2022-09-22 05:22:34 +08:00
2. The **arguments**.
The arguments are things that vary across calls.
Here we have just one argument which we're going to call `x` because this is a conventional name for a numeric vector.
2022-09-22 05:22:34 +08:00
3. The **body**.
The body is the code that is the in all the calls.
2022-09-22 05:22:34 +08:00
Then you create a function by following the template:
```{r}
name <- function(arguments) {
body
}
```
For this case that leads to:
2015-10-21 21:04:37 +08:00
2022-09-20 06:46:17 +08:00
```{r}
rescale01 <- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
```
2022-09-22 05:22:34 +08:00
At this point you might test with a few simple inputs to make sure you've captured the logic correctly:
2016-03-07 22:32:47 +08:00
```{r}
rescale01(c(-10, 0, 10))
rescale01(c(1, 2, 3, NA, 5))
```
2022-09-22 05:22:34 +08:00
Then you can rewrite the call to `mutate()` as:
2015-10-21 21:04:37 +08:00
```{r}
2022-09-09 00:32:10 +08:00
df |> mutate(
a = rescale01(a),
b = rescale01(b),
c = rescale01(c),
2022-09-22 05:22:34 +08:00
d = rescale01(d),
2022-09-09 00:32:10 +08:00
)
2015-10-21 21:04:37 +08:00
```
2022-09-22 05:22:34 +08:00
(In @sec-iteration, you'll learn how to use `across()` to reduce the duplication even further so all you need is `df |> mutate(across(a:d, rescale))`).
2022-09-20 06:46:17 +08:00
2022-09-22 05:22:34 +08:00
### Improving our function
You might notice `rescale()` function does some unnecessary work --- instead of computing `min()` twice and `max()` once we could instead compute both the minimum and maximum in one step with `range()`:
2022-09-09 00:32:10 +08:00
```{r}
2022-09-20 06:46:17 +08:00
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
2022-09-09 00:32:10 +08:00
```
2016-01-25 22:59:36 +08:00
2022-09-22 05:22:34 +08:00
Or you might try this function on a vector that includes an infinite value:
2016-03-12 03:11:41 +08:00
```{r}
x <- c(1:10, Inf)
rescale01(x)
```
2022-09-22 05:22:34 +08:00
That result is not particularly useful so we could ask `range()` to ignore infinite values:
2016-03-12 03:11:41 +08:00
```{r}
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
```
2022-09-22 05:22:34 +08:00
These changes illustrate an important benefit of functions: because we've moved the repeated code into a function, we only need to make the change in one place.
2016-03-12 03:11:41 +08:00
2022-09-20 06:46:17 +08:00
### Mutate functions
2022-09-22 05:22:34 +08:00
Let's look at a few more vector functions before you get some practice writing your own.
We'll start by looking at a few useful functions that work well in functions like `mutate()` and `filter()` because they return an output the same length as the input.
2022-09-29 22:20:45 +08:00
The goal of these sections is to expose you to a bunch of different functions to get your creative juices flowing, and to give you plenty of examples to generalize the structure and utility of functions from.
2022-09-20 06:46:17 +08:00
2022-09-22 05:22:34 +08:00
For example, maybe instead of rescaling to min 0, max 1, you want to rescale to mean zero, standard deviation one:
2022-09-20 06:46:17 +08:00
2022-09-19 05:18:45 +08:00
```{r}
rescale_z <- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
2022-09-20 06:46:17 +08:00
```
Sometimes your functions are highly specialised for one data analysis.
For example, you might have a bunch of variables that record missing values as 997, 998, or 999:
2022-09-19 05:18:45 +08:00
2022-09-20 06:46:17 +08:00
```{r}
2022-09-19 05:18:45 +08:00
fix_na <- function(x) {
2022-09-20 06:46:17 +08:00
if_else(x %in% c(997, 998, 999), NA, x)
2022-09-19 05:18:45 +08:00
}
2022-09-20 06:46:17 +08:00
```
2022-09-19 05:18:45 +08:00
2022-09-22 05:22:34 +08:00
Other cases, you might be wrapping up a simple a `case_when()` to give it a standard name.
For example, the `clamp()` function ensures all values of a vector lie in between a minimum or a maximum:
2022-09-20 06:46:17 +08:00
```{r}
clamp <- function(x, min, max) {
2022-09-19 05:18:45 +08:00
case_when(
x < min ~ min,
x > max ~ max,
.default = x
)
}
2022-09-22 05:22:34 +08:00
clamp(1:10, min = 3, max = 7)
2022-09-20 06:46:17 +08:00
```
2022-09-19 05:18:45 +08:00
2022-09-22 05:22:34 +08:00
Or maybe you'd rather mark those values as `NA`s:
```{r}
discard_outside <- function(x, min, max) {
case_when(
x < min ~ NA,
x > max ~ NA,
.default = x
)
}
discard_outside(1:10, min = 3, max = 7)
```
Of course functions don't just need to work with numeric variables.
You might want to extract out some repeated string manipulation.
Maybe you need to make the first character of each vector upper case:
2022-09-20 06:46:17 +08:00
```{r}
2022-09-19 05:18:45 +08:00
first_upper <- function(x) {
str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
x
}
2022-09-22 05:22:34 +08:00
first_upper("hello")
2022-09-20 06:46:17 +08:00
```
2022-09-29 22:20:45 +08:00
Or maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:
2022-09-20 06:46:17 +08:00
```{r}
2022-09-29 22:20:45 +08:00
# https://twitter.com/NVlabormarket/status/1571939851922198530
2022-09-20 06:46:17 +08:00
clean_number <- function(x) {
is_pct <- str_detect(x, "%")
2022-09-22 05:22:34 +08:00
num <- x |>
2022-09-20 06:46:17 +08:00
str_remove_all("%") |>
2022-09-22 05:22:34 +08:00
str_remove_all(",") |>
str_remove_all(fixed("$")) |>
as.numeric(x)
2022-09-20 06:46:17 +08:00
if_else(is_pct, num / 100, num)
}
2022-09-22 05:22:34 +08:00
clean_number("$12,300")
clean_number("45%")
2022-09-19 05:18:45 +08:00
```
2022-09-29 22:20:45 +08:00
There's no reason that your function can't take multiple vector inputs.
For example, you might want to compute the distance between two locations on the globe using the haversine formula:
```{r}
# https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
haversine <- function(long1, lat1, long2, lat2, round = 3) {
# convert to radians
long1 <- long1 * pi / 180
lat1 <- lat1 * pi / 180
long2 <- long2 * pi / 180
lat2 <- lat2 * pi / 180
R <- 6371 # Earth mean radius in km
a <- sin((lat2 - lat1) / 2)^2 +
cos(lat1) * cos(lat2) * sin((long2 - long1) / 2)^2
d <- R * 2 * asin(sqrt(a))
round(d, round)
}
```
2022-09-19 05:18:45 +08:00
### Summary functions
2022-09-20 06:46:17 +08:00
In other cases you want a function that returns a single value for use in `summary()`.
Sometimes this can just be a matter of setting a default argument:
```{r}
commas <- function(x) {
str_flatten(x, collapse = ", ")
}
2022-09-22 05:22:34 +08:00
commas(c("cat", "dog", "pigeon"))
2022-09-20 06:46:17 +08:00
```
2022-09-29 22:20:45 +08:00
Or performing some very simple computation, like computing the coefficient of variation, which standardizes the standard deviation by dividing it by the mean:
2022-09-20 06:46:17 +08:00
2022-09-19 05:18:45 +08:00
```{r}
cv <- function(x, na.rm = FALSE) {
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}
2022-09-22 05:22:34 +08:00
cv(runif(100, min = 0, max = 50))
cv(runif(100, min = 0, max = 500))
2022-09-20 06:46:17 +08:00
```
2022-09-19 05:18:45 +08:00
2022-09-22 05:22:34 +08:00
Or maybe you just want to make a common pattern easier to remember by given it a memorable name:
2022-09-19 05:18:45 +08:00
2022-09-20 06:46:17 +08:00
```{r}
# https://twitter.com/gbganalyst/status/1571619641390252033
n_missing <- function(x) {
sum(is.na(x))
}
2022-09-19 05:18:45 +08:00
```
2022-09-22 05:22:34 +08:00
You can also write functions with multiple vector inputs.
For example, maybe you want to compute the mean absolute prediction error to help you comparing model predictions with actual values:
2016-01-25 22:59:36 +08:00
2022-09-22 05:22:34 +08:00
```{r}
# https://twitter.com/neilgcurrie/status/1571607727255834625
mape <- function(actual, predicted) {
sum(abs((actual - predicted) / actual)) / length(actual)
}
```
2022-09-22 05:22:34 +08:00
### Exercises
2016-08-10 04:49:26 +08:00
2022-09-22 05:22:34 +08:00
1. Practice turning the following code snippets into functions.
Think about what each function does.
What would you call it?
How many arguments does it need?
2016-02-13 06:05:25 +08:00
```{r}
#| eval: false
2016-02-13 06:05:25 +08:00
mean(is.na(x))
2022-09-22 05:22:34 +08:00
mean(is.na(y))
mean(is.na(z))
2016-02-13 06:05:25 +08:00
x / sum(x, na.rm = TRUE)
2022-09-22 05:22:34 +08:00
y / sum(y, na.rm = TRUE)
z / sum(z, na.rm = TRUE)
2022-09-22 05:22:34 +08:00
round(x / sum(x, na.rm = TRUE) * 100, 1)
round(y / sum(y, na.rm = TRUE) * 100, 1)
round(z / sum(z, na.rm = TRUE) * 100, 1)
```
2022-09-22 05:22:34 +08:00
2. In the second variant of `rescale01()`, infinite values are left unchanged.
Can you rewrite `rescale01()` so that `-Inf` is mapped to 0, and `Inf` is mapped to 1?
3. Given a vector of birthdates, write a function to compute the age in years.
4. Write your own functions to compute the variance and skewness of a numeric vector.
Variance is defined as $$
\mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,}
$$ where $\bar{x} = (\sum_i^n x_i) / n$ is the sample mean.
Skewness is defined as $$
\mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.}
$$
2022-09-29 22:20:45 +08:00
5. Write `both_na()`, a summary function that takes two vectors of the same length and returns the number of positions that have an `NA` in both vectors.
2022-09-22 05:22:34 +08:00
6. Read the documentation to figure out what the following functions do.
Why are they useful even though they are so short?
```{r}
is_directory <- function(x) file.info(x)$isdir
is_readable <- function(x) file.access(x, 4) == 0
2016-02-13 06:05:25 +08:00
```
2022-09-20 06:46:17 +08:00
## Data frame functions
2022-09-22 05:22:34 +08:00
Vector functions are useful for pulling out code that's repeated within dplyr verbs.
In this section, you'll learn how to write "data frame" functions which pull out code that's repeated across multiple pipelines.
2022-09-29 22:20:45 +08:00
These functions work in the same way as dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and usually return a data frame.
2022-09-20 06:46:17 +08:00
2022-09-22 05:22:34 +08:00
### Indirection and tidy evaluation
2022-09-20 06:46:17 +08:00
2022-09-29 22:20:45 +08:00
When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection.
2022-09-21 06:03:26 +08:00
Let's illustrate the problem with a very simple function: `pull_unique()`.
The goal of this function is to `pull()` the unique (distinct) values of a variable:
2022-09-19 05:18:45 +08:00
```{r}
2022-09-21 06:03:26 +08:00
pull_unique <- function(df, var) {
df |>
distinct(var) |>
pull(var)
2022-09-19 05:18:45 +08:00
}
```
2022-09-21 06:03:26 +08:00
If we try and use it, we get an error:
2022-09-20 06:46:17 +08:00
```{r}
2022-09-21 06:03:26 +08:00
#| error: true
diamonds |> pull_unique(clarity)
2022-09-20 06:46:17 +08:00
```
2022-09-21 06:03:26 +08:00
To make the problem a bit more clear we can use a made up data frame:
2022-09-20 06:46:17 +08:00
```{r}
2022-09-21 06:03:26 +08:00
df <- tibble(var = "var", x = "x", y = "y")
df |> pull_unique(x)
df |> pull_unique(y)
2022-09-20 06:46:17 +08:00
```
2022-09-22 05:22:34 +08:00
Regardless of how we call `pull_unique()` it always does `df |> distinct(var) |> pull(var)`, instead of `df |> distinct(x) |> pull(x)` or `df |> distinct(y) |> pull(y)`.
2022-09-21 06:03:26 +08:00
This is a problem of indirection, and it arises because dplyr allows you to refer to the names of variables inside your data frame without any special treatment, so called **tidy evaluation**.
2022-09-29 22:20:45 +08:00
Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it's obvious from the context.
2022-09-22 05:22:34 +08:00
The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function.
Here we need some way tell `distinct()` and `pull()` not to treat `var` as the name of a variable, but instead look inside `var` for the variable we actually want to use.
2022-09-21 06:03:26 +08:00
2022-09-22 05:22:34 +08:00
Tidy evaluation includes a solution to this problem called **embracing**.
2022-09-29 22:20:45 +08:00
Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var }}`.
Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the a literal variable name.
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a function look inside of `var` rather than looking for a variable called `var`.
2022-09-21 06:03:26 +08:00
2022-09-29 22:20:45 +08:00
So to make `pull_unique()` work we need to replace `var` with `{{ var }}`:
2022-09-20 06:46:17 +08:00
```{r}
2022-09-21 06:03:26 +08:00
pull_unique <- function(df, var) {
2022-09-20 06:46:17 +08:00
df |>
2022-09-21 06:03:26 +08:00
distinct({{ var }}) |>
pull({{ var }})
2022-09-20 06:46:17 +08:00
}
2022-09-21 06:03:26 +08:00
diamonds |> pull_unique(clarity)
2022-09-20 06:46:17 +08:00
```
2022-09-21 06:03:26 +08:00
### When to embrace?
2022-09-19 05:18:45 +08:00
2022-09-29 22:20:45 +08:00
So the art of writing data frame functions is basically just figuring out which arguments need to be embraced.
2022-09-22 05:22:34 +08:00
Fortunately this is easy because you can look it up from the documentation 😄.
2022-09-21 06:03:26 +08:00
There are two terms to look for in the docs:
2022-09-19 05:18:45 +08:00
2022-09-21 06:03:26 +08:00
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarise()` which do computation with variables.
2022-09-19 05:18:45 +08:00
2022-09-22 05:22:34 +08:00
- **Tidy-selection**: this is used for for functions like `select()`, `relocate()`, and `rename()` that select groups of variables.
When you start looking closely at the documentation, you'll notice that many dplyr functions use `…`.
This is a special shorthand syntax that matches any that aren't otherwise explicitly matched.
For example, `arrange()` uses data-masking for `…` and `select()` uses tidy-select for `…`.
2022-09-19 05:18:45 +08:00
2022-09-29 22:20:45 +08:00
Your intuition for many common functions should be pretty good --- think about whether you can compute (e.g. `x + 1`) or select (e.g. `a:x`).
There are a few cases where it's harder to tell because you usually use them with single variable, which uses the same syntax for both data-masking or tidy-select.
For example, the arguments to `group_by()`, `count()`, and `distinct()` are computing arguments because they can all create new variables.
If you're ever confused, just look at the docs.
2022-09-22 05:22:34 +08:00
2022-09-21 06:03:26 +08:00
In the next two sections we'll explore the sorts of handy functions you might write for data-masking and tidy-select arguments
2022-09-20 06:46:17 +08:00
2022-09-29 22:20:45 +08:00
### Summary basics
2022-09-20 06:46:17 +08:00
2022-09-21 06:03:26 +08:00
If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:
2022-09-20 06:46:17 +08:00
```{r}
2022-09-21 06:03:26 +08:00
summary6 <- function(data, var) {
2022-09-22 22:01:02 +08:00
data |> summarise(
2022-09-21 06:03:26 +08:00
min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE),
max = max({{ var }}, na.rm = TRUE),
n = n(),
2022-09-22 23:43:21 +08:00
n_miss = sum(is.na({{ var }})),
.groups = "drop"
2022-09-20 06:46:17 +08:00
)
}
2022-09-21 06:03:26 +08:00
diamonds |> summary6(carat)
```
2022-09-29 23:50:47 +08:00
(Whenever you wrap `summarise()` in a helper, we think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.)
2022-09-22 23:43:21 +08:00
2022-09-29 22:20:45 +08:00
The nice thing about this function is because it wraps `summarise()` you can used it on grouped data:
2022-09-21 06:03:26 +08:00
```{r}
diamonds |>
group_by(cut) |>
summary6(carat)
2022-09-20 06:46:17 +08:00
```
2022-09-21 06:03:26 +08:00
Because the arguments to summarize are data-masking that also means that the `var` argument to `summary6()` is data-masking.
That means you can also summarize computed variables:
```{r}
diamonds |>
group_by(cut) |>
summary6(log10(carat))
```
2022-09-29 22:20:45 +08:00
To summarize multiple variables you'll need wait until @sec-across, where you'll learn how to use `across()`.
### Count variations
2022-09-21 06:03:26 +08:00
2022-09-29 22:20:45 +08:00
Another popular helper function is a version of `count()` that also computes proportions:
2022-09-20 06:46:17 +08:00
```{r}
# https://twitter.com/Diabb6/status/1571635146658402309
2022-09-21 06:03:26 +08:00
count_prop <- function(df, var, sort = FALSE) {
2022-09-20 06:46:17 +08:00
df |>
2022-09-21 06:03:26 +08:00
count({{ var }}, sort = sort) |>
mutate(prop = n / sum(n))
2022-09-20 06:46:17 +08:00
}
2022-09-21 06:03:26 +08:00
diamonds |> count_prop(clarity)
2022-09-20 06:46:17 +08:00
```
2022-09-29 22:20:45 +08:00
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced.
2022-09-22 05:22:34 +08:00
`var` is passed to `count()` which uses data-masking for all variables in `…`.
2022-09-21 06:03:26 +08:00
2022-09-22 05:22:34 +08:00
Sometimes you want to select variables inside a function that uses data-masking.
2022-09-29 22:20:45 +08:00
For example, imagine you want to write `count_missing()` that counts the number of missing observations in rows.
2022-09-22 05:22:34 +08:00
You might try writing something like:
```{r}
#| error: true
count_missing <- function(df, group_vars, x_var) {
df |>
group_by({{ group_vars }}) |>
summarise(n_miss = sum(is.na({{ x_var }})))
}
2022-09-29 22:20:45 +08:00
flights |>
2022-09-22 05:22:34 +08:00
count_missing(c(year, month, day), dep_time)
```
This doesn't work because `group_by()` uses data-masking not tidy-select.
2022-09-29 22:20:45 +08:00
We can work around that problem by using `pick()` which allows you to use use tidy-select inside data-masking functions:
2022-09-22 05:22:34 +08:00
```{r}
count_missing <- function(df, group_vars, x_var) {
df |>
group_by(pick({{ group_vars }})) |>
summarise(n_miss = sum(is.na({{ x_var }})))
}
2022-09-29 22:20:45 +08:00
flights |>
2022-09-22 05:22:34 +08:00
count_missing(c(year, month, day), dep_time)
```
2022-09-29 22:20:45 +08:00
Another useful helper that uses `pick()` is to make a 2d table of counts.
Here we count using all the variables in the `rows` and `columns`, then use `pivot_wider()` to rearrange:
2022-09-21 06:03:26 +08:00
```{r}
2022-09-29 22:20:45 +08:00
# https://twitter.com/pollicipes/status/1571606508944719876
2022-09-20 06:46:17 +08:00
count_wide <- function(data, rows, cols) {
data |>
2022-09-22 05:22:34 +08:00
count(pick(c({{ rows }}, {{ cols }}))) |>
pivot_wider(
names_from = {{ cols }},
values_from = n,
names_sort = TRUE,
values_fill = 0
)
2022-09-20 06:46:17 +08:00
}
2022-09-29 22:20:45 +08:00
diamonds |> count_wide(clarity, cut)
diamonds |> count_wide(c(clarity, color), cut)
```
We didn't discuss `pivot_wider()` above, but you can read the docs to discover that `names_from` uses the tidy-select style of tidy evaluation.
### Selecting rows and columns
Or maybe you want to find the sorted unique values of a variable for a subset of the data.
2022-09-29 23:50:47 +08:00
Rather than supplying a variable and a value to do the filtering, we'll allow the user to supply an condition:
2022-09-29 22:20:45 +08:00
```{r}
unique_where <- function(df, condition, var) {
df |>
filter({{ condition }}) |>
distinct({{ var }}) |>
arrange({{ var }}) |>
pull({{ var }})
}
# Find all the destinations in December
flights |> unique_where(month == 12, dest)
# Which months did plane N14228 fly in?
flights |> unique_where(tailnum == "N14228", month)
```
Here we embrace `condition` because it's passed to `filter()` and `var` because its passed to `distinct()`, `arrange()`, and `pull()`.
2022-09-29 23:50:47 +08:00
We've made all these examples take a data frame as the first argument, but if you're working repeatedly with the same data frame, it can make sense to hard code it.
2022-09-29 22:20:45 +08:00
For example, this function always works with the flights dataset, make it easy to grab the subset that you want to work with.
It always includes `time_hour`, `carrier`, and `flight` since these are the primary key that allows you to identify a row.
```{r}
flights_sub <- function(rows, cols) {
flights |>
filter({{ rows }}) |>
select(time_hour, carrier, flight, {{ cols }})
}
flights_sub(dest == "IAH", contains("time"))
2022-09-20 06:46:17 +08:00
```
### Learning more
2022-09-29 22:20:45 +08:00
This section has introduced you to some of the power and flexibility of tidy evaluation with dplyr (and a dash of tidyr).
We've only used the smallest part of tidy evaluation, embracing, and it already gives you considerable power to reduce duplication in your data analyses.
You can learn more advanced techniques in `vignette("programming", package = "dplyr")`.
## Plot functions
2022-09-29 22:20:45 +08:00
Instead of returning a data frame, you might want to return a plot.
Fortunately you can use the same techniques with ggplot2, because `aes()` is a data-masking function.
For example, imagine that you're making a lot of histograms:
```{r}
#| fig-show: hide
diamonds |>
ggplot(aes(carat)) +
geom_histogram(binwidth = 0.1)
diamonds |>
ggplot(aes(carat)) +
geom_histogram(binwidth = 0.05)
```
Wouldn't it be nice if you could wrap this up into a histogram function?
This is easy as once you know that `aes()` is a data-masking function so that you need to embrace:
```{r}
histogram <- function(df, var, binwidth = NULL) {
df |>
ggplot(aes({{ var }})) +
geom_histogram(binwidth = binwidth)
}
diamonds |> histogram(carat, 0.1)
```
Note that `histogram()` returns a ggplot2 plot, so that you can still add on additional components if you want.
Just remember to switch from `|>` to `+`:
```{r}
diamonds |>
histogram(carat, 0.1) +
labs(x = "Size (in carats)", y = "Number of diamonds")
```
2022-09-29 22:20:45 +08:00
### More variables
It's straightforward to add more variables to the mix.
For example, maybe you want an easy way to eye ball whether or not a data set is linear by overlaying a smooth line and a straight line:
```{r}
# https://twitter.com/tyler_js_smith/status/1574377116988104704
2022-09-29 22:20:45 +08:00
linearity_check <- function(df, x, y) {
df |>
ggplot(aes({{ x }}, {{ y }})) +
geom_point() +
geom_smooth(method = "loess", color = "red", se = FALSE) +
2022-09-29 22:20:45 +08:00
geom_smooth(method = "lm", color = "blue", se = FALSE)
}
2022-09-29 22:20:45 +08:00
starwars |>
filter(mass < 1000) |>
linearity_check(mass, height)
```
2022-09-29 22:20:45 +08:00
Or you want to wrap up an alternative for a scatterplot that uses colour to display a third variable, for very large datasets where overplotting is a problem:
```{r}
# https://twitter.com/ppaxisa/status/1574398423175921665
hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
df |>
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
stat_summary_hex(
aes(colour = after_scale(fill)),
bins = bins,
fun = fun,
)
}
diamonds |> hex_plot(carat, price, depth)
```
### Combining with dplyr
Some of the most useful helpers combine a dash of dplyr with ggplot2.
For example, if you might want to do a bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
2022-09-29 23:50:47 +08:00
And we're drawing the vertical bars, so you need to reverse the usual order to get the highest values at the top:
```{r}
sorted_bars <- function(df, var) {
df |>
mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |>
ggplot(aes(y = {{ var }})) +
geom_bar()
}
diamonds |> sorted_bars(cut)
```
2022-09-29 22:20:45 +08:00
You can also get creative and display data summaries in other way:
```{r}
# https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
fancy_ts <- function(df, val, group) {
labs <- df |>
group_by({{group}}) |>
summarize(breaks = max({{val}}))
df |>
ggplot(aes(date, {{val}}, group = {{group}}, color = {{group}})) +
geom_path() +
scale_y_continuous(
breaks = labs$breaks,
labels = scales::label_comma(),
minor_breaks = NULL,
guide = guide_axis(position = "right")
)
}
df <- tibble(
dist1 = sort(rnorm(50, 5, 2)),
dist2 = sort(rnorm(50, 8, 3)),
dist4 = sort(rnorm(50, 15, 1)),
date = seq.Date(as.Date("2022-01-01"), as.Date("2022-04-10"), by = "2 days")
)
df <- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
fancy_ts(df, value, dist_name)
```
2022-09-26 21:55:42 +08:00
Next we'll discuss two more complicated cases: facetting and automatic labelling.
### Facetting
2022-09-29 22:20:45 +08:00
Unfortunately programming with facetting is a special challenge, because facetting was implemented before we understood what tidy evaluation was and how it should work.
Unlike `aes()`, it wasn't straightforward to backport to tidy evalution, so you have to learn a new syntax.
When programming with facets, instead of writing `~ x`, you need to write `vars(x)` and instead of `~ x + y` you need to write `vars(x, y)`.
The only advantage of this syntax is that `vars()` uses tidy evaluation so you can embrace within it:
```{r}
2022-09-26 21:55:42 +08:00
# https://twitter.com/sharoz/status/1574376332821204999
2022-09-26 21:55:42 +08:00
# Facetting is fiddly - have to use special vars syntax.
foo <- function(x) {
ggplot(mtcars) +
aes(x = mpg, y = disp) +
geom_point() +
facet_wrap(vars({{ x }}))
}
```
2022-09-29 23:50:47 +08:00
We've written these functions so that you can supply any data frame, but there are also advantages to hardcoding a data frame, if you're using it repeatedly:
```{r}
2022-09-29 22:20:45 +08:00
# https://twitter.com/yutannihilat_en/status/1574387230025875457
density <- function(fill, ...) {
palmerpenguins::penguins |>
ggplot(aes(bill_length_mm, fill = {{ fill }})) +
geom_density(alpha = 0.5) +
facet_wrap(vars(...))
}
density()
density(species)
density(island, sex)
```
2022-09-29 23:50:47 +08:00
Also note that we hardcoded the `x` variable but allowed the fill to vary.
2022-09-26 21:55:42 +08:00
```{r}
bars <- function(df, condition, var) {
df |>
filter({{ condition }}) |>
ggplot(aes({{ var }})) +
geom_bar() +
scale_x_discrete(guide = guide_axis(angle = 45))
}
diamonds |> bars(cut == "Good", clarity)
```
### Labelling
It'd be nice to label this plot automatically.
To do so, we're going to have to go under the covers of tidy evaluation and use a function from a package we have talked about before: rlang.
rlang is the package that implements tidy evaluation, and is used by all the other packages in the tidyverse.
rlang provides a helpful function called `englue()` to solve just this problem.
It uses a syntax inspired by glue but combined with embracing:
2022-09-29 22:20:45 +08:00
```{r}
# https://twitter.com/ppaxisa/status/1574398423175921665
hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
df |>
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
stat_summary_hex(
aes(colour = after_scale(fill)),
bins = bins,
fun = fun,
) +
labs(colour = rlang::englue("{{z}}"))
}
diamonds |> hex_plot(carat, price, depth)
```
```{r}
histogram <- function(df, var, binwidth = NULL) {
label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
df |>
ggplot(aes({{ var }})) +
geom_histogram(binwidth = binwidth) +
labs(title = label)
}
diamonds |> histogram(carat, 0.1)
```
(Note that if you omit the `binwidth` the function fails with a weird error. That appears to be a bug in `englue()`: https://github.com/r-lib/rlang/issues/1492.
Hopefully it'll be fixed soon!)
You can use the same approach any other place that you might supply a string in a ggplot2 plot.
2022-09-29 22:20:45 +08:00
### Learning more
It's hard to create general purpose plotting functions because you need to consider many different situations, and we haven't given you the programming skills to handle them all.
Fortunately, in most cases it's relatively simple to extract repeated plotting code into a function.
So, for now, strive to keep your functions simple, focussing on concrete repetition, not solve imaginary future problems.
You can also learn other techniques in <https://ggplot2-book.org/programming.html>.
2022-09-19 05:18:45 +08:00
## Style
2016-01-25 22:59:36 +08:00
It's important to remember that functions are not just for the computer, but are also for humans.
R doesn't care what your function is called, or what comments it contains, but these are important for human readers.
This section discusses some things that you should bear in mind when writing functions that humans can understand.
The name of a function is important.
Ideally, the name of your function will be short, but clearly evoke what the function does.
That's hard!
But it's better to be clear than short, as RStudio's autocomplete makes it easy to type long names.
Generally, function names should be verbs, and arguments should be nouns.
There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`).
A good sign that a noun might be a better choice is if you're using a very broad verb like "get", "compute", "calculate", or "determine".
Use your best judgement and don't be afraid to rename a function if you figure out a better name later.
2016-03-08 22:15:54 +08:00
```{r}
#| eval: false
2016-03-08 22:15:54 +08:00
# Too short
f()
# Not a verb, or descriptive
my_awesome_function()
2016-03-08 22:15:54 +08:00
# Long, but clear
impute_missing()
collapse_years()
```
2016-03-03 22:25:43 +08:00
2022-09-22 05:22:34 +08:00
In terms of white space, continue to follow the rules from @sec-workflow-style.
Additionally, `function` should always be followed by squiggly brackets (`{}`), and the contents should be indented by an additional two spaces.
This makes it easier to see the hierarchy in your code by skimming the left-hand margin.
2016-03-08 22:15:54 +08:00
```{r}
2022-09-22 05:22:34 +08:00
# missing extra two spaces
pull_unique <- function(df, var) {
df |>
distinct({{ var }}) |>
pull({{ var }})
}
2022-09-22 05:22:34 +08:00
# Pipe indented incorrectly
pull_unique <- function(df, var) {
df |>
distinct({{ var }}) |>
pull({{ var }})
}
2022-09-22 05:22:34 +08:00
# Missing {} and all one line
pull_unique <- function(df, var) df |> distinct({{ var }}) |> pull({{ var }})
2016-01-25 22:59:36 +08:00
```
2022-09-22 05:22:34 +08:00
As you can see from the example we recommend putting extra spaces inside of `{{ }}`.
This makes it super obvious that something unusual is happening.
2022-09-22 05:22:34 +08:00
Learn more at <https://style.tidyverse.org/functions.html>
2016-03-03 00:55:14 +08:00
2022-09-19 05:18:45 +08:00
### Exercises
2016-03-03 22:25:43 +08:00
2022-09-22 05:22:34 +08:00
1. Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.
2022-09-19 05:18:45 +08:00
```{r}
f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
```
2022-09-19 05:18:45 +08:00
2. Take a function that you've written recently and spend 5 minutes brainstorming a better name for it and its arguments.
2022-09-22 05:22:34 +08:00
3. Make a case for why `norm_r()`, `norm_d()` etc would be better than `rnorm()`, `dnorm()`.
2022-09-19 05:18:45 +08:00
Make a case for the opposite.
2016-03-03 22:25:43 +08:00
2022-09-22 05:22:34 +08:00
## Summary
2016-03-08 22:15:54 +08:00
In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot.
Writing functions to create data frames and plots using the tidyverse required you to learn a little about tidy evaluation.
Tidy evaluation is really important, because its what allows you to write `diamonds |> filter(x == y)` and `filter()` knows to use `x` and `y` from the diamonds dataset.
The downside of tidy evaluation is that you need to learn a new technique for programming: embracing.
Embracing, e.g. `{{ x }}`, tells the tidy-evaluation using function to look inside the argument `x`, rather than using the literal variable `x`.
You can figure out when you need to use embracing by looking in the documentation for the terms for the two major styles of tidyselect: "data masking" and "tidy select".
In the next chapter, we'll dive into some of the details of R's vector data structures that we've omitted so far.
These are immediately useful by themselves, but are a necessary foundation for the following chapter on iteration that provides some amazingly powerful tools.