929 lines
32 KiB
Plaintext
929 lines
32 KiB
Plaintext
# Functions {#sec-functions}
|
|
|
|
```{r}
|
|
#| results: "asis"
|
|
#| echo: false
|
|
source("_common.R")
|
|
status("drafting")
|
|
```
|
|
|
|
## Introduction
|
|
|
|
One of the best ways to improve your reach as a data scientist is to write functions.
|
|
Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
|
|
You should consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
|
|
|
|
Writing a function has three big advantages over using copy-and-paste:
|
|
|
|
1. You can give a function an evocative name that makes your code easier to understand.
|
|
|
|
2. As requirements change, you only need to update code in one place, instead of many.
|
|
|
|
3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
|
|
|
|
Writing good functions is a lifetime journey.
|
|
Even after using R for many years we still learn new techniques and better ways of approaching old problems.
|
|
The goal of this chapter is to get you started on your journey with functions with three useful types of functions:
|
|
|
|
- Vector functions take one or more vectors as input and return a vector as output.
|
|
- Data frame functions take a data frame as input and return a data frame as output.
|
|
- Plot functions that take a data frame as input and return a plot as output.
|
|
|
|
The chapter concludes with some advice on function style.
|
|
|
|
Many of the examples in this chapter were inspired by real data analysis code supplied by folks on twitter.
|
|
We've often simplified the code from the original so you might want to look at the original tweets which we list in the comments.
|
|
If you want just to see a huge variety of functions, check out the motivating tweets: https://twitter.com/hadleywickham/status/1574373127349575680, https://twitter.com/hadleywickham/status/1571603361350164486 A big thanks to everyone who contributed!
|
|
WI won't fully explain all of the functions that we use here, so you might need to do some reading of the documentation.
|
|
|
|
### Prerequisites
|
|
|
|
We'll wrap up a variety of functions from around the tidyverse.
|
|
We'll also use nycflights13 as a source of relatively familiar data to apply our functions to.
|
|
|
|
```{r}
|
|
#| message: false
|
|
library(tidyverse)
|
|
library(nycflights13)
|
|
```
|
|
|
|
This chapter also relies on a function that hasn't yet been implemented for dplyr but will be by the time the book is out:
|
|
|
|
```{r}
|
|
pick <- function(cols) {
|
|
across({{ cols }})
|
|
}
|
|
```
|
|
|
|
## Vector functions
|
|
|
|
We'll begin with vector functions: functions that take one or more vectors and return a vector result.
|
|
|
|
For example, take a look at this code.
|
|
What does it do?
|
|
|
|
```{r}
|
|
df <- tibble(
|
|
a = rnorm(5),
|
|
b = rnorm(5),
|
|
c = rnorm(5),
|
|
d = rnorm(5),
|
|
)
|
|
|
|
df |> mutate(
|
|
a = (a - min(a, na.rm = TRUE)) /
|
|
(max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
|
|
b = (b - min(b, na.rm = TRUE)) /
|
|
(max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
|
|
c = (c - min(c, na.rm = TRUE)) /
|
|
(max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
|
|
d = (d - min(d, na.rm = TRUE)) /
|
|
(max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
|
|
)
|
|
```
|
|
|
|
You might be able to puzzle out that this rescales each column to have a range from 0 to 1.
|
|
But did you spot the mistake?
|
|
When Hadley wrote this code he made an error when copying-and-pasting and forgot to change an `a` to a `b`.
|
|
Preventing this type of mistake of is one very good reason to learn how to write functions.
|
|
|
|
### Writing a function
|
|
|
|
To write a function you need to first analyse your repeated to figure what parts of the repeated code is constant and what parts vary.
|
|
If we take the code above and pull it outside of `mutate()` it's a little easier to see the pattern because each repetition is now one line:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
|
|
(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
|
|
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
|
|
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
|
|
(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))
|
|
```
|
|
|
|
To make this a bit clearer we can replace the bit that varies with `█`:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
|
|
```
|
|
|
|
There's only one thing that varies which implies we're going to need a function with one argument.
|
|
|
|
To turn this into an actual function you need three things:
|
|
|
|
1. A **name**.
|
|
Here we might use `rescale01` because this function rescales a vector to lie between 0 and 1.
|
|
|
|
2. The **arguments**.
|
|
The arguments are things that vary across calls.
|
|
Here we have just one argument which we're going to call `x` because this is a conventional name for a numeric vector.
|
|
|
|
3. The **body**.
|
|
The body is the code that is the in all the calls.
|
|
|
|
Then you create a function by following the template:
|
|
|
|
```{r}
|
|
name <- function(arguments) {
|
|
body
|
|
}
|
|
```
|
|
|
|
For this case that leads to:
|
|
|
|
```{r}
|
|
rescale01 <- function(x) {
|
|
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
|
|
}
|
|
```
|
|
|
|
At this point you might test with a few simple inputs to make sure you've captured the logic correctly:
|
|
|
|
```{r}
|
|
rescale01(c(-10, 0, 10))
|
|
rescale01(c(1, 2, 3, NA, 5))
|
|
```
|
|
|
|
Then you can rewrite the call to `mutate()` as:
|
|
|
|
```{r}
|
|
df |> mutate(
|
|
a = rescale01(a),
|
|
b = rescale01(b),
|
|
c = rescale01(c),
|
|
d = rescale01(d),
|
|
)
|
|
```
|
|
|
|
(In @sec-iteration, you'll learn how to use `across()` to reduce the duplication even further so all you need is `df |> mutate(across(a:d, rescale))`).
|
|
|
|
### Improving our function
|
|
|
|
You might notice `rescale()` function does some unnecessary work --- instead of computing `min()` twice and `max()` once we could instead compute both the minimum and maximum in one step with `range()`:
|
|
|
|
```{r}
|
|
rescale01 <- function(x) {
|
|
rng <- range(x, na.rm = TRUE)
|
|
(x - rng[1]) / (rng[2] - rng[1])
|
|
}
|
|
```
|
|
|
|
Or you might try this function on a vector that includes an infinite value:
|
|
|
|
```{r}
|
|
x <- c(1:10, Inf)
|
|
rescale01(x)
|
|
```
|
|
|
|
That result is not particularly useful so we could ask `range()` to ignore infinite values:
|
|
|
|
```{r}
|
|
rescale01 <- function(x) {
|
|
rng <- range(x, na.rm = TRUE, finite = TRUE)
|
|
(x - rng[1]) / (rng[2] - rng[1])
|
|
}
|
|
rescale01(x)
|
|
```
|
|
|
|
These changes illustrate an important benefit of functions: because we've moved the repeated code into a function, we only need to make the change in one place.
|
|
|
|
### Mutate functions
|
|
|
|
Let's look at a few more vector functions before you get some practice writing your own.
|
|
We'll start by looking at a few useful functions that work well in functions like `mutate()` and `filter()` because they return an output the same length as the input.
|
|
The goal of these sections is to expose you to a bunch of different functions to get your creative juices flowing, and to give you plenty of examples to generalize the structure and utility of functions from.
|
|
|
|
For example, maybe instead of rescaling to min 0, max 1, you want to rescale to mean zero, standard deviation one:
|
|
|
|
```{r}
|
|
rescale_z <- function(x) {
|
|
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
|
|
}
|
|
```
|
|
|
|
Sometimes your functions are highly specialised for one data analysis.
|
|
For example, you might have a bunch of variables that record missing values as 997, 998, or 999:
|
|
|
|
```{r}
|
|
fix_na <- function(x) {
|
|
if_else(x %in% c(997, 998, 999), NA, x)
|
|
}
|
|
```
|
|
|
|
Other cases, you might be wrapping up a simple a `case_when()` to give it a standard name.
|
|
For example, the `clamp()` function ensures all values of a vector lie in between a minimum or a maximum:
|
|
|
|
```{r}
|
|
clamp <- function(x, min, max) {
|
|
case_when(
|
|
x < min ~ min,
|
|
x > max ~ max,
|
|
.default = x
|
|
)
|
|
}
|
|
clamp(1:10, min = 3, max = 7)
|
|
```
|
|
|
|
Or maybe you'd rather mark those values as `NA`s:
|
|
|
|
```{r}
|
|
discard_outside <- function(x, min, max) {
|
|
case_when(
|
|
x < min ~ NA,
|
|
x > max ~ NA,
|
|
.default = x
|
|
)
|
|
}
|
|
discard_outside(1:10, min = 3, max = 7)
|
|
```
|
|
|
|
Of course functions don't just need to work with numeric variables.
|
|
You might want to extract out some repeated string manipulation.
|
|
Maybe you need to make the first character of each vector upper case:
|
|
|
|
```{r}
|
|
first_upper <- function(x) {
|
|
str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
|
|
x
|
|
}
|
|
first_upper("hello")
|
|
```
|
|
|
|
Or maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:
|
|
|
|
```{r}
|
|
# https://twitter.com/NVlabormarket/status/1571939851922198530
|
|
clean_number <- function(x) {
|
|
is_pct <- str_detect(x, "%")
|
|
num <- x |>
|
|
str_remove_all("%") |>
|
|
str_remove_all(",") |>
|
|
str_remove_all(fixed("$")) |>
|
|
as.numeric(x)
|
|
if_else(is_pct, num / 100, num)
|
|
}
|
|
clean_number("$12,300")
|
|
clean_number("45%")
|
|
```
|
|
|
|
There's no reason that your function can't take multiple vector inputs.
|
|
For example, you might want to compute the distance between two locations on the globe using the haversine formula:
|
|
|
|
```{r}
|
|
# https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
|
|
haversine <- function(long1, lat1, long2, lat2, round = 3) {
|
|
# convert to radians
|
|
long1 <- long1 * pi / 180
|
|
lat1 <- lat1 * pi / 180
|
|
long2 <- long2 * pi / 180
|
|
lat2 <- lat2 * pi / 180
|
|
|
|
R <- 6371 # Earth mean radius in km
|
|
a <- sin((lat2 - lat1) / 2)^2 +
|
|
cos(lat1) * cos(lat2) * sin((long2 - long1) / 2)^2
|
|
d <- R * 2 * asin(sqrt(a))
|
|
|
|
round(d, round)
|
|
}
|
|
```
|
|
|
|
### Summary functions
|
|
|
|
In other cases you want a function that returns a single value for use in `summary()`.
|
|
Sometimes this can just be a matter of setting a default argument:
|
|
|
|
```{r}
|
|
commas <- function(x) {
|
|
str_flatten(x, collapse = ", ")
|
|
}
|
|
commas(c("cat", "dog", "pigeon"))
|
|
```
|
|
|
|
Or performing some very simple computation, like computing the coefficient of variation, which standardizes the standard deviation by dividing it by the mean:
|
|
|
|
```{r}
|
|
cv <- function(x, na.rm = FALSE) {
|
|
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
|
|
}
|
|
cv(runif(100, min = 0, max = 50))
|
|
cv(runif(100, min = 0, max = 500))
|
|
```
|
|
|
|
Or maybe you just want to make a common pattern easier to remember by given it a memorable name:
|
|
|
|
```{r}
|
|
# https://twitter.com/gbganalyst/status/1571619641390252033
|
|
n_missing <- function(x) {
|
|
sum(is.na(x))
|
|
}
|
|
```
|
|
|
|
You can also write functions with multiple vector inputs.
|
|
For example, maybe you want to compute the mean absolute prediction error to help you comparing model predictions with actual values:
|
|
|
|
```{r}
|
|
# https://twitter.com/neilgcurrie/status/1571607727255834625
|
|
mape <- function(actual, predicted) {
|
|
sum(abs((actual - predicted) / actual)) / length(actual)
|
|
}
|
|
```
|
|
|
|
### Exercises
|
|
|
|
1. Practice turning the following code snippets into functions.
|
|
Think about what each function does.
|
|
What would you call it?
|
|
How many arguments does it need?
|
|
|
|
```{r}
|
|
#| eval: false
|
|
|
|
mean(is.na(x))
|
|
mean(is.na(y))
|
|
mean(is.na(z))
|
|
|
|
x / sum(x, na.rm = TRUE)
|
|
y / sum(y, na.rm = TRUE)
|
|
z / sum(z, na.rm = TRUE)
|
|
|
|
round(x / sum(x, na.rm = TRUE) * 100, 1)
|
|
round(y / sum(y, na.rm = TRUE) * 100, 1)
|
|
round(z / sum(z, na.rm = TRUE) * 100, 1)
|
|
```
|
|
|
|
2. In the second variant of `rescale01()`, infinite values are left unchanged.
|
|
Can you rewrite `rescale01()` so that `-Inf` is mapped to 0, and `Inf` is mapped to 1?
|
|
|
|
3. Given a vector of birthdates, write a function to compute the age in years.
|
|
|
|
4. Write your own functions to compute the variance and skewness of a numeric vector.
|
|
Variance is defined as $$
|
|
\mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,}
|
|
$$ where $\bar{x} = (\sum_i^n x_i) / n$ is the sample mean.
|
|
Skewness is defined as $$
|
|
\mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.}
|
|
$$
|
|
|
|
5. Write `both_na()`, a summary function that takes two vectors of the same length and returns the number of positions that have an `NA` in both vectors.
|
|
|
|
6. Read the documentation to figure out what the following functions do.
|
|
Why are they useful even though they are so short?
|
|
|
|
```{r}
|
|
is_directory <- function(x) file.info(x)$isdir
|
|
is_readable <- function(x) file.access(x, 4) == 0
|
|
```
|
|
|
|
## Data frame functions
|
|
|
|
Vector functions are useful for pulling out code that's repeated within dplyr verbs.
|
|
In this section, you'll learn how to write "data frame" functions which pull out code that's repeated across multiple pipelines.
|
|
These functions work in the same way as dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and usually return a data frame.
|
|
|
|
### Indirection and tidy evaluation
|
|
|
|
When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection.
|
|
Let's illustrate the problem with a very simple function: `pull_unique()`.
|
|
The goal of this function is to `pull()` the unique (distinct) values of a variable:
|
|
|
|
```{r}
|
|
pull_unique <- function(df, var) {
|
|
df |>
|
|
distinct(var) |>
|
|
pull(var)
|
|
}
|
|
```
|
|
|
|
If we try and use it, we get an error:
|
|
|
|
```{r}
|
|
#| error: true
|
|
diamonds |> pull_unique(clarity)
|
|
```
|
|
|
|
To make the problem a bit more clear we can use a made up data frame:
|
|
|
|
```{r}
|
|
df <- tibble(var = "var", x = "x", y = "y")
|
|
df |> pull_unique(x)
|
|
df |> pull_unique(y)
|
|
```
|
|
|
|
Regardless of how we call `pull_unique()` it always does `df |> distinct(var) |> pull(var)`, instead of `df |> distinct(x) |> pull(x)` or `df |> distinct(y) |> pull(y)`.
|
|
This is a problem of indirection, and it arises because dplyr allows you to refer to the names of variables inside your data frame without any special treatment, so called **tidy evaluation**.
|
|
|
|
Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it's obvious from the context.
|
|
The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function.
|
|
Here we need some way tell `distinct()` and `pull()` not to treat `var` as the name of a variable, but instead look inside `var` for the variable we actually want to use.
|
|
|
|
Tidy evaluation includes a solution to this problem called **embracing**.
|
|
Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var }}`.
|
|
Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the a literal variable name.
|
|
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a function look inside of `var` rather than looking for a variable called `var`.
|
|
|
|
So to make `pull_unique()` work we need to replace `var` with `{{ var }}`:
|
|
|
|
```{r}
|
|
pull_unique <- function(df, var) {
|
|
df |>
|
|
distinct({{ var }}) |>
|
|
pull({{ var }})
|
|
}
|
|
diamonds |> pull_unique(clarity)
|
|
```
|
|
|
|
### When to embrace?
|
|
|
|
So the art of writing data frame functions is basically just figuring out which arguments need to be embraced.
|
|
Fortunately this is easy because you can look it up from the documentation 😄.
|
|
There are two terms to look for in the docs:
|
|
|
|
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarise()` which do computation with variables.
|
|
|
|
- **Tidy-selection**: this is used for for functions like `select()`, `relocate()`, and `rename()` that select groups of variables.
|
|
|
|
When you start looking closely at the documentation, you'll notice that many dplyr functions use `…`.
|
|
This is a special shorthand syntax that matches any that aren't otherwise explicitly matched.
|
|
For example, `arrange()` uses data-masking for `…` and `select()` uses tidy-select for `…`.
|
|
|
|
Your intuition for many common functions should be pretty good --- think about whether you can compute (e.g. `x + 1`) or select (e.g. `a:x`).
|
|
There are a few cases where it's harder to tell because you usually use them with single variable, which uses the same syntax for both data-masking or tidy-select.
|
|
For example, the arguments to `group_by()`, `count()`, and `distinct()` are computing arguments because they can all create new variables.
|
|
If you're ever confused, just look at the docs.
|
|
|
|
In the next two sections we'll explore the sorts of handy functions you might write for data-masking and tidy-select arguments
|
|
|
|
### Summary basics
|
|
|
|
If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:
|
|
|
|
```{r}
|
|
summary6 <- function(data, var) {
|
|
data |> summarise(
|
|
min = min({{ var }}, na.rm = TRUE),
|
|
mean = mean({{ var }}, na.rm = TRUE),
|
|
median = median({{ var }}, na.rm = TRUE),
|
|
max = max({{ var }}, na.rm = TRUE),
|
|
n = n(),
|
|
n_miss = sum(is.na({{ var }})),
|
|
.groups = "drop"
|
|
)
|
|
}
|
|
diamonds |> summary6(carat)
|
|
```
|
|
|
|
(Whenever you wrap `summarise()` in a helper, we think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.)
|
|
|
|
The nice thing about this function is because it wraps `summarise()` you can used it on grouped data:
|
|
|
|
```{r}
|
|
diamonds |>
|
|
group_by(cut) |>
|
|
summary6(carat)
|
|
```
|
|
|
|
Because the arguments to summarize are data-masking that also means that the `var` argument to `summary6()` is data-masking.
|
|
That means you can also summarize computed variables:
|
|
|
|
```{r}
|
|
diamonds |>
|
|
group_by(cut) |>
|
|
summary6(log10(carat))
|
|
```
|
|
|
|
To summarize multiple variables you'll need wait until @sec-across, where you'll learn how to use `across()`.
|
|
|
|
### Count variations
|
|
|
|
Another popular helper function is a version of `count()` that also computes proportions:
|
|
|
|
```{r}
|
|
# https://twitter.com/Diabb6/status/1571635146658402309
|
|
count_prop <- function(df, var, sort = FALSE) {
|
|
df |>
|
|
count({{ var }}, sort = sort) |>
|
|
mutate(prop = n / sum(n))
|
|
}
|
|
diamonds |> count_prop(clarity)
|
|
```
|
|
|
|
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced.
|
|
`var` is passed to `count()` which uses data-masking for all variables in `…`.
|
|
|
|
Sometimes you want to select variables inside a function that uses data-masking.
|
|
For example, imagine you want to write `count_missing()` that counts the number of missing observations in rows.
|
|
You might try writing something like:
|
|
|
|
```{r}
|
|
#| error: true
|
|
count_missing <- function(df, group_vars, x_var) {
|
|
df |>
|
|
group_by({{ group_vars }}) |>
|
|
summarise(n_miss = sum(is.na({{ x_var }})))
|
|
}
|
|
flights |>
|
|
count_missing(c(year, month, day), dep_time)
|
|
```
|
|
|
|
This doesn't work because `group_by()` uses data-masking not tidy-select.
|
|
We can work around that problem by using `pick()` which allows you to use use tidy-select inside data-masking functions:
|
|
|
|
```{r}
|
|
count_missing <- function(df, group_vars, x_var) {
|
|
df |>
|
|
group_by(pick({{ group_vars }})) |>
|
|
summarise(n_miss = sum(is.na({{ x_var }})))
|
|
}
|
|
flights |>
|
|
count_missing(c(year, month, day), dep_time)
|
|
```
|
|
|
|
Another useful helper that uses `pick()` is to make a 2d table of counts.
|
|
Here we count using all the variables in the `rows` and `columns`, then use `pivot_wider()` to rearrange:
|
|
|
|
```{r}
|
|
# https://twitter.com/pollicipes/status/1571606508944719876
|
|
count_wide <- function(data, rows, cols) {
|
|
data |>
|
|
count(pick(c({{ rows }}, {{ cols }}))) |>
|
|
pivot_wider(
|
|
names_from = {{ cols }},
|
|
values_from = n,
|
|
names_sort = TRUE,
|
|
values_fill = 0
|
|
)
|
|
}
|
|
diamonds |> count_wide(clarity, cut)
|
|
diamonds |> count_wide(c(clarity, color), cut)
|
|
```
|
|
|
|
We didn't discuss `pivot_wider()` above, but you can read the docs to discover that `names_from` uses the tidy-select style of tidy evaluation.
|
|
|
|
### Selecting rows and columns
|
|
|
|
Or maybe you want to find the sorted unique values of a variable for a subset of the data.
|
|
Rather than supplying a variable and a value to do the filtering, we'll allow the user to supply an condition:
|
|
|
|
```{r}
|
|
unique_where <- function(df, condition, var) {
|
|
df |>
|
|
filter({{ condition }}) |>
|
|
distinct({{ var }}) |>
|
|
arrange({{ var }}) |>
|
|
pull({{ var }})
|
|
}
|
|
|
|
# Find all the destinations in December
|
|
flights |> unique_where(month == 12, dest)
|
|
# Which months did plane N14228 fly in?
|
|
flights |> unique_where(tailnum == "N14228", month)
|
|
```
|
|
|
|
Here we embrace `condition` because it's passed to `filter()` and `var` because its passed to `distinct()`, `arrange()`, and `pull()`.
|
|
|
|
We've made all these examples take a data frame as the first argument, but if you're working repeatedly with the same data frame, it can make sense to hard code it.
|
|
For example, this function always works with the flights dataset, make it easy to grab the subset that you want to work with.
|
|
It always includes `time_hour`, `carrier`, and `flight` since these are the primary key that allows you to identify a row.
|
|
|
|
```{r}
|
|
flights_sub <- function(rows, cols) {
|
|
flights |>
|
|
filter({{ rows }}) |>
|
|
select(time_hour, carrier, flight, {{ cols }})
|
|
}
|
|
|
|
flights_sub(dest == "IAH", contains("time"))
|
|
```
|
|
|
|
### Learning more
|
|
|
|
This section has introduced you to some of the power and flexibility of tidy evaluation with dplyr (and a dash of tidyr).
|
|
We've only used the smallest part of tidy evaluation, embracing, and it already gives you considerable power to reduce duplication in your data analyses.
|
|
You can learn more advanced techniques in `vignette("programming", package = "dplyr")`.
|
|
|
|
## Plot functions
|
|
|
|
Instead of returning a data frame, you might want to return a plot.
|
|
Fortunately you can use the same techniques with ggplot2, because `aes()` is a data-masking function.
|
|
For example, imagine that you're making a lot of histograms:
|
|
|
|
```{r}
|
|
#| fig-show: hide
|
|
diamonds |>
|
|
ggplot(aes(carat)) +
|
|
geom_histogram(binwidth = 0.1)
|
|
|
|
diamonds |>
|
|
ggplot(aes(carat)) +
|
|
geom_histogram(binwidth = 0.05)
|
|
```
|
|
|
|
Wouldn't it be nice if you could wrap this up into a histogram function?
|
|
This is easy as once you know that `aes()` is a data-masking function so that you need to embrace:
|
|
|
|
```{r}
|
|
histogram <- function(df, var, binwidth = NULL) {
|
|
df |>
|
|
ggplot(aes({{ var }})) +
|
|
geom_histogram(binwidth = binwidth)
|
|
}
|
|
|
|
diamonds |> histogram(carat, 0.1)
|
|
```
|
|
|
|
Note that `histogram()` returns a ggplot2 plot, so that you can still add on additional components if you want.
|
|
Just remember to switch from `|>` to `+`:
|
|
|
|
```{r}
|
|
diamonds |>
|
|
histogram(carat, 0.1) +
|
|
labs(x = "Size (in carats)", y = "Number of diamonds")
|
|
```
|
|
|
|
### More variables
|
|
|
|
It's straightforward to add more variables to the mix.
|
|
For example, maybe you want an easy way to eye ball whether or not a data set is linear by overlaying a smooth line and a straight line:
|
|
|
|
```{r}
|
|
# https://twitter.com/tyler_js_smith/status/1574377116988104704
|
|
|
|
linearity_check <- function(df, x, y) {
|
|
df |>
|
|
ggplot(aes({{ x }}, {{ y }})) +
|
|
geom_point() +
|
|
geom_smooth(method = "loess", color = "red", se = FALSE) +
|
|
geom_smooth(method = "lm", color = "blue", se = FALSE)
|
|
}
|
|
|
|
starwars |>
|
|
filter(mass < 1000) |>
|
|
linearity_check(mass, height)
|
|
```
|
|
|
|
Or you want to wrap up an alternative for a scatterplot that uses colour to display a third variable, for very large datasets where overplotting is a problem:
|
|
|
|
```{r}
|
|
# https://twitter.com/ppaxisa/status/1574398423175921665
|
|
hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
|
|
df |>
|
|
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
|
|
stat_summary_hex(
|
|
aes(colour = after_scale(fill)),
|
|
bins = bins,
|
|
fun = fun,
|
|
)
|
|
}
|
|
diamonds |> hex_plot(carat, price, depth)
|
|
```
|
|
|
|
### Combining with dplyr
|
|
|
|
Some of the most useful helpers combine a dash of dplyr with ggplot2.
|
|
For example, if you might want to do a bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
|
|
And we're drawing the vertical bars, so you need to reverse the usual order to get the highest values at the top:
|
|
|
|
```{r}
|
|
sorted_bars <- function(df, var) {
|
|
df |>
|
|
mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |>
|
|
ggplot(aes(y = {{ var }})) +
|
|
geom_bar()
|
|
}
|
|
diamonds |> sorted_bars(cut)
|
|
```
|
|
|
|
You can also get creative and display data summaries in other way:
|
|
|
|
```{r}
|
|
# https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
|
|
|
|
fancy_ts <- function(df, val, group) {
|
|
labs <- df |>
|
|
group_by({{group}}) |>
|
|
summarize(breaks = max({{val}}))
|
|
|
|
df |>
|
|
ggplot(aes(date, {{val}}, group = {{group}}, color = {{group}})) +
|
|
geom_path() +
|
|
scale_y_continuous(
|
|
breaks = labs$breaks,
|
|
labels = scales::label_comma(),
|
|
minor_breaks = NULL,
|
|
guide = guide_axis(position = "right")
|
|
)
|
|
}
|
|
|
|
df <- tibble(
|
|
dist1 = sort(rnorm(50, 5, 2)),
|
|
dist2 = sort(rnorm(50, 8, 3)),
|
|
dist4 = sort(rnorm(50, 15, 1)),
|
|
date = seq.Date(as.Date("2022-01-01"), as.Date("2022-04-10"), by = "2 days")
|
|
)
|
|
df <- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
|
|
|
|
fancy_ts(df, value, dist_name)
|
|
|
|
```
|
|
|
|
Next we'll discuss two more complicated cases: facetting and automatic labelling.
|
|
|
|
### Facetting
|
|
|
|
Unfortunately programming with facetting is a special challenge, because facetting was implemented before we understood what tidy evaluation was and how it should work.
|
|
Unlike `aes()`, it wasn't straightforward to backport to tidy evalution, so you have to learn a new syntax.
|
|
When programming with facets, instead of writing `~ x`, you need to write `vars(x)` and instead of `~ x + y` you need to write `vars(x, y)`.
|
|
The only advantage of this syntax is that `vars()` uses tidy evaluation so you can embrace within it:
|
|
|
|
```{r}
|
|
# https://twitter.com/sharoz/status/1574376332821204999
|
|
|
|
# Facetting is fiddly - have to use special vars syntax.
|
|
foo <- function(x) {
|
|
ggplot(mtcars) +
|
|
aes(x = mpg, y = disp) +
|
|
geom_point() +
|
|
facet_wrap(vars({{ x }}))
|
|
}
|
|
```
|
|
|
|
We've written these functions so that you can supply any data frame, but there are also advantages to hardcoding a data frame, if you're using it repeatedly:
|
|
|
|
```{r}
|
|
# https://twitter.com/yutannihilat_en/status/1574387230025875457
|
|
density <- function(fill, ...) {
|
|
palmerpenguins::penguins |>
|
|
ggplot(aes(bill_length_mm, fill = {{ fill }})) +
|
|
geom_density(alpha = 0.5) +
|
|
facet_wrap(vars(...))
|
|
}
|
|
|
|
density()
|
|
density(species)
|
|
density(island, sex)
|
|
```
|
|
|
|
Also note that we hardcoded the `x` variable but allowed the fill to vary.
|
|
|
|
```{r}
|
|
bars <- function(df, condition, var) {
|
|
df |>
|
|
filter({{ condition }}) |>
|
|
ggplot(aes({{ var }})) +
|
|
geom_bar() +
|
|
scale_x_discrete(guide = guide_axis(angle = 45))
|
|
}
|
|
|
|
diamonds |> bars(cut == "Good", clarity)
|
|
```
|
|
|
|
### Labelling
|
|
|
|
It'd be nice to label this plot automatically.
|
|
To do so, we're going to have to go under the covers of tidy evaluation and use a function from a package we have talked about before: rlang.
|
|
rlang is the package that implements tidy evaluation, and is used by all the other packages in the tidyverse.
|
|
rlang provides a helpful function called `englue()` to solve just this problem.
|
|
It uses a syntax inspired by glue but combined with embracing:
|
|
|
|
```{r}
|
|
# https://twitter.com/ppaxisa/status/1574398423175921665
|
|
hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
|
|
df |>
|
|
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
|
|
stat_summary_hex(
|
|
aes(colour = after_scale(fill)),
|
|
bins = bins,
|
|
fun = fun,
|
|
) +
|
|
labs(colour = rlang::englue("{{z}}"))
|
|
}
|
|
diamonds |> hex_plot(carat, price, depth)
|
|
```
|
|
|
|
```{r}
|
|
histogram <- function(df, var, binwidth = NULL) {
|
|
label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
|
|
|
|
df |>
|
|
ggplot(aes({{ var }})) +
|
|
geom_histogram(binwidth = binwidth) +
|
|
labs(title = label)
|
|
}
|
|
|
|
diamonds |> histogram(carat, 0.1)
|
|
```
|
|
|
|
(Note that if you omit the `binwidth` the function fails with a weird error. That appears to be a bug in `englue()`: https://github.com/r-lib/rlang/issues/1492.
|
|
Hopefully it'll be fixed soon!)
|
|
|
|
You can use the same approach any other place that you might supply a string in a ggplot2 plot.
|
|
|
|
### Learning more
|
|
|
|
It's hard to create general purpose plotting functions because you need to consider many different situations, and we haven't given you the programming skills to handle them all.
|
|
Fortunately, in most cases it's relatively simple to extract repeated plotting code into a function.
|
|
So, for now, strive to keep your functions simple, focussing on concrete repetition, not solve imaginary future problems.
|
|
|
|
You can also learn other techniques in <https://ggplot2-book.org/programming.html>.
|
|
|
|
## RStudio
|
|
|
|
Once you start writing functions, there are two RStudio shortcuts that are useful.
|
|
|
|
- If you put your cursor on the name of a function that you've written, `F2` will take you to its defintion.
|
|
|
|
- Press `Ctrl + .` to open the fuzzy file and function finder.
|
|
You can type the first few letters in your function name and it'll appear in the dropdown.
|
|
You can also navigate to files, Quarto sections, and more, making it a very hand navigation tool.
|
|
|
|
## Style
|
|
|
|
It's important to remember that functions are not just for the computer, but are also for humans.
|
|
R doesn't care what your function is called, or what comments it contains, but these are important for human readers.
|
|
This section discusses some things that you should bear in mind when writing functions that humans can understand.
|
|
|
|
The name of a function is important.
|
|
Ideally, the name of your function will be short, but clearly evoke what the function does.
|
|
That's hard!
|
|
But it's better to be clear than short, as RStudio's autocomplete makes it easy to type long names.
|
|
|
|
Generally, function names should be verbs, and arguments should be nouns.
|
|
There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`).
|
|
A good sign that a noun might be a better choice is if you're using a very broad verb like "get", "compute", "calculate", or "determine".
|
|
Use your best judgement and don't be afraid to rename a function if you figure out a better name later.
|
|
|
|
```{r}
|
|
#| eval: false
|
|
|
|
# Too short
|
|
f()
|
|
|
|
# Not a verb, or descriptive
|
|
my_awesome_function()
|
|
|
|
# Long, but clear
|
|
impute_missing()
|
|
collapse_years()
|
|
```
|
|
|
|
In terms of white space, continue to follow the rules from @sec-workflow-style.
|
|
Additionally, `function` should always be followed by squiggly brackets (`{}`), and the contents should be indented by an additional two spaces.
|
|
This makes it easier to see the hierarchy in your code by skimming the left-hand margin.
|
|
|
|
```{r}
|
|
# missing extra two spaces
|
|
pull_unique <- function(df, var) {
|
|
df |>
|
|
distinct({{ var }}) |>
|
|
pull({{ var }})
|
|
}
|
|
|
|
# Pipe indented incorrectly
|
|
pull_unique <- function(df, var) {
|
|
df |>
|
|
distinct({{ var }}) |>
|
|
pull({{ var }})
|
|
}
|
|
|
|
# Missing {} and all one line
|
|
pull_unique <- function(df, var) df |> distinct({{ var }}) |> pull({{ var }})
|
|
```
|
|
|
|
As you can see from the example we recommend putting extra spaces inside of `{{ }}`.
|
|
This makes it super obvious that something unusual is happening.
|
|
|
|
Learn more at <https://style.tidyverse.org/functions.html>
|
|
|
|
### Exercises
|
|
|
|
1. Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.
|
|
|
|
```{r}
|
|
f1 <- function(string, prefix) {
|
|
substr(string, 1, nchar(prefix)) == prefix
|
|
}
|
|
f3 <- function(x, y) {
|
|
rep(y, length.out = length(x))
|
|
}
|
|
```
|
|
|
|
2. Take a function that you've written recently and spend 5 minutes brainstorming a better name for it and its arguments.
|
|
|
|
3. Make a case for why `norm_r()`, `norm_d()` etc would be better than `rnorm()`, `dnorm()`.
|
|
Make a case for the opposite.
|
|
|
|
## Summary
|
|
|
|
In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot.
|
|
|
|
Writing functions to create data frames and plots using the tidyverse required you to learn a little about tidy evaluation.
|
|
Tidy evaluation is really important, because its what allows you to write `diamonds |> filter(x == y)` and `filter()` knows to use `x` and `y` from the diamonds dataset.
|
|
The downside of tidy evaluation is that you need to learn a new technique for programming: embracing.
|
|
Embracing, e.g. `{{ x }}`, tells the tidy-evaluation using function to look inside the argument `x`, rather than using the literal variable `x`.
|
|
You can figure out when you need to use embracing by looking in the documentation for the terms for the two major styles of tidyselect: "data masking" and "tidy select".
|
|
|
|
In the next chapter, we'll dive into some of the details of R's vector data structures that we've omitted so far.
|
|
These are immediately useful by themselves, but are a necessary foundation for the following chapter on iteration that provides some amazingly powerful tools.
|
|
|