Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
You should consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
(In @sec-iteration, you'll learn how to use `across()` to reduce the duplication even further so all you need is `df |> mutate(across(a:d, rescale))`).
You might notice `rescale()` function does some unnecessary work --- instead of computing `min()` twice and `max()` once we could instead compute both the minimum and maximum in one step with `range()`:
These changes illustrate an important benefit of functions: because we've moved the repeated code into a function, we only need to make the change in one place.
Let's look at a few more vector functions before you get some practice writing your own.
We'll start by looking at a few useful functions that work well in functions like `mutate()` and `filter()` because they return an output the same length as the input.
Or maybe, like [NV Labor Analysis](https://twitter.com/NVlabormarket/status/1571939851922198530), you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:
Or some very simple computation, for example to compute the coefficient of variation, which standardizes the standard deviation by dividing it by the mean:
Vector functions are useful for pulling out code that's repeated within dplyr verbs.
In this section, you'll learn how to write "data frame" functions which pull out code that's repeated across multiple pipelines.
These functions work in the same way as dplyr verbs: they takes a data frame as the first argument, some extra arguments that say what to do with it, and usually return a data frame.
Regardless of how we call `pull_unique()` it always does `df |> distinct(var) |> pull(var)`, instead of `df |> distinct(x) |> pull(x)` or `df |> distinct(y) |> pull(y)`.
This is a problem of indirection, and it arises because dplyr allows you to refer to the names of variables inside your data frame without any special treatment, so called **tidy evaluation**.
Tidy evaluation is great 95% of the time because it makes our data analyses very concise as you never have to say which data frame a variable comes from; it's obvious from the context.
The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function.
Here we need some way tell `distinct()` and `pull()` not to treat `var` as the name of a variable, but instead look inside `var` for the variable we actually want to use.
Tidy evaluation includes a solution to this problem called **embracing**.
By wrapping a variable in `{{ }}` (embracing it) we tell dplyr that we want to use the value stored inside variable, not the variable itself.
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- it's going to make the function look inside of `var` rather than looking for a variable called `var`.
Your intuition for many common functions should be pretty good --- think about whether it's ok to compute `x + 1` or select multiple variables with `a:x`.
There are are some cases that are harder to guess because you usually use them with a single variable, which uses the same syntax for both data-masking or tidy-select:
- The `names_from` arguments to `pivot_wider()` is a selecting function because you can take the names from multiple variables with `names_from = c(x, y, z)`.
(Whenever you wrap `summarise()` in a helper, I think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.)
To summarize multiple you'll need wait until @sec-across, where you'll learn how to use `across()` to repeat the same computation with multiple variables.
Once you have the basics under your belt, you can learn more about the full range of tidy evaluation possibilities by reading `vignette("programming", package = "dplyr")`.
## Plot functions
You can also use the techniques described above with ggplot2, because `aes()` is a data-masking function.
For example, imagine that you're making a lot of histograms:
```{r}
#| fig-show: hide
diamonds |>
ggplot(aes(carat)) +
geom_histogram(binwidth = 0.1)
diamonds |>
ggplot(aes(carat)) +
geom_histogram(binwidth = 0.05)
```
Wouldn't it be nice if you could wrap this up into a histogram function?
This is easy as once you know that `aes()` is a data-masking function so that you need to embrace:
```{r}
histogram <- function(df, var, binwidth = NULL) {
df |>
ggplot(aes({{ var }})) +
geom_histogram(binwidth = binwidth)
}
diamonds |> histogram(carat, 0.1)
```
Note that `histogram()` returns a ggplot2 plot, so that you can still add on additional components if you want.
Just remember to switch from `|>` to `+`:
```{r}
diamonds |>
histogram(carat, 0.1) +
labs(x = "Size (in carats)", y = "Number of diamonds")
Next we'll discuss two more complicated cases: facetting and automatic labelling.
### Facetting
Unfortunately facetting is a special challenge, mostly because it was implemented well before we understood what tidy evaluation was and how it should work.
And unlike `aes()`, it wasn't straightforward to backport to tidy evalution, so you have to use a different syntax to usual.
Instead of writing `~ x`, you write `vars(x)` and instead of `~ x + y` you write `vars(x, y)`.
The only advantage of this syntax is that `vars()` is data masking so you can embrace within it.
I've written these functions so that you can supply any data frame, but there are also advantages to hardcoding a data frame, if you're using it repeatedly:
To do so, we're going to have to go under the covers of tidy evaluation and use a function from a package we have talked about before: rlang.
rlang is the package that implements tidy evaluation, and is used by all the other packages in the tidyverse.
rlang provides a helpful function called `englue()` to solve just this problem.
It uses a syntax inspired by glue but combined with embracing:
```{r}
histogram <- function(df, var, binwidth = NULL) {
label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
df |>
ggplot(aes({{ var }})) +
geom_histogram(binwidth = binwidth) +
labs(title = label)
}
diamonds |> histogram(carat, 0.1)
```
(Note that if you omit the `binwidth` the function fails with a weird error. That appears to be a bug in `englue()`: https://github.com/r-lib/rlang/issues/1492.
Hopefully it'll be fixed soon!)
You can use the same approach any other place that you might supply a string in a ggplot2 plot.
### Advice
It's hard to create general purpose plotting functions because you need to consider many different situations, and we haven't given you the programming skills to handle them all.
Fortunately, in most cases it's relatively simple to extract repeated plotting code into a function.
So, for now, strive to keep your functions simple, focussing on concrete repetition, not solve imaginary future problems.
You can also learn other techniques in <https://ggplot2-book.org/programming.html>.
Generally, function names should be verbs, and arguments should be nouns.
There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`).
A good sign that a noun might be a better choice is if you're using a very broad verb like "get", "compute", "calculate", or "determine".
Use your best judgement and don't be afraid to rename a function if you figure out a better name later.
In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot.
Writing functions to create data frames and plots using the tidyverse required you to learn a little about tidy evaluation.
Tidy evaluation is really important, because its what allows you to write `diamonds |> filter(x == y)` and `filter()` knows to use `x` and `y` from the diamonds dataset.
The downside of tidy evaluation is that you need to learn a new technique for programming: embracing.
Embracing, e.g. `{{ x }}`, tells the tidy-evaluation using function to look inside the argument `x`, rather than using the literal variable `x`.
You can figure out when you need to use embracing by looking in the documentation for the terms for the two major styles of tidyselect: "data masking" and "tidy select".
In the next chapter, we'll dive into some of the details of R's vector data structures that we've omitted so far.
These are immediately useful by themselves, but are a necessary foundation for the following chapter on iteration that provides some amazingly powerful tools.