Update functions.qmd (#1178)
This commit is contained in:
parent
0743cbd3aa
commit
e5e66de3cf
116
functions.qmd
116
functions.qmd
|
@ -75,7 +75,7 @@ Preventing this type of mistake of is one very good reason to learn how to write
|
|||
### Writing a function
|
||||
|
||||
To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary.
|
||||
If we take the code above and pull it outside of `mutate()` it's a little easier to see the pattern because each repetition is now one line:
|
||||
If we take the code above and pull it outside of `mutate()`, it's a little easier to see the pattern because each repetition is now one line:
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
|
@ -99,11 +99,11 @@ To turn this into a function you need three things:
|
|||
Here we'll use `rescale01` because this function rescales a vector to lie between 0 and 1.
|
||||
|
||||
2. The **arguments**.
|
||||
The arguments are things that vary across calls and our analysis above tells us that have just one.
|
||||
The arguments are things that vary across calls and our analysis above tells us that we have just one.
|
||||
We'll call it `x` because this is the conventional name for a numeric vector.
|
||||
|
||||
3. The **body**.
|
||||
The body is the code that repeated across all the calls.
|
||||
The body is the code that's repeated across all the calls.
|
||||
|
||||
Then you create a function by following the template:
|
||||
|
||||
|
@ -143,7 +143,7 @@ df |> mutate(
|
|||
|
||||
### Improving our function
|
||||
|
||||
You might notice `rescale01()` function does some unnecessary work --- instead of computing `min()` twice and `max()` once we could instead compute both the minimum and maximum in one step with `range()`:
|
||||
You might notice that the `rescale01()` function does some unnecessary work --- instead of computing `min()` twice and `max()` once we could instead compute both the minimum and maximum in one step with `range()`:
|
||||
|
||||
```{r}
|
||||
rescale01 <- function(x) {
|
||||
|
@ -166,6 +166,7 @@ rescale01 <- function(x) {
|
|||
rng <- range(x, na.rm = TRUE, finite = TRUE)
|
||||
(x - rng[1]) / (rng[2] - rng[1])
|
||||
}
|
||||
|
||||
rescale01(x)
|
||||
```
|
||||
|
||||
|
@ -173,11 +174,11 @@ These changes illustrate an important benefit of functions: because we've moved
|
|||
|
||||
### Mutate functions
|
||||
|
||||
Now you've got the basic idea of functions, lets take a look a whole bunch of examples.
|
||||
We'll start by looking at "mutate" functions, functions that work well like `mutate()` and `filter()` because they return an output the same length as the input.
|
||||
Now you've got the basic idea of functions, let's take a look at a whole bunch of examples.
|
||||
We'll start by looking at "mutate" functions, i.e. functions that work well inside of `mutate()` and `filter()` because they return an output of the same length as the input.
|
||||
|
||||
Lets start with a simple variation of `rescale01()`.
|
||||
Maybe you want compute the Z-score, rescaling a vector to have to a mean of zero and a standard deviation of one:
|
||||
Let's start with a simple variation of `rescale01()`.
|
||||
Maybe you want to compute the Z-score, rescaling a vector to have a mean of zero and a standard deviation of one:
|
||||
|
||||
```{r}
|
||||
z_score <- function(x) {
|
||||
|
@ -185,7 +186,7 @@ z_score <- function(x) {
|
|||
}
|
||||
```
|
||||
|
||||
Or maybe you want to wrap up a straightforward `case_when()` in order to give it a useful name.
|
||||
Or maybe you want to wrap up a straightforward `case_when()` and give it a useful name.
|
||||
For example, this `clamp()` function ensures all values of a vector lie in between a minimum or a maximum:
|
||||
|
||||
```{r}
|
||||
|
@ -196,6 +197,7 @@ clamp <- function(x, min, max) {
|
|||
.default = x
|
||||
)
|
||||
}
|
||||
|
||||
clamp(1:10, min = 3, max = 7)
|
||||
```
|
||||
|
||||
|
@ -209,11 +211,12 @@ na_outside <- function(x, min, max) {
|
|||
.default = x
|
||||
)
|
||||
}
|
||||
|
||||
na_outside(1:10, min = 3, max = 7)
|
||||
```
|
||||
|
||||
Of course functions don't just need to work with numeric variables.
|
||||
You might want to extract out some repeated string manipulation.
|
||||
You might want to do some repeated string manipulation.
|
||||
Maybe you need to make the first character upper case:
|
||||
|
||||
```{r}
|
||||
|
@ -221,6 +224,7 @@ first_upper <- function(x) {
|
|||
str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
|
||||
x
|
||||
}
|
||||
|
||||
first_upper("hello")
|
||||
```
|
||||
|
||||
|
@ -237,11 +241,12 @@ clean_number <- function(x) {
|
|||
as.numeric(x)
|
||||
if_else(is_pct, num / 100, num)
|
||||
}
|
||||
|
||||
clean_number("$12,300")
|
||||
clean_number("45%")
|
||||
```
|
||||
|
||||
Sometimes your functions will be highly specialized for one data analysis.
|
||||
Sometimes your functions will be highly specialized for one data analysis step.
|
||||
For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with `NA`:
|
||||
|
||||
```{r}
|
||||
|
@ -282,15 +287,17 @@ Sometimes this can just be a matter of setting a default argument or two:
|
|||
commas <- function(x) {
|
||||
str_flatten(x, collapse = ", ", last = " and ")
|
||||
}
|
||||
|
||||
commas(c("cat", "dog", "pigeon"))
|
||||
```
|
||||
|
||||
Or you might wrap up a simple computation, like for the coefficient of variation, which divides standard deviation by the mean:
|
||||
Or you might wrap up a simple computation, like for the coefficient of variation, which divides the standard deviation by the mean:
|
||||
|
||||
```{r}
|
||||
cv <- function(x, na.rm = FALSE) {
|
||||
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
|
||||
}
|
||||
|
||||
cv(runif(100, min = 0, max = 50))
|
||||
cv(runif(100, min = 0, max = 500))
|
||||
```
|
||||
|
@ -402,7 +409,7 @@ If we try and use it, we get an error:
|
|||
diamonds |> grouped_mean(cut, carat)
|
||||
```
|
||||
|
||||
To make the problem a bit more clear we can use a made up data frame:
|
||||
To make the problem a bit more clear, we can use a made up data frame:
|
||||
|
||||
```{r}
|
||||
df <- tibble(
|
||||
|
@ -412,6 +419,7 @@ df <- tibble(
|
|||
x = 10,
|
||||
y = 100
|
||||
)
|
||||
|
||||
df |> grouped_mean(group, x)
|
||||
df |> grouped_mean(group, y)
|
||||
```
|
||||
|
@ -428,7 +436,7 @@ Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var
|
|||
Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name.
|
||||
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a dplyr function look inside of `var` rather than looking for a variable called `var`.
|
||||
|
||||
So to make grouped_mean`()` work we need to replace surround `group_var` and `mean_var()` with `{{ }}`:
|
||||
So to make grouped_mean`()` work, we need to surround `group_var` and `mean_var()` with `{{ }}`:
|
||||
|
||||
```{r}
|
||||
grouped_mean <- function(df, group_var, mean_var) {
|
||||
|
@ -445,16 +453,16 @@ Success!
|
|||
### When to embrace? {#sec-embracing}
|
||||
|
||||
So the key challenge in writing data frame functions is figuring out which arguments need to be embraced.
|
||||
Fortunately this is easy because you can look it up from the documentation 😄.
|
||||
There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:
|
||||
Fortunately, this is easy because you can look it up from the documentation 😄.
|
||||
There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:
|
||||
|
||||
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarize()` that compute with variables.
|
||||
|
||||
- **Tidy-selection**: this is used for for functions like `select()`, `relocate()`, and `rename()` that select variables.
|
||||
- **Tidy-selection**: this is used for functions like `select()`, `relocate()`, and `rename()` that select variables.
|
||||
|
||||
Your intuition about which arguments use tidy evaluation should be good for many common functions --- just think about whether you can compute (e.g. `x + 1`) or select (e.g. `a:x`).
|
||||
|
||||
In the following sections we'll explore the sorts of handy functions you might write once you understand embracing.
|
||||
In the following sections, we'll explore the sorts of handy functions you might write once you understand embracing.
|
||||
|
||||
### Common use cases
|
||||
|
||||
|
@ -472,12 +480,13 @@ summary6 <- function(data, var) {
|
|||
.groups = "drop"
|
||||
)
|
||||
}
|
||||
|
||||
diamonds |> summary6(carat)
|
||||
```
|
||||
|
||||
(Whenever you wrap `summarize()` in a helper, we think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.)
|
||||
|
||||
The nice thing about this function is because it wraps `summarize()` you can used it on grouped data:
|
||||
The nice thing about this function is, because it wraps `summarize()`, you can use it on grouped data:
|
||||
|
||||
```{r}
|
||||
diamonds |>
|
||||
|
@ -485,7 +494,7 @@ diamonds |>
|
|||
summary6(carat)
|
||||
```
|
||||
|
||||
Because the arguments to summarize are data-masking that also means that the `var` argument to `summary6()` is data-masking.
|
||||
Furthermore, since the arguments to summarize are data-masking also means that the `var` argument to `summary6()` is data-masking.
|
||||
That means you can also summarize computed variables:
|
||||
|
||||
```{r}
|
||||
|
@ -494,7 +503,7 @@ diamonds |>
|
|||
summary6(log10(carat))
|
||||
```
|
||||
|
||||
To summarize multiple variables you'll need to wait until @sec-across, where you'll learn how to use `across()`.
|
||||
To summarize multiple variables, you'll need to wait until @sec-across, where you'll learn how to use `across()`.
|
||||
|
||||
Another popular `summarize()` helper function is a version of `count()` that also computes proportions:
|
||||
|
||||
|
@ -505,6 +514,7 @@ count_prop <- function(df, var, sort = FALSE) {
|
|||
count({{ var }}, sort = sort) |>
|
||||
mutate(prop = n / sum(n))
|
||||
}
|
||||
|
||||
diamonds |> count_prop(clarity)
|
||||
```
|
||||
|
||||
|
@ -527,9 +537,9 @@ flights |> unique_where(month == 12, dest)
|
|||
flights |> unique_where(tailnum == "N14228", month)
|
||||
```
|
||||
|
||||
Here we embrace `condition` because it's passed to `filter()` and `var` because its passed to `distinct()` and `arrange()`.
|
||||
Here we embrace `condition` because it's passed to `filter()` and `var` because it's passed to `distinct()` and `arrange()`.
|
||||
|
||||
We've made all these examples take a data frame as the first argument, but if you're working repeatedly with the same data, it can make sense to hardcode it.
|
||||
We've made all these examples to take a data frame as the first argument, but if you're working repeatedly with the same data, it can make sense to hardcode it.
|
||||
For example, the following function always works with the flights dataset and always selects `time_hour`, `carrier`, and `flight` since they form the compound primary key that allows you to identify a row.
|
||||
|
||||
```{r}
|
||||
|
@ -555,12 +565,13 @@ count_missing <- function(df, group_vars, x_var) {
|
|||
group_by({{ group_vars }}) |>
|
||||
summarize(n_miss = sum(is.na({{ x_var }})))
|
||||
}
|
||||
|
||||
flights |>
|
||||
count_missing(c(year, month, day), dep_time)
|
||||
```
|
||||
|
||||
This doesn't work because `group_by()` uses data-masking, not tidy-selection.
|
||||
We can work around that problem by using the handy `pick()` which allows you to use use tidy-selection inside data-masking functions:
|
||||
We can work around that problem by using the handy `pick()` function, which allows you to use tidy-selection inside data-masking functions:
|
||||
|
||||
```{r}
|
||||
count_missing <- function(df, group_vars, x_var) {
|
||||
|
@ -568,6 +579,7 @@ count_missing <- function(df, group_vars, x_var) {
|
|||
group_by(pick({{ group_vars }})) |>
|
||||
summarize(n_miss = sum(is.na({{ x_var }})))
|
||||
}
|
||||
|
||||
flights |>
|
||||
count_missing(c(year, month, day), dep_time)
|
||||
```
|
||||
|
@ -587,6 +599,7 @@ count_wide <- function(data, rows, cols) {
|
|||
values_fill = 0
|
||||
)
|
||||
}
|
||||
|
||||
diamonds |> count_wide(clarity, cut)
|
||||
diamonds |> count_wide(c(clarity, color), cut)
|
||||
```
|
||||
|
@ -595,9 +608,9 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
|
|||
|
||||
### Exercises
|
||||
|
||||
1. Using the datasets from nycflights13, write functions that:
|
||||
1. Using the datasets from nycflights13, write a function that:
|
||||
|
||||
1. Find all flights that were cancelled (i.e. `is.na(arr_time)`) or delayed by more than an hour.
|
||||
1. Finds all flights that were cancelled (i.e. `is.na(arr_time)`) or delayed by more than an hour.
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
|
@ -632,7 +645,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
|
|||
weather |> standardise_time(sched_dep_time)
|
||||
```
|
||||
|
||||
2. For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-select: `distinct()`, `count()`, `group_by()`, `rename_with()`, `slice_min()`, `slice_sample()`.
|
||||
2. For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: `distinct()`, `count()`, `group_by()`, `rename_with()`, `slice_min()`, `slice_sample()`.
|
||||
|
||||
3. Generalize the following function so that you can supply any number of variables to count.
|
||||
|
||||
|
@ -647,7 +660,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
|
|||
## Plot functions
|
||||
|
||||
Instead of returning a data frame, you might want to return a plot.
|
||||
Fortunately you can use the same techniques with ggplot2, because `aes()` is a data-masking function.
|
||||
Fortunately, you can use the same techniques with ggplot2, because `aes()` is a data-masking function.
|
||||
For example, imagine that you're making a lot of histograms:
|
||||
|
||||
```{r}
|
||||
|
@ -662,7 +675,7 @@ diamonds |>
|
|||
```
|
||||
|
||||
Wouldn't it be nice if you could wrap this up into a histogram function?
|
||||
This is easy as once you know that `aes()` is a data-masking function so that you need to embrace:
|
||||
This is easy as pie once you know that `aes()` is a data-masking function and you need to embrace:
|
||||
|
||||
```{r}
|
||||
histogram <- function(df, var, binwidth = NULL) {
|
||||
|
@ -674,7 +687,7 @@ histogram <- function(df, var, binwidth = NULL) {
|
|||
diamonds |> histogram(carat, 0.1)
|
||||
```
|
||||
|
||||
Note that `histogram()` returns a ggplot2 plot, so that you can still add on additional components if you want.
|
||||
Note that `histogram()` returns a ggplot2 plot, meaning you can still add on additional components if you want.
|
||||
Just remember to switch from `|>` to `+`:
|
||||
|
||||
```{r}
|
||||
|
@ -690,7 +703,6 @@ For example, maybe you want an easy way to eyeball whether or not a data set is
|
|||
|
||||
```{r}
|
||||
# https://twitter.com/tyler_js_smith/status/1574377116988104704
|
||||
|
||||
linearity_check <- function(df, x, y) {
|
||||
df |>
|
||||
ggplot(aes({{ x }}, {{ y }})) +
|
||||
|
@ -717,6 +729,7 @@ hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
|
|||
fun = fun,
|
||||
)
|
||||
}
|
||||
|
||||
diamonds |> hex_plot(carat, price, depth)
|
||||
```
|
||||
|
||||
|
@ -724,7 +737,7 @@ diamonds |> hex_plot(carat, price, depth)
|
|||
|
||||
Some of the most useful helpers combine a dash of dplyr with ggplot2.
|
||||
For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
|
||||
Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:
|
||||
Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top (also note the `:=` operator, which allows you to inject names with glue syntax on the left-hand side of `:=`; type: ?\`:=\` for more details):
|
||||
|
||||
```{r}
|
||||
sorted_bars <- function(df, var) {
|
||||
|
@ -733,10 +746,11 @@ sorted_bars <- function(df, var) {
|
|||
ggplot(aes(y = {{ var }})) +
|
||||
geom_bar()
|
||||
}
|
||||
|
||||
diamonds |> sorted_bars(cut)
|
||||
```
|
||||
|
||||
Or you could maybe you want to make it easy to draw a bar plot just for a subset of the data:
|
||||
Or maybe you want to make it easy to draw a bar plot just for a subset of the data:
|
||||
|
||||
```{r}
|
||||
conditional_bars <- function(df, condition, var) {
|
||||
|
@ -749,20 +763,19 @@ conditional_bars <- function(df, condition, var) {
|
|||
diamonds |> conditional_bars(cut == "Good", clarity)
|
||||
```
|
||||
|
||||
You can also get creative and display data summaries in other way.
|
||||
You can also get creative and display data summaries in other ways.
|
||||
For example, this code uses the axis labels to display the highest value.
|
||||
As you learn more about ggplot2, the power of your functions will continue to increase.
|
||||
|
||||
```{r}
|
||||
# https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
|
||||
|
||||
fancy_ts <- function(df, val, group) {
|
||||
labs <- df |>
|
||||
group_by({{group}}) |>
|
||||
summarize(breaks = max({{val}}))
|
||||
group_by({{ group }}) |>
|
||||
summarize(breaks = max({{ val }}))
|
||||
|
||||
df |>
|
||||
ggplot(aes(date, {{val}}, group = {{group}}, color = {{group}})) +
|
||||
ggplot(aes(date, {{ val }}, group = {{ group }}, color = {{ group }})) +
|
||||
geom_path() +
|
||||
scale_y_continuous(
|
||||
breaks = labs$breaks,
|
||||
|
@ -778,6 +791,7 @@ df <- tibble(
|
|||
dist4 = sort(rnorm(50, 15, 1)),
|
||||
date = seq.Date(as.Date("2022-01-01"), as.Date("2022-04-10"), by = "2 days")
|
||||
)
|
||||
|
||||
df <- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
|
||||
|
||||
fancy_ts(df, value, dist_name)
|
||||
|
@ -787,19 +801,19 @@ Next we'll discuss two more complicated cases: faceting and automatic labeling.
|
|||
|
||||
### Faceting
|
||||
|
||||
Unfortunately programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work.
|
||||
so you have to learn a new syntax.
|
||||
Unfortunately, programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work.
|
||||
So you have to learn a new syntax.
|
||||
When programming with facets, instead of writing `~ x`, you need to write `vars(x)` and instead of `~ x + y` you need to write `vars(x, y)`.
|
||||
The only advantage of this syntax is that `vars()` uses tidy evaluation so you can embrace within it:
|
||||
|
||||
```{r}
|
||||
# https://twitter.com/sharoz/status/1574376332821204999
|
||||
|
||||
foo <- function(x) {
|
||||
ggplot(mtcars, aes(mpg, disp)) +
|
||||
geom_point() +
|
||||
facet_wrap(vars({{ x }}))
|
||||
}
|
||||
|
||||
foo(cyl)
|
||||
```
|
||||
|
||||
|
@ -833,12 +847,12 @@ histogram <- function(df, var, binwidth = NULL) {
|
|||
```
|
||||
|
||||
Wouldn't it be nice if we could label the output with the variable and the bin width that was used?
|
||||
To do so, we're going to have to go under the covers of tidy evaluation and use a function from package we haven't talked about before: rlang.
|
||||
To do so, we're going to have to go under the covers of tidy evaluation and use a function from the package we haven't talked about yet: rlang.
|
||||
rlang is a low-level package that's used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).
|
||||
|
||||
To solve the labeling problem we can use `rlang::englue()`.
|
||||
This works similarly to `str_glue()`, so any value wrapped in `{ }` will be inserted into the string.
|
||||
But it also understands `{{ }}`, which automatically insert the appropriate variable name:
|
||||
But it also understands `{{ }}`, which automatically inserts the appropriate variable name:
|
||||
|
||||
```{r}
|
||||
histogram <- function(df, var, binwidth) {
|
||||
|
@ -853,16 +867,17 @@ histogram <- function(df, var, binwidth) {
|
|||
diamonds |> histogram(carat, 0.1)
|
||||
```
|
||||
|
||||
You can use the same approach any other place that you might supply a string in a ggplot2 plot.
|
||||
You can use the same approach in any other place where you want to supply a string in a ggplot2 plot.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Build up a rich plotting function by incrementally implementing each of the steps below.
|
||||
1. Draw a scatterplot given dataset and `x` and `y` variables.
|
||||
Build up a rich plotting function by incrementally implementing each of the steps below:
|
||||
|
||||
2. Add a line of best fit (i.e. a linear model with no standard errors).
|
||||
1. Draw a scatterplot given dataset and `x` and `y` variables.
|
||||
|
||||
3. Add a title.
|
||||
2. Add a line of best fit (i.e. a linear model with no standard errors).
|
||||
|
||||
3. Add a title.
|
||||
|
||||
## Style
|
||||
|
||||
|
@ -923,6 +938,7 @@ This makes it very obvious that something unusual is happening.
|
|||
f1 <- function(string, prefix) {
|
||||
substr(string, 1, nchar(prefix)) == prefix
|
||||
}
|
||||
|
||||
f3 <- function(x, y) {
|
||||
rep(y, length.out = length(x))
|
||||
}
|
||||
|
@ -935,8 +951,8 @@ This makes it very obvious that something unusual is happening.
|
|||
|
||||
## Summary
|
||||
|
||||
In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot.
|
||||
Along the way your saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.
|
||||
In this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot.
|
||||
Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.
|
||||
|
||||
We have only shown you the bare minimum to get started with functions and there's much more to learn.
|
||||
A few places to learn more are:
|
||||
|
|
Loading…
Reference in New Issue