Update functions.qmd (#1178)

This commit is contained in:
mcsnowface, PhD 2022-12-08 13:38:35 -07:00 committed by GitHub
parent 0743cbd3aa
commit e5e66de3cf
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 66 additions and 50 deletions

View File

@ -75,7 +75,7 @@ Preventing this type of mistake of is one very good reason to learn how to write
### Writing a function
To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary.
If we take the code above and pull it outside of `mutate()` it's a little easier to see the pattern because each repetition is now one line:
If we take the code above and pull it outside of `mutate()`, it's a little easier to see the pattern because each repetition is now one line:
```{r}
#| eval: false
@ -99,11 +99,11 @@ To turn this into a function you need three things:
Here we'll use `rescale01` because this function rescales a vector to lie between 0 and 1.
2. The **arguments**.
The arguments are things that vary across calls and our analysis above tells us that have just one.
The arguments are things that vary across calls and our analysis above tells us that we have just one.
We'll call it `x` because this is the conventional name for a numeric vector.
3. The **body**.
The body is the code that repeated across all the calls.
The body is the code that's repeated across all the calls.
Then you create a function by following the template:
@ -143,7 +143,7 @@ df |> mutate(
### Improving our function
You might notice `rescale01()` function does some unnecessary work --- instead of computing `min()` twice and `max()` once we could instead compute both the minimum and maximum in one step with `range()`:
You might notice that the `rescale01()` function does some unnecessary work --- instead of computing `min()` twice and `max()` once we could instead compute both the minimum and maximum in one step with `range()`:
```{r}
rescale01 <- function(x) {
@ -166,6 +166,7 @@ rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
```
@ -173,11 +174,11 @@ These changes illustrate an important benefit of functions: because we've moved
### Mutate functions
Now you've got the basic idea of functions, lets take a look a whole bunch of examples.
We'll start by looking at "mutate" functions, functions that work well like `mutate()` and `filter()` because they return an output the same length as the input.
Now you've got the basic idea of functions, let's take a look at a whole bunch of examples.
We'll start by looking at "mutate" functions, i.e. functions that work well inside of `mutate()` and `filter()` because they return an output of the same length as the input.
Lets start with a simple variation of `rescale01()`.
Maybe you want compute the Z-score, rescaling a vector to have to a mean of zero and a standard deviation of one:
Let's start with a simple variation of `rescale01()`.
Maybe you want to compute the Z-score, rescaling a vector to have a mean of zero and a standard deviation of one:
```{r}
z_score <- function(x) {
@ -185,7 +186,7 @@ z_score <- function(x) {
}
```
Or maybe you want to wrap up a straightforward `case_when()` in order to give it a useful name.
Or maybe you want to wrap up a straightforward `case_when()` and give it a useful name.
For example, this `clamp()` function ensures all values of a vector lie in between a minimum or a maximum:
```{r}
@ -196,6 +197,7 @@ clamp <- function(x, min, max) {
.default = x
)
}
clamp(1:10, min = 3, max = 7)
```
@ -209,11 +211,12 @@ na_outside <- function(x, min, max) {
.default = x
)
}
na_outside(1:10, min = 3, max = 7)
```
Of course functions don't just need to work with numeric variables.
You might want to extract out some repeated string manipulation.
You might want to do some repeated string manipulation.
Maybe you need to make the first character upper case:
```{r}
@ -221,6 +224,7 @@ first_upper <- function(x) {
str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
x
}
first_upper("hello")
```
@ -237,11 +241,12 @@ clean_number <- function(x) {
as.numeric(x)
if_else(is_pct, num / 100, num)
}
clean_number("$12,300")
clean_number("45%")
```
Sometimes your functions will be highly specialized for one data analysis.
Sometimes your functions will be highly specialized for one data analysis step.
For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with `NA`:
```{r}
@ -282,15 +287,17 @@ Sometimes this can just be a matter of setting a default argument or two:
commas <- function(x) {
str_flatten(x, collapse = ", ", last = " and ")
}
commas(c("cat", "dog", "pigeon"))
```
Or you might wrap up a simple computation, like for the coefficient of variation, which divides standard deviation by the mean:
Or you might wrap up a simple computation, like for the coefficient of variation, which divides the standard deviation by the mean:
```{r}
cv <- function(x, na.rm = FALSE) {
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}
cv(runif(100, min = 0, max = 50))
cv(runif(100, min = 0, max = 500))
```
@ -402,7 +409,7 @@ If we try and use it, we get an error:
diamonds |> grouped_mean(cut, carat)
```
To make the problem a bit more clear we can use a made up data frame:
To make the problem a bit more clear, we can use a made up data frame:
```{r}
df <- tibble(
@ -412,6 +419,7 @@ df <- tibble(
x = 10,
y = 100
)
df |> grouped_mean(group, x)
df |> grouped_mean(group, y)
```
@ -428,7 +436,7 @@ Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var
Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name.
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a dplyr function look inside of `var` rather than looking for a variable called `var`.
So to make grouped_mean`()` work we need to replace surround `group_var` and `mean_var()` with `{{ }}`:
So to make grouped_mean`()` work, we need to surround `group_var` and `mean_var()` with `{{ }}`:
```{r}
grouped_mean <- function(df, group_var, mean_var) {
@ -445,16 +453,16 @@ Success!
### When to embrace? {#sec-embracing}
So the key challenge in writing data frame functions is figuring out which arguments need to be embraced.
Fortunately this is easy because you can look it up from the documentation 😄.
There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:
Fortunately, this is easy because you can look it up from the documentation 😄.
There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarize()` that compute with variables.
- **Tidy-selection**: this is used for for functions like `select()`, `relocate()`, and `rename()` that select variables.
- **Tidy-selection**: this is used for functions like `select()`, `relocate()`, and `rename()` that select variables.
Your intuition about which arguments use tidy evaluation should be good for many common functions --- just think about whether you can compute (e.g. `x + 1`) or select (e.g. `a:x`).
In the following sections we'll explore the sorts of handy functions you might write once you understand embracing.
In the following sections, we'll explore the sorts of handy functions you might write once you understand embracing.
### Common use cases
@ -472,12 +480,13 @@ summary6 <- function(data, var) {
.groups = "drop"
)
}
diamonds |> summary6(carat)
```
(Whenever you wrap `summarize()` in a helper, we think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.)
The nice thing about this function is because it wraps `summarize()` you can used it on grouped data:
The nice thing about this function is, because it wraps `summarize()`, you can use it on grouped data:
```{r}
diamonds |>
@ -485,7 +494,7 @@ diamonds |>
summary6(carat)
```
Because the arguments to summarize are data-masking that also means that the `var` argument to `summary6()` is data-masking.
Furthermore, since the arguments to summarize are data-masking also means that the `var` argument to `summary6()` is data-masking.
That means you can also summarize computed variables:
```{r}
@ -494,7 +503,7 @@ diamonds |>
summary6(log10(carat))
```
To summarize multiple variables you'll need to wait until @sec-across, where you'll learn how to use `across()`.
To summarize multiple variables, you'll need to wait until @sec-across, where you'll learn how to use `across()`.
Another popular `summarize()` helper function is a version of `count()` that also computes proportions:
@ -505,6 +514,7 @@ count_prop <- function(df, var, sort = FALSE) {
count({{ var }}, sort = sort) |>
mutate(prop = n / sum(n))
}
diamonds |> count_prop(clarity)
```
@ -527,9 +537,9 @@ flights |> unique_where(month == 12, dest)
flights |> unique_where(tailnum == "N14228", month)
```
Here we embrace `condition` because it's passed to `filter()` and `var` because its passed to `distinct()` and `arrange()`.
Here we embrace `condition` because it's passed to `filter()` and `var` because it's passed to `distinct()` and `arrange()`.
We've made all these examples take a data frame as the first argument, but if you're working repeatedly with the same data, it can make sense to hardcode it.
We've made all these examples to take a data frame as the first argument, but if you're working repeatedly with the same data, it can make sense to hardcode it.
For example, the following function always works with the flights dataset and always selects `time_hour`, `carrier`, and `flight` since they form the compound primary key that allows you to identify a row.
```{r}
@ -555,12 +565,13 @@ count_missing <- function(df, group_vars, x_var) {
group_by({{ group_vars }}) |>
summarize(n_miss = sum(is.na({{ x_var }})))
}
flights |>
count_missing(c(year, month, day), dep_time)
```
This doesn't work because `group_by()` uses data-masking, not tidy-selection.
We can work around that problem by using the handy `pick()` which allows you to use use tidy-selection inside data-masking functions:
We can work around that problem by using the handy `pick()` function, which allows you to use tidy-selection inside data-masking functions:
```{r}
count_missing <- function(df, group_vars, x_var) {
@ -568,6 +579,7 @@ count_missing <- function(df, group_vars, x_var) {
group_by(pick({{ group_vars }})) |>
summarize(n_miss = sum(is.na({{ x_var }})))
}
flights |>
count_missing(c(year, month, day), dep_time)
```
@ -587,6 +599,7 @@ count_wide <- function(data, rows, cols) {
values_fill = 0
)
}
diamonds |> count_wide(clarity, cut)
diamonds |> count_wide(c(clarity, color), cut)
```
@ -595,9 +608,9 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
### Exercises
1. Using the datasets from nycflights13, write functions that:
1. Using the datasets from nycflights13, write a function that:
1. Find all flights that were cancelled (i.e. `is.na(arr_time)`) or delayed by more than an hour.
1. Finds all flights that were cancelled (i.e. `is.na(arr_time)`) or delayed by more than an hour.
```{r}
#| eval: false
@ -632,7 +645,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
weather |> standardise_time(sched_dep_time)
```
2. For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-select: `distinct()`, `count()`, `group_by()`, `rename_with()`, `slice_min()`, `slice_sample()`.
2. For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: `distinct()`, `count()`, `group_by()`, `rename_with()`, `slice_min()`, `slice_sample()`.
3. Generalize the following function so that you can supply any number of variables to count.
@ -647,7 +660,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
## Plot functions
Instead of returning a data frame, you might want to return a plot.
Fortunately you can use the same techniques with ggplot2, because `aes()` is a data-masking function.
Fortunately, you can use the same techniques with ggplot2, because `aes()` is a data-masking function.
For example, imagine that you're making a lot of histograms:
```{r}
@ -662,7 +675,7 @@ diamonds |>
```
Wouldn't it be nice if you could wrap this up into a histogram function?
This is easy as once you know that `aes()` is a data-masking function so that you need to embrace:
This is easy as pie once you know that `aes()` is a data-masking function and you need to embrace:
```{r}
histogram <- function(df, var, binwidth = NULL) {
@ -674,7 +687,7 @@ histogram <- function(df, var, binwidth = NULL) {
diamonds |> histogram(carat, 0.1)
```
Note that `histogram()` returns a ggplot2 plot, so that you can still add on additional components if you want.
Note that `histogram()` returns a ggplot2 plot, meaning you can still add on additional components if you want.
Just remember to switch from `|>` to `+`:
```{r}
@ -690,7 +703,6 @@ For example, maybe you want an easy way to eyeball whether or not a data set is
```{r}
# https://twitter.com/tyler_js_smith/status/1574377116988104704
linearity_check <- function(df, x, y) {
df |>
ggplot(aes({{ x }}, {{ y }})) +
@ -717,6 +729,7 @@ hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
fun = fun,
)
}
diamonds |> hex_plot(carat, price, depth)
```
@ -724,7 +737,7 @@ diamonds |> hex_plot(carat, price, depth)
Some of the most useful helpers combine a dash of dplyr with ggplot2.
For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:
Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top (also note the `:=` operator, which allows you to inject names with glue syntax on the left-hand side of `:=`; type: ?\`:=\` for more details):
```{r}
sorted_bars <- function(df, var) {
@ -733,10 +746,11 @@ sorted_bars <- function(df, var) {
ggplot(aes(y = {{ var }})) +
geom_bar()
}
diamonds |> sorted_bars(cut)
```
Or you could maybe you want to make it easy to draw a bar plot just for a subset of the data:
Or maybe you want to make it easy to draw a bar plot just for a subset of the data:
```{r}
conditional_bars <- function(df, condition, var) {
@ -749,20 +763,19 @@ conditional_bars <- function(df, condition, var) {
diamonds |> conditional_bars(cut == "Good", clarity)
```
You can also get creative and display data summaries in other way.
You can also get creative and display data summaries in other ways.
For example, this code uses the axis labels to display the highest value.
As you learn more about ggplot2, the power of your functions will continue to increase.
```{r}
# https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
fancy_ts <- function(df, val, group) {
labs <- df |>
group_by({{group}}) |>
summarize(breaks = max({{val}}))
group_by({{ group }}) |>
summarize(breaks = max({{ val }}))
df |>
ggplot(aes(date, {{val}}, group = {{group}}, color = {{group}})) +
ggplot(aes(date, {{ val }}, group = {{ group }}, color = {{ group }})) +
geom_path() +
scale_y_continuous(
breaks = labs$breaks,
@ -778,6 +791,7 @@ df <- tibble(
dist4 = sort(rnorm(50, 15, 1)),
date = seq.Date(as.Date("2022-01-01"), as.Date("2022-04-10"), by = "2 days")
)
df <- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
fancy_ts(df, value, dist_name)
@ -787,19 +801,19 @@ Next we'll discuss two more complicated cases: faceting and automatic labeling.
### Faceting
Unfortunately programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work.
so you have to learn a new syntax.
Unfortunately, programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work.
So you have to learn a new syntax.
When programming with facets, instead of writing `~ x`, you need to write `vars(x)` and instead of `~ x + y` you need to write `vars(x, y)`.
The only advantage of this syntax is that `vars()` uses tidy evaluation so you can embrace within it:
```{r}
# https://twitter.com/sharoz/status/1574376332821204999
foo <- function(x) {
ggplot(mtcars, aes(mpg, disp)) +
geom_point() +
facet_wrap(vars({{ x }}))
}
foo(cyl)
```
@ -833,12 +847,12 @@ histogram <- function(df, var, binwidth = NULL) {
```
Wouldn't it be nice if we could label the output with the variable and the bin width that was used?
To do so, we're going to have to go under the covers of tidy evaluation and use a function from package we haven't talked about before: rlang.
To do so, we're going to have to go under the covers of tidy evaluation and use a function from the package we haven't talked about yet: rlang.
rlang is a low-level package that's used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).
To solve the labeling problem we can use `rlang::englue()`.
This works similarly to `str_glue()`, so any value wrapped in `{ }` will be inserted into the string.
But it also understands `{{ }}`, which automatically insert the appropriate variable name:
But it also understands `{{ }}`, which automatically inserts the appropriate variable name:
```{r}
histogram <- function(df, var, binwidth) {
@ -853,16 +867,17 @@ histogram <- function(df, var, binwidth) {
diamonds |> histogram(carat, 0.1)
```
You can use the same approach any other place that you might supply a string in a ggplot2 plot.
You can use the same approach in any other place where you want to supply a string in a ggplot2 plot.
### Exercises
1. Build up a rich plotting function by incrementally implementing each of the steps below.
1. Draw a scatterplot given dataset and `x` and `y` variables.
Build up a rich plotting function by incrementally implementing each of the steps below:
2. Add a line of best fit (i.e. a linear model with no standard errors).
1. Draw a scatterplot given dataset and `x` and `y` variables.
3. Add a title.
2. Add a line of best fit (i.e. a linear model with no standard errors).
3. Add a title.
## Style
@ -923,6 +938,7 @@ This makes it very obvious that something unusual is happening.
f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
@ -935,8 +951,8 @@ This makes it very obvious that something unusual is happening.
## Summary
In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot.
Along the way your saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.
In this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot.
Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.
We have only shown you the bare minimum to get started with functions and there's much more to learn.
A few places to learn more are: