More work on programming
This commit is contained in:
parent
3e167168e7
commit
765d1c8191
377
functions.qmd
377
functions.qmd
|
@ -11,8 +11,6 @@ status("drafting")
|
|||
|
||||
One of the best ways to improve your reach as a data scientist is to write functions.
|
||||
Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
|
||||
You should consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
|
||||
|
||||
Writing a function has three big advantages over using copy-and-paste:
|
||||
|
||||
1. You can give a function an evocative name that makes your code easier to understand.
|
||||
|
@ -21,9 +19,8 @@ Writing a function has three big advantages over using copy-and-paste:
|
|||
|
||||
3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
|
||||
|
||||
Writing good functions is a lifetime journey.
|
||||
Even after using R for many years we still learn new techniques and better ways of approaching old problems.
|
||||
The goal of this chapter is to get you started on your journey with functions with three useful types of functions:
|
||||
A good rule of thumb is to consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
|
||||
The goal of this chapter is to get you started on your journey with three useful types of functions:
|
||||
|
||||
- Vector functions take one or more vectors as input and return a vector as output.
|
||||
- Data frame functions take a data frame as input and return a data frame as output.
|
||||
|
@ -31,15 +28,14 @@ The goal of this chapter is to get you started on your journey with functions wi
|
|||
|
||||
The chapter concludes with some advice on function style.
|
||||
|
||||
Many of the examples in this chapter were inspired by real data analysis code supplied by folks on twitter.
|
||||
We've often simplified the code from the original so you might want to look at the original tweets which we list in the comments.
|
||||
If you want just to see a huge variety of functions, check out the motivating tweets: https://twitter.com/hadleywickham/status/1574373127349575680, https://twitter.com/hadleywickham/status/1571603361350164486 A big thanks to everyone who contributed!
|
||||
WI won't fully explain all of the functions that we use here, so you might need to do some reading of the documentation.
|
||||
This chapter includes many examples to help you generalize the patterns that you see.
|
||||
Many of the examples were inspired by real data analysis code supplied by folks on twitter; follow the links in the comment to see original inspiration.
|
||||
And if you want to see even more examples, check out the motivating tweets for [general functions](https://twitter.com/hadleywickham/status/1571603361350164486) and [plotting functions](https://twitter.com/hadleywickham/status/1574373127349575680).
|
||||
|
||||
### Prerequisites
|
||||
|
||||
We'll wrap up a variety of functions from around the tidyverse.
|
||||
We'll also use nycflights13 as a source of relatively familiar data to apply our functions to.
|
||||
We'll also use nycflights13 as a source of familiar data to use our functions with.
|
||||
|
||||
```{r}
|
||||
#| message: false
|
||||
|
@ -58,7 +54,6 @@ pick <- function(cols) {
|
|||
## Vector functions
|
||||
|
||||
We'll begin with vector functions: functions that take one or more vectors and return a vector result.
|
||||
|
||||
For example, take a look at this code.
|
||||
What does it do?
|
||||
|
||||
|
@ -89,7 +84,7 @@ Preventing this type of mistake of is one very good reason to learn how to write
|
|||
|
||||
### Writing a function
|
||||
|
||||
To write a function you need to first analyse your repeated to figure what parts of the repeated code is constant and what parts vary.
|
||||
To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary.
|
||||
If we take the code above and pull it outside of `mutate()` it's a little easier to see the pattern because each repetition is now one line:
|
||||
|
||||
```{r}
|
||||
|
@ -108,19 +103,17 @@ To make this a bit clearer we can replace the bit that varies with `█`:
|
|||
(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
|
||||
```
|
||||
|
||||
There's only one thing that varies which implies we're going to need a function with one argument.
|
||||
|
||||
To turn this into an actual function you need three things:
|
||||
To turn this into a function you need three things:
|
||||
|
||||
1. A **name**.
|
||||
Here we might use `rescale01` because this function rescales a vector to lie between 0 and 1.
|
||||
Here we'll use `rescale01` because this function rescales a vector to lie between 0 and 1.
|
||||
|
||||
2. The **arguments**.
|
||||
The arguments are things that vary across calls.
|
||||
Here we have just one argument which we're going to call `x` because this is a conventional name for a numeric vector.
|
||||
The arguments are things that vary across calls and our analysis above tells us that have just one.
|
||||
We'll call it `x` because this is the conventional name for a numeric vector.
|
||||
|
||||
3. The **body**.
|
||||
The body is the code that is the in all the calls.
|
||||
The body is the code that repeated across all the calls.
|
||||
|
||||
Then you create a function by following the template:
|
||||
|
||||
|
@ -190,29 +183,20 @@ These changes illustrate an important benefit of functions: because we've moved
|
|||
|
||||
### Mutate functions
|
||||
|
||||
Let's look at a few more vector functions before you get some practice writing your own.
|
||||
We'll start by looking at a few useful functions that work well in functions like `mutate()` and `filter()` because they return an output the same length as the input.
|
||||
The goal of these sections is to expose you to a bunch of different functions to get your creative juices flowing, and to give you plenty of examples to generalize the structure and utility of functions from.
|
||||
Now you've got the basic idea of functions, lets take a look a whole bunch of examples.
|
||||
We'll start by looking at "mutate" functions, functions that work well like `mutate()` and `filter()` because they return an output the same length as the input.
|
||||
|
||||
For example, maybe instead of rescaling to min 0, max 1, you want to rescale to mean zero, standard deviation one:
|
||||
Lets start with a simple variation of `rescale01()`.
|
||||
Maybe you want compute the Z-score, rescaling a vector to have to a mean of zero and a standard deviation of one:
|
||||
|
||||
```{r}
|
||||
rescale_z <- function(x) {
|
||||
z_score <- function(x) {
|
||||
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
|
||||
}
|
||||
```
|
||||
|
||||
Sometimes your functions are highly specialised for one data analysis.
|
||||
For example, you might have a bunch of variables that record missing values as 997, 998, or 999:
|
||||
|
||||
```{r}
|
||||
fix_na <- function(x) {
|
||||
if_else(x %in% c(997, 998, 999), NA, x)
|
||||
}
|
||||
```
|
||||
|
||||
Other cases, you might be wrapping up a simple a `case_when()` to give it a standard name.
|
||||
For example, the `clamp()` function ensures all values of a vector lie in between a minimum or a maximum:
|
||||
Or maybe you want to wrap up a straightforward `case_when()` in order to give it a useful.
|
||||
For example, this `clamp()` function ensures all values of a vector lie in between a minimum or a maximum:
|
||||
|
||||
```{r}
|
||||
clamp <- function(x, min, max) {
|
||||
|
@ -228,19 +212,19 @@ clamp(1:10, min = 3, max = 7)
|
|||
Or maybe you'd rather mark those values as `NA`s:
|
||||
|
||||
```{r}
|
||||
discard_outside <- function(x, min, max) {
|
||||
na_outside <- function(x, min, max) {
|
||||
case_when(
|
||||
x < min ~ NA,
|
||||
x > max ~ NA,
|
||||
.default = x
|
||||
)
|
||||
}
|
||||
discard_outside(1:10, min = 3, max = 7)
|
||||
na_outside(1:10, min = 3, max = 7)
|
||||
```
|
||||
|
||||
Of course functions don't just need to work with numeric variables.
|
||||
You might want to extract out some repeated string manipulation.
|
||||
Maybe you need to make the first character of each vector upper case:
|
||||
Maybe you need to make the first character upper case:
|
||||
|
||||
```{r}
|
||||
first_upper <- function(x) {
|
||||
|
@ -267,8 +251,19 @@ clean_number("$12,300")
|
|||
clean_number("45%")
|
||||
```
|
||||
|
||||
There's no reason that your function can't take multiple vector inputs.
|
||||
For example, you might want to compute the distance between two locations on the globe using the haversine formula:
|
||||
Sometimes your functions will be highly specialized for one data analysis.
|
||||
For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with `NA`:
|
||||
|
||||
```{r}
|
||||
fix_na <- function(x) {
|
||||
if_else(x %in% c(997, 998, 999), NA, x)
|
||||
}
|
||||
```
|
||||
|
||||
We've focused on examples that take a single vector because we think they're the most common.
|
||||
But there's no reason that your function can't take multiple vector inputs.
|
||||
For example, you might want to compute the distance between two locations on the globe using the haversine formula.
|
||||
This requires four vectors:
|
||||
|
||||
```{r}
|
||||
# https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
|
||||
|
@ -290,17 +285,17 @@ haversine <- function(long1, lat1, long2, lat2, round = 3) {
|
|||
|
||||
### Summary functions
|
||||
|
||||
In other cases you want a function that returns a single value for use in `summary()`.
|
||||
Sometimes this can just be a matter of setting a default argument:
|
||||
Another important family of vector functions is summary functions, functions that return a single value for use in `summarize()`.
|
||||
Sometimes this can just be a matter of setting a default argument or two:
|
||||
|
||||
```{r}
|
||||
commas <- function(x) {
|
||||
str_flatten(x, collapse = ", ")
|
||||
str_flatten(x, collapse = ", ", last = " and ")
|
||||
}
|
||||
commas(c("cat", "dog", "pigeon"))
|
||||
```
|
||||
|
||||
Or performing some very simple computation, like computing the coefficient of variation, which standardizes the standard deviation by dividing it by the mean:
|
||||
Or you might wrap up a simple computation, like for the coefficient of variation, which divides standard deviation by the mean:
|
||||
|
||||
```{r}
|
||||
cv <- function(x, na.rm = FALSE) {
|
||||
|
@ -320,7 +315,7 @@ n_missing <- function(x) {
|
|||
```
|
||||
|
||||
You can also write functions with multiple vector inputs.
|
||||
For example, maybe you want to compute the mean absolute prediction error to help you comparing model predictions with actual values:
|
||||
For example, maybe you want to compute the mean absolute prediction error to help you compare model predictions with actual values:
|
||||
|
||||
```{r}
|
||||
# https://twitter.com/neilgcurrie/status/1571607727255834625
|
||||
|
@ -329,6 +324,17 @@ mape <- function(actual, predicted) {
|
|||
}
|
||||
```
|
||||
|
||||
::: callout-note
|
||||
## RStudio
|
||||
|
||||
Once you start writing functions, there are two RStudio shortcuts that are super useful:
|
||||
|
||||
- To find the definition of a function that you've written, place the cursor on the name of the function and press `F2`.
|
||||
|
||||
- To quickly jump to a function, press `Ctrl + .` to open the fuzzy file and function finder and type the first few letters of your function name.
|
||||
You can also navigate to files, Quarto sections, and more, making it a very hand navigation tool.
|
||||
:::
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Practice turning the following code snippets into functions.
|
||||
|
@ -377,9 +383,13 @@ mape <- function(actual, predicted) {
|
|||
|
||||
## Data frame functions
|
||||
|
||||
Vector functions are useful for pulling out code that's repeated within dplyr verbs.
|
||||
In this section, you'll learn how to write "data frame" functions which pull out code that's repeated across multiple pipelines.
|
||||
These functions work in the same way as dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and usually return a data frame.
|
||||
Vector functions are useful for pulling out code that's repeated within a dplyr verb.
|
||||
But you'll often also repeat the verbs themselves, particularly within a large pipeline.
|
||||
When you notice yourself copying and pasting multiple verbs multiple times, you might think about writing a data frame function.
|
||||
Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or vector.
|
||||
|
||||
To let you write a function that uses dplyr verbs, we'll first introduce you to the challenge of indirection and how you can overcome it with embracing, `{{ }}`.
|
||||
With this theory under your belt, we'll then show you a bunch of examples to illustrate what you might do with it.
|
||||
|
||||
### Indirection and tidy evaluation
|
||||
|
||||
|
@ -411,7 +421,7 @@ df |> pull_unique(y)
|
|||
```
|
||||
|
||||
Regardless of how we call `pull_unique()` it always does `df |> distinct(var) |> pull(var)`, instead of `df |> distinct(x) |> pull(x)` or `df |> distinct(y) |> pull(y)`.
|
||||
This is a problem of indirection, and it arises because dplyr allows you to refer to the names of variables inside your data frame without any special treatment, so called **tidy evaluation**.
|
||||
This is a problem of indirection, and it arises because dplyr uses **tidy evaluation** to allow you to refer to the names of variables inside your data frame without any special treatment.
|
||||
|
||||
Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it's obvious from the context.
|
||||
The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function.
|
||||
|
@ -420,7 +430,7 @@ Here we need some way tell `distinct()` and `pull()` not to treat `var` as the n
|
|||
Tidy evaluation includes a solution to this problem called **embracing**.
|
||||
Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var }}`.
|
||||
Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the a literal variable name.
|
||||
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a function look inside of `var` rather than looking for a variable called `var`.
|
||||
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a dplyr function look inside of `var` rather than looking for a variable called `var`.
|
||||
|
||||
So to make `pull_unique()` work we need to replace `var` with `{{ var }}`:
|
||||
|
||||
|
@ -433,28 +443,23 @@ pull_unique <- function(df, var) {
|
|||
diamonds |> pull_unique(clarity)
|
||||
```
|
||||
|
||||
Success!
|
||||
|
||||
### When to embrace?
|
||||
|
||||
So the art of writing data frame functions is basically just figuring out which arguments need to be embraced.
|
||||
So the key challenge in writing data frame functions is figuring out which arguments need to be embraced.
|
||||
Fortunately this is easy because you can look it up from the documentation 😄.
|
||||
There are two terms to look for in the docs:
|
||||
There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:
|
||||
|
||||
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarise()` which do computation with variables.
|
||||
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarise()` that compute with variables.
|
||||
|
||||
- **Tidy-selection**: this is used for for functions like `select()`, `relocate()`, and `rename()` that select groups of variables.
|
||||
- **Tidy-selection**: this is used for for functions like `select()`, `relocate()`, and `rename()` that select variables.
|
||||
|
||||
When you start looking closely at the documentation, you'll notice that many dplyr functions use `…`.
|
||||
This is a special shorthand syntax that matches any that aren't otherwise explicitly matched.
|
||||
For example, `arrange()` uses data-masking for `…` and `select()` uses tidy-select for `…`.
|
||||
Your intuition about which arguments use tidy evaluation should be good for many common functions --- just think about whether you can compute (e.g. `x + 1`) or select (e.g. `a:x`).
|
||||
|
||||
Your intuition for many common functions should be pretty good --- think about whether you can compute (e.g. `x + 1`) or select (e.g. `a:x`).
|
||||
There are a few cases where it's harder to tell because you usually use them with single variable, which uses the same syntax for both data-masking or tidy-select.
|
||||
For example, the arguments to `group_by()`, `count()`, and `distinct()` are computing arguments because they can all create new variables.
|
||||
If you're ever confused, just look at the docs.
|
||||
In the following sections we'll explore the sorts of handy functions you might write once you understand embracing.
|
||||
|
||||
In the next two sections we'll explore the sorts of handy functions you might write for data-masking and tidy-select arguments
|
||||
|
||||
### Summary basics
|
||||
### Common use cases
|
||||
|
||||
If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:
|
||||
|
||||
|
@ -494,9 +499,7 @@ diamonds |>
|
|||
|
||||
To summarize multiple variables you'll need wait until @sec-across, where you'll learn how to use `across()`.
|
||||
|
||||
### Count variations
|
||||
|
||||
Another popular helper function is a version of `count()` that also computes proportions:
|
||||
Another popular `summarise()` helper function is a version of `count()` that also computes proportions:
|
||||
|
||||
```{r}
|
||||
# https://twitter.com/Diabb6/status/1571635146658402309
|
||||
|
@ -508,59 +511,7 @@ count_prop <- function(df, var, sort = FALSE) {
|
|||
diamonds |> count_prop(clarity)
|
||||
```
|
||||
|
||||
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced.
|
||||
`var` is passed to `count()` which uses data-masking for all variables in `…`.
|
||||
|
||||
Sometimes you want to select variables inside a function that uses data-masking.
|
||||
For example, imagine you want to write `count_missing()` that counts the number of missing observations in rows.
|
||||
You might try writing something like:
|
||||
|
||||
```{r}
|
||||
#| error: true
|
||||
count_missing <- function(df, group_vars, x_var) {
|
||||
df |>
|
||||
group_by({{ group_vars }}) |>
|
||||
summarise(n_miss = sum(is.na({{ x_var }})))
|
||||
}
|
||||
flights |>
|
||||
count_missing(c(year, month, day), dep_time)
|
||||
```
|
||||
|
||||
This doesn't work because `group_by()` uses data-masking not tidy-select.
|
||||
We can work around that problem by using `pick()` which allows you to use use tidy-select inside data-masking functions:
|
||||
|
||||
```{r}
|
||||
count_missing <- function(df, group_vars, x_var) {
|
||||
df |>
|
||||
group_by(pick({{ group_vars }})) |>
|
||||
summarise(n_miss = sum(is.na({{ x_var }})))
|
||||
}
|
||||
flights |>
|
||||
count_missing(c(year, month, day), dep_time)
|
||||
```
|
||||
|
||||
Another useful helper that uses `pick()` is to make a 2d table of counts.
|
||||
Here we count using all the variables in the `rows` and `columns`, then use `pivot_wider()` to rearrange:
|
||||
|
||||
```{r}
|
||||
# https://twitter.com/pollicipes/status/1571606508944719876
|
||||
count_wide <- function(data, rows, cols) {
|
||||
data |>
|
||||
count(pick(c({{ rows }}, {{ cols }}))) |>
|
||||
pivot_wider(
|
||||
names_from = {{ cols }},
|
||||
values_from = n,
|
||||
names_sort = TRUE,
|
||||
values_fill = 0
|
||||
)
|
||||
}
|
||||
diamonds |> count_wide(clarity, cut)
|
||||
diamonds |> count_wide(c(clarity, color), cut)
|
||||
```
|
||||
|
||||
We didn't discuss `pivot_wider()` above, but you can read the docs to discover that `names_from` uses the tidy-select style of tidy evaluation.
|
||||
|
||||
### Selecting rows and columns
|
||||
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced because it's passed to `count()` which uses data-masking for all variables in `…`.
|
||||
|
||||
Or maybe you want to find the sorted unique values of a variable for a subset of the data.
|
||||
Rather than supplying a variable and a value to do the filtering, we'll allow the user to supply an condition:
|
||||
|
@ -582,9 +533,8 @@ flights |> unique_where(tailnum == "N14228", month)
|
|||
|
||||
Here we embrace `condition` because it's passed to `filter()` and `var` because its passed to `distinct()`, `arrange()`, and `pull()`.
|
||||
|
||||
We've made all these examples take a data frame as the first argument, but if you're working repeatedly with the same data frame, it can make sense to hard code it.
|
||||
For example, this function always works with the flights dataset, make it easy to grab the subset that you want to work with.
|
||||
It always includes `time_hour`, `carrier`, and `flight` since these are the primary key that allows you to identify a row.
|
||||
We've made all these examples take a data frame as the first argument, but if you're working repeatedly with the same data, it can make sense to hardcode it.
|
||||
For example, the following function always works with the flights dataset and always selects `time_hour`, `carrier`, and `flight` since they are form the compound primary key that allows you to identify a row.
|
||||
|
||||
```{r}
|
||||
flights_sub <- function(rows, cols) {
|
||||
|
@ -596,11 +546,60 @@ flights_sub <- function(rows, cols) {
|
|||
flights_sub(dest == "IAH", contains("time"))
|
||||
```
|
||||
|
||||
### Data-masking vs tidy-selection
|
||||
|
||||
Sometimes you want to select variables inside a function that uses data-masking.
|
||||
For example, imagine you want to write `count_missing()` that counts the number of missing observations in rows.
|
||||
You might try writing something like:
|
||||
|
||||
```{r}
|
||||
#| error: true
|
||||
count_missing <- function(df, group_vars, x_var) {
|
||||
df |>
|
||||
group_by({{ group_vars }}) |>
|
||||
summarise(n_miss = sum(is.na({{ x_var }})))
|
||||
}
|
||||
flights |>
|
||||
count_missing(c(year, month, day), dep_time)
|
||||
```
|
||||
|
||||
This doesn't work because `group_by()` uses data-masking, not tidy-selection.
|
||||
We can work around that problem by using the handy `pick()` which allows you to use use tidy-selection inside data-masking functions:
|
||||
|
||||
```{r}
|
||||
count_missing <- function(df, group_vars, x_var) {
|
||||
df |>
|
||||
group_by(pick({{ group_vars }})) |>
|
||||
summarise(n_miss = sum(is.na({{ x_var }})))
|
||||
}
|
||||
flights |>
|
||||
count_missing(c(year, month, day), dep_time)
|
||||
```
|
||||
|
||||
Another convenient use of `pick()` is to make a 2d table of counts.
|
||||
Here we count using all the variables in the `rows` and `columns`, then use `pivot_wider()` to rearrange into a grid:
|
||||
|
||||
```{r}
|
||||
# https://twitter.com/pollicipes/status/1571606508944719876
|
||||
count_wide <- function(data, rows, cols) {
|
||||
data |>
|
||||
count(pick(c({{ rows }}, {{ cols }}))) |>
|
||||
pivot_wider(
|
||||
names_from = {{ cols }},
|
||||
values_from = n,
|
||||
names_sort = TRUE,
|
||||
values_fill = 0
|
||||
)
|
||||
}
|
||||
diamonds |> count_wide(clarity, cut)
|
||||
diamonds |> count_wide(c(clarity, color), cut)
|
||||
```
|
||||
|
||||
While our examples have mostly focused on dplyr, the tidy evaluation also underpins tidyr, and if you look at the `pivot_wider()` docs you can see that `names_from` uses tidy-selection.
|
||||
|
||||
### Learning more
|
||||
|
||||
This section has introduced you to some of the power and flexibility of tidy evaluation with dplyr (and a dash of tidyr).
|
||||
We've only used the smallest part of tidy evaluation, embracing, and it already gives you considerable power to reduce duplication in your data analyses.
|
||||
You can learn more advanced techniques in `vignette("programming", package = "dplyr")`.
|
||||
### Exercises
|
||||
|
||||
## Plot functions
|
||||
|
||||
|
@ -644,7 +643,7 @@ diamonds |>
|
|||
### More variables
|
||||
|
||||
It's straightforward to add more variables to the mix.
|
||||
For example, maybe you want an easy way to eye ball whether or not a data set is linear by overlaying a smooth line and a straight line:
|
||||
For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:
|
||||
|
||||
```{r}
|
||||
# https://twitter.com/tyler_js_smith/status/1574377116988104704
|
||||
|
@ -662,7 +661,7 @@ starwars |>
|
|||
linearity_check(mass, height)
|
||||
```
|
||||
|
||||
Or you want to wrap up an alternative for a scatterplot that uses colour to display a third variable, for very large datasets where overplotting is a problem:
|
||||
Or maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:
|
||||
|
||||
```{r}
|
||||
# https://twitter.com/ppaxisa/status/1574398423175921665
|
||||
|
@ -670,7 +669,7 @@ hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
|
|||
df |>
|
||||
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
|
||||
stat_summary_hex(
|
||||
aes(colour = after_scale(fill)),
|
||||
aes(colour = after_scale(fill)), # make border same colour as fill
|
||||
bins = bins,
|
||||
fun = fun,
|
||||
)
|
||||
|
@ -681,8 +680,8 @@ diamonds |> hex_plot(carat, price, depth)
|
|||
### Combining with dplyr
|
||||
|
||||
Some of the most useful helpers combine a dash of dplyr with ggplot2.
|
||||
For example, if you might want to do a bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
|
||||
And we're drawing the vertical bars, so you need to reverse the usual order to get the highest values at the top:
|
||||
For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
|
||||
Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:
|
||||
|
||||
```{r}
|
||||
sorted_bars <- function(df, var) {
|
||||
|
@ -694,7 +693,22 @@ sorted_bars <- function(df, var) {
|
|||
diamonds |> sorted_bars(cut)
|
||||
```
|
||||
|
||||
You can also get creative and display data summaries in other way:
|
||||
Or you could maybe you want to make it easy to draw a bar plot just for a subset of the data:
|
||||
|
||||
```{r}
|
||||
conditional_bars <- function(df, condition, var) {
|
||||
df |>
|
||||
filter({{ condition }}) |>
|
||||
ggplot(aes({{ var }})) +
|
||||
geom_bar()
|
||||
}
|
||||
|
||||
diamonds |> conditional_bars(cut == "Good", clarity)
|
||||
```
|
||||
|
||||
You can also get creative and display data summaries in other way.
|
||||
For example, this code uses the axis labels to display the highest value.
|
||||
As you learn more about ggplot2, the power of your functions will continue to increase.
|
||||
|
||||
```{r}
|
||||
# https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
|
||||
|
@ -724,15 +738,14 @@ df <- tibble(
|
|||
df <- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
|
||||
|
||||
fancy_ts(df, value, dist_name)
|
||||
|
||||
```
|
||||
|
||||
Next we'll discuss two more complicated cases: facetting and automatic labelling.
|
||||
Next we'll discuss two more complicated cases: faceting and automatic labeling.
|
||||
|
||||
### Facetting
|
||||
### Faceting
|
||||
|
||||
Unfortunately programming with facetting is a special challenge, because facetting was implemented before we understood what tidy evaluation was and how it should work.
|
||||
Unlike `aes()`, it wasn't straightforward to backport to tidy evalution, so you have to learn a new syntax.
|
||||
Unfortunately programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work.
|
||||
so you have to learn a new syntax.
|
||||
When programming with facets, instead of writing `~ x`, you need to write `vars(x)` and instead of `~ x + y` you need to write `vars(x, y)`.
|
||||
The only advantage of this syntax is that `vars()` uses tidy evaluation so you can embrace within it:
|
||||
|
||||
|
@ -746,17 +759,19 @@ foo <- function(x) {
|
|||
geom_point() +
|
||||
facet_wrap(vars({{ x }}))
|
||||
}
|
||||
foo(cyl)
|
||||
```
|
||||
|
||||
We've written these functions so that you can supply any data frame, but there are also advantages to hardcoding a data frame, if you're using it repeatedly:
|
||||
As with data frame functions, it can also be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable.
|
||||
The following function makes it particularly easy to interactively explore the conditional distribution `bill_length_mm` from palmerpenguins dataset.
|
||||
|
||||
```{r}
|
||||
# https://twitter.com/yutannihilat_en/status/1574387230025875457
|
||||
density <- function(fill, ...) {
|
||||
density <- function(fill, facets) {
|
||||
palmerpenguins::penguins |>
|
||||
ggplot(aes(bill_length_mm, fill = {{ fill }})) +
|
||||
geom_density(alpha = 0.5) +
|
||||
facet_wrap(vars(...))
|
||||
facet_wrap(vars({{ facets }}))
|
||||
}
|
||||
|
||||
density()
|
||||
|
@ -766,43 +781,28 @@ density(island, sex)
|
|||
|
||||
Also note that we hardcoded the `x` variable but allowed the fill to vary.
|
||||
|
||||
```{r}
|
||||
bars <- function(df, condition, var) {
|
||||
df |>
|
||||
filter({{ condition }}) |>
|
||||
ggplot(aes({{ var }})) +
|
||||
geom_bar() +
|
||||
scale_x_discrete(guide = guide_axis(angle = 45))
|
||||
}
|
||||
|
||||
diamonds |> bars(cut == "Good", clarity)
|
||||
```
|
||||
|
||||
### Labelling
|
||||
|
||||
It'd be nice to label this plot automatically.
|
||||
To do so, we're going to have to go under the covers of tidy evaluation and use a function from a package we have talked about before: rlang.
|
||||
rlang is the package that implements tidy evaluation, and is used by all the other packages in the tidyverse.
|
||||
rlang provides a helpful function called `englue()` to solve just this problem.
|
||||
It uses a syntax inspired by glue but combined with embracing:
|
||||
|
||||
```{r}
|
||||
# https://twitter.com/ppaxisa/status/1574398423175921665
|
||||
hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
|
||||
df |>
|
||||
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
|
||||
stat_summary_hex(
|
||||
aes(colour = after_scale(fill)),
|
||||
bins = bins,
|
||||
fun = fun,
|
||||
) +
|
||||
labs(colour = rlang::englue("{{z}}"))
|
||||
}
|
||||
diamonds |> hex_plot(carat, price, depth)
|
||||
```
|
||||
Remember the histogram function we showed you earlier?
|
||||
|
||||
```{r}
|
||||
histogram <- function(df, var, binwidth = NULL) {
|
||||
df |>
|
||||
ggplot(aes({{ var }})) +
|
||||
geom_histogram(binwidth = binwidth)
|
||||
}
|
||||
```
|
||||
|
||||
Wouldn't it be nice if we could label the output with the variable and the binwidth that was used?
|
||||
To do so, we're going to have to go under the covers of tidy evaluation and use a function from a new package: rlang.
|
||||
rlang is a low-level package that's used by just about every other package in the tidyverse because it implements tidy evaluation (and provided many other useful tools).
|
||||
|
||||
To solve the labelling problem we can use `rlang::englue()`.
|
||||
This works similarly to `str_glue()`, so any value wrapped in `{ }` will be inserted into the string.
|
||||
But unlike `str_glue()`, it also understands `{{ }}`, which automatically insert the appropriate variable name.
|
||||
|
||||
```{r}
|
||||
histogram <- function(df, var, binwidth) {
|
||||
label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
|
||||
|
||||
df |>
|
||||
|
@ -819,23 +819,7 @@ Hopefully it'll be fixed soon!)
|
|||
|
||||
You can use the same approach any other place that you might supply a string in a ggplot2 plot.
|
||||
|
||||
### Learning more
|
||||
|
||||
It's hard to create general purpose plotting functions because you need to consider many different situations, and we haven't given you the programming skills to handle them all.
|
||||
Fortunately, in most cases it's relatively simple to extract repeated plotting code into a function.
|
||||
So, for now, strive to keep your functions simple, focussing on concrete repetition, not solve imaginary future problems.
|
||||
|
||||
You can also learn other techniques in <https://ggplot2-book.org/programming.html>.
|
||||
|
||||
## RStudio
|
||||
|
||||
Once you start writing functions, there are two RStudio shortcuts that are useful.
|
||||
|
||||
- If you put your cursor on the name of a function that you've written, `F2` will take you to its defintion.
|
||||
|
||||
- Press `Ctrl + .` to open the fuzzy file and function finder.
|
||||
You can type the first few letters in your function name and it'll appear in the dropdown.
|
||||
You can also navigate to files, Quarto sections, and more, making it a very hand navigation tool.
|
||||
### Exercises
|
||||
|
||||
## Style
|
||||
|
||||
|
@ -916,13 +900,16 @@ Learn more at <https://style.tidyverse.org/functions.html>
|
|||
## Summary
|
||||
|
||||
In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot.
|
||||
Along the way your saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.
|
||||
|
||||
Writing functions to create data frames and plots using the tidyverse required you to learn a little about tidy evaluation.
|
||||
Tidy evaluation is really important, because its what allows you to write `diamonds |> filter(x == y)` and `filter()` knows to use `x` and `y` from the diamonds dataset.
|
||||
The downside of tidy evaluation is that you need to learn a new technique for programming: embracing.
|
||||
Embracing, e.g. `{{ x }}`, tells the tidy-evaluation using function to look inside the argument `x`, rather than using the literal variable `x`.
|
||||
You can figure out when you need to use embracing by looking in the documentation for the terms for the two major styles of tidyselect: "data masking" and "tidy select".
|
||||
You also learned a little about tidy evaluation so you could wrap functions from dplyr, tidyr, and ggplot2.
|
||||
Tidy evaluation is a key component of the tidyverse because it allows you to write `diamonds |> filter(x == y)` and `filter()` knows to use `x` and `y` from the diamonds dataset.
|
||||
The downside of tidy evaluation is that you need to learn a new technique for programming: embracing, `{{ x }}`.
|
||||
Embracing already gives you considerable power to reduce duplication in your data analyses, but there are many more advanced techniques available, which you can learn more about it `vignette("programming", package = "dplyr")` and `vignette("programming", package = "tidyr")`.
|
||||
|
||||
Here we've focused on very simple plotting functions, the sort of functions that you might naturally extract from repeated code in your analyses.
|
||||
As you get better at programming and learn more about ggplot2, you'll be able create richer functions with greater flexibility.
|
||||
The next place you might stop on your journey is the [Programming with ggplot2](https://ggplot2-book.org/programming.html){.uri} chapter of the ggplot2 book, where you'll learn other ways to reduce duplication in your plotting code.
|
||||
|
||||
In the next chapter, we'll dive into some of the details of R's vector data structures that we've omitted so far.
|
||||
These are immediately useful by themselves, but are a necessary foundation for the following chapter on iteration that provides some amazingly powerful tools.
|
||||
|
||||
|
|
18
program.qmd
18
program.qmd
|
@ -12,6 +12,7 @@ Programming is a cross-cutting skill needed for all data science work: you must
|
|||
```{r}
|
||||
#| label: fig-ds-program
|
||||
#| echo: false
|
||||
#| out.width: ~
|
||||
#| fig-cap: >
|
||||
#| Programming is the water in which all other components of the data
|
||||
#| science process swims.
|
||||
|
@ -19,7 +20,6 @@ Programming is a cross-cutting skill needed for all data science work: you must
|
|||
#| Our model of the data science process with program (import, tidy,
|
||||
#| transform, visualize, model, and communicate, i.e. everything)
|
||||
#| highlighted in blue.
|
||||
#| out.width: NULL
|
||||
|
||||
knitr::include_graphics("diagrams/data-science/program.png", dpi = 270)
|
||||
```
|
||||
|
@ -47,25 +47,13 @@ In the following three chapters, you'll learn skills that will allow you to both
|
|||
Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies.
|
||||
Instead, in @sec-functions, you'll learn how to write **functions** which let you extract out repeated code so that it can be easily reused.
|
||||
|
||||
2. As you start to write more powerful functions, you'll need a solid grounding in R's **data structures**, provided by vectors, which we discuss in @sec-vectors. You must master the four common atomic vectors, the three important S3 classes built on top of them, and understand the mysteries of the list and data frame.
|
||||
2. As you start to write more powerful functions, you'll need a solid grounding in R's **data structures**, provided by vectors, which we discuss in @sec-vectors.
|
||||
You must master the four common atomic vectors, the three important S3 classes built on top of them, and understand the mysteries of the list and data frame.
|
||||
|
||||
3. Functions extract out repeated code, but you often need to repeat the same actions on different inputs.
|
||||
You need tools for **iteration** that let you do similar things again and again.
|
||||
These tools include for loops and functional programming, which you'll learn about in @sec-iteration.
|
||||
|
||||
A common theme throughout these chapters is the idea of reducing duplication in your code.
|
||||
Reducing code duplication has three main benefits:
|
||||
|
||||
1. It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same.
|
||||
|
||||
2. It's easier to respond to changes in requirements.
|
||||
As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.
|
||||
|
||||
3. You're likely to have fewer bugs because each line of code is used in more places.
|
||||
|
||||
One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated.
|
||||
Another tool for reducing duplication is iteration, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
|
||||
|
||||
## Learning more
|
||||
|
||||
The goal of these chapters is to teach you the minimum about programming that you need to practice data science.
|
||||
|
|
Loading…
Reference in New Issue