r4ds/functions.Rmd

389 lines
13 KiB
Plaintext
Raw Normal View History

2016-03-01 22:29:58 +08:00
# Functions
2015-10-21 21:04:37 +08:00
2016-03-01 22:16:28 +08:00
One of the best ways to grow in your skills as a data scientist in R is to write functions. Functions allow you to automate common tasks, instead of using copy-and-paste. Writing good functions is a lifetime journey: you won't learn everything but you'll hopefully get to start walking in the right direction.
2016-02-11 21:58:53 +08:00
2016-03-01 22:29:58 +08:00
## When should you write a function?
2015-10-21 21:04:37 +08:00
Whenever you've copied and pasted code more than twice, you need to take a look at it and see if you can extract out the common components and make a function. For example, take a look at this code. What does it do?
```{r}
df <- data.frame(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
2015-10-19 21:41:33 +08:00
```
2016-02-13 06:05:25 +08:00
You might be able to puzzle out that this rescales each column to 0--1. But did you spot the mistake? I made an error when copying-and-pasting the code for `df$b`, and I forgot to change an `a` to a `b`. Extracting repeated code out into a function is a good idea because it helps make your code more understandable (because you can name the operation), and it prevents you from making this class of errors.
2015-10-19 21:41:33 +08:00
2015-10-21 21:04:37 +08:00
To write a function you need to first analyse the operation. How many inputs does it have?
```{r, eval = FALSE}
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
2015-10-19 21:41:33 +08:00
```
2016-03-01 22:16:28 +08:00
This code only has one input: `df$a`.
To make that more clear, it's a good idea to rewrite the code using some temporary variables. Here this function only takes one input, so I'll call it `x`:
2015-10-21 21:04:37 +08:00
```{r}
x <- 1:10
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
```
2016-01-25 22:59:36 +08:00
There is some duplication in this code: I'm computing the `min()` and `max()` multiple times, and I could instead do that in one step:
2015-10-21 21:04:37 +08:00
```{r}
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
```
2016-03-01 22:16:28 +08:00
Now that I've simplified the code, and checked that it still works, I can turn it into a function:
2015-10-21 21:04:37 +08:00
```{r}
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5, 10))
2015-10-21 21:04:37 +08:00
```
Always make sure your code works on a simple test case before creating the function!
2015-10-21 21:04:37 +08:00
2016-02-13 06:05:25 +08:00
Note the process that I followed here: I constructed the `function` last. It's much easier to start with code that works on a sample input and then turn it into a function rather than the other way around. You're more likely to get to your final destination if you take small steps and check your work after each step.
2015-10-21 21:04:37 +08:00
Now we can use that to simplify our original example:
```{r}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
2016-01-25 22:59:36 +08:00
This makes it more clear what we're doing, and avoids one class of copy-and-paste errors. However, we still have quite a bit of duplication: we're still doing the same thing to multiple columns. We'll learn how to handle that in the for loop section. But first, lets talk a bit more about functions.
### Practice
2016-02-13 06:05:25 +08:00
1. Practice turning the following code snippets into functions. Think about
what each function does. What would you call it? How many arguments does it
need? Can you rewrite it to be more expressive or less duplicative?
```{r, eval = FALSE}
mean(is.na(x))
x / sum(x, na.rm = TRUE)
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
mean((x - mean(x))^3) / mean((x - mean(x))^2)^(3/2)
```
1. Implement a `fizzbuzz` function. It take a single number as input. If
the number is divisible by three, return "fizz". If it's divisible by
five return "buzz". If it's divisible by three and five, return "fizzbuzz".
Otherwise, return the number.
2015-10-21 21:04:37 +08:00
2016-03-01 22:29:58 +08:00
## Function components
2015-10-21 21:04:37 +08:00
2016-01-25 22:59:36 +08:00
There are three attributes that define what a function does:
2016-02-13 06:05:25 +08:00
1. The __arguments__ of a function are its possible inputs.
Sometimes these are called _formal_ arguments to distinguish them from
the actual arguments that a function is called with. For example, the
formal argument of mean are `x`, `trim` and `na.rm`, but a given call
might only use some of these arguments.
2016-01-25 22:59:36 +08:00
1. The __body__ of a function is the code that it runs each time.
2016-02-13 06:05:25 +08:00
The last statement evaluated in the function body is what it returns.
The return value is not a property of the function because it changes
depending on the input values.
2016-01-25 22:59:36 +08:00
1. The function __environment__ controls how it looks up values from names
2016-02-13 06:05:25 +08:00
(i.e. how it goes from the name `x`, to its value, `10`). The set of
rules that governs this behaviour is called scoping.
2016-01-25 22:59:36 +08:00
2016-03-01 22:29:58 +08:00
### Arguments
2016-01-25 22:59:36 +08:00
You can choose to supply default values to your arguments for common options. This is useful so that you don't need to repeat yourself all the time.
```{r}
foo <- function(x = 1, y = TRUE, z = 10:1) {
}
```
2016-02-13 06:05:25 +08:00
Whenever you have a mix of arguments with and without defaults, those without defaults should come first.
Default values can depend on other arguments but don't overuse this technique as it's possible to create code that is very difficult to understand. What does this function do?
2016-01-25 22:59:36 +08:00
```{r}
2016-02-13 06:05:25 +08:00
bar <- function(x = y + 1, y = x - 1) {
2016-01-25 22:59:36 +08:00
x * y
}
```
2016-02-13 06:05:25 +08:00
There's a special argument that's used quite commonly: `...`. This captures any other arguments not otherwise matched. It's useful because you can then send those `...` on to another argument. This is a useful catch-all if your function primarily wraps another function. For example, you might have written your own wrapper designed to add linear model lines to a ggplot:
2016-01-25 22:59:36 +08:00
```{r}
geom_lm <- function(formula = y ~ x, colour = alpha("steelblue", 0.5),
size = 2, ...) {
geom_smooth(formula = formula, se = FALSE, method = "lm", colour = colour,
size = size, ...)
}
```
2016-02-12 19:54:28 +08:00
This allows you to use any other arguments of `geom_smooth()`, even those that aren't explicitly listed in your wrapper (and even arguments that don't exist yet in the version of ggplot2 that you're using).
2016-01-25 22:59:36 +08:00
2016-02-11 21:58:53 +08:00
Note that arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called:
```{r}
g <- function(a, b, c) {
a + b
}
g(1, 2, stop("Not used!"))
```
You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>
2016-03-01 22:29:58 +08:00
### Body
2016-01-25 22:59:36 +08:00
2016-02-11 21:58:53 +08:00
The body of the function does the actual work. The value returned by the function is the last statement it evaluates. Unlike other languages all statements in R return a value. An `if` statement returns the value from the branch that was chosen:
```{r}
greeting <- function(time = lubridate::now()) {
hour <- lubridate::hour(time)
if (hour < 12) {
"Good morning"
} else if (hour < 18) {
"Good afternoon"
} else {
"Good evening"
}
}
greeting()
```
That also means you can assign the result of an `if` statement to a variable:
```{r}
y <- 10
x <- if (y < 20) "Too low" else "Too high"
```
2016-03-01 22:16:28 +08:00
You can explicitly return early from a function with `return()`. I think it's best to save the use of `return()` to signal that you can return early with a simpler solution. For example, you might write an if statement like this:
2016-02-11 21:58:53 +08:00
```{r, eval = FALSE}
f <- function() {
if (x) {
# Do
# something
# that
# takes
# many
# lines
# to
# express
} else {
# return something short
}
}
```
2016-03-01 22:16:28 +08:00
But if the first block is very long, by the time you get to the else, you've forgotten what's going on. One way to rewrite it is to use an early return for the simple case:
2016-02-11 21:58:53 +08:00
```{r, eval = FALSE}
f <- function() {
if (!x) {
return(something_short)
}
# Do
# something
# that
# takes
# many
# lines
# to
# express
}
```
2016-03-01 22:16:28 +08:00
This tends to make the code easier to understand, because you don't need quite so much context to understand it.
#### Invisible values
2016-02-11 21:58:53 +08:00
Some functions return "invisible" values. These are not printed out by default but can be saved to a variable:
```{r}
f <- function() {
invisible(42)
}
f()
x <- f()
x
```
You can also force printing by surrounding the call in parentheses:
2016-01-25 22:59:36 +08:00
2016-02-11 21:58:53 +08:00
```{r}
(f())
```
Invisible values are mostly used when your function is called primarily for its side-effects (e.g. printing, plotting, or saving a file). It's nice to be able pipe such functions together, so returning the main input value is useful. This allows you to do things like:
```{r, eval = FALSE}
library(readr)
mtcars %>%
write_csv("mtcars.csv") %>%
write_tsv("mtcars.tsv")
```
2016-01-25 22:59:36 +08:00
2016-03-01 22:29:58 +08:00
### Environment
2016-01-25 22:59:36 +08:00
2016-02-11 21:58:53 +08:00
The environment of a function controls how R finds the value associated with a name. For example, take this function:
2016-01-25 22:59:36 +08:00
```{r}
f <- function(x) {
x + y
}
```
2016-02-11 21:58:53 +08:00
In many programming languages, this would be an error, because `y` is not defined inside the function. In R, this is valid code because R uses rules called lexical scoping to determine the value associated with a name. Since `y` is not defined inside the function, R will look where the function was defined:
2016-01-25 22:59:36 +08:00
```{r}
y <- 100
f(10)
y <- 1000
f(10)
```
2016-02-13 06:05:25 +08:00
This behaviour seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it doesn't cause too many problems (especially if you regularly restart R to get to a clean slate). The advantage of this behaviour is that from a language standpoint it allows R to be very consistent. Every name is looked up using the same set of rules. For `f()` that includes the behaviour of two things that you might not expect: `{` and `+`.
2016-01-25 22:59:36 +08:00
2016-02-13 06:05:25 +08:00
This allows you to do devious things like:
2016-01-25 22:59:36 +08:00
2016-02-13 06:05:25 +08:00
```{r}
`+` <- function(x, y) {
if (runif(1) < 0.1) {
sum(x, y)
} else {
sum(x, y) * 1.1
}
}
table(replicate(1000, 1 + 2))
rm(`+`)
```
This is a common phenomenon in R. R gives you a lot of control. You can do many things that are not possible in other programming languages. You can things that 99% of the time extremely ill-advised (like overriding how addition works!), but this power and flexibility is what makes tools like ggplot2 and dplyr possible. Learning how to make good use of this flexibility is beyond the scope of this book, but you can read about in "Advanced R".
2016-01-25 22:59:36 +08:00
#### Exercises
2016-02-13 06:05:25 +08:00
1. What happens if you call `bar()`? What does the error message mean?
2016-01-25 22:59:36 +08:00
1. What happens if you try to override the method in `geom_lm()` created
2016-02-13 06:05:25 +08:00
above (e.g. `geom_lm(method = "glm")`? Why?
2016-03-01 22:29:58 +08:00
## Making functions with magrittr
2016-02-13 06:05:25 +08:00
Another way to write functions is using magrittr. You've already seen how to execute a pipeline on a specific dataset:
2016-02-11 21:58:53 +08:00
```{r}
library(dplyr)
mtcars %>%
filter(mpg > 5) %>%
group_by(cyl) %>%
summarise(n = n())
```
2016-02-13 06:05:25 +08:00
But you can also create a generic pipeline that you can apply to any object:
2016-02-11 21:58:53 +08:00
```{r}
my_fun <- . %>%
filter(mpg > 5) %>%
group_by(cyl) %>%
summarise(n = n())
my_fun
my_fun(mtcars)
```
2016-02-13 06:05:25 +08:00
The key is to use `.` as the initial input in to the pipe. This is a great way to create a quick and dirty function if you've already made one pipe and now want to re-apply it in many places.
2016-03-01 22:29:58 +08:00
## Non-standard evaluation
2015-10-21 21:04:37 +08:00
2016-02-13 06:05:25 +08:00
One challenge with writing functions is that many of the functions you've used in this book use non-standard evaluation to minimise typing. This makes these functions great for interactive use, but it does make it more challenging to program with them, because you need to use more advanced techniques. For example, imagine you'd written the following duplicated code across a handful of data analysis projects:
2016-02-11 21:58:53 +08:00
```{r}
mtcars %>%
group_by(cyl) %>%
summarise(mean = mean(mpg, na.rm = TRUE), n = n()) %>%
filter(n > 10) %>%
arrange(desc(mean))
ggplot2::diamonds %>%
group_by(cut) %>%
summarise(mean = mean(price, na.rm = TRUE), n = n()) %>%
filter(n > 10) %>%
arrange(desc(mean))
nycflights13::planes %>%
group_by(model) %>%
summarise(mean = mean(year, na.rm = TRUE), n = n()) %>%
filter(n > 100) %>%
arrange(desc(mean))
```
You'd like to be able to write a function with arguments data frame, group and variable so you could rewrite the above code as:
```{r, eval = FALSE}
mtcars %>%
mean_by(cyl, mpg, n = 10)
ggplot2::diamonds %>%
mean_by(cut, price, n = 10)
nycflights13::planes %>%
mean_by(model, year, n = 100)
```
Unfortunately the obvious approach doesn't work:
```{r}
mean_by <- function(data, group_var, mean_var, n = 10) {
data %>%
group_by(group_var) %>%
summarise(mean = mean(mean_var, na.rm = TRUE), n = n()) %>%
filter(n > 100) %>%
arrange(desc(mean))
}
```
2016-02-13 06:05:25 +08:00
This fails because it tells dplyr to group by `group_var` and compute the mean of `mean_var` neither of which exist in the data frame. Writing reusable functions for ggplot2 poses a similar problem because `aes(group_var, mean_var)` would look for variables called `group_var` and `mean_var`. It's really only been in the last couple of months that I fully understood this problem, so there aren't currently any great (or general) solutions. However, now that I've understood the problem I think there will be some systematic solutions in the near future.
2016-02-11 21:58:53 +08:00
### Exercises
1. Follow <http://nicercode.github.io/intro/writing-functions.html> to
write your own functions to compute the variance and skew of a vector.
1. Read the [complete lyrics](https://en.wikipedia.org/wiki/Little_Bunny_Foo_Foo)
to "Little Bunny Foo". There's a lot of duplication in this song.
Extend the initial piping example to recreate the complete song, using
functions to reduce duplication.