One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks. Writing a function has three big advantages over using copy-and-paste:
Writing good functions is a lifetime journey. Even after using R for many years we still learn new techniques and better ways of approaching old problems. The goal of this chapter is not to master every esoteric detail of functions but to get you started with some pragmatic advice that you can start using right away.
As well as practical advice for writing functions, this chapter also gives you some suggestions for how to style your code. Good coding style is like using correct punctuation. You can manage without it, but it sure makes things easier to read. As with styles of punctuation, there are many possible variations. Here we present the style we use in our code, but the most important thing is to be consistent.
You should consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). For example, take a look at this code. What does it do?
You might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? I made an error when copying-and-pasting the code for `df$b`: I forgot to change an `a` to a `b`. Extracting repeated code out into a function is a good idea because it prevents you from making this type of mistake.
This code only has one input: `df$a`. (It's a little suprisingly that `TRUE` is not an input: you can explore why in the exercise below). To make the single input more clear, it's a good idea to rewrite the code using temporary variables with a general name. Here this function only takes one vector of input, so I'll call it `x`:
Pulling out intermediate calculations into named variables is a good practice because it makes it more clear what the code is doing. Now that I've simplified the code, and checked that it still works, I can turn it into a function:
Note the overall process: I only made the function after I'd figured out how to make it work with a simple input. It's easier to start with working code and turn it into a function; it's harder to create a function and then try to make it work.
At this point it's a good idea to check your function with a few different inputs:
```{r}
rescale01(c(-10, 0, 10))
rescale01(c(1, 2, 3, NA, 5))
```
As you write more and more functions you'll eventually want to convert these informal, interactive tests into formal, automated tests. That process is called unit testing. Unfortunately, it's beyond the scope of this book, but you can learn about it in <http://r-pkgs.had.co.nz/tests.html>.
Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we're doing the same thing to multiple columns. We'll learn how to eliminate that duplication in [iteration], once you've learn more about R's data structures in [data-structures].
Another advantage of functions is that if our requirements change, we only need to make the change in one place. For example, we might discover that some of our variables include infinite values, and `rescale01()` fails:
```{r}
x <- c(1:10, Inf)
rescale01(x)
```
Because we've extract the code into a function, we only need to make the fix in one place:
```{r}
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
```
This is an important part of the "do no repeat yourself" (or DRY) principle. The more repitition you have in your code, the more places you need to remember to update when things change (and they always code!), and the more likely you are to create bugs over time.
It's important to remember that functions are not just for the computer, but are also for humans. R doesn't care what your function is called, or what comments it contains, but these are important for human readers. This section discusses some things that you should bear in mind when writing functions that humans can understand.
The name of a function is important. Ideally the name of your function will be short, but clearly evoke what the function does. However, it's hard to come up with concise names, and autocomplete makes it easy to type long names, so it's better to err on the side of clear descriptions, rather than short names.
Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`). A good sign that a noun might be a better choice is if you're using a very broad verb like "get", "compute", "calculate", or determine. Use your best judgement and don't be afraid to rename a function if you later figure out a better name.
If your function name is composed of multiple words, I recommend using "snake\_case", where each word is lower case and separated by an underscore. camelCase is a popular alternative, but be consistent: pick one or the other and stick with it. R itself is not very consistent, but there's nothing you can do about that. Make sure you don't fall into the same trap by making your code as consistent as possible.
If you have a family of functions that do similar things, make sure they have consistent names and arguments. Use a common prefix to indicate that they are connected. That's better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.
Where possible, avoid overriding existing functions and variables. It's impossible to do in general because so many good names are already taken by other packages, but avoiding the most common names from base R will avoid confusion.
Use comments, lines starting with `#`, to explain the "why" of your code. You generally should avoid comments that explain the "what" or the "how". If you can't understand what the code does from reading it, you should think about how to rewrite it to be more clear. Do you need to add some intermediate variables with useful names? Do you need to break out a subcomponent of a large function so you can name it? However, your code can never capture the reasoning behind your decisions: why did you choose this approach instead of an alternative? What else did you try that didn't work? It's a great idea to capture that sort of thinking in a comment.
Another important use of comments is to break up your file into easily readable chunks. Use long lines of `-` and `=` to make it easy to spot the breaks. RStudio even provides a keyboard shortcut for this: Cmd/Ctrl + Shift + R.
Here's a simple function that uses an if statement. The goal of this function is to return a logical vector describing whether or not each element of a vector is named.
This function takes advantage of the standard return rule: a function returns the last value that it computed. Here that is either one of the two branches of the `if` statement.
The `condition` must evaluate to either `TRUE` or `FALSE`. If it's a vector, you'll get a warning message; if it's an `NA`, you'll get an error. Watch out for these messages in your own code:
You can use `||` (or) and `&&` (and) to combine multiple logical expressions. These operators are "short-circuiting": as soon as `||` sees the first `TRUE` it returns `TRUE` without computing anything else. As soon as `&&` sees the first `FALSE` it returns `FALSE`. You should never use `|` or `&` in an `if` statement: these are vectorised operations that apply to multiple values (that's why you use them in `filter()`). If you do have a logical vector, you can use `any()` or `all()` to collapse it to a single value.
Be careful when testing for equality. `==` is vectorised, which means that it's easy to get more than one output. Either check the length is already 1, collapsed with `all()` or `any()`, or use the non-vectorised `identical()`. `identical()` is very strict: it always returns either a single `TRUE` or a single `FALSE`, and doesn't coerce types. This means that you need to be careful when comparing integers and doubles:
But note that if you end up with a very long series of chained `if` statements, you should consider rewriting. One useful technique is the `switch()` function. It allows you to evaluate selected code based on position or name. Note that neither `if` nor `switch()` is vectorised: they work with a single value at a time.
Squiggly brackets are optional (for both `if` and `function`), but highly recommended. When coupled with good style (described below), this makes it easier to see the hierarchy in your code. You can easily see how the code is nested by skimming the left-hand margin.
An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it's followed by `else`. Always indent the code inside curly braces.
The arguments to a function typically fall into two broad sets: one set supplies the data to compute on, and the other supplies arguments that controls the details of the computation. For example:
Generally, data arguments should come first. Detail arguments should go on the end, and usually should have default values. You specify a default value in the same way you call a function with a named argument:
The default value should almost always be the most common value. There are a few exceptions to do with safety. For example, it makes sense for `na.rm` to default to `FALSE` because missing values are important. Even though `na.rm = TRUE` is what you usually put in your code, it's a bad idea to silently ignore missing values by default.
When you call a function, typically you can omit the names for the data arguments (because they are used so commonly). If you override the default value of a detail argument, you should use the full name:
You can refer to an argument by its unique prefix (e.g. `mean(x, n = TRUE)`), but this is generally best avoided given the possibilities for confusion.
Notice that when you call a function, you should place a space around `=` in function calls, and always put a space after a comma, not before (just like in regular English). Using whitespace makes it easier to skim the function for the important components.
The names of the arguments are also important. R doesn't care, but the readers of your code (including future you!) will find your code easier to understand. Generally you should prefer longer, more descriptive names, but there are a handful of very common, very short names. It's worth memorising these:
* `x`, `y`, `z`: vectors.
* `w`: a vector of weights.
* `df`: a data frame.
* `i`, `j`: numeric indices (typically rows and columns).
* `n`: length, or number of rows.
* `p`: number of columns.
Otherwise, consider matching names of arguments in existing R functions. For example, always use `na.rm` to determine if missing values should be removed.
As you start to write more functions, you'll eventually get to the point where you don't remember exactly how your function works. At this point it's easier to call your function with invalid inputs. To avoid this problem, it's often useful to make constraints explicit. For example, imagine you've written some functions for computing weighted summary statistics:
What happens if `x` and `w` are not the same length?
```{r}
wt_mean(1:6, 1:3)
```
In this case, because of R's recycling rules, we don't get an error.
It's good practice to check important preconditions, and throw an error (with `stop()`), if they are not true:
```{r}
wt_mean <- function(x, w) {
if (length(x) != length(w)) {
stop("`x` and `w` must be the same length", call. = FALSE)
}
sum(w * x) / sum(x)
}
```
Be careful not to take this too far. There's a tradeoff between how much time you spend making your function robust, versus how long you spend writing it. For example, if you also added a `na.rm` argument, I probably wouldn't check it carefully:
```{r}
wt_mean <- function(x, w, na.rm = FALSE) {
if (!is.logical(na.rm)) {
stop("`na.rm` must be logical")
}
if (length(na.rm) != 1) {
stop("`na.rm` must be length 1")
}
if (length(x) != length(w)) {
stop("`x` and `w` must be the same length", call. = FALSE)
}
if (na.rm) {
miss <- is.na(x) | is.na(w)
x <- x[!miss]
w <- w[!miss]
}
sum(w * x) / sum(x)
}
```
This is a lot of extra work for little additional gain.
Many functions in R take an arbitrary number of inputs:
```{r}
sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
str_c("a", "b", "c", "d", "e", "f")
```
How do these functions work? They rely on a special argument: `...` (pronounced dot-dot-dot). This special argument captures any number of arguments that aren't otherwise matched.
It's useful because you can then send those `...` on to another function. This is a useful catch-all if your function primarily wraps another function. For example, I commonly create these helper functions that wrap around `paste()`:
Here `...` lets me forward on any arguments that I don't want to deal with to `paste()`. It's a very convenient technique. But it does came at a price: any misspelled arguments will not raise an error. This makes it easy for typos to go unnoticed:
Arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called. This is an important property of R as a programming language, but is generally not important for data analysis. You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>
Figuring out what your function should return is usually straightforward: it's why you created the function in the first place! There are two things you should consider when returning a value: Does returning early make your function easier to read? And can you make your function pipeable?
The value returned by the function is the usually the last statement it evaluates, but you choose to return early by using `return()`. I think it's best to save the use of `return()` to signal that you can return early with a simpler solution. A common reason to do this is because the inputs are empty:
Another reason is because you have a `if` statement with one complex block and one simple block. For example, you might write an if statement like this:
But if the first block is very long, by the time you get to the else, you've forgotten what's going on. One way to rewrite it is to use an early return for the simple case:
In __transformation__ functions, there's a clear "primary" object that is passed in as the first argument, and a modified version is returned by the function. For example, the key objects for dplyr and tidyr are data frames. If you can identify what the object type is for your domain, you'll find that your functions just work in a pipe.
__Side-effect__ functions, however, are primarily called to perform an action, like drawing a plot or saving a file, not transforming an object. These functions should "invisibly" return the first argument, so they're not printed by default, but can still be used in a pipeline. For example, this simple function that prints out the number of missing values in a data frame:
The last component of a function is it's environment. This is not something you need to understand deeply when you first start writing functions. However, it's important to know a little bit about environments because they are crucial to how functions work. The environment of a function controls how R finds the value associated with a name. For example, take this function:
In many programming languages, this would be an error, because `y` is not defined inside the function. In R, this is valid code because R uses rules called _lexical scoping_ to find the value associated with a name. Since `y` is not defined inside the function, R will look in the _environment_ where the function was defined:
This behaviour seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it doesn't cause too many problems (especially if you regularly restart R to get to a clean slate).
The advantage of this behaviour is that from a language standpoint it allows R to be very consistent. Every name is looked up using the same set of rules. For `f()` that includes the behaviour of two things that you might not expect: `{` and `+`. This allows you to do devious things like:
This is a common phenomenon in R. R places few limits on your power. You can do many things that you can't do in other programming languages. You can do many things that 99% of the time are extremely ill-advised (like overriding how addition works!). But this power and flexibility is what makes tools like ggplot2 and dplyr possible. Learning how to make best use of this flexibility is beyond the scope of this book, but you can read about in "[Advanced R](http://adv-r.had.co.nz)".