Code is a means of communication, not just to the computer, but to other people. This is important because every project you undertake is fundamentally collaborative, and even if you're not working with other people you'll definitely be working with future-you.
After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did.
To me, this is what mastering R as a programming language is all about: making it easier to express yourself, so that over time your becomes more and more clear, and easier to write. In this chapter, you'll learn some of the most important skills, but to learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:
* [Hands on programming with R](http://shop.oreilly.com/product/0636920028574.do),
by Garrett Grolemund. This is an introduction to R as a programming language
and is a great place to start if R is your first programming language.
* [Advanced R](http://adv-r.had.co.nz) by Hadley Wickham. This dives into the
details of R the programming language. This is a great place to start if
you've programmed in other languages and you want to learn what makes R
special, different, and particularly well suited to data analysis.
You get better very slowly if you don't consciously practice, so this chapter brings together a number of ideas that we mention elsewhere into one focussed chapter on code as communication.
* If you're working with functions that don't have a dataframe based API
(i.e. you pass them individual vectors, not a data frame and expressions
to be evaluated in the context of that data frame), you might find `%$%`
useful. It "explodes" out the variables in a data frame so that you can
refer to them explicitly. This is useful when working with many functions
in base R:
```{r}
mtcars %$%
cor(disp, mpg)
```
## When not to use the pipe
The pipe is a powerful tool, but it's not the only tool at your disposal, and it doesn't solve every problem! Generally, you should reach for another tool when:
* Your pipes get longer than five or six lines. It's a good idea to create
intermediate objects with meaningful names. That helps with debugging,
because it's easier to figure out when things went wrong. It also helps
understand the problem, because a good name can be very evocative of the
purpose.
* You have multiple inputs or outputs.
* Instead of creating a linear pipeline where you're primarily transforming
one object, you're starting to create a directed graphs with a complex
dependency structure. Pipes are fundamentally linear and expressing
complex relationships with them does not often yield clear code.
* For assignment. magrittr provides the `%<>%` operator which allows you to
replace code like:
```R
mtcars <- mtcars %>% transform(cyl = cyl * 2)
```
with
```R
mtcars %<>% transform(cyl = cyl * 2)
```
I'm not a fan of this operator because I think assignment is such a
special operation that it should always be clear when it's occuring.
In my opinion, a little bit of duplication (i.e. repeating the
name of the object twice), is fine in return for making assignment
more explicit.
I think it also gives you a better mental model of how assignment works
in R. The above code does not modify `mtcars`: it instead creates a
As you become a better R programmer, you'll learn more techniques for reducing various types of duplication. This allows you to do more with less, and allows you to express yourself more clearly by taking advantage of powerful programming constructs.
Two main tools for reducing duplication are functions and for-loops. You tend to use for-loops less often in R than in other programming languages because R is a functional programming language. That means that you can extract out common patterns of for loops and put them in a function.
### Extracting out a function
Whenever you've copied and pasted code more than twice, you need to take a look at it and see if you can extract out the common components and make a function. For example, take a look at this code. What does it do?
You might be able to puzzle out that this rescales each column to 0--1. Did you spot the mistake? I made an error when updating the code for `df$b`, and I forgot to change an `a` to a `b`. Extracting repeated code out into a function is a good idea because it helps make your code more understandable (because you can name the operation), and it prevents you from making this sort of update error.
Now we can use that to simplify our original example:
```{r}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
This makes it more clear what we're doing, and avoids one class of copy-and-paste errors. However, we still have quite a bit of duplication: we're doing the same thing to each column.
### Common looping patterns
Before we tackle the problem of rescaling each column, lets start with a simpler case. Imagine we want to summarise each column with its median. One way to do that is to use a for loop. Every for loop has three main components:
Now imagine that you also want to compute the interquartile range of each column? How would you change the function? What if you also wanted to calculate the min and max?
```{r}
col_min <- function(df) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- min(df[[i]])
}
out
}
col_max <- function(df) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- max(df[[i]])
}
out
}
```
I've now copied-and-pasted this function three times, so it's time to think about how to generalise it. If you look at these functions, you'll notice that they are very similar: the only difference is the function that gets called.
I mentioned earlier that R is a functional programming language. Practically, what this means is that you can not only pass vectors and data frames to functions, but you can also pass other functions. So you can generalise these `col_*` functions by adding an additional argument:
```{r}
col_summary <- function(df, fun) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- fun(df[[i]])
}
out
}
col_summary(df, median)
col_summary(df, min)
```
We can take this one step further and use another cool feature of R functions: "`...`". "`...`" just takes any additional arguments and allows you to pass them on to another function:
```{r}
col_summary <- function(df, fun, ...) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- fun(df[[i]], ...)
}
out
}
col_summary(df, median, na.rm = TRUE)
```
If you've used R for a bit, the behaviour of function might seem familiar: it looks like the `lapply()` or `sapply()` functions. Indeed, all of the apply function in R abstract over common looping patterns.
There are two main differences with `lapply()` and `col_summary()`:
* `lapply()` returns a list. This allows it to work with any R function, not
just those that return numeric output.
* `lapply()` is written in C, not R. This gives some very minor performance
improvements.
As you learn more about R, you'll learn more functions that allow you to abstract over common patterns of for loops.
### Exercises
1. Adapt `col_summary()` so that it only applies to numeric inputs.
You might want to start with an `is_numeric()` function that returns
a logical vector that has a TRUE corresponding to each numeric column.
1. How do `sapply()` and `vapply()` differ from `col_summary()`?
(This is an advanced topic. You shouldn't worry too much about it when you first start writing function. Instead you should focus on getting a function that works right for the easiest 80% of the problem. Then in time, you'll learn how to get to 99% with minimal extra effort. The defaults in this book should steer you in the right direction: we avoid teaching you functions with major suprises.)
There is one principle that tends to lends itself both to easily readable code and code that works well even when generalised to handle new situations that you didn't previously think about.
You want to use functions who behaviour can be understood with as little context as possible. The smaller amount of code that you need to read to predict the likely outcome of a function, the easier it is to understand the code. Such code is also less likely to fail in unexpected ways because in new situations.
You can glean a little from the context: `foo()` is a function that takes (at least) two arguments, and it returns a result we store in `baz`. But apart from that, you have no idea. To understand what this function does, you need to read much more of the context. This is an extreme example.
Function and variable names are important because they hint at (or at least jog your memory of) what the code does. The advantage of using built-in functions is that you can use them in many places so that you're more likely to remember what they do.
The other side of this problem is using functions that rarely surprise you: functions that have consistent behaviour regardless of their inputs. These function are useful because they act as bottlenecks - it doesn't matter what go into them because you always know what comes out.
* What sort of column will `data.frame(x = "a")` create? You
can't be sure whether it will contain characters or factors depending on
the value of the global option `stringsAsFactors`.
`data.frame(x = "a", stringsAsFactors = FALSE)` or
`data_frame(x = "a")`.
Avoiding functions that behave unpredictably helps you to write code that you can understand (because you don't need to be intimately familiar with the specifics of the call). This book teaches you functions that have follow this principle as much as possible. When you have a choice between two functions with similar behaviour, pick the one that needs the least context to understand. This often means being more explicit, which means writing more code. That means your functions will take longer to write, but they will be easier to read and more robust to varying inputs.
The transition from interactive analysis to programming R can be very frustrating because it forces you to confront differences that you previously swept under the carpet. You need to learn about how functions can behave differently some of the time, and
If this behaviour is advantageous for programming, why do any functions behave differently? Because R is not just a programming language, it's also an environment for interactive data analysis. And somethings make sense for interactive use (where you quickly check the output and guessing what you want is ok) but don't make sense for programming (where you want errors to arise as quickly as possible).
It's a continuum, not two discrete endpoints. It's not possible to write code where every single line is understandable in isolation. Even if you could, it wouldn't be desirable. Relying on a little context is useful. You just don't want to go overboard.
You'll learn more about these in the data structures package.
### Type
`sapply()` vs `vapply()` vs `purrr::map_xyz()`. (Wouldn't be a problem if R's functions declared their return types.)
```{r, eval = FALSE}
sapply(df, class) # you need to know the details of df to predict output
map_chr(df, class) # you know it returns a character vector the same length as df no matter what
```
This doesn't make `sapply()` bad and `map_chr()` good. `sapply()` is nice because you can use it interactively without having to think about what `f` will return. 95% of the time it will do the right thing, and if it doesn't you can quickly fix it. `map_chr()` is more important when your programming because a clear error message is more valuable when an operation is buried deep inside a tree of function calls. At this point its worth thinking more about
`[.data.frame`.
You'll learn more about this type of functions and alternatives that are more predictable in the purrr chapter.
Another type of type-stability is illustrated by the dplyr functions. `filter()` , `mutate()`, `summarise()` etc don't always return the same type, but they always return something that behaves like a data frame, and is the same type as the first value.
### Variable lookup
You've learned a number of functions that implement special lookup rules:
```{r, eval = FALSE}
ggplot(mpg, aes(displ, cty)) + geom_point()
filter(mpg, displ > 10)
```
This is so called "non-standard evaluation", because the usual lookup rules don't apply. In both cases above neither `displ` nor `cty` are present in the global environment. Instead both ggplot2 and dplyr look for them first in a data frame. This is great for interactive use, but can cause problems inside a function because they'll fall back to the global environment if the variable isn't found.
[Talk a little bit about the standard scoping rules]
For example, take this function:
```{r}
big_x <- function(df, threshold) {
dplyr::filter(df, x > threshold)
}
```
There are two ways in which this function can fail:
1. `df$x` might not exist. There are two potential failure modes:
```{r, error = TRUE}
big_x(mtcars, 10)
x <- 1
big_x(mtcars, 10)
```
The second failure mode is particularly pernicious because it doesn't
throw an error, but instead silently returns an incorrect result
because its find `x` in a parent environment. It's unlikely to happen,
but I think it's worth weighting heavily in your analysis of potential
failure modes because it's a failure that will be extremely time consuming
to track down, as you need to read a lot of context.
1. `df$threshold` might exist! There's only one potential failure mode
How can you avoid this problem? Currently, you need to do this:
```{r}
big_x <- function(df, threshold) {
if (!"x" %in% names(df))
stop("`df` must contain variable called `x`.", call. = FALSE)
if ("threshold" %in% names(df))
stop("`df` must not contain variable called `threshold`.", call. = FALSE)
dplyr::filter(df, x > threshold)
}
```
Because dplyr currently has no way to force a name to be interpreted as either a local or parent variable, as I've only just realised that's really why I think you should avoid using NSE. In a future version you should be able to do:
```{r}
big_x <- function(df, threshold) {
dplyr::filter(df, .this$x > .parent$threshold)
}
```
Another option is to implement it yourself using base subsetting:
```{r}
big_x <- function(df, threshold) {
i <- df$x > threshold
df[!is.na(i) & i, , drop = FALSE]
}
```
The challenge is remembering that `filter()` also drops missing values, and you need to remember to use `drop = FALSE` or the function will return a vector if `df` only has one column.
### Purity
Functions are easiest to reason about if they have two properties:
1. Their output only depends on their inputs.
1. They don't affect the outside world except through their return value.
There are lots of important functions that aren't pure:
1. Random number generation.
1. I/O
1. Current time etc.
1. Plotting
But it makes sense to separate into functions that are called primarily for their side-effects and functions that are called primarily for their return value. In other words, if you see `f(x, y, z)` you know it's called for the side-effect, and if you call `a <- g(x, y, z)` you know it's called for its return value and is unlikely to affect the state of the world otherwise.
The biggest breaker of this rule in base R are functions that create data frames. Most of these functions have a `stringsAsFactors` argument that defaults to `getOption("stringsAsFactors")`. This means that a global option affects the operation of a very large number of functions, and you need to be aware that depending on an external state a function might produce either a character vector or a factor. In this book, we steer you away from that problem by recommnding functions like `readr::read_csv()` and `dplyr::data_frame()` that don't rely on this option. But be aware of it! Generally if a function is affected by a global option, you should avoid setting it
Generally, if you want to use options in your own functions, I recommend using them for controlling default displays, not data types. For example, dplyr has some options that let you control the default number of rows and columns that are printed out. This is a good use of an option because it's something that people frequently want control over, but doesn't affect the computation of a result, just it's interactive display.
`options(digits)`
#### Exercises
1. Look at the `encoding` argument to `file()`, `url()`, `gzfile()` etc.
What is the default value? Why should you avoid setting the default