754 lines
30 KiB
Plaintext
754 lines
30 KiB
Plaintext
# Expressing yourself in code
|
|
|
|
```{r, include = FALSE}
|
|
source("common.R")
|
|
knitr::opts_chunk$set(
|
|
cache = TRUE
|
|
)
|
|
library(dplyr)
|
|
diamonds <- ggplot2::diamonds
|
|
```
|
|
|
|
Code is a tool of communication, not just to the computer, but to other people. This is important because every project you undertake is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you. You want to write clear code so that future-you doesn't curse present-you when you look at a project again after several months have passed.
|
|
|
|
To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. In this chapter, you'll learn three important skills that help you move in this direction:
|
|
|
|
1. We'll dive deep into the __pipe__, `%>%`, talking more about how it works
|
|
and how it gives you a new tool for rewriting your code. You'll also learn
|
|
about when not to use the pipe!
|
|
|
|
1. Repeating yourself in code is dangerous because it can easily lead to
|
|
errors and inconsistencies. We'll talk about how to write __functions__
|
|
in order to remove duplication in your logic.
|
|
|
|
1. Another important tool for removing duplication is the __for loop__ which
|
|
allows you to repeat the same action again and again and again. You tend to
|
|
use for-loops less often in R than in other programming languages because R
|
|
is a functional programming language which means that you can extract out
|
|
common patterns of for loops and put them in a function. We'll come back to
|
|
that idea in XYZ.
|
|
|
|
Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've solved in the past with much effort.
|
|
|
|
Writing code is similar in many ways to writing prose. One parallel which I find particularly useful is that in both cases rewriting is key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely you'll first attempt will be clear.)
|
|
|
|
## Piping
|
|
|
|
Pipes let you transform the way you call deeply nested functions. Using a pipe doesn't affect at all what the code does; behind the scenes it is run in exactly the same way. What the pipe does is change how the code is written and hence how it is read. It tends to transform to a more imperative form (do this, do that, do that other thing, ...) so that it's easier to read.
|
|
|
|
### Piping alternatives
|
|
|
|
To explore how you can write the same code in many different ways, let's use code to tell a story about a little bunny named foo foo:
|
|
|
|
> Little bunny Foo Foo
|
|
> Went hopping through the forest
|
|
> Scooping up the field mice
|
|
> And bopping them on the head
|
|
|
|
We'll start by defining an object to represent little bunny Foo Foo:
|
|
|
|
```{r, eval = FALSE}
|
|
foo_foo <- little_bunny()
|
|
```
|
|
|
|
And then we'll use a function for each key verb `hop()`, `scoop()`, and `bop()`. Using this object and these verbs, there are a number of ways we could retell the story in code:
|
|
|
|
* Save each intermediate step as a new object
|
|
* Rewrite the original object multiple times
|
|
* Compose functions
|
|
* Use the pipe
|
|
|
|
Below we work through each approach, showing you the code and talking about the advantages and disadvantages.
|
|
|
|
#### Intermediate steps
|
|
|
|
The simplest and most robust approach to sequencing multiple function calls is to save each intermediary as a new object:
|
|
|
|
```{r, eval = FALSE}
|
|
foo_foo_1 <- hop(foo_foo, through = forest)
|
|
foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
|
|
foo_foo_3 <- bop(foo_foo_2, on = head)
|
|
```
|
|
|
|
The main downside of this form is that it forces you to name each intermediate element. If there are natural names, this form feels natural, and you should use it. But if you're giving then arbitrary unique names, like this example, I don't think it's that useful. Whenever I write code like this, I invariably write the wrong number somewhere and then spend 10 minutes scratching my head and trying to figure out what went wrong with my code.
|
|
|
|
You may worry that this form creates many intermediate copies of your data and takes up a lot of memory. First, in R, worrying about memory is not a useful way to spend your time: worry about it when it becomes a problem (i.e. you run out of memory), not before. Second, R isn't stupid: it will reuse the shared columns in a pipeline of data frame transformations. Let's take a look at an actual data manipulation pipeline where we add a new column to the `diamonds` dataset from ggplot2:
|
|
|
|
```{r}
|
|
diamonds2 <- mutate(diamonds, price_per_carat = price / carat)
|
|
|
|
library(pryr)
|
|
object_size(diamonds)
|
|
object_size(diamonds2)
|
|
object_size(diamonds, diamonds2)
|
|
```
|
|
|
|
`pryr::object_size()` gives the memory occupied by all of its arguments. The results seem counterintuitive at first:
|
|
|
|
* `diamonds` takes up 3.46 MB,
|
|
* `diamonds2` takes up 3.89 MB,
|
|
* `diamonds` and `diamonds2` together take up 3.89 MB!
|
|
|
|
How can that work? Well, `diamonds2` has 10 columns in common with `diamonds`: there's no need to duplicate all that data so both data frames share the vectors. R will only create a copy of a vector if you modify it. Modifying a single value will mean that the data frames can no longer share as much memory. The individual sizes will be unchanged, but the collective size will increase:
|
|
|
|
```{r}
|
|
diamonds$carat[1] <- NA
|
|
object_size(diamonds)
|
|
object_size(diamonds2)
|
|
object_size(diamonds, diamonds2)
|
|
```
|
|
|
|
(Note that we use `pryr::object_size()` here, not the built-in `object.size()`, because it doesn't have quite enough smarts.)
|
|
|
|
#### Overwrite the original
|
|
|
|
One way to eliminate the intermediate objects is to just overwrite the same object again and again:
|
|
|
|
```{r, eval = FALSE}
|
|
foo_foo <- hop(foo_foo, through = forest)
|
|
foo_foo <- scoop(foo_foo, up = field_mice)
|
|
foo_foo <- bop(foo_foo, on = head)
|
|
```
|
|
|
|
This is less typing (and less thinking), so you're less likely to make mistakes. However, there are two problems:
|
|
|
|
1. It will make debugging painful: if you make a mistake you'll need to start
|
|
again from scratch.
|
|
|
|
1. The repetition of the object being transformed (we've written `foo_foo` six
|
|
times!) obscures what's changing on each line.
|
|
|
|
#### Function composition
|
|
|
|
Another approach is to abandon assignment altogether and just string the function calls together:
|
|
|
|
```{r, eval = FALSE}
|
|
bop(
|
|
scoop(
|
|
hop(foo_foo, through = forest),
|
|
up = field_mice
|
|
),
|
|
on = head
|
|
)
|
|
```
|
|
|
|
Here the disadvantage is that you have to read from inside-out, from right-to-left, and that the arguments end up spread far apart (sometimes called the
|
|
[dagwood sandwhich](https://en.wikipedia.org/wiki/Dagwood_sandwich) problem).
|
|
|
|
#### Use the pipe
|
|
|
|
Finally, we can use the pipe:
|
|
|
|
```{r, eval = FALSE}
|
|
foo_foo %>%
|
|
hop(through = forest) %>%
|
|
scoop(up = field_mouse) %>%
|
|
bop(on = head)
|
|
```
|
|
|
|
This is my favourite form. The downside is that you need to understand what the pipe does, but once you've mastered that idea task, you can read this series of function compositions like it's a set of imperative actions. Foo foo, hops, then scoops, then bops.
|
|
|
|
Behind the scenes magrittr converts this to:
|
|
|
|
```{r, eval = FALSE}
|
|
. <- hop(foo_foo, through = forest)
|
|
. <- scoop(., up = field_mice)
|
|
bop(., on = head)
|
|
```
|
|
|
|
It's useful to know this because if an error is thrown in the middle of the pipe, you'll need to be able to interpret the `traceback()`.
|
|
|
|
### Other tools from magrittr
|
|
|
|
The pipe is provided by the magrittr package, by Stefan Milton Bache. Most of packages you work in this book automatically provide `%>%` for you. You might want to load magrittr yourself if you're using another package, or you want to access some of the other pipe variants that magrittr provides.
|
|
|
|
```{r}
|
|
library(magrittr)
|
|
```
|
|
|
|
* When working with more complex pipes, it's some times useful to call a
|
|
function for its side-effects. Maybe you want to print out the current
|
|
object, or plot it, or save it to disk. Many times, such functions don't
|
|
return anything, effectively terminating the pipe.
|
|
|
|
To work around this problem, you can use the "tee" pipe. `%T>%` works like
|
|
`%>%` except instead it returns the LHS instead of the RHS. It's called
|
|
"tee" because it's like a literal T-shaped pipe.
|
|
|
|
```{r}
|
|
rnorm(100) %>%
|
|
matrix(ncol = 2) %>%
|
|
plot() %>%
|
|
str()
|
|
|
|
rnorm(100) %>%
|
|
matrix(ncol = 2) %T>%
|
|
plot() %>%
|
|
str()
|
|
```
|
|
|
|
* If you're working with functions that don't have a dataframe based API
|
|
(i.e. you pass them individual vectors, not a data frame and expressions
|
|
to be evaluated in the context of that data frame), you might find `%$%`
|
|
useful. It "explodes" out the variables in a data frame so that you can
|
|
refer to them explicitly. This is useful when working with many functions
|
|
in base R:
|
|
|
|
```{r}
|
|
mtcars %$%
|
|
cor(disp, mpg)
|
|
```
|
|
|
|
* For assignment magrittr provides the `%<>%` operator which allows you to
|
|
replace code like:
|
|
|
|
```R
|
|
mtcars <- mtcars %>% transform(cyl = cyl * 2)
|
|
```
|
|
|
|
with
|
|
|
|
```R
|
|
mtcars %<>% transform(cyl = cyl * 2)
|
|
```
|
|
|
|
I'm not a fan of this operator because I think assignment is such a
|
|
special operation that it should always be clear when it's occurring.
|
|
In my opinion, a little bit of duplication (i.e. repeating the
|
|
name of the object twice), is fine in return for making assignment
|
|
more explicit.
|
|
|
|
### When not to use the pipe
|
|
|
|
The pipe is a powerful tool, but it's not the only tool at your disposal, and it doesn't solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:
|
|
|
|
* Your pipes get longer than five or six lines. In that case, create
|
|
intermediate objects with meaningful names. That will make debugging easier,
|
|
because you can more easily check the intermediate results. It also helps
|
|
when reading the code, because the variable names can help describe the
|
|
intent of the code.
|
|
|
|
* You have multiple inputs or outputs. If there is not one primary object
|
|
being transformed, write code the regular ways.
|
|
|
|
* You are starting to think about a directed graph with a complex
|
|
dependency structure. Pipes are fundamentally linear and expressing
|
|
complex relationships with them typically does not yield clear code.
|
|
|
|
### Pipes in production
|
|
|
|
When you run a pipe interactively, it's easy to see if something goes wrong. When you start writing pipes that are used in production, i.e. they're run automatically and a human doesn't immediately look at the output it's a really good idea to include some assertions that verify the data looks like expected. One great way to do this is the ensurer package, written by Stefan Milton Bache (the author of magrittr).
|
|
|
|
<http://www.r-statistics.com/2014/11/the-ensurer-package-validation-inside-pipes/>
|
|
|
|
## Functions
|
|
|
|
One of the best ways to grow in your capabilities as a user of R for data science is to write functions. Functions allow you to automate common tasks, instead of using copy-and-paste. Writing good functions is a lifetime journey: you won't learn everything but you'll hopefully get to start walking in the right direction.
|
|
|
|
Whenever you've copied and pasted code more than twice, you need to take a look at it and see if you can extract out the common components and make a function. For example, take a look at this code. What does it do?
|
|
|
|
```{r}
|
|
df <- data.frame(
|
|
a = rnorm(10),
|
|
b = rnorm(10),
|
|
c = rnorm(10),
|
|
d = rnorm(10)
|
|
)
|
|
|
|
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
|
|
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
|
|
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
|
|
(max(df$a, na.rm = TRUE) - min(df$b, na.rm = TRUE))
|
|
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
|
|
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
|
|
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
|
|
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
|
|
```
|
|
|
|
You might be able to puzzle out that this rescales each column to 0--1. But did you spot the mistake? I made an error when copying-and-pasting the code for `df$b`, and I forgot to change an `a` to a `b`. Extracting repeated code out into a function is a good idea because it helps make your code more understandable (because you can name the operation), and it prevents you from making this class of errors.
|
|
|
|
To write a function you need to first analyse the operation. How many inputs does it have?
|
|
|
|
```{r, eval = FALSE}
|
|
(df$a - min(df$a, na.rm = TRUE)) /
|
|
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
|
|
```
|
|
|
|
It's often a good idea to rewrite the code using some temporary values. Here this function only takes one input, so I'll call it `x`:
|
|
|
|
```{r}
|
|
x <- 1:10
|
|
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
|
|
```
|
|
|
|
There is some duplication in this code: I'm computing the `min()` and `max()` multiple times, and I could instead do that in one step:
|
|
|
|
```{r}
|
|
rng <- range(x, na.rm = TRUE)
|
|
(x - rng[1]) / (rng[2] - rng[1])
|
|
```
|
|
|
|
Now that I've simplified the code, and made sure it works, I can turn it into a function:
|
|
|
|
```{r}
|
|
rescale01 <- function(x) {
|
|
rng <- range(x, na.rm = TRUE)
|
|
(x - rng[1]) / (rng[2] - rng[1])
|
|
}
|
|
rescale01(c(0, 5, 10))
|
|
```
|
|
|
|
Always make sure your code works on a simple test case before creating the function!
|
|
|
|
Note the process that I followed here: I constructed the `function` last. It's much easier to start with code that works on a sample input and then turn it into a function rather than the other way around. You're more likely to get to your final destination if you take small steps and check your work after each step.
|
|
|
|
Now we can use that to simplify our original example:
|
|
|
|
```{r}
|
|
df$a <- rescale01(df$a)
|
|
df$b <- rescale01(df$b)
|
|
df$c <- rescale01(df$c)
|
|
df$d <- rescale01(df$d)
|
|
```
|
|
|
|
This makes it more clear what we're doing, and avoids one class of copy-and-paste errors. However, we still have quite a bit of duplication: we're still doing the same thing to multiple columns. We'll learn how to handle that in the for loop section. But first, lets talk a bit more about functions.
|
|
|
|
### Practice
|
|
|
|
1. Practice turning the following code snippets into functions. Think about
|
|
what each function does. What would you call it? How many arguments does it
|
|
need? Can you rewrite it to be more expressive or less duplicative?
|
|
|
|
```{r, eval = FALSE}
|
|
mean(is.na(x))
|
|
|
|
x / sum(x, na.rm = TRUE)
|
|
|
|
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
|
|
|
|
mean((x - mean(x))^3) / mean((x - mean(x))^2)^(3/2)
|
|
```
|
|
|
|
1. Implement a `fizzbuzz` function. It take a single number as input. If
|
|
the number is divisible by three, return "fizz". If it's divisible by
|
|
five return "buzz". If it's divisible by three and five, return "fizzbuzz".
|
|
Otherwise, return the number.
|
|
|
|
### Function components
|
|
|
|
There are three attributes that define what a function does:
|
|
|
|
1. The __arguments__ of a function are its possible inputs.
|
|
Sometimes these are called _formal_ arguments to distinguish them from
|
|
the actual arguments that a function is called with. For example, the
|
|
formal argument of mean are `x`, `trim` and `na.rm`, but a given call
|
|
might only use some of these arguments.
|
|
|
|
1. The __body__ of a function is the code that it runs each time.
|
|
The last statement evaluated in the function body is what it returns.
|
|
The return value is not a property of the function because it changes
|
|
depending on the input values.
|
|
|
|
1. The function __environment__ controls how it looks up values from names
|
|
(i.e. how it goes from the name `x`, to its value, `10`). The set of
|
|
rules that governs this behaviour is called scoping.
|
|
|
|
#### Arguments
|
|
|
|
You can choose to supply default values to your arguments for common options. This is useful so that you don't need to repeat yourself all the time.
|
|
|
|
```{r}
|
|
foo <- function(x = 1, y = TRUE, z = 10:1) {
|
|
|
|
}
|
|
```
|
|
|
|
Whenever you have a mix of arguments with and without defaults, those without defaults should come first.
|
|
|
|
Default values can depend on other arguments but don't overuse this technique as it's possible to create code that is very difficult to understand. What does this function do?
|
|
|
|
```{r}
|
|
bar <- function(x = y + 1, y = x - 1) {
|
|
x * y
|
|
}
|
|
```
|
|
|
|
There's a special argument that's used quite commonly: `...`. This captures any other arguments not otherwise matched. It's useful because you can then send those `...` on to another argument. This is a useful catch-all if your function primarily wraps another function. For example, you might have written your own wrapper designed to add linear model lines to a ggplot:
|
|
|
|
```{r}
|
|
geom_lm <- function(formula = y ~ x, colour = alpha("steelblue", 0.5),
|
|
size = 2, ...) {
|
|
geom_smooth(formula = formula, se = FALSE, method = "lm", colour = colour,
|
|
size = size, ...)
|
|
}
|
|
```
|
|
|
|
This allows you to use any other arguments of `geom_smooth()`, even those that aren't explicitly listed in your wrapper (and even arguments that don't exist yet in the version of ggplot2 that you're using).
|
|
|
|
Note that arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called:
|
|
|
|
```{r}
|
|
g <- function(a, b, c) {
|
|
a + b
|
|
}
|
|
g(1, 2, stop("Not used!"))
|
|
```
|
|
|
|
You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>
|
|
|
|
#### Body
|
|
|
|
The body of the function does the actual work. The value returned by the function is the last statement it evaluates. Unlike other languages all statements in R return a value. An `if` statement returns the value from the branch that was chosen:
|
|
|
|
```{r}
|
|
greeting <- function(time = lubridate::now()) {
|
|
hour <- lubridate::hour(time)
|
|
|
|
if (hour < 12) {
|
|
"Good morning"
|
|
} else if (hour < 18) {
|
|
"Good afternoon"
|
|
} else {
|
|
"Good evening"
|
|
}
|
|
}
|
|
greeting()
|
|
```
|
|
|
|
That also means you can assign the result of an `if` statement to a variable:
|
|
|
|
```{r}
|
|
y <- 10
|
|
x <- if (y < 20) "Too low" else "Too high"
|
|
```
|
|
|
|
You can explicitly return early from a function with `return()`. I think it's best to save the use of `return()` to signal when you're returning early for some special reason.
|
|
|
|
It's sometimes useful when you want to write code like this:
|
|
|
|
```{r, eval = FALSE}
|
|
f <- function() {
|
|
if (x) {
|
|
# Do
|
|
# something
|
|
# that
|
|
# takes
|
|
# many
|
|
# lines
|
|
# to
|
|
# express
|
|
} else {
|
|
# return something short
|
|
}
|
|
}
|
|
```
|
|
|
|
Because you can rewrite it as:
|
|
|
|
```{r, eval = FALSE}
|
|
|
|
f <- function() {
|
|
if (!x) {
|
|
return(something_short)
|
|
}
|
|
|
|
# Do
|
|
# something
|
|
# that
|
|
# takes
|
|
# many
|
|
# lines
|
|
# to
|
|
# express
|
|
}
|
|
```
|
|
|
|
Some functions return "invisible" values. These are not printed out by default but can be saved to a variable:
|
|
|
|
```{r}
|
|
f <- function() {
|
|
invisible(42)
|
|
}
|
|
|
|
f()
|
|
|
|
x <- f()
|
|
x
|
|
```
|
|
|
|
You can also force printing by surrounding the call in parentheses:
|
|
|
|
```{r}
|
|
(f())
|
|
```
|
|
|
|
Invisible values are mostly used when your function is called primarily for its side-effects (e.g. printing, plotting, or saving a file). It's nice to be able pipe such functions together, so returning the main input value is useful. This allows you to do things like:
|
|
|
|
```{r, eval = FALSE}
|
|
library(readr)
|
|
|
|
mtcars %>%
|
|
write_csv("mtcars.csv") %>%
|
|
write_tsv("mtcars.tsv")
|
|
```
|
|
|
|
#### Environment
|
|
|
|
The environment of a function controls how R finds the value associated with a name. For example, take this function:
|
|
|
|
```{r}
|
|
f <- function(x) {
|
|
x + y
|
|
}
|
|
```
|
|
|
|
In many programming languages, this would be an error, because `y` is not defined inside the function. In R, this is valid code because R uses rules called lexical scoping to determine the value associated with a name. Since `y` is not defined inside the function, R will look where the function was defined:
|
|
|
|
```{r}
|
|
y <- 100
|
|
f(10)
|
|
|
|
y <- 1000
|
|
f(10)
|
|
```
|
|
|
|
This behaviour seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it doesn't cause too many problems (especially if you regularly restart R to get to a clean slate). The advantage of this behaviour is that from a language standpoint it allows R to be very consistent. Every name is looked up using the same set of rules. For `f()` that includes the behaviour of two things that you might not expect: `{` and `+`.
|
|
|
|
This allows you to do devious things like:
|
|
|
|
```{r}
|
|
`+` <- function(x, y) {
|
|
if (runif(1) < 0.1) {
|
|
sum(x, y)
|
|
} else {
|
|
sum(x, y) * 1.1
|
|
}
|
|
}
|
|
table(replicate(1000, 1 + 2))
|
|
rm(`+`)
|
|
```
|
|
|
|
This is a common phenomenon in R. R gives you a lot of control. You can do many things that are not possible in other programming languages. You can things that 99% of the time extremely ill-advised (like overriding how addition works!), but this power and flexibility is what makes tools like ggplot2 and dplyr possible. Learning how to make good use of this flexibility is beyond the scope of this book, but you can read about in "Advanced R".
|
|
|
|
#### Exercises
|
|
|
|
1. What happens if you call `bar()`? What does the error message mean?
|
|
|
|
1. What happens if you try to override the method in `geom_lm()` created
|
|
above (e.g. `geom_lm(method = "glm")`? Why?
|
|
|
|
### Making functions with magrittr
|
|
|
|
Another way to write functions is using magrittr. You've already seen how to execute a pipeline on a specific dataset:
|
|
|
|
```{r}
|
|
library(dplyr)
|
|
mtcars %>%
|
|
filter(mpg > 5) %>%
|
|
group_by(cyl) %>%
|
|
summarise(n = n())
|
|
```
|
|
|
|
But you can also create a generic pipeline that you can apply to any object:
|
|
|
|
```{r}
|
|
my_fun <- . %>%
|
|
filter(mpg > 5) %>%
|
|
group_by(cyl) %>%
|
|
summarise(n = n())
|
|
my_fun
|
|
|
|
my_fun(mtcars)
|
|
```
|
|
|
|
The key is to use `.` as the initial input in to the pipe. This is a great way to create a quick and dirty function if you've already made one pipe and now want to re-apply it in many places.
|
|
|
|
### Non-standard evaluation
|
|
|
|
One challenge with writing functions is that many of the functions you've used in this book use non-standard evaluation to minimise typing. This makes these functions great for interactive use, but it does make it more challenging to program with them, because you need to use more advanced techniques. For example, imagine you'd written the following duplicated code across a handful of data analysis projects:
|
|
|
|
```{r}
|
|
mtcars %>%
|
|
group_by(cyl) %>%
|
|
summarise(mean = mean(mpg, na.rm = TRUE), n = n()) %>%
|
|
filter(n > 10) %>%
|
|
arrange(desc(mean))
|
|
|
|
ggplot2::diamonds %>%
|
|
group_by(cut) %>%
|
|
summarise(mean = mean(price, na.rm = TRUE), n = n()) %>%
|
|
filter(n > 10) %>%
|
|
arrange(desc(mean))
|
|
|
|
nycflights13::planes %>%
|
|
group_by(model) %>%
|
|
summarise(mean = mean(year, na.rm = TRUE), n = n()) %>%
|
|
filter(n > 100) %>%
|
|
arrange(desc(mean))
|
|
```
|
|
|
|
You'd like to be able to write a function with arguments data frame, group and variable so you could rewrite the above code as:
|
|
|
|
```{r, eval = FALSE}
|
|
mtcars %>%
|
|
mean_by(cyl, mpg, n = 10)
|
|
|
|
ggplot2::diamonds %>%
|
|
mean_by(cut, price, n = 10)
|
|
|
|
nycflights13::planes %>%
|
|
mean_by(model, year, n = 100)
|
|
```
|
|
|
|
Unfortunately the obvious approach doesn't work:
|
|
|
|
```{r}
|
|
mean_by <- function(data, group_var, mean_var, n = 10) {
|
|
data %>%
|
|
group_by(group_var) %>%
|
|
summarise(mean = mean(mean_var, na.rm = TRUE), n = n()) %>%
|
|
filter(n > 100) %>%
|
|
arrange(desc(mean))
|
|
}
|
|
```
|
|
|
|
This fails because it tells dplyr to group by `group_var` and compute the mean of `mean_var` neither of which exist in the data frame. Writing reusable functions for ggplot2 poses a similar problem because `aes(group_var, mean_var)` would look for variables called `group_var` and `mean_var`. It's really only been in the last couple of months that I fully understood this problem, so there aren't currently any great (or general) solutions. However, now that I've understood the problem I think there will be some systematic solutions in the near future.
|
|
|
|
### Exercises
|
|
|
|
1. Follow <http://nicercode.github.io/intro/writing-functions.html> to
|
|
write your own functions to compute the variance and skew of a vector.
|
|
|
|
1. Read the [complete lyrics](https://en.wikipedia.org/wiki/Little_Bunny_Foo_Foo)
|
|
to "Little Bunny Foo". There's a lot of duplication in this song.
|
|
Extend the initial piping example to recreate the complete song, using
|
|
functions to reduce duplication.
|
|
|
|
## For loops
|
|
|
|
Before we tackle the problem of rescaling each column, lets start with a simpler case. Imagine we want to summarise each column with its median. One way to do that is to use a for loop. Every for loop has three main components:
|
|
|
|
```{r}
|
|
results <- vector("numeric", ncol(df))
|
|
for (i in seq_along(df)) {
|
|
results[[i]] <- median(df[[i]])
|
|
}
|
|
results
|
|
```
|
|
|
|
There are three parts to a for loop:
|
|
|
|
1. The __results__: `results <- vector("integer", length(x))`.
|
|
This creates an integer vector the same length as the input. It's important
|
|
to enough space for all the results up front, otherwise you have to grow the
|
|
results vector at each iteration, which is very slow for large loops.
|
|
|
|
1. The __sequence__: `i in seq_along(df)`. This determines what to loop over:
|
|
each run of the for loop will assign `i` to a different value from
|
|
`seq_along(df)`, shorthand for `1:length(df)`. It's useful to think of `i`
|
|
as a pronoun.
|
|
|
|
1. The __body__: `results[i] <- median(df[[i]])`. This code is run repeatedly,
|
|
each time with a different value in `i`. The first iteration will run
|
|
`results[1] <- median(df[[2]])`, the second `results[2] <- median(df[[2]])`,
|
|
and so on.
|
|
|
|
This loop used a function you might not be familiar with: `seq_along()`. This is a safe version of the more familiar `1:length(l)`. There's one important difference in behaviour. If you have a zero-length vector, `seq_along()` does the right thing:
|
|
|
|
```{r}
|
|
y <- numeric(0)
|
|
seq_along(y)
|
|
1:length(y)
|
|
```
|
|
|
|
Lets go back to our original motivation:
|
|
|
|
```{r}
|
|
df$a <- rescale01(df$a)
|
|
df$b <- rescale01(df$b)
|
|
df$c <- rescale01(df$c)
|
|
df$d <- rescale01(df$d)
|
|
```
|
|
|
|
In this case the output is already present: we're modifying an existing object.
|
|
|
|
Think about a data frame as a list of columns (we'll make this definition precise later on). The length of a data frame is the number of columns. To extract a single column, you use `[[`.
|
|
|
|
That makes our for loop quite simple:
|
|
|
|
```{r, eval = FALSE}
|
|
for (i in seq_along(df)) {
|
|
df[[i]] <- rescale01(df[[i]])
|
|
}
|
|
```
|
|
|
|
For loops are not as important in R as they are in other languages as rather than writing your own for loops, you'll typically use prewritten functions that wrap up common for-loop patterns. You'll learn about those in the next chapter. These functions are important because they wrap up the book-keeping code related to the for loop, focussing purely on what's happening. For example the two for-loops we wrote above can be rewritten as:
|
|
|
|
```{r, eval = FALSE}
|
|
library(purrr)
|
|
|
|
map_dbl(df, median)
|
|
df[] <- map(df, rescale01)
|
|
```
|
|
|
|
The focus is now on the function doing the modification, rather than the apparatus of the for-loop.
|
|
|
|
### Looping patterns
|
|
|
|
There are three basic ways to loop over a vector:
|
|
|
|
1. Loop over the elements: `for (x in xs)`. Most useful for side-effects,
|
|
but it's difficult to save the output efficiently.
|
|
|
|
1. Loop over the numeric indices: `for (i in seq_along(xs))`. Most common
|
|
form if you want to know the element (`xs[[i]]`) and its position.
|
|
|
|
1. Loop over the names: `for (nm in names(xs))`. Gives you both the name
|
|
and the position. This is useful if you want to use the name in a
|
|
plot title or a file name.
|
|
|
|
The most general form uses `seq_along(xs)`, because from the position you can access both the name and the value:
|
|
|
|
```{r, eval = FALSE}
|
|
for (i in seq_along(x)) {
|
|
name <- names(x)[[i]]
|
|
value <- x[[i]]
|
|
}
|
|
```
|
|
|
|
### Exercises
|
|
|
|
1. Convert the song "99 bottles of beer on the wall" to a function. Generalise
|
|
to any number of any vessel containing any liquid on any surface.
|
|
|
|
1. Convert the nursey rhyme "ten in the bed" to a function. Generalise it
|
|
to any number of people in any sleeping structure.
|
|
|
|
1. It's common to see for loops that don't preallocate the output and instead
|
|
increase the length of a vector at each step:
|
|
|
|
```{r}
|
|
results <- vector("integer", 0)
|
|
for (i in seq_along(x)) {
|
|
results <- c(results, lengths(x[[i]]))
|
|
}
|
|
results
|
|
```
|
|
|
|
How does this affect performance?
|
|
|
|
## Learning more
|
|
|
|
As you become a better R programmer, you'll learn more techniques for reducing various types of duplication. This allows you to do more with less, and allows you to express yourself more clearly by taking advantage of powerful programming constructs.
|
|
|
|
To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:
|
|
|
|
* [Hands on programming with R](http://shop.oreilly.com/product/0636920028574.do),
|
|
by Garrett Grolemund. This is an introduction to R as a programming language
|
|
and is a great place to start if R is your first programming language.
|
|
|
|
* [Advanced R](http://adv-r.had.co.nz) by Hadley Wickham. This dives into the
|
|
details of R the programming language. This is a great place to start if
|
|
you've programmed in other languages and you want to learn what makes R
|
|
special, different, and particularly well suited to data analysis.
|