Pull out iteration and pipes chapters

This commit is contained in:
hadley
2016-03-01 08:29:58 -06:00
parent 68a398a59e
commit f3db1315db
8 changed files with 891 additions and 871 deletions

1
.gitignore vendored
View File

@@ -4,3 +4,4 @@
*_files
_main.rds
_book
_main.html

View File

@@ -14,7 +14,9 @@ rmd_files: [
"strings.Rmd",
"datetimes.Rmd",
"program.Rmd",
"pipes.Rmd",
"functions.Rmd",
"iteration.Rmd",
"data-structures.Rmd",
"lists.Rmd",
"robust-code.Rmd",

Binary file not shown.

View File

@@ -1,289 +1,9 @@
# Expressing yourself in code
```{r, include = FALSE}
library(dplyr)
diamonds <- ggplot2::diamonds
```
Code is a tool of communication, not just to the computer, but to other people. This is important because every project you undertake is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you. You want to write clear code so that future-you doesn't curse present-you when you look at a project again after several months have passed.
To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. In this chapter, you'll learn three important skills that help you move in this direction:
1. We'll dive deep into the __pipe__, `%>%`, talking more about how it works
and how it gives you a new tool for rewriting your code. You'll also learn
about when not to use the pipe!
1. Repeating yourself in code is dangerous because it can easily lead to
errors and inconsistencies. We'll talk about how to write __functions__
in order to remove duplication in your logic.
1. Another important tool for removing duplication is the __for loop__ which
allows you to repeat the same action again and again and again. You tend to
use for-loops less often in R than in other programming languages because R
is a functional programming language which means that you can extract out
common patterns of for loops and put them in a function. We'll come back to
that idea in XYZ.
Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've solved in the past with much effort.
Writing code is similar in many ways to writing prose. One parallel which I find particularly useful is that in both cases rewriting is key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely you'll first attempt will be clear.)
## Piping
Pipes let you transform the way you call deeply nested functions. Using a pipe doesn't affect at all what the code does; behind the scenes it is run in (almost) exactly the same way. What the pipe does is change how the code is written and hence how it is read. It tends to transform to a more imperative form (do this, do that, do that other thing, ...) so that it's easier to read.
### Piping alternatives
To explore how you can write the same code in many different ways, let's use code to tell a story about a little bunny named foo foo:
> Little bunny Foo Foo
> Went hopping through the forest
> Scooping up the field mice
> And bopping them on the head
We'll start by defining an object to represent little bunny Foo Foo:
```{r, eval = FALSE}
foo_foo <- little_bunny()
```
And then we'll use a function for each key verb `hop()`, `scoop()`, and `bop()`. Using this object and these verbs, there are a number of ways we could retell the story in code:
* Save each intermediate step as a new object
* Rewrite the original object multiple times
* Compose functions
* Use the pipe
Below we work through each approach, showing you the code and talking about the advantages and disadvantages.
#### Intermediate steps
The simplest and most robust approach to sequencing multiple function calls is to save each intermediary as a new object:
```{r, eval = FALSE}
foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)
```
The main downside of this form is that it forces you to name each intermediate element. If there are natural names, this form feels natural, and you should use it. But if you're giving then arbitrary unique names, like this example, I don't think it's that useful. Whenever I write code like this, I invariably write the wrong number somewhere and then spend 10 minutes scratching my head and trying to figure out what went wrong with my code.
You may worry that this form creates many intermediate copies of your data and takes up a lot of memory. First, in R, worrying about memory is not a useful way to spend your time: worry about it when it becomes a problem (i.e. you run out of memory), not before. Second, R isn't stupid: it will reuse the shared columns in a pipeline of data frame transformations. Let's take a look at an actual data manipulation pipeline where we add a new column to the `diamonds` dataset from ggplot2:
```{r}
diamonds2 <- mutate(diamonds, price_per_carat = price / carat)
library(pryr)
object_size(diamonds)
object_size(diamonds2)
object_size(diamonds, diamonds2)
```
`pryr::object_size()` gives the memory occupied by all of its arguments. The results seem counterintuitive at first:
* `diamonds` takes up 3.46 MB,
* `diamonds2` takes up 3.89 MB,
* `diamonds` and `diamonds2` together take up 3.89 MB!
How can that work? Well, `diamonds2` has 10 columns in common with `diamonds`: there's no need to duplicate all that data so both data frames share the vectors. R will only create a copy of a vector if you modify it. Modifying a single value will mean that the data frames can no longer share as much memory. The individual sizes will be unchanged, but the collective size will increase:
```{r}
diamonds$carat[1] <- NA
object_size(diamonds)
object_size(diamonds2)
object_size(diamonds, diamonds2)
```
(Note that we use `pryr::object_size()` here, not the built-in `object.size()`, because it doesn't have quite enough smarts.)
#### Overwrite the original
One way to eliminate the intermediate objects is to just overwrite the same object again and again:
```{r, eval = FALSE}
foo_foo <- hop(foo_foo, through = forest)
foo_foo <- scoop(foo_foo, up = field_mice)
foo_foo <- bop(foo_foo, on = head)
```
This is less typing (and less thinking), so you're less likely to make mistakes. However, there are two problems:
1. It will make debugging painful: if you make a mistake you'll need to start
again from scratch.
1. The repetition of the object being transformed (we've written `foo_foo` six
times!) obscures what's changing on each line.
#### Function composition
Another approach is to abandon assignment altogether and just string the function calls together:
```{r, eval = FALSE}
bop(
scoop(
hop(foo_foo, through = forest),
up = field_mice
),
on = head
)
```
Here the disadvantage is that you have to read from inside-out, from right-to-left, and that the arguments end up spread far apart (sometimes called the
[dagwood sandwhich](https://en.wikipedia.org/wiki/Dagwood_sandwich) problem).
#### Use the pipe
Finally, we can use the pipe:
```{r, eval = FALSE}
foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mouse) %>%
bop(on = head)
```
This is my favourite form. The downside is that you need to understand what the pipe does, but once you've mastered that idea task, you can read this series of function compositions like it's a set of imperative actions. Foo foo, hops, then scoops, then bops.
Behind the scenes magrittr converts this to:
```{r, eval = FALSE}
. <- hop(foo_foo, through = forest)
. <- scoop(., up = field_mice)
bop(., on = head)
```
It's useful to know this because if an error is thrown in the middle of the pipe, you'll need to be able to interpret the `traceback()`.
### Other tools from magrittr
The pipe is provided by the magrittr package, by Stefan Milton Bache. Most of packages you work in this book automatically provide `%>%` for you. You might want to load magrittr yourself if you're using another package, or you want to access some of the other pipe variants that magrittr provides.
```{r}
library(magrittr)
```
* When working with more complex pipes, it's some times useful to call a
function for its side-effects. Maybe you want to print out the current
object, or plot it, or save it to disk. Many times, such functions don't
return anything, effectively terminating the pipe.
To work around this problem, you can use the "tee" pipe. `%T>%` works like
`%>%` except instead it returns the LHS instead of the RHS. It's called
"tee" because it's like a literal T-shaped pipe.
```{r}
rnorm(100) %>%
matrix(ncol = 2) %>%
plot() %>%
str()
rnorm(100) %>%
matrix(ncol = 2) %T>%
plot() %>%
str()
```
* If you're working with functions that don't have a dataframe based API
(i.e. you pass them individual vectors, not a data frame and expressions
to be evaluated in the context of that data frame), you might find `%$%`
useful. It "explodes" out the variables in a data frame so that you can
refer to them explicitly. This is useful when working with many functions
in base R:
```{r}
mtcars %$%
cor(disp, mpg)
```
* For assignment magrittr provides the `%<>%` operator which allows you to
replace code like:
```R
mtcars <- mtcars %>% transform(cyl = cyl * 2)
```
with
```R
mtcars %<>% transform(cyl = cyl * 2)
```
I'm not a fan of this operator because I think assignment is such a
special operation that it should always be clear when it's occurring.
In my opinion, a little bit of duplication (i.e. repeating the
name of the object twice), is fine in return for making assignment
more explicit.
### When not to use the pipe
I also made a slight simplifiation when I said that the `x %>% f(y)` is exactly the same as `f(x, y)`. That's not quite true, which you'll see particularly for two classes of functions:
1. Functions that use the current environment. For example, `assign()`
will create a new variable with the given name in the current environment:
```{r}
assign("x", 10)
x
"x" %>% assign(100)
x
```
The use of assign with the pipe does not work because it assigns it to
a temporary environment used by `%>%`. If you do want to use assign with the
pipe, you can be explicit about the environment:
```{r}
env <- environment()
"x" %>% assign(100, envir = env)
x
```
Other functions with this problem are `get()`, and `load()`
1. Functions that use effect how their arguments are computed. In R, arguments
are lazy which means they are only computed when the function uses them,
not prior to calling the function. This means that the function can affect
the global environment in various ways. The pipe forces computation of
each element in series so you can't rely on this behaviour.
```{r, error = TRUE}
tryCatch(stop("!"), error = function(e) "An error")
stop("!") %>%
tryCatch(error = function(e) "An error")
```
There are a relatively wide class of functions with this behaviour including
`try()`, `supressMessages()`, `suppressWarnings()`, any function from the
withr package, ...
The pipe is a powerful tool, but it's not the only tool at your disposal, and it doesn't solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:
* Your pipes get longer than five or six lines. In that case, create
intermediate objects with meaningful names. That will make debugging easier,
because you can more easily check the intermediate results. It also helps
when reading the code, because the variable names can help describe the
intent of the code.
* You have multiple inputs or outputs. If there is not one primary object
being transformed, write code the regular ways.
* You are starting to think about a directed graph with a complex
dependency structure. Pipes are fundamentally linear and expressing
complex relationships with them typically does not yield clear code.
### Pipes in production
When you run a pipe interactively, it's easy to see if something goes wrong. When you start writing pipes that are used in production, i.e. they're run automatically and a human doesn't immediately look at the output it's a really good idea to include some assertions that verify the data looks like expected. One great way to do this is the ensurer package, written by Stefan Milton Bache (the author of magrittr).
<http://www.r-statistics.com/2014/11/the-ensurer-package-validation-inside-pipes/>
## Functions
# Functions
One of the best ways to grow in your skills as a data scientist in R is to write functions. Functions allow you to automate common tasks, instead of using copy-and-paste. Writing good functions is a lifetime journey: you won't learn everything but you'll hopefully get to start walking in the right direction.
## When should you write a function?
Whenever you've copied and pasted code more than twice, you need to take a look at it and see if you can extract out the common components and make a function. For example, take a look at this code. What does it do?
```{r}
@@ -375,7 +95,7 @@ This makes it more clear what we're doing, and avoids one class of copy-and-past
five return "buzz". If it's divisible by three and five, return "fizzbuzz".
Otherwise, return the number.
### Function components
## Function components
There are three attributes that define what a function does:
@@ -394,7 +114,7 @@ There are three attributes that define what a function does:
(i.e. how it goes from the name `x`, to its value, `10`). The set of
rules that governs this behaviour is called scoping.
#### Arguments
### Arguments
You can choose to supply default values to your arguments for common options. This is useful so that you don't need to repeat yourself all the time.
@@ -437,7 +157,7 @@ g(1, 2, stop("Not used!"))
You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>
#### Body
### Body
The body of the function does the actual work. The value returned by the function is the last statement it evaluates. Unlike other languages all statements in R return a value. An `if` statement returns the value from the branch that was chosen:
@@ -535,7 +255,7 @@ mtcars %>%
write_tsv("mtcars.tsv")
```
#### Environment
### Environment
The environment of a function controls how R finds the value associated with a name. For example, take this function:
@@ -580,7 +300,7 @@ This is a common phenomenon in R. R gives you a lot of control. You can do many
1. What happens if you try to override the method in `geom_lm()` created
above (e.g. `geom_lm(method = "glm")`? Why?
### Making functions with magrittr
## Making functions with magrittr
Another way to write functions is using magrittr. You've already seen how to execute a pipeline on a specific dataset:
@@ -606,7 +326,7 @@ my_fun(mtcars)
The key is to use `.` as the initial input in to the pipe. This is a great way to create a quick and dirty function if you've already made one pipe and now want to re-apply it in many places.
### Non-standard evaluation
## Non-standard evaluation
One challenge with writing functions is that many of the functions you've used in this book use non-standard evaluation to minimise typing. This makes these functions great for interactive use, but it does make it more challenging to program with them, because you need to use more advanced techniques. For example, imagine you'd written the following duplicated code across a handful of data analysis projects:
@@ -666,131 +386,3 @@ This fails because it tells dplyr to group by `group_var` and compute the mean o
to "Little Bunny Foo". There's a lot of duplication in this song.
Extend the initial piping example to recreate the complete song, using
functions to reduce duplication.
## For loops
Before we tackle the problem of rescaling each column, lets start with a simpler case. Imagine we want to summarise each column with its median. One way to do that is to use a for loop. Every for loop has three main components:
```{r}
results <- vector("numeric", ncol(df))
for (i in seq_along(df)) {
results[[i]] <- median(df[[i]])
}
results
```
There are three parts to a for loop:
1. The __results__: `results <- vector("integer", length(x))`.
This creates an integer vector the same length as the input. It's important
to enough space for all the results up front, otherwise you have to grow the
results vector at each iteration, which is very slow for large loops.
1. The __sequence__: `i in seq_along(df)`. This determines what to loop over:
each run of the for loop will assign `i` to a different value from
`seq_along(df)`, shorthand for `1:length(df)`. It's useful to think of `i`
as a pronoun.
1. The __body__: `results[i] <- median(df[[i]])`. This code is run repeatedly,
each time with a different value in `i`. The first iteration will run
`results[1] <- median(df[[2]])`, the second `results[2] <- median(df[[2]])`,
and so on.
This loop used a function you might not be familiar with: `seq_along()`. This is a safe version of the more familiar `1:length(l)`. There's one important difference in behaviour. If you have a zero-length vector, `seq_along()` does the right thing:
```{r}
y <- numeric(0)
seq_along(y)
1:length(y)
```
Lets go back to our original motivation:
```{r}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
In this case the output is already present: we're modifying an existing object.
Think about a data frame as a list of columns (we'll make this definition precise later on). The length of a data frame is the number of columns. To extract a single column, you use `[[`.
That makes our for loop quite simple:
```{r, eval = FALSE}
for (i in seq_along(df)) {
df[[i]] <- rescale01(df[[i]])
}
```
For loops are not as important in R as they are in other languages as rather than writing your own for loops, you'll typically use prewritten functions that wrap up common for-loop patterns. You'll learn about those in the next chapter. These functions are important because they wrap up the book-keeping code related to the for loop, focussing purely on what's happening. For example the two for-loops we wrote above can be rewritten as:
```{r, eval = FALSE}
library(purrr)
map_dbl(df, median)
df[] <- map(df, rescale01)
```
The focus is now on the function doing the modification, rather than the apparatus of the for-loop.
### Looping patterns
There are three basic ways to loop over a vector:
1. Loop over the elements: `for (x in xs)`. Most useful for side-effects,
but it's difficult to save the output efficiently.
1. Loop over the numeric indices: `for (i in seq_along(xs))`. Most common
form if you want to know the element (`xs[[i]]`) and its position.
1. Loop over the names: `for (nm in names(xs))`. Gives you both the name
and the position. This is useful if you want to use the name in a
plot title or a file name.
The most general form uses `seq_along(xs)`, because from the position you can access both the name and the value:
```{r, eval = FALSE}
for (i in seq_along(x)) {
name <- names(x)[[i]]
value <- x[[i]]
}
```
### Exercises
1. Convert the song "99 bottles of beer on the wall" to a function. Generalise
to any number of any vessel containing any liquid on any surface.
1. Convert the nursey rhyme "ten in the bed" to a function. Generalise it
to any number of people in any sleeping structure.
1. It's common to see for loops that don't preallocate the output and instead
increase the length of a vector at each step:
```{r}
results <- vector("integer", 0)
for (i in seq_along(x)) {
results <- c(results, lengths(x[[i]]))
}
results
```
How does this affect performance?
## Learning more
As you become a better R programmer, you'll learn more techniques for reducing various types of duplication. This allows you to do more with less, and allows you to express yourself more clearly by taking advantage of powerful programming constructs.
To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:
* [Hands on programming with R](http://shop.oreilly.com/product/0636920028574.do),
by Garrett Grolemund. This is an introduction to R as a programming language
and is a great place to start if R is your first programming language.
* [Advanced R](http://adv-r.had.co.nz) by Hadley Wickham. This dives into the
details of R the programming language. This is a great place to start if
you've programmed in other languages and you want to learn what makes R
special, different, and particularly well suited to data analysis.

585
iteration.Rmd Normal file
View File

@@ -0,0 +1,585 @@
# Iteration
```{r setup, include=FALSE}
library(purrr)
```
## For loops
Before we tackle the problem of rescaling each column, lets start with a simpler case. Imagine we want to summarise each column with its median. One way to do that is to use a for loop. Every for loop has three main components:
```{r}
df <- data.frame(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
results <- vector("numeric", ncol(df))
for (i in seq_along(df)) {
results[[i]] <- median(df[[i]])
}
results
```
There are three parts to a for loop:
1. The __results__: `results <- vector("integer", length(x))`.
This creates an integer vector the same length as the input. It's important
to enough space for all the results up front, otherwise you have to grow the
results vector at each iteration, which is very slow for large loops.
1. The __sequence__: `i in seq_along(df)`. This determines what to loop over:
each run of the for loop will assign `i` to a different value from
`seq_along(df)`, shorthand for `1:length(df)`. It's useful to think of `i`
as a pronoun.
1. The __body__: `results[i] <- median(df[[i]])`. This code is run repeatedly,
each time with a different value in `i`. The first iteration will run
`results[1] <- median(df[[2]])`, the second `results[2] <- median(df[[2]])`,
and so on.
This loop used a function you might not be familiar with: `seq_along()`. This is a safe version of the more familiar `1:length(l)`. There's one important difference in behaviour. If you have a zero-length vector, `seq_along()` does the right thing:
```{r}
y <- numeric(0)
seq_along(y)
1:length(y)
```
Lets go back to our original motivation:
```{r}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
In this case the output is already present: we're modifying an existing object.
Think about a data frame as a list of columns (we'll make this definition precise later on). The length of a data frame is the number of columns. To extract a single column, you use `[[`.
That makes our for loop quite simple:
```{r, eval = FALSE}
for (i in seq_along(df)) {
df[[i]] <- rescale01(df[[i]])
}
```
For loops are not as important in R as they are in other languages as rather than writing your own for loops, you'll typically use prewritten functions that wrap up common for-loop patterns. You'll learn about those in the next chapter. These functions are important because they wrap up the book-keeping code related to the for loop, focussing purely on what's happening. For example the two for-loops we wrote above can be rewritten as:
```{r, eval = FALSE}
library(purrr)
map_dbl(df, median)
df[] <- map(df, rescale01)
```
The focus is now on the function doing the modification, rather than the apparatus of the for-loop.
### Looping patterns
There are three basic ways to loop over a vector:
1. Loop over the elements: `for (x in xs)`. Most useful for side-effects,
but it's difficult to save the output efficiently.
1. Loop over the numeric indices: `for (i in seq_along(xs))`. Most common
form if you want to know the element (`xs[[i]]`) and its position.
1. Loop over the names: `for (nm in names(xs))`. Gives you both the name
and the position. This is useful if you want to use the name in a
plot title or a file name.
The most general form uses `seq_along(xs)`, because from the position you can access both the name and the value:
```{r, eval = FALSE}
for (i in seq_along(x)) {
name <- names(x)[[i]]
value <- x[[i]]
}
```
### Exercises
1. Convert the song "99 bottles of beer on the wall" to a function. Generalise
to any number of any vessel containing any liquid on any surface.
1. Convert the nursey rhyme "ten in the bed" to a function. Generalise it
to any number of people in any sleeping structure.
1. It's common to see for loops that don't preallocate the output and instead
increase the length of a vector at each step:
```{r, eval = FALSE}
results <- vector("integer", 0)
for (i in seq_along(x)) {
results <- c(results, lengths(x[[i]]))
}
results
```
How does this affect performance?
## For loops vs functionals
Imagine you have a data frame and you want to compute the mean of each column. You might write code like this:
```{r}
df <- data.frame(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
results <- numeric(length(df))
for (i in seq_along(df)) {
results[i] <- mean(df[[i]])
}
results
```
(Here we're taking advantage of the fact that a data frame is a list of the individual columns, so `length()` and `seq_along()` are useful.)
You realise that you're going to want to compute the means of every column pretty frequently, so you extract it out into a function:
```{r}
col_mean <- function(df) {
results <- numeric(length(df))
for (i in seq_along(df)) {
results[i] <- mean(df[[i]])
}
results
}
```
But then you think it'd also be helpful to be able to compute the median or the standard deviation:
```{r}
col_median <- function(df) {
results <- numeric(length(df))
for (i in seq_along(df)) {
results[i] <- median(df[[i]])
}
results
}
col_sd <- function(df) {
results <- numeric(length(df))
for (i in seq_along(df)) {
results[i] <- sd(df[[i]])
}
results
}
```
I've now copied-and-pasted this function three times, so it's time to think about how to generalise it. Most of the code is for-loop boilerplate and it's hard to see the one piece (`mean()`, `median()`, `sd()`) that differs.
What would you do if you saw a set of functions like this:
```{r}
f1 <- function(x) abs(x - mean(x)) ^ 1
f2 <- function(x) abs(x - mean(x)) ^ 2
f3 <- function(x) abs(x - mean(x)) ^ 3
```
Hopefully, you'd notice that there's a lot of duplication, and extract it out into an additional argument:
```{r}
f <- function(x, i) abs(x - mean(x)) ^ i
```
You've reduce the chance of bugs (because you now have 1/3 less code), and made it easy to generalise to new situations. We can do exactly the same thing with `col_mean()`, `col_median()` and `col_sd()`, by adding an argument that contains the function to apply to each column:
```{r}
col_summary <- function(df, fun) {
out <- vector("numeric", length(df))
for (i in seq_along(df)) {
out[i] <- fun(df[[i]])
}
out
}
col_summary(df, median)
col_summary(df, min)
```
The idea of using a function as an argument to another function is extremely powerful. It might take you a while to wrap your head around it, but it's worth the investment. In the rest of the chapter, you'll learn about and use the purrr package which provides a set of functions that eliminate the need for for-loops for many common scenarios.
### Exercises
1. Read the documentation for `apply()`. In the 2d case, what two for loops
does it generalise?
1. Adapt `col_summary()` so that it only applies to numeric columns
You might want to start with an `is_numeric()` function that returns
a logical vector that has a TRUE corresponding to each numeric column.
## The map functions
This pattern of looping over a list and doing something to each element is so common that the purrr package provides a family of functions to do it for you. Each function always returns the same type of output so there are six variations based on what sort of result you want:
* `map()` returns a list.
* `map_lgl()` returns a logical vector.
* `map_int()` returns a integer vector.
* `map_dbl()` returns a double vector.
* `map_chr()` returns a character vector.
* `map_df()` returns a data frame.
* `walk()` returns nothing. Walk is a little different to the others because
it's called exclusively for its side effects, so it's described in more detail
later in [walk](#walk).
Each function takes a list as input, applies a function to each piece, and then returns a new vector that's the same length as the input. The type of the vector is determined by the specific map function. Usually you want to use the most specific available, using `map()` only as a fallback when there is no specialised equivalent available.
We can use these functions to perform the same computations as the previous for loops:
```{r}
map_int(df, length)
map_dbl(df, mean)
map_dbl(df, median)
```
Compared to using a for loop, focus is on the operation being performed (i.e. `length()`, `mean()`, or `median()`), not the book-keeping required to loop over every element and store the results.
There are a few differences between `map_*()` and `compute_summary()`:
* All purrr functions are implemented in C. This means you can't easily
understand their code, but it makes them a little faster.
* The second argument, `.f`, the function to apply, can be a formula, a
character vector, or an integer vector. You'll learn about those handy
shortcuts in the next section.
* Any arguments after `.f` will be passed on to it each time it's called:
```{r}
map_dbl(df, mean, trim = 0.5)
```
* The map functions also preserve names:
```{r}
z <- list(x = 1:3, y = 4:5)
map_int(z, length)
```
### Shortcuts
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces (one for each value of cylinder) and fits the same linear model to each piece:
```{r}
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))
```
The syntax for creating an anonymous function in R is quite verbose so purrr provides a convenient shortcut: a one-sided formula.
```{r}
models <- mtcars %>%
split(.$cyl) %>%
map(~lm(mpg ~ wt, data = .))
```
Here I've used `.` as a pronoun: it refers to the current list element (in the same way that `i` referred to the current index in the for loop). You can also use `.x` and `.y` to refer to up to two arguments. If you want to create a function with more than two arguments, do it the regular way!
When you're looking at many models, you might want to extract a summary statistic like the $R^2$. To do that we need to first run `summary()` and then extract the component called `r.squared`. We could do that using the shorthand for anonymous functions:
```{r}
models %>%
map(summary) %>%
map_dbl(~.$r.squared)
```
But extracting named components is a common operation, so purrr provides an even shorter shortcut: you can use a string.
```{r}
models %>%
map(summary) %>%
map_dbl("r.squared")
```
You can also use a numeric vector to select elements by position:
```{r}
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2)
```
### Base R
If you're familiar with the apply family of functions in base R, you might have noticed some similarities with the purrr functions:
* `lapply()` is basically identical to `map()`. There's no advantage to using
`map()` over `lapply()` except that it's consistent with all the other
functions in purrr.
* The base `sapply()` is a wrapper around `lapply()` that automatically tries
to simplify the results. This is useful for interactive work but is
problematic in a function because you never know what sort of output
you'll get:
```{r}
x1 <- list(
c(0.27, 0.37, 0.57, 0.91, 0.20),
c(0.90, 0.94, 0.66, 0.63, 0.06),
c(0.21, 0.18, 0.69, 0.38, 0.77)
)
x2 <- list(
c(0.50, 0.72, 0.99, 0.38, 0.78),
c(0.93, 0.21, 0.65, 0.13, 0.27),
c(0.39, 0.01, 0.38, 0.87, 0.34)
)
threshold <- function(x, cutoff = 0.8) x[x > cutoff]
str(sapply(x1, threshold))
str(sapply(x2, threshold))
```
* `vapply()` is a safe alternative to `sapply()` because you supply an additional
argument that defines the type. The only problem with `vapply()` is that
it's a lot of typing: `vapply(df, is.numeric, logical(1))` is equivalent to
`map_lgl(df, is.numeric)`.
One of advantage of `vapply()` over the map functions is that it can also
produce matrices - the map functions only ever produce vectors.
* `map_df(x, f)` is effectively the same as `do.call("rbind", lapply(x, f))`
but under the hood is much more efficient.
### Exercises
1. How can you determine which columns in a data frame are factors?
(Hint: data frames are lists.)
1. What happens when you use the map functions on vectors that aren't lists?
What does `map(1:5, runif)` do? Why?
1. What does `map(-2:2, rnorm, n = 5)` do. Why?
1. Rewrite `map(x, function(df) lm(mpg ~ wt, data = df))` to eliminate the
anonymous function.
## Dealing with failure
When you do many operations on a list, sometimes one will fail. When this happens, you'll get an error message, and no output. This is annoying: why does one failure prevent you from accessing all the other successes? How do you ensure that one bad apple doesn't ruin the whole barrel?
In this section you'll learn how to deal this situation with a new function: `safely()`. `safely()` is an adverb: it takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead, it always returns a list with two elements:
1. `result` is the original result. If there was an error, this will be `NULL`.
1. `error` is an error object. If the operation was successful this will be
`NULL`.
(You might be familiar with the `try()` function in base R. It's similar, but because it sometimes returns the original result and it sometimes returns an error object it's more difficult to work with.)
Let's illustrate this with a simple example: `log()`:
```{r}
safe_log <- safely(log)
str(safe_log(10))
str(safe_log("a"))
```
When the function succeeds the `result` element contains the result and the `error` element is `NULL`. When the function fails, the `result` element is `NULL` and the `error` element contains an error object.
`safely()` is designed to work with map:
```{r}
x <- list(1, 10, "a")
y <- x %>% map(safely(log))
str(y)
```
This would be easier to work with if we had two lists: one of all the errors and one of all the results. That's easy to get with `transpose()`.
```{r}
y <- y %>% transpose()
str(y)
```
It's up to you how to deal with the errors, but typically you'll either look at the values of `x` where `y` is an error or work with the values of y that are ok:
```{r}
is_ok <- y$error %>% map_lgl(is_null)
x[!is_ok]
y$result[is_ok] %>% flatten_dbl()
```
Purrr provides two other useful adverbs:
* Like `safely()`, `possibly()` always succeeds. It's simpler than `safely()`,
because you give it a default value to return when there is an error.
```{r}
x <- list(1, 10, "a")
x %>% map_dbl(possibly(log, NA_real_))
```
* `quietly()` performs a similar role to `safely()`, but instead of capturing
errors, it captures printed output, messages, and warnings:
```{r}
x <- list(1, -1)
x %>% map(quietly(log)) %>% str()
```
### Exercises
1. Challenge: read all the csv files in this directory. Which ones failed
and why?
```{r, eval = FALSE}
files <- dir("data", pattern = "\\.csv$")
files %>%
set_names(., basename(.)) %>%
map_df(safely(readr::read_csv), .id = "filename") %>%
```
## Parallel maps
So far we've mapped along a single list. But often you have multiple related lists that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
```{r}
mu <- list(5, 10, -3)
mu %>% map(rnorm, n = 10)
```
What if you also want to vary the standard deviation? You need to iterate along a vector of means and a vector of standard deviations in parallel. That's a job for `map2()` which works with two parallel sets of inputs:
```{r}
sigma <- list(1, 5, 10)
map2(mu, sigma, rnorm, n = 10)
```
`map2()` generates this series of function calls:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-map2.png")
```
The arguments that vary for each call come before the function name, and arguments that are the same for every function call come afterwards.
Like `map()`, `map2()` is just a wrapper around a for loop:
```{r}
map2 <- function(x, y, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], y[[i]], ...)
}
out
}
```
You could also imagine `map3()`, `map4()`, `map5()`, `map6()` etc, but that would get tedious quickly. Instead, purrr provides `pmap()` which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:
```{r}
n <- list(1, 3, 5)
args1 <- list(n, mu, sigma)
args1 %>% pmap(rnorm) %>% str()
```
That looks like:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-pmap-unnamed.png")
```
However, instead of relying on position matching, it's better to name the arguments. This is more verbose, but it makes the code clearer.
```{r}
args2 <- list(mean = mu, sd = sigma, n = n)
args2 %>% pmap(rnorm) %>% str()
```
That generates longer, but safer, calls:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-pmap-named.png")
```
Since the arguments are all the same length, it makes sense to store them in a data frame:
```{r}
params <- dplyr::data_frame(mean = mu, sd = sigma, n = n)
params$result <- params %>% pmap(rnorm)
params
```
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns. We'll come back to this idea when we explore the intersection of dplyr, purrr, and model fitting.
### Invoking different functions
There's one more step up in complexity - as well as varying the arguments to the function you might also vary the function itself:
```{r}
f <- c("runif", "rnorm", "rpois")
param <- list(
list(min = -1, max = 1),
list(sd = 5),
list(lambda = 10)
)
```
To handle this case, you can use `invoke_map()`:
```{r}
invoke_map(f, param, n = 5) %>% str()
```
```{r, echo = FALSE}
knitr::include_graphics("diagrams/lists-invoke.png")
```
The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
You can use `dplyr::frame_data()` to make creating these matching pairs a little easier:
```{r, eval = FALSE}
# Needs dev version of dplyr
sim <- dplyr::frame_data(
~f, ~params,
"runif", list(min = -1, max = -1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>% dplyr::mutate(
samples = invoke_map(f, params, n = 10)
)
```
## Walk {#walk}
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value. Here's a very simple example:
```{r}
x <- list(1, "a", 3)
x %>%
walk(print)
```
`walk()` is generally not that useful compared to `walk2()` or `pwalk()`. For example, if you had a list of plots and a vector of file names, you could use `pwalk()` to save each file to the corresponding location on disk:
```{r}
library(ggplot2)
plots <- mtcars %>%
split(.$cyl) %>%
map(~ggplot(., aes(mpg, wt)) + geom_point())
paths <- paste0(names(plots), ".pdf")
pwalk(list(paths, plots), ggsave, path = tempdir())
```
`walk()`, `walk2()` and `pwalk()` all invisibly return the `.x`, the first argument. This makes them suitable for use in the middle of pipelines.

454
lists.Rmd
View File

@@ -164,244 +164,6 @@ knitr::include_graphics("images/pepper-3.jpg")
1. What happens if you subset a data frame as if you're subsetting a list?
What are the key differences between a list and a data frame?
## For loops vs functionals
Imagine you have a data frame and you want to compute the mean of each column. You might write code like this:
```{r}
df <- data.frame(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
results <- numeric(length(df))
for (i in seq_along(df)) {
results[i] <- mean(df[[i]])
}
results
```
(Here we're taking advantage of the fact that a data frame is a list of the individual columns, so `length()` and `seq_along()` are useful.)
You realise that you're going to want to compute the means of every column pretty frequently, so you extract it out into a function:
```{r}
col_mean <- function(df) {
results <- numeric(length(df))
for (i in seq_along(df)) {
results[i] <- mean(df[[i]])
}
results
}
```
But then you think it'd also be helpful to be able to compute the median or the standard deviation:
```{r}
col_median <- function(df) {
results <- numeric(length(df))
for (i in seq_along(df)) {
results[i] <- median(df[[i]])
}
results
}
col_sd <- function(df) {
results <- numeric(length(df))
for (i in seq_along(df)) {
results[i] <- sd(df[[i]])
}
results
}
```
I've now copied-and-pasted this function three times, so it's time to think about how to generalise it. Most of the code is for-loop boilerplate and it's hard to see the one piece (`mean()`, `median()`, `sd()`) that differs.
What would you do if you saw a set of functions like this:
```{r}
f1 <- function(x) abs(x - mean(x)) ^ 1
f2 <- function(x) abs(x - mean(x)) ^ 2
f3 <- function(x) abs(x - mean(x)) ^ 3
```
Hopefully, you'd notice that there's a lot of duplication, and extract it out into an additional argument:
```{r}
f <- function(x, i) abs(x - mean(x)) ^ i
```
You've reduce the chance of bugs (because you now have 1/3 less code), and made it easy to generalise to new situations. We can do exactly the same thing with `col_mean()`, `col_median()` and `col_sd()`, by adding an argument that contains the function to apply to each column:
```{r}
col_summary <- function(df, fun) {
out <- vector("numeric", length(df))
for (i in seq_along(df)) {
out[i] <- fun(df[[i]])
}
out
}
col_summary(df, median)
col_summary(df, min)
```
The idea of using a function as an argument to another function is extremely powerful. It might take you a while to wrap your head around it, but it's worth the investment. In the rest of the chapter, you'll learn about and use the purrr package which provides a set of functions that eliminate the need for for-loops for many common scenarios.
### Exercises
1. Read the documentation for `apply()`. In the 2d case, what two for loops
does it generalise?
1. Adapt `col_summary()` so that it only applies to numeric columns
You might want to start with an `is_numeric()` function that returns
a logical vector that has a TRUE corresponding to each numeric column.
## The map functions
This pattern of looping over a list and doing something to each element is so common that the purrr package provides a family of functions to do it for you. Each function always returns the same type of output so there are six variations based on what sort of result you want:
* `map()` returns a list.
* `map_lgl()` returns a logical vector.
* `map_int()` returns a integer vector.
* `map_dbl()` returns a double vector.
* `map_chr()` returns a character vector.
* `map_df()` returns a data frame.
* `walk()` returns nothing. Walk is a little different to the others because
it's called exclusively for its side effects, so it's described in more detail
later in [walk](#walk).
Each function takes a list as input, applies a function to each piece, and then returns a new vector that's the same length as the input. The type of the vector is determined by the specific map function. Usually you want to use the most specific available, using `map()` only as a fallback when there is no specialised equivalent available.
We can use these functions to perform the same computations as the previous for loops:
```{r}
map_int(x, length)
map_dbl(x, mean)
map_dbl(x, median)
```
Compared to using a for loop, focus is on the operation being performed (i.e. `length()`, `mean()`, or `median()`), not the book-keeping required to loop over every element and store the results.
There are a few differences between `map_*()` and `compute_summary()`:
* All purrr functions are implemented in C. This means you can't easily
understand their code, but it makes them a little faster.
* The second argument, `.f`, the function to apply, can be a formula, a
character vector, or an integer vector. You'll learn about those handy
shortcuts in the next section.
* Any arguments after `.f` will be passed on to it each time it's called:
```{r}
map_dbl(x, mean, trim = 0.5)
```
* The map functions also preserve names:
```{r}
z <- list(x = 1:3, y = 4:5)
map_int(z, length)
```
### Shortcuts
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces (one for each value of cylinder) and fits the same linear model to each piece:
```{r}
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))
```
The syntax for creating an anonymous function in R is quite verbose so purrr provides a convenient shortcut: a one-sided formula.
```{r}
models <- mtcars %>%
split(.$cyl) %>%
map(~lm(mpg ~ wt, data = .))
```
Here I've used `.` as a pronoun: it refers to the current list element (in the same way that `i` referred to the current index in the for loop). You can also use `.x` and `.y` to refer to up to two arguments. If you want to create a function with more than two arguments, do it the regular way!
When you're looking at many models, you might want to extract a summary statistic like the $R^2$. To do that we need to first run `summary()` and then extract the component called `r.squared`. We could do that using the shorthand for anonymous functions:
```{r}
models %>%
map(summary) %>%
map_dbl(~.$r.squared)
```
But extracting named components is a common operation, so purrr provides an even shorter shortcut: you can use a string.
```{r}
models %>%
map(summary) %>%
map_dbl("r.squared")
```
You can also use a numeric vector to select elements by position:
```{r}
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2)
```
### Base R
If you're familiar with the apply family of functions in base R, you might have noticed some similarities with the purrr functions:
* `lapply()` is basically identical to `map()`. There's no advantage to using
`map()` over `lapply()` except that it's consistent with all the other
functions in purrr.
* The base `sapply()` is a wrapper around `lapply()` that automatically tries
to simplify the results. This is useful for interactive work but is
problematic in a function because you never know what sort of output
you'll get:
```{r}
x1 <- list(
c(0.27, 0.37, 0.57, 0.91, 0.20),
c(0.90, 0.94, 0.66, 0.63, 0.06),
c(0.21, 0.18, 0.69, 0.38, 0.77)
)
x2 <- list(
c(0.50, 0.72, 0.99, 0.38, 0.78),
c(0.93, 0.21, 0.65, 0.13, 0.27),
c(0.39, 0.01, 0.38, 0.87, 0.34)
)
threshold <- function(x, cutoff = 0.8) x[x > cutoff]
str(sapply(x1, threshold))
str(sapply(x2, threshold))
```
* `vapply()` is a safe alternative to `sapply()` because you supply an additional
argument that defines the type. The only problem with `vapply()` is that
it's a lot of typing: `vapply(df, is.numeric, logical(1))` is equivalent to
`map_lgl(df, is.numeric)`.
One of advantage of `vapply()` over the map functions is that it can also
produce matrices - the map functions only ever produce vectors.
* `map_df(x, f)` is effectively the same as `do.call("rbind", lapply(x, f))`
but under the hood is much more efficient.
### Exercises
1. How can you determine which columns in a data frame are factors?
(Hint: data frames are lists.)
1. What happens when you use the map functions on vectors that aren't lists?
What does `map(1:5, runif)` do? Why?
1. What does `map(-2:2, rnorm, n = 5)` do. Why?
1. Rewrite `map(x, function(df) lm(mpg ~ wt, data = df))` to eliminate the
anonymous function.
## Handling hierarchy {#hierarchy}
The map functions apply a function to every element in a list. They are the most commonly used part of purrr, but not the only part. Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy:
@@ -469,7 +231,7 @@ Graphically, that sequence of operations looks like:
```{r, echo = FALSE}
knitr::include_graphics("diagrams/lists-flatten.png")
````
```
Whenever I get confused about a sequence of flattening operations, I'll often draw a diagram like this to help me understand what's going on.
@@ -514,220 +276,6 @@ df %>% transpose() %>% str()
### Exercises
## Dealing with failure
When you do many operations on a list, sometimes one will fail. When this happens, you'll get an error message, and no output. This is annoying: why does one failure prevent you from accessing all the other successes? How do you ensure that one bad apple doesn't ruin the whole barrel?
In this section you'll learn how to deal this situation with a new function: `safely()`. `safely()` is an adverb: it takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead, it always returns a list with two elements:
1. `result` is the original result. If there was an error, this will be `NULL`.
1. `error` is an error object. If the operation was successful this will be
`NULL`.
(You might be familiar with the `try()` function in base R. It's similar, but because it sometimes returns the original result and it sometimes returns an error object it's more difficult to work with.)
Let's illustrate this with a simple example: `log()`:
```{r}
safe_log <- safely(log)
str(safe_log(10))
str(safe_log("a"))
```
When the function succeeds the `result` element contains the result and the `error` element is `NULL`. When the function fails, the `result` element is `NULL` and the `error` element contains an error object.
`safely()` is designed to work with map:
```{r}
x <- list(1, 10, "a")
y <- x %>% map(safely(log))
str(y)
```
This would be easier to work with if we had two lists: one of all the errors and one of all the results. That's easy to get with `transpose()`.
```{r}
y <- y %>% transpose()
str(y)
```
It's up to you how to deal with the errors, but typically you'll either look at the values of `x` where `y` is an error or work with the values of y that are ok:
```{r}
is_ok <- y$error %>% map_lgl(is_null)
x[!is_ok]
y$result[is_ok] %>% flatten_dbl()
```
Purrr provides two other useful adverbs:
* Like `safely()`, `possibly()` always succeeds. It's simpler than `safely()`,
because you give it a default value to return when there is an error.
```{r}
x <- list(1, 10, "a")
x %>% map_dbl(possibly(log, NA_real_))
```
* `quietly()` performs a similar role to `safely()`, but instead of capturing
errors, it captures printed output, messages, and warnings:
```{r}
x <- list(1, -1)
x %>% map(quietly(log)) %>% str()
```
### Exercises
1. Challenge: read all the csv files in this directory. Which ones failed
and why?
```{r, eval = FALSE}
files <- dir("data", pattern = "\\.csv$")
files %>%
set_names(., basename(.)) %>%
map_df(safely(readr::read_csv), .id = "filename") %>%
```
## Parallel maps
So far we've mapped along a single list. But often you have multiple related lists that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
```{r}
mu <- list(5, 10, -3)
mu %>% map(rnorm, n = 10)
```
What if you also want to vary the standard deviation? You need to iterate along a vector of means and a vector of standard deviations in parallel. That's a job for `map2()` which works with two parallel sets of inputs:
```{r}
sigma <- list(1, 5, 10)
map2(mu, sigma, rnorm, n = 10)
```
`map2()` generates this series of function calls:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-map2.png")
```
The arguments that vary for each call come before the function name, and arguments that are the same for every function call come afterwards.
Like `map()`, `map2()` is just a wrapper around a for loop:
```{r}
map2 <- function(x, y, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], y[[i]], ...)
}
out
}
```
You could also imagine `map3()`, `map4()`, `map5()`, `map6()` etc, but that would get tedious quickly. Instead, purrr provides `pmap()` which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:
```{r}
n <- list(1, 3, 5)
args1 <- list(n, mu, sigma)
args1 %>% pmap(rnorm) %>% str()
```
That looks like:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-pmap-unnamed.png")
```
However, instead of relying on position matching, it's better to name the arguments. This is more verbose, but it makes the code clearer.
```{r}
args2 <- list(mean = mu, sd = sigma, n = n)
args2 %>% pmap(rnorm) %>% str()
```
That generates longer, but safer, calls:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-pmap-named.png")
```
Since the arguments are all the same length, it makes sense to store them in a data frame:
```{r}
params <- dplyr::data_frame(mean = mu, sd = sigma, n = n)
params$result <- params %>% pmap(rnorm)
params
```
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns. We'll come back to this idea when we explore the intersection of dplyr, purrr, and model fitting.
### Invoking different functions
There's one more step up in complexity - as well as varying the arguments to the function you might also vary the function itself:
```{r}
f <- c("runif", "rnorm", "rpois")
param <- list(
list(min = -1, max = 1),
list(sd = 5),
list(lambda = 10)
)
```
To handle this case, you can use `invoke_map()`:
```{r}
invoke_map(f, param, n = 5) %>% str()
```
```{r, echo = FALSE}
knitr::include_graphics("diagrams/lists-invoke.png")
```
The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
You can use `dplyr::frame_data()` to make creating these matching pairs a little easier:
```{r, eval = FALSE}
# Needs dev version of dplyr
sim <- dplyr::frame_data(
~f, ~params,
"runif", list(min = -1, max = -1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>% dplyr::mutate(
samples = invoke_map(f, params, n = 10)
)
```
## Walk {#walk}
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value. Here's a very simple example:
```{r}
x <- list(1, "a", 3)
x %>%
walk(print)
```
`walk()` is generally not that useful compared to `walk2()` or `pwalk()`. For example, if you had a list of plots and a vector of file names, you could use `pwalk()` to save each file to the corresponding location on disk:
```{r}
library(ggplot2)
plots <- mtcars %>%
split(.$cyl) %>%
map(~ggplot(., aes(mpg, wt)) + geom_point())
paths <- paste0(names(plots), ".pdf")
pwalk(list(paths, plots), ggsave, path = tempdir())
```
`walk()`, `walk2()` and `pwalk()` all invisibly return the `.x`, the first argument. This makes them suitable for use in the middle of pipelines.
## Predicates
Imagine we want to summarise each numeric column of a data frame. We could do it in two steps:

256
pipes.Rmd Normal file
View File

@@ -0,0 +1,256 @@
# Pipes
```{r, include = FALSE}
library(dplyr)
diamonds <- ggplot2::diamonds
```
Pipes let you transform the way you call deeply nested functions. Using a pipe doesn't affect at all what the code does; behind the scenes it is run in (almost) exactly the same way. What the pipe does is change how the code is written and hence how it is read. It tends to transform to a more imperative form (do this, do that, do that other thing, ...) so that it's easier to read.
### Piping alternatives
To explore how you can write the same code in many different ways, let's use code to tell a story about a little bunny named foo foo:
> Little bunny Foo Foo
> Went hopping through the forest
> Scooping up the field mice
> And bopping them on the head
We'll start by defining an object to represent little bunny Foo Foo:
```{r, eval = FALSE}
foo_foo <- little_bunny()
```
And then we'll use a function for each key verb `hop()`, `scoop()`, and `bop()`. Using this object and these verbs, there are a number of ways we could retell the story in code:
* Save each intermediate step as a new object
* Rewrite the original object multiple times
* Compose functions
* Use the pipe
Below we work through each approach, showing you the code and talking about the advantages and disadvantages.
#### Intermediate steps
The simplest and most robust approach to sequencing multiple function calls is to save each intermediary as a new object:
```{r, eval = FALSE}
foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)
```
The main downside of this form is that it forces you to name each intermediate element. If there are natural names, this form feels natural, and you should use it. But if you're giving then arbitrary unique names, like this example, I don't think it's that useful. Whenever I write code like this, I invariably write the wrong number somewhere and then spend 10 minutes scratching my head and trying to figure out what went wrong with my code.
You may worry that this form creates many intermediate copies of your data and takes up a lot of memory. First, in R, worrying about memory is not a useful way to spend your time: worry about it when it becomes a problem (i.e. you run out of memory), not before. Second, R isn't stupid: it will reuse the shared columns in a pipeline of data frame transformations. Let's take a look at an actual data manipulation pipeline where we add a new column to the `diamonds` dataset from ggplot2:
```{r}
diamonds2 <- mutate(diamonds, price_per_carat = price / carat)
library(pryr)
object_size(diamonds)
object_size(diamonds2)
object_size(diamonds, diamonds2)
```
`pryr::object_size()` gives the memory occupied by all of its arguments. The results seem counterintuitive at first:
* `diamonds` takes up 3.46 MB,
* `diamonds2` takes up 3.89 MB,
* `diamonds` and `diamonds2` together take up 3.89 MB!
How can that work? Well, `diamonds2` has 10 columns in common with `diamonds`: there's no need to duplicate all that data so both data frames share the vectors. R will only create a copy of a vector if you modify it. Modifying a single value will mean that the data frames can no longer share as much memory. The individual sizes will be unchanged, but the collective size will increase:
```{r}
diamonds$carat[1] <- NA
object_size(diamonds)
object_size(diamonds2)
object_size(diamonds, diamonds2)
```
(Note that we use `pryr::object_size()` here, not the built-in `object.size()`, because it doesn't have quite enough smarts.)
#### Overwrite the original
One way to eliminate the intermediate objects is to just overwrite the same object again and again:
```{r, eval = FALSE}
foo_foo <- hop(foo_foo, through = forest)
foo_foo <- scoop(foo_foo, up = field_mice)
foo_foo <- bop(foo_foo, on = head)
```
This is less typing (and less thinking), so you're less likely to make mistakes. However, there are two problems:
1. It will make debugging painful: if you make a mistake you'll need to start
again from scratch.
1. The repetition of the object being transformed (we've written `foo_foo` six
times!) obscures what's changing on each line.
#### Function composition
Another approach is to abandon assignment altogether and just string the function calls together:
```{r, eval = FALSE}
bop(
scoop(
hop(foo_foo, through = forest),
up = field_mice
),
on = head
)
```
Here the disadvantage is that you have to read from inside-out, from right-to-left, and that the arguments end up spread far apart (sometimes called the
[dagwood sandwhich](https://en.wikipedia.org/wiki/Dagwood_sandwich) problem).
#### Use the pipe
Finally, we can use the pipe:
```{r, eval = FALSE}
foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mouse) %>%
bop(on = head)
```
This is my favourite form. The downside is that you need to understand what the pipe does, but once you've mastered that idea task, you can read this series of function compositions like it's a set of imperative actions. Foo foo, hops, then scoops, then bops.
Behind the scenes magrittr converts this to:
```{r, eval = FALSE}
. <- hop(foo_foo, through = forest)
. <- scoop(., up = field_mice)
bop(., on = head)
```
It's useful to know this because if an error is thrown in the middle of the pipe, you'll need to be able to interpret the `traceback()`.
### Other tools from magrittr
The pipe is provided by the magrittr package, by Stefan Milton Bache. Most of packages you work in this book automatically provide `%>%` for you. You might want to load magrittr yourself if you're using another package, or you want to access some of the other pipe variants that magrittr provides.
```{r}
library(magrittr)
```
* When working with more complex pipes, it's some times useful to call a
function for its side-effects. Maybe you want to print out the current
object, or plot it, or save it to disk. Many times, such functions don't
return anything, effectively terminating the pipe.
To work around this problem, you can use the "tee" pipe. `%T>%` works like
`%>%` except instead it returns the LHS instead of the RHS. It's called
"tee" because it's like a literal T-shaped pipe.
```{r}
rnorm(100) %>%
matrix(ncol = 2) %>%
plot() %>%
str()
rnorm(100) %>%
matrix(ncol = 2) %T>%
plot() %>%
str()
```
* If you're working with functions that don't have a dataframe based API
(i.e. you pass them individual vectors, not a data frame and expressions
to be evaluated in the context of that data frame), you might find `%$%`
useful. It "explodes" out the variables in a data frame so that you can
refer to them explicitly. This is useful when working with many functions
in base R:
```{r}
mtcars %$%
cor(disp, mpg)
```
* For assignment magrittr provides the `%<>%` operator which allows you to
replace code like:
```R
mtcars <- mtcars %>% transform(cyl = cyl * 2)
```
with
```R
mtcars %<>% transform(cyl = cyl * 2)
```
I'm not a fan of this operator because I think assignment is such a
special operation that it should always be clear when it's occurring.
In my opinion, a little bit of duplication (i.e. repeating the
name of the object twice), is fine in return for making assignment
more explicit.
### When not to use the pipe
I also made a slight simplifiation when I said that the `x %>% f(y)` is exactly the same as `f(x, y)`. That's not quite true, which you'll see particularly for two classes of functions:
1. Functions that use the current environment. For example, `assign()`
will create a new variable with the given name in the current environment:
```{r}
assign("x", 10)
x
"x" %>% assign(100)
x
```
The use of assign with the pipe does not work because it assigns it to
a temporary environment used by `%>%`. If you do want to use assign with the
pipe, you can be explicit about the environment:
```{r}
env <- environment()
"x" %>% assign(100, envir = env)
x
```
Other functions with this problem are `get()`, and `load()`
1. Functions that use effect how their arguments are computed. In R, arguments
are lazy which means they are only computed when the function uses them,
not prior to calling the function. This means that the function can affect
the global environment in various ways. The pipe forces computation of
each element in series so you can't rely on this behaviour.
```{r, error = TRUE}
tryCatch(stop("!"), error = function(e) "An error")
stop("!") %>%
tryCatch(error = function(e) "An error")
```
There are a relatively wide class of functions with this behaviour including
`try()`, `supressMessages()`, `suppressWarnings()`, any function from the
withr package, ...
The pipe is a powerful tool, but it's not the only tool at your disposal, and it doesn't solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:
* Your pipes get longer than five or six lines. In that case, create
intermediate objects with meaningful names. That will make debugging easier,
because you can more easily check the intermediate results. It also helps
when reading the code, because the variable names can help describe the
intent of the code.
* You have multiple inputs or outputs. If there is not one primary object
being transformed, write code the regular ways.
* You are starting to think about a directed graph with a complex
dependency structure. Pipes are fundamentally linear and expressing
complex relationships with them typically does not yield clear code.
### Pipes in production
When you run a pipe interactively, it's easy to see if something goes wrong. When you start writing pipes that are used in production, i.e. they're run automatically and a human doesn't immediately look at the output it's a really good idea to include some assertions that verify the data looks like expected. One great way to do this is the ensurer package, written by Stefan Milton Bache (the author of magrittr).
<http://www.r-statistics.com/2014/11/the-ensurer-package-validation-inside-pipes/>

View File

@@ -1,3 +1,39 @@
# Programming
Computer-human communication matters.
Code is a tool of communication, not just to the computer, but to other people. This is important because every project you undertake is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you. You want to write clear code so that future-you doesn't curse present-you when you look at a project again after several months have passed.
To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. In this part of the book, you'll learn three important skills that help you move in this direction:
1. We'll dive deep into the __pipe__, `%>%`, talking more about how it works
and how it gives you a new tool for rewriting your code. You'll also learn
about when not to use the pipe!
1. Repeating yourself in code is dangerous because it can easily lead to
errors and inconsistencies. We'll talk about how to write __functions__
in order to remove duplication in your logic.
1. Another important tool for removing duplication is the __for loop__ which
allows you to repeat the same action again and again and again. You tend to
use for-loops less often in R than in other programming languages because R
is a functional programming language which means that you can extract out
common patterns of for loops and put them in a function. We'll come back to
that idea in XYZ.
Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've solved in the past with much effort.
Writing code is similar in many ways to writing prose. One parallel which I find particularly useful is that in both cases rewriting is key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely you'll first attempt will be clear.)
## Learning more
As you become a better R programmer, you'll learn more techniques for reducing various types of duplication. This allows you to do more with less, and allows you to express yourself more clearly by taking advantage of powerful programming constructs.
To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:
* [Hands on programming with R](http://shop.oreilly.com/product/0636920028574.do),
by Garrett Grolemund. This is an introduction to R as a programming language
and is a great place to start if R is your first programming language.
* [Advanced R](http://adv-r.had.co.nz) by Hadley Wickham. This dives into the
details of R the programming language. This is a great place to start if
you've programmed in other languages and you want to learn what makes R
special, different, and particularly well suited to data analysis.