r4ds/lists.Rmd

853 lines
29 KiB
Plaintext
Raw Normal View History

2015-12-17 07:22:03 +08:00
---
layout: default
2015-12-17 07:22:03 +08:00
title: Lists
---
2015-12-12 02:34:20 +08:00
# Lists
2015-12-12 03:28:10 +08:00
```{r setup-lists, include=FALSE}
2015-11-09 20:33:07 +08:00
library(purrr)
2015-12-08 03:55:44 +08:00
source("common.R")
2015-11-09 20:33:07 +08:00
```
2015-12-09 00:11:53 +08:00
In this chapter, you'll learn how to handle lists, the data structure R uses for complex, hierarchical objects. You've already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
2015-11-09 20:33:07 +08:00
2015-12-09 00:11:53 +08:00
If you've worked with list-like objects before, you're probably familiar with the for loop. I'll talk a little bit about for loops here, but the focus will be functions from the __purrr__ package. purrr makes it easier to work with lists by eliminating common for loop boilerplate so you can focus on the specifics. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and easier to learn.
2015-11-09 20:33:07 +08:00
2015-11-19 02:03:51 +08:00
The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces:
2015-11-09 20:33:07 +08:00
2015-11-19 02:03:51 +08:00
1. How can you solve the problem for a single element of the list? Once
you've solved that problem, purrr takes care of generalising your
solution to every element in the list.
2015-11-09 20:33:07 +08:00
2015-11-19 02:03:51 +08:00
1. If you're solving a complex problem, how can you break it down into
bite sized pieces that allow you to advance one small step towards a
solution? With purrr, you get lots of small pieces that you can
combose together with the pipe.
2015-11-09 20:33:07 +08:00
2015-11-19 02:03:51 +08:00
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
2015-11-09 20:33:07 +08:00
In later chapters you'll learn how to apply these ideas when modelling. You can often use multiple simple models to help understand a complex dataset, or you might have multiple models because you're bootstrapping or cross-validating. The techniques you learn in this chapter will be invaluable.
2015-11-09 20:33:07 +08:00
<!--
## Warm ups
* What does this for loop do?
* How is a data frame like a list?
* What does `mean()` mean? What does `mean` mean?
* How do you get help about the $ function? How do you normally write
`[[`(mtcars, 1) ?
2015-11-09 22:58:33 +08:00
* Argument order
2015-11-09 20:33:07 +08:00
-->
## List basics
2015-12-09 00:11:53 +08:00
You create a list with `list()`:
2015-11-19 02:03:51 +08:00
```{r}
x <- list(1, 2, 3)
str(x)
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
2015-11-19 02:03:51 +08:00
```
Unlike atomic vectors, `lists()` can contain a mix of objects:
2015-11-19 02:03:51 +08:00
```{r}
y <- list("a", 1L, 1.5, TRUE)
str(y)
```
Lists can even contain other lists!
```{r}
z <- list(list(1, 2), list(3, 4))
str(z)
```
`str()` is very helpful when looking at lists because it focusses on the structure, not the contents.
2015-12-09 00:11:53 +08:00
### Visualising lists
2015-12-09 00:11:53 +08:00
To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists. For example, take these three lists:
```{r}
x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
```
2015-12-09 00:11:53 +08:00
I draw them as follows:
2016-01-07 22:33:03 +08:00
```{r, echo = FALSE, out.width = "75%"}
2015-12-12 03:28:10 +08:00
knitr::include_graphics("diagrams/lists-structure.png")
```
2015-12-09 00:11:53 +08:00
* Lists are rounded rectangles that contain their children.
* I draw each child a little darker than its parent to make it easier to see
the hierarchy.
* The orientation of the children (i.e. rows or columns) isn't important,
so I pick a row or column orientation to either save space or illustrate
an important property in the example.
### Subsetting
2015-12-09 00:11:53 +08:00
There are three ways to subset a list, which I'll illustrate with `a`:
```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
```
2015-11-19 02:03:51 +08:00
* `[` extracts a sub-list. The result will always be a list.
```{r}
str(a[1:2])
str(a[4])
2015-11-19 02:03:51 +08:00
```
2015-12-09 00:11:53 +08:00
Like subsetting vectors, you can use an integer vector to select by
position, or a character vector to select by name.
* `[[` extracts a single component from a list. It removes a level of
hierarchy from the list.
2015-11-19 02:03:51 +08:00
```{r}
str(y[[1]])
str(y[[4]])
2015-11-19 02:03:51 +08:00
```
* `$` is a shorthand for extracting named elements of a list. It works
2015-12-09 00:11:53 +08:00
similarly to `[[` except that you don't need to use quotes.
2015-11-19 02:03:51 +08:00
```{r}
a$a
a[["b"]]
2015-11-19 02:03:51 +08:00
```
Or visually:
2016-01-07 22:33:03 +08:00
```{r, echo = FALSE, out.width = "75%"}
2015-12-12 03:28:10 +08:00
knitr::include_graphics("diagrams/lists-subsetting.png")
```
### Lists of condiments
2015-12-09 00:11:53 +08:00
It's easy to get confused between `[` and `[[`, but it's important to understand the difference. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help remember these differences:
2015-11-19 02:03:51 +08:00
2016-01-07 22:33:03 +08:00
```{r, echo = FALSE, out.width = "25%"}
2015-12-12 03:28:10 +08:00
knitr::include_graphics("images/pepper.jpg")
2015-11-19 02:03:51 +08:00
```
If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:
2016-01-07 22:33:03 +08:00
```{r, echo = FALSE, out.width = "25%"}
2015-12-12 03:28:10 +08:00
knitr::include_graphics("images/pepper-1.jpg")
2015-11-19 02:03:51 +08:00
```
`x[2]` would look the same, but would contain the second packet. `x[1:2]` would be a pepper shaker containing two pepper packets.
`x[[1]]` is:
2016-01-07 22:33:03 +08:00
```{r, echo = FALSE, out.width = "25%"}
2015-12-12 03:28:10 +08:00
knitr::include_graphics("images/pepper-2.jpg")
2015-11-19 02:03:51 +08:00
```
If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:
2016-01-07 22:33:03 +08:00
```{r, echo = FALSE, out.width = "25%"}
2015-12-12 03:28:10 +08:00
knitr::include_graphics("images/pepper-3.jpg")
2015-11-19 02:03:51 +08:00
```
2015-11-09 20:33:07 +08:00
### Exercises
1. Draw the following lists as nested sets.
1. Generate the lists corresponding to these nested set diagrams.
1. What happens if you subset a data frame as if you're subsetting a list?
2015-12-06 21:07:33 +08:00
What are the key differences between a list and a data frame?
2015-12-09 00:11:53 +08:00
## For loops
2015-11-09 20:33:07 +08:00
2015-12-09 00:11:53 +08:00
To illustrate for loops, we'll start by creating a stereotypical list: an eight element list where each element contains a random vector of random length. (You'll learn about `rerun()` later.)
2015-11-09 20:33:07 +08:00
```{r}
2015-11-19 02:03:51 +08:00
x <- rerun(8, runif(sample(5, 1)))
2015-11-09 20:33:07 +08:00
str(x)
```
2015-11-19 02:03:51 +08:00
Imagine we want to compute the length of each element in this list. One way to do that is with a for loop:
2015-11-09 20:33:07 +08:00
```{r}
2015-11-19 02:03:51 +08:00
results <- vector("integer", length(x))
2015-11-09 20:33:07 +08:00
for (i in seq_along(x)) {
results[i] <- length(x[[i]])
}
results
```
If you do this a lot, you should probably make a function for it:
2015-11-19 02:03:51 +08:00
```{r}
col_medians <- function(df) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- median(df[[i]])
2015-11-09 20:33:07 +08:00
}
out
2015-11-09 20:33:07 +08:00
}
col_medians(df)
2015-11-09 20:33:07 +08:00
```
Now imagine that you also want to compute the interquartile range of each column? How would you change the function? What if you also wanted to calculate the min and max?
2015-11-09 20:33:07 +08:00
```{r}
col_min <- function(df) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- min(df[[i]])
2015-11-09 20:33:07 +08:00
}
out
2015-11-09 20:33:07 +08:00
}
col_max <- function(df) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- max(df[[i]])
2015-11-09 20:33:07 +08:00
}
out
2015-11-09 20:33:07 +08:00
}
```
I've now copied-and-pasted this function three times, so it's time to think about how to generalise it. If you look at these functions, you'll notice that they are very similar: the only difference is the function that gets called.
I mentioned earlier that R is a functional programming language. Practically, what this means is that you can not only pass vectors and data frames to functions, but you can also pass other functions. So you can generalise these `col_*` functions by adding an additional argument:
2015-12-09 00:11:53 +08:00
But this is only two of the many functions we might want to apply to every element of a list, and there's already lot of duplication. Most of the code is for-loop boilerplate and it's hard to see the one function (`length()`, `mean()`, or `median()`) that's actually important.
2015-11-09 20:33:07 +08:00
What would you do if you saw a set of functions like this:
```{r}
f1 <- function(x) abs(x - mean(x)) ^ 1
f2 <- function(x) abs(x - mean(x)) ^ 2
f3 <- function(x) abs(x - mean(x)) ^ 3
```
2015-12-09 00:11:53 +08:00
Hopefully, you'd notice that there's a lot of duplication, and extract it out into an additional argument:
2015-11-09 20:33:07 +08:00
```{r}
f <- function(x, i) abs(x - mean(x)) ^ i
```
2015-11-19 02:03:51 +08:00
You've reduce the chance of bugs (because you now have 1/3 less code), and made it easy to generalise to new situations. We can do exactly the same thing with `compute_length()`, `compute_median()` and `compute_mean()`:
2015-11-09 20:33:07 +08:00
I mentioned earlier that R is a functional programming language. Practically, what this means is that you can not only pass vectors and data frames to functions, but you can also pass other functions. So you can generalise these `col_*` functions by adding an additional argument:
2015-11-09 20:33:07 +08:00
```{r}
col_summary <- function(df, fun) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- fun(df[[i]])
2015-11-09 20:33:07 +08:00
}
out
2015-11-09 20:33:07 +08:00
}
col_summary(df, median)
col_summary(df, min)
```
We can take this one step further and use another cool feature of R functions: "`...`". "`...`" just takes any additional arguments and allows you to pass them on to another function:
```{r}
col_summary <- function(df, fun, ...) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- fun(df[[i]], ...)
}
out
}
col_summary(df, median, na.rm = TRUE)
```
2015-11-09 20:33:07 +08:00
2015-12-09 00:11:53 +08:00
Instead of hardcoding the summary function, we allow it to vary, by adding an additional argument that is a function. It can take a while to wrap your head around this, but it's very powerful technique. This is one of the reasons that R is known as a "functional" programming language.
2015-11-09 20:33:07 +08:00
2015-11-20 01:18:36 +08:00
### Exercises
1. Read the documentation for `apply()`. In the 2d case, what two for loops
does it generalise?
1. Adapt `col_summary()` so that it only applies to numeric columns
You might want to start with an `is_numeric()` function that returns
a logical vector that has a TRUE corresponding to each numeric column.
2015-11-20 01:18:36 +08:00
2015-11-19 02:03:51 +08:00
## The map functions
2015-11-09 20:33:07 +08:00
2015-11-19 02:03:51 +08:00
This pattern of looping over a list and doing something to each element is so common that the purrr package provides a family of functions to do it for you. Each function always returns the same type of output so there are six variations based on what sort of result you want:
2015-11-09 20:33:07 +08:00
2015-11-26 05:25:22 +08:00
* `map()` returns a list.
* `map_lgl()` returns a logical vector.
* `map_int()` returns a integer vector.
* `map_dbl()` returns a double vector.
* `map_chr()` returns a character vector.
* `map_df()` returns a data frame.
* `walk()` returns nothing. Walk is a little different to the others because
it's called exclusively for its side effects, so it's described in more detail
2015-12-09 00:11:53 +08:00
later in [walk](#walk).
2015-11-19 02:03:51 +08:00
2015-12-09 00:11:53 +08:00
Each functions takes a list as input, applies a function to each piece, and then returns a new vector that's the same length as the input. The type of the vector is determine by the specific map function. Usually you want to use the most specific avaiable; using `map()` only as a fallback when there is no specialised equivalent available.
2015-11-19 02:03:51 +08:00
2015-12-09 00:11:53 +08:00
We can use these functions to perform the same computations as the previous for loops:
2015-11-09 20:33:07 +08:00
```{r}
2015-11-09 22:58:33 +08:00
map_int(x, length)
map_dbl(x, mean)
2015-11-19 02:03:51 +08:00
map_dbl(x, median)
2015-11-09 20:33:07 +08:00
```
2015-11-26 05:25:22 +08:00
Compared to using a for loop, focus is on the operation being performed (i.e. `length()`, `mean()`, or `median()`), not the book-keeping required to loop over every element and store the results.
2015-11-19 02:03:51 +08:00
There are a few differences between `map_*()` and `compute_summary()`:
2015-11-09 20:33:07 +08:00
2015-11-26 05:25:22 +08:00
* All purrr functions are implemented in C. This means you can't easily
understand their code, but it makes them a little faster.
2015-11-19 02:03:51 +08:00
2015-11-26 05:25:22 +08:00
* The second argument, `.f`, the function to apply, can be a formula, a
character vector, or an integer vector. You'll learn about those handy
shortcuts in the next section.
2015-11-19 02:03:51 +08:00
2015-11-26 05:25:22 +08:00
* Any arguments after `.f` will be passed on to it each time its called:
2015-11-09 20:33:07 +08:00
2015-11-19 02:03:51 +08:00
```{r}
map_dbl(x, mean, trim = 0.5)
```
2015-11-09 20:33:07 +08:00
2015-11-26 05:25:22 +08:00
* The map functions also preserve names:
2015-11-09 22:58:33 +08:00
2015-11-19 02:03:51 +08:00
```{r}
z <- list(x = 1:3, y = 4:5)
map_int(z, length)
```
2015-11-24 07:19:47 +08:00
### Shortcuts
2015-11-20 01:18:36 +08:00
2015-12-09 00:11:53 +08:00
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each individual in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces (only for each value of cylinder) and fits the same linear model to each piece:
2015-11-09 22:58:33 +08:00
```{r}
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))
```
2015-11-26 05:25:22 +08:00
The syntax for creating an anonymous function in R is quite verbose so purrr provides a convenient shortcut: a one-sided formula.
2015-11-09 22:58:33 +08:00
```{r}
models <- mtcars %>%
split(.$cyl) %>%
map(~lm(mpg ~ wt, data = .))
```
2015-11-24 07:19:47 +08:00
Here I've used `.` as a pronoun: it refers to the "current" list element (in the same way that `i` referred to the number in the for loop). You can also use `.x` and `.y` to refer to up to two arguments. If you want to create an function with more than two arguments, do it the regular way!
2015-11-09 22:58:33 +08:00
2015-12-09 00:11:53 +08:00
When you're looking at many models, you might want to extract a summary statistic like the $R^2$. To do that we need to first run `summary()` and then extract the component called `r.squared`. We could do that using the shorthand for anonymous funtions:
2015-11-09 22:58:33 +08:00
```{r}
models %>%
map(summary) %>%
map_dbl(~.$r.squared)
```
2015-12-09 00:11:53 +08:00
But extracting named components is a common operation, so purrr provides an even shorter shortcut: you can use a string.
2015-11-09 22:58:33 +08:00
```{r}
models %>%
map(summary) %>%
map_dbl("r.squared")
```
2015-11-24 07:19:47 +08:00
You can also use a numeric vector to select elements by position:
2015-11-24 07:19:47 +08:00
```{r}
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2)
```
2015-11-09 22:58:33 +08:00
2015-11-26 05:25:22 +08:00
### Base R
If you're familiar with the apply family of functions in base R, you might have noticed some similarities with the purrr functions:
* `lapply()` is basically identical to `map()`. There's no advantage to using
`map()` over `lapply()` except that it's consistent with all the other
functions in purrr.
* The base `sapply()` is a wrapper around `lapply()` that automatically tries
to simplify the results. This is useful for interactive work but is
problematic in a function because you never know what sort of output
you'll get:
```{r}
x1 <- list(
c(0.27, 0.37, 0.57, 0.91, 0.20),
c(0.90, 0.94, 0.66, 0.63, 0.06),
c(0.21, 0.18, 0.69, 0.38, 0.77)
)
x2 <- list(
c(0.50, 0.72, 0.99, 0.38, 0.78),
c(0.93, 0.21, 0.65, 0.13, 0.27),
c(0.39, 0.01, 0.38, 0.87, 0.34)
)
threshhold <- function(x, cutoff = 0.8) x[x > cutoff]
str(sapply(x1, threshhold))
str(sapply(x2, threshhold))
```
* `vapply()` is a safe alternative to `sapply()` because you supply an additional
argument that defines the type. The only problem with `vapply()` is that
it's a lot of typing: `vapply(df, is.numeric, logical(1))` is equivalent to
`map_lgl(df, is.numeric)`.
2015-12-09 00:11:53 +08:00
One of advantage of `vapply()` over the map functions is that it can also
produce matrices - the map functions only ever produce vectors.
2015-11-26 05:25:22 +08:00
* `map_df(x, f)` is effectively the same as `do.call("rbind", lapply(x, f))`
but under the hood is much more efficient.
### Exercises
1. How can you determine which columns in a data frame are factors?
(Hint: data frames are lists.)
1. What happens when you use the map functions on vectors that aren't lists?
What does `map(1:5, runif)` do? Why?
1. What does `map(-2:2, rnorm, n = 5)` do. Why?
1. Rewrite `map(x, function(df) lm(mpg ~ wt, data = df))` to eliminate the
anonymous function.
## Handling hierarchy {#hierarchy}
2015-12-09 00:11:53 +08:00
The map functions apply a function to every element in a list. They are the most commonly used part of purrr, but not the only part. Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy:
* You can extract deeply nested elements in a single call by supplying
a character vector to the map functions.
* You can remove a level of the hierarchy with the flatten functions.
2015-11-26 05:25:22 +08:00
2015-12-09 00:11:53 +08:00
* You can flip levels of the hierarchy with the transpose function.
2015-11-09 22:58:33 +08:00
2015-12-09 00:11:53 +08:00
### Extracting deeply nested elements
Some times you get data structures that are very deeply nested. A common source of sych data is JSON from a web API. I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. Now I'm going to load it into a list with jsonlite. By default `fromJSON()` tries to be helpful and simplifies the structure a little for you. Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`:
2015-11-09 22:58:33 +08:00
```{r}
2015-11-21 03:31:32 +08:00
# From https://api.github.com/repos/hadley/r4ds/issues
issues <- jsonlite::fromJSON("issues.json", simplifyVector = FALSE)
```
2015-11-09 22:58:33 +08:00
2015-12-09 00:11:53 +08:00
There are eight issues, and each issue is a nested list:
2015-11-21 03:31:32 +08:00
```{r}
2015-11-09 22:58:33 +08:00
length(issues)
str(issues[[1]])
```
2015-11-21 03:31:32 +08:00
2015-11-24 07:19:47 +08:00
To work with this sort of data, you typically want to turn it into a data frame by extracting the related vectors that you're most interested in:
2015-11-21 03:31:32 +08:00
```{r}
issues %>% map_int("id")
issues %>% map_lgl("locked")
issues %>% map_chr("state")
```
You can use the same technique to extract more deeply nested structure. For example, imagine you want to extract the name and id of the user. You could do that in two steps:
```{r}
users <- issues %>% map("user")
users %>% map_chr("login")
users %>% map_int("id")
```
2015-12-09 00:11:53 +08:00
But by supplying a character _vector_ to `map_*`, you can do it in one:
2015-11-09 22:58:33 +08:00
```{r}
issues %>% map_chr(c("user", "login"))
issues %>% map_int(c("user", "id"))
```
2015-11-21 03:31:32 +08:00
### Removing a level of hierarchy
2015-11-24 07:19:47 +08:00
As well as indexing deeply into hierarchy, it's sometimes useful to flatten it. That's the job of the flatten family of functions: `flatten()`, `flatten_lgl()`, `flatten_int()`, `flatten_dbl()`, and `flatten_chr()`. In the code below we take a list of lists of double vectors, then flatten it to a list of double vectors, then to a double vector.
2015-11-21 03:31:32 +08:00
```{r}
x <- list(list(a = 1, b = 2), list(c = 3, d = 4))
2015-12-09 00:11:53 +08:00
str(x)
y <- flatten(x)
str(y)
flatten_dbl(y)
2015-11-21 03:31:32 +08:00
```
Graphically, that sequence of operations looks like:
2015-12-12 03:28:10 +08:00
```{r, echo = FALSE}
knitr::include_graphics("diagrams/lists-flatten.png")
````
2015-11-21 03:31:32 +08:00
Whenever I get confused about a sequence of flattening operations, I'll often draw a diagram like this to help me understand what's going on.
2015-11-09 20:33:07 +08:00
2015-12-09 00:11:53 +08:00
Base R has `unlist()`, but I recommend avoiding it for the same reason I recommend avoiding `sapply()`: it always succeeds. Even if your data structure accidentally changes, `unlist()` will continue to work silently the wrong type of output. This tends to create problems that are frustrating to debug.
2015-11-26 05:25:22 +08:00
### Switching levels in the hierarchy
2015-11-09 20:33:07 +08:00
Other times the hierarchy feels "inside out". You can use `transpose()` to flip the first and second levels of a list:
2015-11-24 07:19:47 +08:00
```{r}
x <- list(
x = list(a = 1, b = 3, c = 5),
y = list(a = 2, b = 4, c = 6)
2015-11-24 07:19:47 +08:00
)
x %>% str()
x %>% transpose() %>% str()
2015-11-24 07:19:47 +08:00
```
Graphically, this looks like:
2015-11-09 20:33:07 +08:00
2016-01-07 22:33:03 +08:00
```{r, echo = FALSE, out.width = "75%"}
2015-12-12 03:28:10 +08:00
knitr::include_graphics("diagrams/lists-transpose.png")
```
2015-11-09 20:33:07 +08:00
You'll see an example of this in the next section, as `transpose()` is particularly useful in conjunction with adverbs like `safely()` and `quietly()`.
2015-11-21 03:31:32 +08:00
It's called transpose by analogy to matrices. When you subset a transposed matrix, you switch indices: `x[i, j]` is the same as `t(x)[j, i]`. It's the same idea when transposing a list, but the subsetting looks a little different: `x[[i]][[j]]` is equivalent to `transpose(x)[[j]][[i]]`. Similarly, a transpose is its own inverse so `transpose(transpose(x))` is equal to `x`.
2015-11-09 20:33:07 +08:00
2015-12-18 00:29:47 +08:00
Transpose is also useful when working with JSON apis. Many JSON APIs represent data frames in a row-based format, rather than R's column-based format. `transpose()` makes it easy to switch between the two:
2015-12-01 07:31:06 +08:00
```{r}
df <- dplyr::data_frame(x = 1:3, y = c("a", "b", "c"))
df %>% transpose() %>% str()
```
2015-11-09 22:58:33 +08:00
### Exercises
2015-11-09 20:33:07 +08:00
2015-11-09 22:58:33 +08:00
## Dealing with failure
2015-11-09 20:33:07 +08:00
2015-11-21 03:31:32 +08:00
When you do many operations on a list, sometimes one will fail. When this happens, you'll get an error message, and no output. This is annoying: why does one failure prevent you from accessing all the other successes? How do you ensure that one bad apple doesn't ruin the whole barrel?
2015-12-09 00:11:53 +08:00
In this section you'll learn how to deal this situation with a new function: `safely()`. `safely()` is an adverb: it takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead, it always returns a list with two elements:
2015-12-01 16:46:38 +08:00
2015-12-09 00:11:53 +08:00
1. `result` is original result. If there was an error, this will be `NULL`.
2015-12-01 16:46:38 +08:00
2015-12-09 00:11:53 +08:00
1. `error` is an error object. If the operation was successful this will be
`NULL`.
2015-11-11 01:12:09 +08:00
2015-11-21 03:31:32 +08:00
(You might be familiar with the `try()` function in base R. It's similar, but because it sometimes returns the original result and it sometimes returns an error object it's more difficult to work with.)
2015-11-11 01:12:09 +08:00
Let's illustrate this with a simple example: `log()`:
```{r}
2015-11-20 01:18:36 +08:00
safe_log <- safely(log)
2015-11-11 01:12:09 +08:00
str(safe_log(10))
str(safe_log("a"))
```
2015-12-09 00:11:53 +08:00
When the function succeeds the `result` element contains the result and the error element is `NULL`. When the function fails, the result element is `NULL` and the error element contains an error object.
2015-11-11 01:12:09 +08:00
2015-12-09 00:11:53 +08:00
`safely()` is designed to work with map:
2015-11-11 01:12:09 +08:00
```{r}
x <- list(1, 10, "a")
2015-12-09 00:11:53 +08:00
y <- x %>% map(safely(log))
2015-11-11 01:12:09 +08:00
str(y)
```
This would be easier to work with if we had two lists: one of all the errors and one of all the results. That's easy to get with `transpose()`.
2015-11-09 20:33:07 +08:00
2015-11-11 01:12:09 +08:00
```{r}
2015-11-26 05:25:22 +08:00
y <- y %>% transpose()
str(y)
2015-11-11 01:12:09 +08:00
```
2015-11-09 20:33:07 +08:00
2015-11-21 03:31:32 +08:00
It's up to you how to deal with the errors, but typically you'll either look at the values of `x` where `y` is an error or work with the values of y that are ok:
2015-11-11 01:12:09 +08:00
```{r}
2015-11-26 05:25:22 +08:00
is_ok <- y$error %>% map_lgl(is_null)
2015-11-13 03:43:06 +08:00
x[!is_ok]
2015-11-26 05:25:22 +08:00
y$result[is_ok] %>% flatten_dbl()
2015-11-13 03:43:06 +08:00
```
2015-12-09 00:11:53 +08:00
Purrr provides two other useful adverbs:
2015-12-01 16:46:38 +08:00
2015-12-09 00:11:53 +08:00
* Like `safely()`, `possibly()` always succeeds. It's simpler than `safely()`,
because you give it a default value to return when there is an error.
2015-11-21 03:31:32 +08:00
```{r}
x <- list(1, 10, "a")
x %>% map_dbl(possibly(log, NA_real_))
```
2015-12-09 00:11:53 +08:00
* `quietly()` performs a similar role to `safely()`, but instead of capturing
errors, it captures printed output, messages, and warnings:
2015-11-21 03:31:32 +08:00
```{r}
x <- list(1, -1)
x %>% map(quietly(log)) %>% str()
```
2015-11-13 03:43:06 +08:00
2015-11-21 03:31:32 +08:00
### Exercises
2015-11-09 20:33:07 +08:00
2015-11-21 03:31:32 +08:00
1. Challenge: read all the csv files in this directory. Which ones failed
and why?
2015-11-09 20:33:07 +08:00
2015-11-21 03:31:32 +08:00
```{r, eval = FALSE}
files <- dir("data", pattern = "\\.csv$")
files %>%
set_names(., basename(.)) %>%
2015-12-09 00:11:53 +08:00
map_df(safely(readr::read_csv), .id = "filename") %>%
2015-11-21 03:31:32 +08:00
```
2015-11-09 20:33:07 +08:00
2015-11-21 03:31:32 +08:00
## Parallel maps
2015-11-09 20:33:07 +08:00
2015-11-21 03:31:32 +08:00
So far we've mapped along a single list. But often you have mutliple related lists that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
2015-11-09 20:33:07 +08:00
```{r}
2015-12-09 00:11:53 +08:00
mu <- list(5, 10, -3)
2015-11-09 22:58:33 +08:00
mu %>% map(rnorm, n = 10)
2015-11-09 20:33:07 +08:00
```
2015-11-21 03:31:32 +08:00
What if you also want to vary the standard deviation? You need to iterate along a vector of means and a vector of standard deviations in parallel. That's a job for `map2()` which works with two parallel sets of inputs:
2015-11-09 22:58:33 +08:00
```{r}
2015-12-09 00:11:53 +08:00
sigma <- list(1, 5, 10)
map2(mu, sigma, rnorm, n = 10)
2015-11-09 22:58:33 +08:00
```
2015-11-09 20:33:07 +08:00
2015-12-09 00:11:53 +08:00
`map2()` generates this series of function calls:
2016-01-07 22:33:03 +08:00
```{r, echo = FALSE, out.width = "75%"}
2015-12-12 03:28:10 +08:00
knitr::include_graphics("diagrams/lists-map2.png")
```
2015-12-09 00:11:53 +08:00
The arguments that vary for each call come before the function name, and arguments that are the same for every function call come afterwards.
2015-11-09 20:33:07 +08:00
2015-11-26 05:25:22 +08:00
Like `map()`, `map2()` is just a wrapper around a for loop:
2015-11-09 20:33:07 +08:00
```{r}
map2 <- function(x, y, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], y[[i]], ...)
}
out
}
2015-11-09 22:58:33 +08:00
```
2015-12-09 00:11:53 +08:00
You could also imagine `map3()`, `map4()`, `map5()`, `map6()` etc, but that would get tedious quickly. Instead, purrr provides `pmap()` which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:
2015-11-09 22:58:33 +08:00
```{r}
2015-12-09 00:11:53 +08:00
n <- list(1, 3, 5)
args1 <- list(n, mu, sigma)
args1 %>% pmap(rnorm) %>% str()
2015-11-09 22:58:33 +08:00
```
2015-12-09 00:11:53 +08:00
That looks like:
2016-01-07 22:33:03 +08:00
```{r, echo = FALSE, out.width = "75%"}
2015-12-12 03:28:10 +08:00
knitr::include_graphics("diagrams/lists-pmap-unnamed.png")
```
2015-12-09 00:11:53 +08:00
2015-11-26 05:25:22 +08:00
However, instead of relying on position matching, it's better to name the arguments. This is more verbose, but it makes the code clearer.
2015-11-09 20:33:07 +08:00
2015-11-09 22:58:33 +08:00
```{r}
2015-12-09 00:11:53 +08:00
args2 <- list(mean = mu, sd = sigma, n = n)
args2 %>% pmap(rnorm) %>% str()
2015-11-09 22:58:33 +08:00
```
2015-11-09 20:33:07 +08:00
2015-12-09 00:11:53 +08:00
That generates longer, but safer, calls:
2016-01-07 22:33:03 +08:00
```{r, echo = FALSE, out.width = "75%"}
2015-12-12 03:28:10 +08:00
knitr::include_graphics("diagrams/lists-pmap-named.png")
```
2015-12-09 00:11:53 +08:00
2015-11-26 05:25:22 +08:00
Since the arguments are all the same length, it makes sense to store them in a data frame:
2015-11-09 20:33:07 +08:00
2015-11-09 22:58:33 +08:00
```{r}
2015-12-09 00:11:53 +08:00
params <- dplyr::data_frame(mean = mu, sd = sigma, n = n)
2015-11-21 03:31:32 +08:00
params$result <- params %>% pmap(rnorm)
params
2015-11-09 22:58:33 +08:00
```
2015-11-09 20:33:07 +08:00
2015-12-09 00:11:53 +08:00
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns. We'll come back to this idea when we explore the intersection of dplyr, purr, and model fitting.
2015-11-26 05:25:22 +08:00
### Invoking different functions
2015-11-09 20:33:07 +08:00
2015-11-21 03:31:32 +08:00
There's one more step up in complexity - as well as varying the arguments to the function you might also vary the function itself:
2015-11-13 03:43:06 +08:00
```{r}
f <- c("runif", "rnorm", "rpois")
param <- list(
list(min = -1, max = 1),
list(sd = 5),
list(lambda = 10)
)
```
To handle this case, you can use `invoke_map()`:
```{r}
2015-12-09 00:11:53 +08:00
invoke_map(f, param, n = 5) %>% str()
2015-11-13 03:43:06 +08:00
```
2015-12-12 03:28:10 +08:00
```{r, echo = FALSE}
knitr::include_graphics("diagrams/lists-invoke.png")
```
2015-11-13 03:43:06 +08:00
2015-12-09 00:11:53 +08:00
The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
You can use `dplyr::frame_data()` to make creating these matching pairs a little easier:
2015-11-13 03:43:06 +08:00
2015-11-24 08:08:48 +08:00
```{r, eval = FALSE}
# Needs dev version of dplyr
2015-11-13 03:43:06 +08:00
sim <- dplyr::frame_data(
~f, ~params,
"runif", list(min = -1, max = -1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>% dplyr::mutate(
samples = invoke_map(f, params, n = 10)
)
```
2015-12-09 00:11:53 +08:00
## Walk {#walk}
2015-11-24 07:19:47 +08:00
2015-12-09 00:11:53 +08:00
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or saving files to disk - the important thing is the action, not the return value. Here's a very simple example:
```{r}
x <- list(1, "a", 3)
x %>%
walk(print)
```
`walk()` is generally not that useful compared to `walk2()` or `pwalk()`. For example, if you had a list of plots and a vector of file names, you could use `pwalk()` to save each file to the corresponding location on disk:
2015-11-24 07:19:47 +08:00
```{r}
library(ggplot2)
plots <- mtcars %>%
split(.$cyl) %>%
map(~ggplot(., aes(mpg, wt)) + geom_point())
paths <- paste0(names(plots), ".pdf")
pwalk(list(paths, plots), ggsave, path = tempdir())
```
2015-12-09 00:11:53 +08:00
`walk()`, `walk2()` and `pwalk()` all invisibly return the `.x`, the first argument. This makes them suitable for use in the middle of pipelines.
## Predicates
2015-11-26 05:25:22 +08:00
Imagine we want to summarise each numeric column of a data frame. We could do it in two steps:
1. Find all numeric columns.
1. Sumarise summarise each column.
In code, that would look like:
```{r}
col_sum <- function(df, f) {
is_num <- df %>% map_lgl(is_numeric)
df[is_num] %>% map_dbl(f)
}
```
2015-12-09 00:11:53 +08:00
`is_numeric()` is a __predicate__: a function that returns either `TRUE` or `FALSE`. There are a number of of purrr functions designed to work specifically with predicates:
* `keep()` and `discard()` keeps/discards list elements where the predicate is
true.
* `head_while()` and `tail_while()` keep the first/last elements of a list until
you get the first element where the predicate is true.
* `some()` and `every()` determine if the predicate is true for any or all of
the elements.
* `detect()` and `detect_index()`
2015-11-26 05:25:22 +08:00
We could use `keep()` to simplify the summary function to:
```{r}
col_sum <- function(df, f) {
df %>%
keep(is.numeric) %>%
map_dbl(f)
}
```
2015-11-26 05:25:22 +08:00
I like this formulation because you can easily read the sequence of steps.
### Built-in predicates
Purrr comes with a number of predicate functions built-in:
| | lgl | int | dbl | chr | list | null |
|------------------|-----|-----|-----|-----|------|------|
| `is_logical()` | x | | | | | |
| `is_integer()` | | x | | | | |
| `is_double()` | | | x | | | |
| `is_numeric()` | | x | x | | | |
| `is_character()` | | | | x | | |
| `is_atomic()` | x | x | x | x | | |
| `is_list()` | | | | | x | |
| `is_vector()` | x | x | x | x | x | |
| `is_null()` | | | | | | x |
2015-12-01 16:46:38 +08:00
Compared to the base R functions, they only inspect the type of the object, not its attributes. This means they tend to be less suprising:
```{r}
is.atomic(NULL)
is_atomic(NULL)
is.vector(factor("a"))
is_vector(factor("a"))
```
Each predicate also comes with "scalar" and "bare" versions. The scalar version checks that the length is 1 and the bare version checks that the object is a bare vector with no S3 class.
```{r}
y <- factor(c("a", "b", "c"))
is_integer(y)
is_scalar_integer(y)
is_bare_integer(y)
```
### Exercises
2015-11-26 05:25:22 +08:00
1. A possible base R equivalent of `col_sum()` is:
```{r}
col_sum3 <- function(df, f) {
is_num <- sapply(df, is.numeric)
df_num <- df[, is_num]
sapply(df_num, f)
}
```
But it has a number of bugs as illustrated with the following inputs:
```{r, eval = FALSE}
df <- data.frame(z = c("a", "b", "c"), x = 1:3, y = 3:1)
# OK
col_sum3(df, mean)
# Has problems: don't always return numeric vector
col_sum3(df[1:2], mean)
col_sum3(df[1], mean)
col_sum3(df[0], mean)
```
What causes the bugs?
1. Carefully read the documentation of `is.vector()`. What does it actually
test for?