In this chapter, you'll learn how to handle lists, the data structure R uses for complex, hierarchical objects. You've already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. Lists allow you to do this because unlike vectors, a list can contain other lists.
If you've worked with list-like objects before, you're probably familiar with the for loop. I'll talk a little bit about for loops here, but the focus will be functions from the __purrr__ package. purrr makes it easier to work with lists by eliminating common for loop boilerplate so you can focus on the specific details. This is the same idea as the apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc), but purrr is more consistent and easier to learn.
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
`str()` is very helpful when looking at lists because it focusses on the structure, not the contents.
## Visualising lists
It's helpful to have a visual representation of lists, so I'll use a nested set representation where each level of the hierarchy is nested in the previous. I'll always use rounded rectangles to represent lists, and regular rectangles to represent vectors. Note that single numbers (e.g. 1, 2), also called scalars, are not top-level objects in R and must always live inside a vector.
```{r}
x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
```
To make it easier to see the levels in the list, I colour each level a little darker than the previous. The orientiation of the elements (i.e. rows or columns) isn't important to the structure of the list (just the order of the elements), so I pick a row or column orientation to either save space or illustrate and important property of the operation.
(Unfortunately there's no way to draw these diagrams automatically - I did them by hand, carefully picking the arrangement that I think best illustrates the point I'm trying to make)
### Subsetting
There are three ways to subset a list, which I'll illustrate with this list:
```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
It's easy to get confused between `[` and `[[`, but understanding the difference is critical when working with lists. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help remember these differences:
Lets start by creating a stereotypical list: an eight element list where each element contains a random vector of random length. (You'll learn about `rerun()` later.)
This loop used a function you might not be familiar with: `seq_along()`. This is a safe version of the more familiar `1:length(l)`. There's one important difference in behaviour. If you have a zero-length vector, `seq_along()` does the right thing:
Figuring out the length of the elements of a list is a common operation, so it makes sense to turn it into a function so we can reuse it again and again:
(And in fact base R has this already: it's called `lengths()`.)
Now imagine we want to compute the `mean()` of each element. How would our function change? What if we wanted to compute the `median()`? You could create variations of `compute_lengths()` like this:
But this is only two functions we might want to apply to every element of a list, and there's already lot of duplication. Most of the code is for-loop boilerplate and it's hard to see the one function (`length()`, `mean()`, or `median()`) that's actually important.
You've reduce the chance of bugs (because you now have 1/3 less code), and made it easy to generalise to new situations. We can do exactly the same thing with `compute_length()`, `compute_median()` and `compute_mean()`:
Instead of hardcoding the summary function, we allow it to vary, by adding an addition argument that is a function. It can take a while to wrap your head around this, but it's very powerful technique. This is one of the reasons that R is known as a "functional" programming language.
This pattern of looping over a list and doing something to each element is so common that the purrr package provides a family of functions to do it for you. Each function always returns the same type of output so there are six variations based on what sort of result you want:
Each of these functions take a list as input, applies a function to each piece and then return a new vector that's the same length as the input. The following code uses purrr to do the same computations as the previous for loops:
Compared to using a for loop, focus is on the operation being performed (i.e. `length()`, `mean()`, or `median()`), not the book-keeping required to loop over every element and store the results.
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each individual in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces and fits the same linear model to each piece:
Here I've used `.` as a pronoun: it refers to the "current" list element (in the same way that `i` referred to the number in the for loop). You can also use `.x` and `.y` to refer to up to two arguments. If you want to create an function with more than two arguments, do it the regular way!
When you're looking at many models, you might want to extract a summary static like the $R^2$. To do that we need to first run `summary()` and then extract the component called `r.squared`. We could do that using the shorthand for anonymous funtions:
* `vapply()` is a safe alternative to `sapply()` because you supply an additional
argument that defines the type. The only problem with `vapply()` is that
it's a lot of typing: `vapply(df, is.numeric, logical(1))` is equivalent to
`map_lgl(df, is.numeric)`.
One of advantage `vapply()` over the map functions is that it can also
produce matrices - the map functions always produce vectors.
* `map_df(x, f)` is effectively the same as `do.call("rbind", lapply(x, f))`
but under the hood is much more efficient.
### Exercises
1. How can you determine which columns in a data frame are factors?
(Hint: data frames are lists.)
1. What happens when you use the map functions on vectors that aren't lists?
What does `map(1:5, runif)` do? Why?
1. What does `map(-2:2, rnorm, n = 5)` do. Why?
1. Rewrite `map(x, function(df) lm(mpg ~ wt, data = df))` to eliminate the
anonymous function.
## Handling hierarchy {#hierarchy}
As you start to use these functions more frequently, you'll find that you start to create quite complex trees. The techniques in this section will help you work with those structures.
Some times you get data structures that are very deeply nested. A common source of hierarchical data is JSON from a web API. I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. Now I'm going to load it with jsonlite. By default `fromJSON()` tries to be helpful and simplifies the structure a little. Here I'm going to show you how to do it by hand, so I set `simplifyVector = FALSE`:
You can use the same technique to extract more deeply nested structure. For example, imagine you want to extract the name and id of the user. You could do that in two steps:
```{r}
users <- issues %>% map("user")
users %>% map_chr("login")
users %>% map_int("id")
```
Or by using a character vector, you can do it in one:
As well as indexing deeply into hierarchy, it's sometimes useful to flatten it. That's the job of the flatten family of functions: `flatten()`, `flatten_lgl()`, `flatten_int()`, `flatten_dbl()`, and `flatten_chr()`. In the code below we take a list of lists of double vectors, then flatten it to a list of double vectors, then to a double vector.
Base R has `unlist()`, but I recommend avoiding it for the same reason I recommend avoiding `sapply()`: it always succeeds. Even if you data structure accidentally changes, `unlist()` will continue to work silently giving the wrong answer.
You'll see an example of this in the next section, as `transpose()` is particularly useful in conjunction with adverbs like `safely()` and `quietly()`.
It's called transpose by analogy to matrices. When you subset a transposed matrix, you switch indices: `x[i, j]` is the same as `t(x)[j, i]`. It's the same idea when transposing a list, but the subsetting looks a little different: `x[[i]][[j]]` is equivalent to `transpose(x)[[j]][[i]]`. Similarly, a transpose is its own inverse so `transpose(transpose(x))` is equal to `x`.
Tranpose is also useful when working with JSON apis. Many JSON APIs represent data frames in a row-based format, rather than R's column-based format. `transpose()` makes it easy to switch between the two:
```{r}
df <- dplyr::data_frame(x = 1:3, y = c("a", "b", "c"))
When you do many operations on a list, sometimes one will fail. When this happens, you'll get an error message, and no output. This is annoying: why does one failure prevent you from accessing all the other successes? How do you ensure that one bad apple doesn't ruin the whole barrel?
In this section you'll learn how to deal this situation with a new function: `safely()`. `safely()` is an adverb: it takes a function and returns a modified function. In this case, the modified function returns a list with elements `result` (the original result) and `error` (the text of the error if it occured). For any given run, one will always be `NULL`.
(You might be familiar with the `try()` function in base R. It's similar, but because it sometimes returns the original result and it sometimes returns an error object it's more difficult to work with.)
When the function succeeds the `result` element contains the result and the error element is empty. When the function fails, the result element is empty and the error element contains the error.
It's up to you how to deal with the errors, but typically you'll either look at the values of `x` where `y` is an error or work with the values of y that are ok:
So far we've mapped along a single list. But often you have mutliple related lists that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
What if you also want to vary the standard deviation? You need to iterate along a vector of means and a vector of standard deviations in parallel. That's a job for `map2()` which works with two parallel sets of inputs:
You could imagine `map3()`, `map4()`, `map5()`, `map6()` etc, but that would get tedious quickly. Instead, purrr provides `pmap()` which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns.
The first argument is a list of functions or character vector of function names, the second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
You can use `dplyr::frame_data()` to create these matching pairs a little easier:
`is_numeric()` is a __predicate__: a function that returns `TRUE` or `FALSE`. There are a number of of purrr functions designed to work specifically with predicates:
Each predicate also comes with "scalar" and "bare" versions. The scalar version checks that the length is 1 and the bare version checks that the object is a bare vector with no S3 class.
A natural application of `map2()` is handling test-training pairs when doing model evaluation. This is an important modelling technique: you should never evaluate a model on the same data it was fit to because it's going to make you overconfident. Instead, it's better to divide the data up and use one piece to fit the model and the other piece to evaluate it. A popular technique for this is called k-fold cross validation. You randomly hold out x% of the data and fit the model to the rest. You need to repeat this a few times because of random variation.
If we wanted, we could extract the coefficients using broom, and make a single data frame with `map_df()` and then visualise the distributions with ggplot2:
But we're most interested in the quality of the models, so we make predictions for each test data set and compute the mean squared distance between predicted and actual: