1. You'll use list-columns to store arbitrary data structures in a data frame.
This, for example, will allow you to have a column of your data frame that
consists of linear models.
1. Turn models into tidy data with the broom package, by David Robinson.
This is a powerful technique for working with large numbers of models
because once you have tidy data, you can apply many techniques that you've
learned about in early chapters of this book.
These ideas are particularly powerful in conjunction with the ideas of functional programming, so make sure you've read [iteration] and [handling hierarchy] before starting this chapter.
We'll start by diving in to a motivating example using data about life expectancy. It's a small dataset but it illustrates how important modelling can be for improving your visualisation. The following sections dive into more detail about the individual techniques: list-columns and nesting & unnesting.
This chapter focusses on models generated from subsets of your data. This is a powerful technique to help you understand your data, and is often a key step on the way to creating a single complex model that combines the information from all subsets. In the next chapter, you'll learn about another family of techniques for generating multiple models: resampling. Resampling is a powerful tool to help you understand the inferential properties of a model.
To motivate the power of many simple models, we're going to look into the "Gapminder" data. This data was popularised by Hans Rosling, a Swedish doctor and statistician. If you've never heard of him, I strongly recommend that you stop reading this chapter and go watch one of his videos. He is a fantastic data presenter! A good place to start is this short video filmed in conjunction with the BBC: <https://www.youtube.com/watch?v=jbkSRLYSojo>.
The gapminder data summarises the progression of countries over time, looking at statistics like life expentancy and GDP. The data is easy to access in R, thanks to the work of Jenny Bryan who created the gapminder package:
In this case study, we're going to focus on just three variables to answer the question "How does life expectancy (`lifeExp`) change over time (`year`) for each country (`country`)?". A good place to start is with a plot:
This is a small dataset: it only has around 1,700 observations and three variables. But it's still hard to see what's going on in this plot! Overall, it looks like life expectency has been steadily improving. However, if you look closely, you might notice some countries that don't follow this pattern. How can we make those countries easier to see?
One way is to use the same approach as in the last chapter: there's a strong signal (over all linear growth) that makes it hard to see the smaller pattern. We'll tease these factors apart by fitting a model with a linear trend. The model captures steady growth over time, and the residuals will show what's left.
You could imagine copy and pasting that code multiple times. But you've already learned a better way! Extract out the common code with a function and repeat using a map function from purrr.
This problem is structured a little differently to what you've seen before. Instead of repeating an action for each variable, we want to repeat an action for each country, a subset of rows. To do, we need a new data structure: the __nested data frame__. To create a nested data frame we start with a grouped data frame, and "nest" it:
This creates an data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames. This seems like crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea.
The `data` column is a little tricky to look at because it's a moderately complicated list (we're still working on better tools to explore these objects). But if you look at one of the elements of the `data` column you'll see that it contains all the data for that country (Afghanastan in this case).
Note the difference between a standard grouped data frame and a nested data frame: in a grouped data frame, each row is an observation; in a nested data frame, each row is a group. Another way to think about this nested dataset is that an observation is now the complete time course for a country, rather than a single point in time.
Now that we have our nested data frame, we're in a good position to fit some models because we can think about transforming each data frame into a model. Transforming each element of a list is the job of `purrr:map()`:
However, rather than leaving leaving the list of models as a free-floating object, I think it's better to store it as a variable in the `by_country` data frame. This is why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?
In other words, instead of creating a new object in the global environment, we're going to create a new variable in the `by_country` data frame. That's a job for `dplyr::mutate()`:
This has a big advantage: because all the related objects are stored together, you don't need to manually keep them in sync when you filter or arrange. Dplyr takes take of that for you:
If your list of data frames and list of models where separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer!
Previously we computed the residuals of a single model with a single dataset. Now we have 142 data frames and 142 models. To compute the residuals, we need to call `add_residuals()` with each model-data pair:
But how you can plot a list of data frames? Instead of struggling to answer that question, let's turn the list of data frames back into a regular data frame. Previously we used `nest()` to turn a regular data frame into an nested data frame, now we need to do the opposite with `unnest()`:
It looks like overall we've missed some mild quadratic pattern. There's also something intersting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section attacking it from a slightly different angle.
Instead of looking at the residuals from the model, we could look at some general measurements of model quality. You learned how to compute some specific measures in the previous chapter. Here we'll show a different approach using the broom package.
This isn't quite the output we want, because it still includes all the list columns. This is default behaviour when `unnest()` works on single row data frames. To suppress these columns we use `.drop = TRUE`:
The worst models all appear to be in Africa. Let's double check that with a plot. Here we have a relatively small number of observations and a discrete variable, so `geom_jitter()` is effective:
Now that you've seen a basic workflow for managing many models, lets dive back into some of the details. In this section, we'll dive into the notional of the list-column in a little more detail, and then we'll give a few more details about `nest()`/`unnest()`.
It's only in the last year that I've really appreciated the idea of the list-column. List-columns are implicit in the defintion of the data frame: a data frame is a named list of equal length vectors. A list is a vector, so it's always been legitimate to put use a list as a column of a data frame.
List-columns are often most useful as intermediate data structure. They're hard to work with directly, because most R functions work with atomic vectors or data frames, but the advantage of keeping related items together in a data frame is worth a little hassle.
Generally there are three parts of an effective list-column pipeline:
1. You'll create the list column using one `nest()`, `summarise()` + `list()`
or `mutate()` + a map function, as described in [Creating list-columns].
1. You'll create other intermediate list-columns by transforming existing
list columns with `map()`, `map2()` or `pmap()`. For example,
in the case study above, we created a list-column of models by transforming
a list column of data frames.
1. You collapse the list-column back down to a data frame or atomic vector,
Generally, when creating list-columns, you should make sure they're homogeneous: each element should contain the same type of thing. There are no checks to make sure this is true, but if you use purrr and remember what you've learned about type-stable functions you should find it happens naturally.
`nest()` creates a specific type of list-column: a list-column of data frames. There are two ways to use it. So far you've seen how to use it with a grouped data frame. When applied to a grouped data frame, `nest()` keeps the grouping columns as is, and bundles everything else into the list-column:
```{r}
gapminder %>%
group_by(country, continent) %>%
nest()
```
You can also use it on an ungrouped data frame, specifying which columns you want to nest:
Some useful fuctions take an atomic vector and return a list. For example, earlier you learned about `stringr::str_split()` which takes a character vector and returns a list of charcter vectors.
Another common pattern is to use the map family of from purrr. For example, we could take the final example from [Invoking different functions] and rewrite it to use summarise:
```{r}
sim <- tibble::frame_data(
~f, ~params,
"runif", list(min = -1, max = -1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>%
mutate(sims = invoke_map(f, params, n = 10))
```
Note that technically `sim` isn't homogenous because it contains some double vectors and some numeric vectors! However, this is unlikely to cause many problems since integers and doubles are both numeric vectors.
It's also common to create list-columns by transforming existing list-columns. You'll learn about that in the next section.
One restriction of `summarise()` is that it only works with aggregate functions that return a single value. That means that you can't use it with functions like `quantile()` that return a vector of arbitrary length:
To apply the techniques of data manipulation and visualisation you've learned in this book, you'll need to collapse the list-column back to a regular column, or set of columns. The technique you'll use to collapse back down to a simpler structure depends on whether you want a single value per element, or multiple values:
1. If you want many values, use `unnest()` to convert list-columns back
to regular columns, repeating the rows as many times as necessary.
These are described in more detail below.
### List to vector
If you can reduce you list column to an atomic vector, that will be a regular column. For example, you can always summarise an object with it's type and length, so this code will work regardless of what sort of list-column you have.
```{r}
df <- data_frame(
x = list(
letters,
1:4,
runif(10)
)
)
df %>% mutate(
type = map_chr(x, typeof),
length = map_int(x, length)
)
```
This is the same basic information that you get from the default tbl print method, but now you can use it for filtering. This is a useful technique if you've somehow ended up with a heterogenous list, and want to filter out the parts that you don't need.
Don't forget about the `map_*()` shortcuts - you can use `map_chr(x, "apple")` to extract the string stored in `apple` for each element of `x`.
### Unnesting
`unnest()` works by repeating the regular columns once for each element of the list-column. For example, in the following very simple example we repeat the first row 4 times (because there the first element of `y` has length four), and the second row once:
```{r}
data_frame(x = 1:2, y = list(1:4, 1)) %>% unnest(y)
```
This means that you can't simultaneously unnest two columns that contain different number of elements:
```{r, error = TRUE}
# Ok, because y and z have the same number of elements in
# every row
df1 <- data_frame(
x = 1:2,
y = list(c("a", "b"), "c"),
z = list(1:2, 3)
)
df1
df1 %>% unnest(y, z)
# Doesn't work because y and z have different number of elements
df2 <- data_frame(
x = 1:2,
y = list("a", c("b", "c")),
z = list(1:2, 3)
)
df2
df2 %>% unnest(y, z)
```
The same principle applies when unnesting list-columns of data frames. You can unnest multiple list-cols as long as all the data frames in each row have the same number of rows.
### Exercises
1. Why might the `lengths()` function be useful for creating atomic