Continue to polish many models

This commit is contained in:
hadley 2016-06-20 08:30:47 -05:00
parent 17ef519058
commit c1ec9d4f1a
2 changed files with 145 additions and 89 deletions

View File

@ -39,6 +39,7 @@ Remotes:
hadley/modelr,
hadley/purrr,
hadley/stringr,
hadley/tibble,
hadley/ggplot2,
hadley/nycflights13,
yihui/knitr,

View File

@ -1,25 +1,43 @@
# Many models
In this chapter you're going to learn two powerful ideas that help you to work with large numbers of models with ease:
In this chapter you're going to learn three powerful ideas that help you to work with large numbers of models with ease:
1. You'll use list-columns to store arbitrary data structures in a data frame.
This, for example, will allow you to have a column of your data frame that
consists of linear models.
1. Using many simple models to better understand complex datasets.
1. Using list-columns to store arbitrary data structures in a data frame.
For example, this will allow you to have a column that contains linear
models.
1. Turn models into tidy data with the broom package, by David Robinson.
This is a powerful technique for working with large numbers of models
because once you have tidy data, you can apply many techniques that you've
learned about in early chapters of this book.
1. Using the __broom__ package, by David Robinson, to turn models into tidy
data. This is a powerful technique for working with large numbers of models
because once you have tidy data, you can apply all of the techniques that
you've learned about in earlier in the book.
These ideas are particularly powerful in conjunction with the ideas of functional programming, so make sure you've read [iteration] and [handling hierarchy] before starting this chapter.
We'll start by diving in to a motivating example using data about life expectancy. It's a small dataset but it illustrates how important modelling can be for improving your visualisation. The following sections dive into more detail about the individual techniques: list-columns and nesting & unnesting.
We'll start by diving in to a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signal so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.
This chapter focusses on models generated from subsets of your data. This is a powerful technique to help you understand your data, and is often a key step on the way to creating a single complex model that combines the information from all subsets. In the next chapter, you'll learn about another family of techniques for generating multiple models: resampling. Resampling is a powerful tool to help you understand the inferential properties of a model.
The following sections will dive into more detail about the individual techniques:
### Prerequisites
1. In [list-columns], you'll learn more about the list-column data structure,
and why it's valid to put lists in data frames.
1. In [creating list-columns], you'll learn the three main ways in which you'll
create list-columns.
1. In [simplifying list-columns] you'll learn how to convert list-columns back
to regular atomic vectors (or sets of atomic vectors) so you can work
with them more easily.
1. In [making tidy data with broom], you'll learn about the full set of tools
provided by broom, and see how they can be applied to other types of
data structure.
Working with many models requires a combination of packages that you're familiar with from data exploration, wrangling, programming, and modelling.
This chapter focusses on models generated from subsets of your data (exhaustive partition defined by the value of one or more variables). This is a powerful technique for data exploration, and is often a key step on the way to a single complex model that combines the information from all subsets. In the next chapter, you'll learn about another set of techniques that generates multiple models: resampling. Resampling is a powerful tool to help you understand the inferential properties of a model.
## Prerequisites
Working with many models requires a combination of packages that you're already familiar with from data exploration, wrangling, programming, and modelling.
```{r setup, message = FALSE}
# Standard data manipulation and visulisation
@ -36,9 +54,9 @@ library(tidyr)
## gapminder
To motivate the power of many simple models, we're going to look into the "Gapminder" data. This data was popularised by Hans Rosling, a Swedish doctor and statistician. If you've never heard of him, I strongly recommend that you stop reading this chapter and go watch one of his videos. He is a fantastic data presenter! A good place to start is this short video filmed in conjunction with the BBC: <https://www.youtube.com/watch?v=jbkSRLYSojo>.
To motivate the power of many simple models, we're going to look into the "gapminder" data. This data was popularised by Hans Rosling, a Swedish doctor and statistician. If you've never heard of him, stop reading this chapter right now and go watch one of his videos! He is a fantastic data presenter and illustrates how you can use data to present a compelling story. A good place to start is this short video filmed in conjunction with the BBC: <https://www.youtube.com/watch?v=jbkSRLYSojo>.
The gapminder data summarises the progression of countries over time, looking at statistics like life expentancy and GDP. The data is easy to access in R, thanks to the work of Jenny Bryan who created the gapminder package:
The gapminder data summarises the progression of countries over time, looking at statistics like life expentancy and GDP. The data is easy to access in R, thanks to Jenny Bryan who created the gapminder package:
```{r}
library(gapminder)
@ -53,9 +71,9 @@ gapminder %>%
geom_line(alpha = 1/3)
```
This is a small dataset: it only has around 1,700 observations and three variables. But it's still hard to see what's going on in this plot! Overall, it looks like life expectency has been steadily improving. However, if you look closely, you might notice some countries that don't follow this pattern. How can we make those countries easier to see?
This is a small dataset: it only has ~1,700 observations and 3 variables. But it's still hard to see what's going on! Overall, it looks like life expectency has been steadily improving. However, if you look closely, you might notice some countries that don't follow this pattern. How can we make those countries easier to see?
One way is to use the same approach as in the last chapter: there's a strong signal (over all linear growth) that makes it hard to see the smaller pattern. We'll tease these factors apart by fitting a model with a linear trend. The model captures steady growth over time, and the residuals will show what's left.
One way is to use the same approach as in the last chapter: there's a strong signal (overall linear growth) that makes it hard to see subtler trends. We'll tease these factors apart by fitting a model with a linear trend. The model captures steady growth over time, and the residuals will show what's left.
You already know how to do that if we had a single country:
@ -81,13 +99,11 @@ nz %>%
ggtitle("Remaining pattern")
```
But how can we easily fit that model to every country?
How can we easily fit that model to every country?
### Nested data
You could imagine copy and pasting that code multiple times. But you've already learned a better way! Extract out the common code with a function and repeat using a map function from purrr.
This problem is structured a little differently to what you've seen before. Instead of repeating an action for each variable, we want to repeat an action for each country, a subset of rows. To do, we need a new data structure: the __nested data frame__. To create a nested data frame we start with a grouped data frame, and "nest" it:
You could imagine copy and pasting that code multiple times; but you've already learned a better way! Extract out the common code with a function and repeat using a map function from purrr. This problem is structured a little differently to what you've seen before. Instead of repeating an action for each variable, we want to repeat an action for each country, a subset of rows. To do that, we need a new data structure: the __nested data frame__. To create a nested data frame we start with a grouped data frame, and "nest" it:
```{r}
by_country <- gapminder %>%
@ -97,32 +113,36 @@ by_country <- gapminder %>%
by_country
```
(I'm cheating a little by grouping by both `continent` and `country`. Given `country`, `continent` is fixed, so this doesn't add any more groups, but it's an easy way to carry an extra variable along for the ride.)
(I'm cheating a little by grouping on both `continent` and `country`. Given `country`, `continent` is fixed, so this doesn't add any more groups, but it's an easy way to carry an extra variable along for the ride.)
This creates an data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames. This seems like crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea.
This creates an data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames (or tibbles, to be precise). This seems like crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea.
The `data` column is a little tricky to look at because it's a moderately complicated list (we're still working on better tools to explore these objects). But if you look at one of the elements of the `data` column you'll see that it contains all the data for that country (Afghanastan in this case).
The `data` column is a little tricky to look at because it's a moderately complicated list, and we're still working on good tools to explore these objects. Unfortunately using `str()` is not recommended as it will often produce very long output. But if you pluck out a single element from the `data` column you'll see that it contains all the data for that country (in this case, Afghanistan).
```{r}
by_country$data[[1]]
```
Note the difference between a standard grouped data frame and a nested data frame: in a grouped data frame, each row is an observation; in a nested data frame, each row is a group. Another way to think about this nested dataset is we now have a meta-observation: a row that represents the complete time course for a country, rather than a single point in time.
Note the difference between a standard grouped data frame and a nested data frame: in a grouped data frame, each row is an observation; in a nested data frame, each row is a group. Another way to think about a nested dataset is we now have a meta-observation: a row that represents the complete time course for a country, rather than a single point in time.
### List-columns
Now that we have our nested data frame, we're in a good position to fit some models because we can think about transforming each data frame into a model. Transforming each element of a list is the job of `purrr:map()`:
Now that we have our nested data frame, we're in a good position to fit some models. We have a model-fitting function:
```{r}
country_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
```
And we want to apply it to every data frame. The data frames are in a list, so we can use `purrr::map()` to apply `country_model` to each element:
```{r}
models <- map(by_country$data, country_model)
```
However, rather than leaving leaving the list of models as a free-floating object, I think it's better to store it as a variable in the `by_country` data frame. This is why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?
%>%
However, rather than leaving leaving the list of models as a free-floating object, I think it's better to store it as a column in the `by_country` data frame. Storing related objects in columns is a key part of the value of data frames, and why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?
In other words, instead of creating a new object in the global environment, we're going to create a new variable in the `by_country` data frame. That's a job for `dplyr::mutate()`:
```{r}
@ -131,14 +151,14 @@ by_country <- by_country %>%
by_country
```
This has a big advantage: because all the related objects are stored together, you don't need to manually keep them in sync when you filter or arrange. Dplyr takes take of that for you:
This has a big advantage: because all the related objects are stored together, you don't need to manually keep them in sync when you filter or arrange. The semantics of the data frame takes take of that for you:
```{r}
by_country %>% filter(continent == "Europe")
by_country %>% arrange(continent, country)
```
If your list of data frames and list of models where separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer!
If your list of data frames and list of models were separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer!
### Unnesting
@ -151,14 +171,16 @@ by_country <- by_country %>% mutate(
by_country
```
But how you can plot a list of data frames? Instead of struggling to answer that question, let's turn the list of data frames back into a regular data frame. Previously we used `nest()` to turn a regular data frame into an nested data frame, now we need to do the opposite with `unnest()`:
But how you can plot a list of data frames? Instead of struggling to answer that question, let's turn the list of data frames back into a regular data frame. Previously we used `nest()` to turn a regular data frame into an nested data frame, and now we do the opposite with `unnest()`:
```{r}
resids <- unnest(by_country, resids)
resids
```
Then we can plot the residuals. Facetting by continent is partiuclarly revealing:
Note that each regular column is repeated one for each row in the nested column.
Now we have regular data frame, we can plot the residuals:
```{r}
resids %>%
@ -166,31 +188,22 @@ resids %>%
geom_line(aes(group = country), alpha = 1 / 3) +
geom_smooth(se = FALSE)
```
Facetting by continent is particularly revealing:
```{r}
resids %>%
ggplot(aes(year, resid, group = country)) +
geom_line(alpha = 1 / 3) +
facet_wrap(~continent)
```
It looks like overall we've missed some mild quadratic pattern. There's also something intersting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section attacking it from a slightly different angle.
It looks like we've missed some mild quadratic pattern. There's also something intersting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.
### Model quality
Instead of looking at the residuals from the model, we could look at some general measurements of model quality. You learned how to compute some specific measures in the previous chapter. Here we'll show a different approach using the broom package.
The broom package provides three general tools for turning models in to tidy data frames:
1. `broom::glance(model)` returns a row for each model. Each column gives a
model summary: either a measure of model quality, or complexity, or a
combination of the two.
1. `broom:tidy(model)` returns a row for each coefficient in the model. Each
column gives information about the estimate or its variability.
1. `broom::augment(model, data)` returns a row for each row in `data`, adding
extra values like residuals, and influence statistics.
Here we'll use `broom::glance()` to extract some model quality metrics. If we apply it to a model, we get a data frame with a single row:
Instead of looking at the residuals from the model, we could look at some general measurements of model quality. You learned how to compute some specific measures in the previous chapter. Here we'll show a different approach using the broom package. The broom package provides a general set of functions to turn models into tidy data. Here we'll use `broom::glance()` to extract some model quality metrics. If we apply it to a model, we get a data frame with a single row:
```{r}
broom::glance(nz_mod)
@ -204,8 +217,6 @@ by_country %>%
unnest(glance)
```
(Pay attention to the variables that aren't printed: there's a lot of useful stuff there.)
This isn't quite the output we want, because it still includes all the list columns. This is default behaviour when `unnest()` works on single row data frames. To suppress these columns we use `.drop = TRUE`:
```{r}
@ -215,6 +226,8 @@ glance <- by_country %>%
glance
```
(Pay attention to the variables that aren't printed: there's a lot of useful stuff there.)
With this data frame in hand, we can start to look for models that don't fit well:
```{r}
@ -245,12 +258,14 @@ We see two main effects here: the tragedies of the HIV/AIDS epidemic, and the Rw
### Exercises
1. A linear trend seems to be slightly too simple for the overall trend.
Can you do better with a natural spline with two or three degrees of
freedom?
Can you do better with a quadratic polynomial? How can you interpret
the coefficients of the quadratic? (Hint you might want to transform
`year` so that it has mean zero.)
1. Explore other methods for visualsiation the distribution of $R^2$ per
continent. You might want to try `ggbeeswarm`, which provides similar
methods for avoiding overlaps as jitter, but with less randomness.
1. Explore other methods for visualising the distribution of $R^2$ per
continent. You might want to try the ggbeeswarm pakage, which provides
similar methods for avoiding overlaps as jitter, but uses deterministic
methods.
1. To create the last plot (showing the data for the countries with the
worst model fits), we needed two steps: we created a data frame with
@ -260,11 +275,7 @@ We see two main effects here: the tragedies of the HIV/AIDS epidemic, and the Rw
## List-columns
Now that you've seen a basic workflow for managing many models, lets dive back into some of the details. In this section, we'll dive into the notional of the list-column in a little more detail, and then we'll give a few more details about `nest()`/`unnest()`.
It's only recently that I've really appreciated the idea of the list-column. List-columns are implicit in the defintion of the data frame: a data frame is a named list of equal length vectors. A list is a vector, so it's always been legitimate to put use a list as a column of a data frame.
However, base R doesn't make it easier to create list-columns, and `data.frame()` treats a list as a list of columns:.
Now that you've seen a basic workflow for managing many models, lets dive back into some of the details. In this section, we'll explore the list-column data structure in a little more detail. It's only recently that I've really appreciated the idea of the list-column. List-columns are implicit in the defintion of the data frame: a data frame is a named list of equal length vectors. A list is a vector, so it's always been legitimate to put use a list as a column of a data frame. However, base R doesn't make it easy to create list-columns, and `data.frame()` treats a list as a list of columns:.
```{r}
data.frame(x = list(1:3, 3:5))
@ -279,7 +290,7 @@ data.frame(
)
```
Tibble alleviates this problem by not messing with the inputs to `data_frame()`, and by providing a better print method:
Tibble alleviates this problem by being lazier (`data_frame()` doesn't modify its inputs) and by providing a better print method:
```{r}
data_frame(
@ -305,25 +316,24 @@ Generally there are three parts of an effective list-column pipeline:
## Creating list-columns
Typically, you won't create list-columns by hand. There are three primary ways of creating list-colums:
Typically, you won't create list-columns with `data_frame()`. Instead, you'll create them from regular columns, using one of three methods:
1. Using `tidyr::nest()` to convert a grouped data frame into a nested data
1. With `tidyr::nest()` to convert a grouped data frame into a nested data
frame where you have list-column of data frames.
1. With `mutate()` and vectorised functions that return a list.
1. With `summarise()` and aggregate functions that return an arbitrary
number of results.
1. With `summarise()` and summary functions that return multiple results.
1. From a named-list.
Alternatively, you might create them from a named list, using `tibble::enframe()`.
Generally, when creating list-columns, you should make sure they're homogeneous: each element should contain the same type of thing. There are no checks to make sure this is true, but if you use purrr and remember what you've learned about type-stable functions you should find it happens naturally.
These are described below.
### With nesting
`nest()` creates a specific type of list-column: a list-column of data frames. There are two ways to use it. So far you've seen how to use it with a grouped data frame. When applied to a grouped data frame, `nest()` keeps the grouping columns as is, and bundles everything else into the list-column:
`nest()` creates a nested data frame, which is a data frame with a list-column of data frames. In a nested data frame each row is a meta-observation: the other columns give variables that define the observation (like country and continent above), and the list-column of data frames gives the individual observations that make up the meta-observation.
There are two ways to use `nest()`. So far you've seen how to use it with a grouped data frame. When applied to a grouped data frame, `nest()` keeps the grouping columns as is, and bundles everything else into the list-column:
```{r}
gapminder %>%
@ -338,11 +348,9 @@ gapminder %>%
nest(year:gdpPercap)
```
To be precise, a nested data frame is a data frame with a list-column of data frames. In a nested data frame each row is a meta-observation: the other columns give variables that define the observation (like country and continent above), and the list-column of data frames gives the individual observations that make up the meta-observation.
### From vectorised functions
Some useful fuctions take an atomic vector and return a list. For example, earlier you learned about `stringr::str_split()` which takes a character vector and returns a list of charcter vectors.
Some useful fuctions take an atomic vector and return a list. For example, in [strings] you learned about `stringr::str_split()` which takes a character vector and returns a list of charcter vectors. If you use that inside mutate, you'll get a list-column:
```{r}
df <- data_frame(x1 = c("a,b,c", "d,e,f,g"))
@ -361,7 +369,7 @@ df %>%
(If you find yourself using this pattern alot, make sure to check out `tidyr:separate_rows()` which is a wrapper around this common pattern).
Another common pattern is to use the map family of from purrr. For example, we could take the final example from [Invoking different functions] and rewrite it to use summarise:
Another example of this pattern is using the `map()`, `map2()`, `pmap()` from purrr. For example, we could take the final example from [Invoking different functions] and rewrite it to use `mutate()`:
```{r}
sim <- tibble::frame_data(
@ -375,13 +383,11 @@ sim %>%
mutate(sims = invoke_map(f, params, n = 10))
```
Note that technically `sim` isn't homogenous because it contains both double vectors and some integer vectors! However, this is unlikely to cause many problems since integers and doubles are both numeric vectors.
It's also common to create list-columns by transforming existing list-columns. You'll learn about that in the next section.
Note that technically `sim` isn't homogenous because it contains both double and integer vectors. However, this is unlikely to cause many problems since integers and doubles are both numeric vectors.
### From multivalued summaries
One restriction of `summarise()` is that it only works with aggregate functions that return a single value. That means that you can't use it with functions like `quantile()` that return a vector of arbitrary length:
One restriction of `summarise()` is that it only works with summary functions that return a single value. That means that you can't use it with functions like `quantile()` that return a vector of arbitrary length:
```{r, error = TRUE}
mtcars %>%
@ -389,7 +395,7 @@ mtcars %>%
summarise(q = quantile(mpg))
```
You can however, wrap the result in a list! This obeys the contract of `summarise()`, because each summary is now a vector (a list) of length 1.
You can however, wrap the result in a list! This obeys the contract of `summarise()`, because each summary is now a list (a vector) of length 1.
```{r}
mtcars %>%
@ -409,14 +415,34 @@ mtcars %>%
### From a named list
Data frames with list-columns provide a solution to a common problem: what do you do if you want to iterate over both the contents of a list and its elements? Instead of trying to jam everything into one object make a data frame: one column can contain the elements, and one column can contain the list. An easy way to create such a data frame from a list is tibble::enframe(). The advantage of this structure is that it generalises in a straightforward way - names are useful if you have character vector of metadata, but don't help if you have other types of data, or multiple vectors.
Data frames with list-columns provide a solution to a common problem: what do you do if you want to iterate over both the contents of a list and its elements? Instead of trying to jam everything into one object, it's often easier to make a data frame: one column can contain the elements, and one column can contain the list. An easy way to create such a data frame from a list is `tibble::enframe()`.
```{r}
x <- list(
a = 1:5,
b = 3:4,
c = 5:6
)
df <- tibble::enframe(x)
df
```
The advantage of this structure is that it generalises in a straightforward way - names are useful if you have character vector of metadata, but don't help if you have other types of data, or multiple vectors.
Now if you want to iterate over names and values in parallel, you can use `map2()`:
```{r}
df %>% mutate(smry = map2_chr(name, value, ~ paste0(.x, ": ", .y[1])))
```
### Exercises
1. List all the functions that you can think of that take a atomic vector and
return a list.
1. Brainstorm useful summary functions that, like `quantile()` return
1. Brainstorm useful summary functions that, like `quantile()`, return
multiple values.
1. What's missing in the following data frame? How does `quantile()` return
@ -437,11 +463,9 @@ Data frames with list-columns provide a solution to a common problem: what do yo
summarise_each(funs(list))
```
## Simplifying list-columns
## Collapsing list-columns
To apply the techniques of data manipulation and visualisation you've learned in this book, you'll need to collapse the list-column back to a regular column, or set of columns. The technique you'll use to collapse back down to a simpler structure depends on whether you want a single value per element, or multiple values:
To apply the techniques of data manipulation and visualisation you've learned in this book, you'll need to simplify the list-column back to a regular column (an atomic vector), or set of columns. The technique you'll use to collapse back down to a simpler structure depends on whether you want a single value per element, or multiple values:
1. If you want a single values, use `mutate()` with `map_lgl()`,
`map_int()`, `map_dbl()`, and `map_chr()` to create an atomic vector.
@ -453,14 +477,14 @@ These are described in more detail below.
### List to vector
If you can reduce you list column to an atomic vector, that will be a regular column. For example, you can always summarise an object with it's type and length, so this code will work regardless of what sort of list-column you have.
If you can reduce your list column to an atomic vector then it will be a regular column. For example, you can always summarise an object with it's type and length, so this code will work regardless of what sort of list-column you have:
```{r}
df <- data_frame(
x = list(
letters,
1:4,
runif(10)
letters[1:5],
1:3,
runif(5)
)
)
@ -470,9 +494,22 @@ df %>% mutate(
)
```
This is the same basic information that you get from the default tbl print method, but now you can use it for filtering. This is a useful technique if you've somehow ended up with a heterogenous list, and want to filter out the parts that you don't need.
This is the same basic information that you get from the default tbl print method, but now you can use it for filtering. This is a useful technique if you have a heterogenous list, and want to filter out the parts aren't working for you.
Don't forget about the `map_*()` shortcuts - you can use `map_chr(x, "apple")` to extract the string stored in `apple` for each element of `x`.
Don't forget about the `map_*()` shortcuts - you can use `map_chr(x, "apple")` to extract the string stored in `apple` for each element of `x`. This is useful for pulling apart nested lists into regular columns. Use the `.null` argument to provide a value to use if the element is missing (instead of returning `NULL`):
```{r}
df <- data_frame(
x = list(
list(a = 1, b = 2),
list(a = 2, c = 4)
)
)
df %>% mutate(
a = map_dbl(x, "a"),
b = map_dbl(x, "b", .null = NA_real_)
)
```
### Unnesting
@ -511,3 +548,21 @@ The same principle applies when unnesting list-columns of data frames. You can u
1. Why might the `lengths()` function be useful for creating atomic
vector columns from list-columns?
1. List the most common types of vector found in a data frame. What makes
lists different?
## Making tidy data with broom
The broom package provides three general tools for turning models in to tidy data frames:
1. `broom::glance(model)` returns a row for each model. Each column gives a
model summary: either a measure of model quality, or complexity, or a
combination of the two.
1. `broom:tidy(model)` returns a row for each coefficient in the model. Each
column gives information about the estimate or its variability.
1. `broom::augment(model, data)` returns a row for each row in `data`, adding
extra values like residuals, and influence statistics.