Rewriting gapminder write-up

This commit is contained in:
hadley 2016-06-15 10:16:21 -05:00
parent 0e2ccfeba3
commit 726ffcf6e5
1 changed files with 94 additions and 47 deletions

View File

@ -1,18 +1,32 @@
# Working with many models
# Many models
In this chapter you're going to learn three powerful technique that allow you to work with large numbers of model in a straight forward way. You will combine purrr, tidyr, dplyr, and broom.
In this chapter you're going to learn two powerful ideas that help you to work with large numbers of models with ease:
We'll dive into a quick case study using data about life expectancy and then dig into the details in the following sections. In the next chapter, you'll use these techniques after generating multiple datasets in other ways.
1. You'll use list-columns to store arbitrary data structures in a data frame.
This, for example, will allow you to have a column of your data frame that
consists of linear models.
1. Turn models into tidy data with the broom package, by David Robinson.
This is a powerful technique for working with large numbers of models
because once you have tidy data, you can apply many techniques that you've
learned about in early chapters of this book.
These ideas are particularly powerful in conjunction with the ideas of functional programming, so make sure you've read [iteration] and [handling hierarchy] before starting this chapter.
We'll start by diving in to a motivating example using data about life expectancy. It's a small dataset but it illustrates how important modelling can be for improving your visualisation. The following sections dive into more detail about the individual techniques: list-columns and nesting & unnesting.
This chapter focusses on models generated from subsets of your data. This is a powerful technique to help you understand your data, and is often a key step on the way to creating a single complex model that combines the information from all subsets. In the next chapter, you'll learn about another family of techniques for generating multiple models: resampling. Resampling is a powerful tool to help you understand the inferential properties of a model.
### Prerequisites
Working with many models requires a combination of packages that you're familiar with from data exploration, wrangling, programming, and modelling.
```{r setup, message = FALSE}
# Standard data manipulation and visulisation
library(dplyr)
library(ggplot2)
# Tools for working with models
library(broom)
library(modelr)
# Tools for working with lots of models
@ -22,50 +36,58 @@ library(tidyr)
## gapminder
We're going to explore these ideas using the "Gapminder" data. This data was popularised by Hans Rosling. If you've never heard of him, you should stop now and watch one of his data presentations, like <https://www.youtube.com/watch?v=jbkSRLYSojo>.
To motivate the power of many simple models, we're going to look into the "Gapminder" data. This data was popularised by Hans Rosling, a Swedish doctor and statistician. If you've never heard of him, I strongly recommend that you stop reading this chapter and go watch one of his videos. He is a fantastic data presenter! A good place to start is this short video filmed in conjunction with the BBC: <https://www.youtube.com/watch?v=jbkSRLYSojo>.
We're going to use a subset of the full data as included in the gapminder package, by Jenny Bryan:
The gapminder data summarises the progression of countries over time, looking at statistics like life expentancy and GDP. The data is easy to access in R, thanks to the work of Jenny Bryan who created the gapminder package:
```{r}
library(gapminder)
gapminder
```
We're going to focus on just three variables: how does life expectancy (`lifeExp`) change over time (`year`) for each country (`country`). We can attempt to display this with a simple line chart:
In this case study, we're going to focus on just three variables to answer the question "How does life expectancy (`lifeExp`) change over time (`year`) for each country (`country`)?". A good place to start is with a plot:
```{r}
gapminder %>%
ggplot(aes(year, lifeExp, group = country)) +
geom_line()
geom_line(alpha = 1/3)
```
This is a small dataset, only ~1700 observations and three variables, but it's still hard to see what's going on in this plot. Overall, it looks like life expectency has been steadily improving over time but if you look closely you might spot some countries that don't follow this pattern.
This is a small dataset: it only has around 1,700 observations and three variables. But it's still hard to see what's going on in this plot! Overall, it looks like life expectency has been steadily improving. However, if you look closely, you might notice some countries that don't follow this pattern. How can we make those countries easier to see?
We're going make the unusual patterns easier to see by following the same appraoch as the previous chapter. We'll capture the linear trend in a model and predict the results. You already know how to do this if we had a single country:
One way is to use the same approach as in the last chapter: there's a strong signal (over all linear growth) that makes it hard to see the smaller pattern. We'll tease these factors apart by fitting a model with a linear trend. The model captures steady growth over time, and the residuals will show what's left.
```{r}
You already know how to do that if we had a single country:
```{r, out.width = "33%", fig.asp = 1, fig.width = 3, fig.show = "hold"}
nz <- filter(gapminder, country == "New Zealand")
nz_mod <- lm(lifeExp ~ year, data = nz)
nz %>%
ggplot(aes(year, lifeExp)) +
geom_line() +
ggtitle("Full data = ")
nz_mod <- lm(lifeExp ~ year, data = nz)
nz %>%
add_predictions(pred = nz_mod) %>%
ggplot(aes(year, pred)) +
geom_line()
geom_line() +
ggtitle("Linear trend + ")
nz %>%
add_residuals(resid = nz_mod) %>%
ggplot(aes(year, resid)) +
geom_hline(yintercept = 0, colour = "white", size = 3) +
geom_line()
geom_hline(yintercept = 0, colour = "white", size = 3) +
geom_line() +
ggtitle("Remaining pattern")
```
But what can we do if want to do fit that model to each country?
But how can we easily fit that model to every country?
### Nested data
You could imagine copy and pasting that code multiple times. But you've already learned a better way of handling that pattern: extract out the common code with a function and repeat using a map function from purrr.
You could imagine copy and pasting that code multiple times. But you've already learned a better way! Extract out the common code with a function and repeat using a map function from purrr.
This problem is structured a little differently because we want to repeat something for each country, a subset of rows, rather than each variable. So first we need to make a list of data frames. There are lots of ways to do this (for example, you could use `split()` from base R), but we're going to use a function from tidyr, called `nest()`:
This problem is structured a little differently to what you've seen before. Instead of repeating an action for each variable, we want to repeat an action for each country, a subset of rows. To do, we need a new data structure: the __nested data frame__. To create a nested data frame we start with a grouped data frame, and "nest" it:
```{r}
by_country <- gapminder %>%
@ -75,21 +97,21 @@ by_country <- gapminder %>%
by_country
```
This creates an data frame that has one row per country, and a rather unusual column: `data`. `data` is a list of data frames. This is a pretty crazy idea: We have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea.
This creates an data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames. This seems like crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea.
If you look at one of the elements of the `data` column you'll see that it contains all the data for that country (Afghanastan in this case).
The `data` column is a little tricky to look at because it's a moderately complicated list (we're still working on better tools to explore these objects). But if you look at one of the elements of the `data` column you'll see that it contains all the data for that country (Afghanastan in this case).
```{r}
by_country$data[[1]]
```
We've changed from a standard "grouped" data frame, where each row is an observation, and the groups are stored as an index, to a __nested__ data frame where each row is one group, and the full data is stored in a list-column.
Note the difference between a standard grouped data frame and a nested data frame: in a grouped data frame, each row is an observation; in a nested data frame, each row is a group. Another way to think about this nested dataset is that an observation is now the complete time course for a country, rather than a single point in time.
### List-columns
Now we need to take each element of that list of data frames and fit a model to it. We can do that with `purrr::map()`:
Now that we have our nested data frame, we're in a good position to fit some models because we can think about transforming each data frame into a model. Transforming each element of a list is the job of `purrr:map()`:
```{r}
```{r, eval = FALSE}
country_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
@ -97,9 +119,9 @@ country_model <- function(df) {
models <- map(by_country$data, country_model)
```
However, rather than leaving that as a separate object that's floating around in the global environment, I think it's better to store it as a variable in the `by_country` data frame.
However, rather than leaving leaving the list of models as a free-floating object, I think it's better to store it as a variable in the `by_country` data frame. This is why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?
This is the basic reason that I think list-columns are such a good idea. We are going to have lots of object where we have one per country. So why not store them all together in one data frame?
In other words, instead of creating a new object in the global environment, we're going to create a new variable in the `by_country` data frame. That's a job for `dplyr::mutate()`:
```{r}
by_country <- by_country %>%
@ -107,79 +129,95 @@ by_country <- by_country %>%
by_country
```
This has a big advantage: because all the related objects are stored together, you don't need to manual keep them all in sync when you filter or arrange. Dplyr takes take of that for you!
This has a big advantage: because all the related objects are stored together, you don't need to manually keep them in sync when you filter or arrange. Dplyr takes take of that for you:
```{r}
by_country %>% filter(continent == "Europe")
by_country %>% arrange(continent, country)
```
If your list of data frames and list of models where separate objects, you have to remember that whenever you re-order or subset one, you need to re-order all the others. It's easy to forget, and you end up with vectors that are no longer synchronised. Your code will continue to work, but it will give the wrong answer!
If your list of data frames and list of models where separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer!
### Unnesting
Previously we computed the residuals of a single model with a single dataset. Now we have 142 data frames and 142 models. To compute the residuals, we need to call `add_residuals()` in parallel:
Previously we computed the residuals of a single model with a single dataset. Now we have 142 data frames and 142 models. To compute the residuals, we need to call `add_residuals()` with each model-data pair:
```{r}
by_country %>% mutate(
resids = map2(data, model, add_residuals)
by_country <- by_country %>% mutate(
resids = map2(data, model, ~ add_residuals(.x, resid = .y))
)
by_country
```
But how you can plot a list of data frames? Well, what if we could turn it back into a regular data frame? Previously we used nest to turn a regular data frame into an nested data frame, now we need to do the opposite with `unnest()`:
But how you can plot a list of data frames? Instead of struggling to answer that question, let's turn the list of data frames back into a regular data frame. Previously we used `nest()` to turn a regular data frame into an nested data frame, now we need to do the opposite with `unnest()`:
```{r}
resids <- by_country %>%
mutate(resids = map2(data, model, ~ add_residuals(.x, resid = .y))) %>%
unnest(resids)
resids <- unnest(by_country, resids)
resids
```
And we can plot the results:
Then we can plot the residuals. Facetting by continent is partiuclarly revealing:
```{r}
resids %>%
ggplot(aes(year, resid)) +
geom_line(aes(group = country), alpha = 1 / 3) +
geom_smooth(se = FALSE)
resids %>%
ggplot(aes(year, resid, group = country)) +
geom_line(alpha = 1 / 3) +
facet_wrap(~continent)
```
There's something intersting going on in Africa: we see large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section attacking it from a slightly different angle.
It looks like overall we've missed some mild quadratic pattern. There's also something intersting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section attacking it from a slightly different angle.
### Model quality
Instead of looking at the residuals from the model, we could look at some general measurements of model quality. One way to compute these in a convenient way is the broom package, by David Robinson.
Instead of looking at the residuals from the model, we could look at some general measurements of model quality. You learned how to compute some specific measures in the previous chapter. Here we'll show a different approach using the broom package.
`tidyr::glance()` gives us a data frame with a single row. Each column gives a model summary: either a measure of model quality, or complexity, or a combination of the two:
The broom package provides three general tools for turning models in to tidy data frames:
1. `broom::glance(model)` returns a row for each model. Each column gives a
model summary: either a measure of model quality, or complexity, or a
combination of the two.
1. `broom:tidy(model)` returns a row for each coefficient in the model. Each
column gives information about the estimate or its variability.
1. `broom::augment(model, data)` returns a row for each row in `data`, adding
extra values like residuals, and influence statistics.
Here we'll use `broom::glance()` to extract some model quality metrics. If we apply it to a model, we get a data frame with a single row:
```{r}
glance(nz_mod)
```
We can use the same technique as with residuals to compute this for each country:
We can use `mutate()` and `unnest()` to create a data frame with a row for each country:
```{r}
by_country %>%
mutate(glance = map(model, glance)) %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance)
```
But this includes all the list columns - this is the default when `unnest()` works on single row data frames (because it possible to keep them all in sync). To suppress these columns we use `.drop = TRUE`:
This isn't quite the output we want, because it still includes all the list columns. This is default behaviour when `unnest()` works on single row data frames. To suppress these columns we use `.drop = TRUE`:
```{r}
glance <- by_country %>%
mutate(glance = map(model, glance)) %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance, .drop = TRUE)
glance
```
We could look at which countries have the worst fits:
With this data frame in hand, we can start to look for models that don't fit well:
```{r}
glance %>% arrange(r.squared)
```
The worst models all appear to be in Africa. We could show this graphically. Here we have a relatively small number of observations and a discrete variable, so `geom_jitter()` is effective:
The worst models all appear to be in Africa. Let's double check that with a plot. Here we have a relatively small number of observations and a discrete variable, so `geom_jitter()` is effective:
```{r}
glance %>%
@ -191,22 +229,31 @@ We could put out the countries with particularly bad $R^2$ and plot the data:
```{r}
bad_fit <- filter(glance, r.squared < 0.25)
bad_fit
gapminder %>%
semi_join(bad_fit, by = "country") %>%
ggplot(aes(year, lifeExp, colour = country)) +
geom_line()
```
We see two main effect here: the tragedies of the HIV/AIDS epidemic, and the Rwandan genocide.
We see two main effects here: the tragedies of the HIV/AIDS epidemic, and the Rwandan genocide.
### Exercises
1. A linear trend seems to be slightly too simple for the overall trend.
Can you do better with a natural spline with two or three degrees of
freedom?
1. Explore other methods for visualsiation the distribution of $R^2$ per
continent. You might want to try `ggbeeswarm`, which provides similar
methods for avoiding overlaps as jitter, but with less randomness.
1. To create the last plot (showing the data for the countries with the
worst model fits), we needed two steps: we created a data frame with
one row per country and then semi-joined it to the original dataset.
It's possible avoid this join if we use `unnest()` instead of
`unnest(.drop = TRUE)`. How?
## List-columns
The idea of a list column is powerful. The contract of a data frame is that it's a named list of vectors, where each vector has the same length. A list is a vector, and a list can contain anything, so you can put anything in a list-column of a data frame.