r4ds/model-basics.Rmd

667 lines
27 KiB
Plaintext

# Model
A model is a function that summarizes how the values of one variable vary in relation to the values of other variables. Models play a large role in hypothesis testing and prediction, but for the moment you should think of models just like you think of statistics. A statistic summarizes a *distribution* in a way that is easy to understand; and a model summarizes *covariation* in a way that is easy to understand. In other words, a model is just another way to describe data.
Family of models vs fitted model. Set of possible values, vs. one specific model. A fitted model = family of models plus a dataset.
This chapter will explain how to build useful models with R.
## Outline
*Section 1* will show you how to build linear models, the most commonly used type of model. Along the way, you will learn R's model syntax, a general syntax that you can reuse with most of R's modeling functions.
*Section 2* will show you the best ways to use R's model output, which often requires additional wrangling.
*Section 3* will teach you to build and interpret multivariate linear models, models that use more than one explanatory variable to explain the values of a response variable.
*Section 4* will explain how to use categorical variables in your models and how to interpret the results of models that use categorical variables. Here you will learn about interaction effects, as well as logistic models.
*Section 5* will present a logical way to extend linear models to describe non-linear relationships.
### Prerequisites
To access the functions and data sets that we will use in the chapter, load the following packages:
```{r message = FALSE}
# Modelling functions
library(modelr)
library(broom)
# Modelling requires plently of visualisation and data manipulation
library(ggplot2)
library(dplyr)
library(tidyr)
# Options that make your life easier
options(
contrasts = c("contr.treatment", "contr.treatment"),
na.option = na.exclude
)
```
## Linear models
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that increases their income. Could this be true?
Luckily, it is easy to measure someone's height, as well as their income, which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering just how your tax dollars are being spent, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident.
You can load the latest cross-section of NLS data, collected in 2013 with the code below.
```{r echo = FALSE}
heights <- tibble::as_data_frame(readRDS("data/heights.RDS"))
heights
n <- nrow(heights)
n
```
I've narrowed the data down to 10 variables:
* `id` - A number to identify each subject
* `income` - The self-reported income of each subject
* `height` - The height of each subject in inches
* `weight` - The weight of each subject in pounds
* `sex` - The sex of each subject
* `race` - The race of each subject
* `education` - The number of years of education completed by each subject
* `asvab` - Each subject's score on the Armed Services Vocational Aptitude Battery (ASVAB), an intelligence assessment, out of 100.
* `sat_math` - Each subject's score on the math portion of the Scholastic Aptitude Test (SAT), out of 800.
* `bdate` - Month of birth with 1 = January.
Now that you have the data, you can visualize the relationship between height and income. But what does the data say? How would you describe the relationship?
```{r warnings = FALSE}
ggplot(heights, aes(height, income)) +
geom_point()
```
First, let's address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That's not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Bureau of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data for data science.
```{r}
heights <- heights %>% filter(income < 150000)
nrow(heights) / n
```
Also, you can see that heights have been rounded to the nearest inch so using boxplots will make it easier to see the pattern. We'll also remove the very tall and very short people so we can focus on the most typically heights:
```{r}
heights <- heights %>% filter(between(height, 59, 78))
nrow(heights) / n
ggplot(heights, aes(height, income, group = height)) +
geom_boxplot()
```
(Throwing away data in the first pass at a model is perfectly acceptable: starting with a simple subset of a problem that you can easily solve is a good general strategy. But in a real analysis, once you've got the first simple model working, you really should come back and all look at the full dataset. Is removing the data still a good idea?)
You can see there seems to be a fairly weak relationship: as height increase the median wage also seems to increase. But how could we summarise that more quantitiatively?
One option is the __correlation__, $r$, from statistics, which measures how strongly the values of two variables are related. The sign of the correlation describes whether the variables have a positive or negative relationship. The magnitude of the correlation describes how strongly the values of one variable determine the values of the second. A correlation of 1 or -1 implies that the value of one variable completely determines the value of the second variable.
```{r echo = FALSE, cache=TRUE, fig.height = 2}
x1 <- rnorm(100)
y1 <- .5 * x1 + rnorm(100, sd = .5)
y2 <- -.5 * x1 + rnorm(100, sd = .5)
cordat <- data_frame(
x = rep(x1, 5),
y = c(-x1, y2, rnorm(100), y1, x1),
cor = factor(
rep(1:5, each = 100),
labels = paste0("Correlation = ", c(-1, -0.5, 0, 0.5, 1))
)
)
ggplot(cordat, aes(x, y)) +
geom_point() +
facet_grid(. ~ cor) +
coord_fixed() +
xlab(NULL) +
ylab(NULL)
```
In R, we can compute the correlation with `cor()`:
```{r}
cor(heights$height, heights$income)
```
The correlation suggests that heights may have a small effect on income.
Another way to summarise the relationship is with a linear model.
Use R's `lm()` function to fit a linear model to your data. The first argument of `lm()` should be a formula, two or more variables separated by a `~`. You've seen formulas before, we used them in Chapter 2 to facet graphs.
```{r}
income ~ height
h <- lm(income ~ height, data = heights)
h
```
`lm()` fits a straight line that describes the relationship between the variables in your formula. You can picture the result visually like this.
```{r}
ggplot(heights, aes(height, income)) +
geom_boxplot(aes(group = height)) +
geom_smooth(method = lm, se = FALSE)
```
`lm()` treats the variable(s) on the right-hand side of the formula as _explanatory variables_ that partially determine the value of the variable on the left-hand side of the formula, which is known as the _response variable_. In other words, it acts as if the _response variable_ is determined by a function of the _explanatory variables_. Linear regression is _linear_ because it finds the linear combination of the explanatory variables that best predict the response.
Linear models are straightforward to interpret. Incomes have a baseline mean of $`r coef(h)[1]`$. Each one inch increase of height above zero is associated with an increase of $`r coef(h)[2]`$ in income.
```{r}
summary(h)
```
### Exercises
1. What variables in `heights` do you expect to be most highly correlated with
income? Use `cor()` plus `purrr::map_dbl()` to check your guesses.
1. Correlation only summarises the linear relationship between two continuous
variables. There are some famous drawbacks to the correlation. What
are they? Hint: google for Anscombe's quartet, read <https://xkcd.com/552/>.
## Understanding the model
For simple models, like this one, you can figure out what the model says about the data by carefully studying the coefficients. If you ever take a formal statistics course on modelling, you'll spend a lot of time doing that. Here, however, we're going to take a different tack. In this book, we're going to focus on understanding a model by looking at its predictions. This has a big advantage: every type of model makes predictions (otherwise what use would it be?) so we can use the same set of techniques to understand simple linear models or complex random forrests. We'll see that advantage later on when we explore some other families of models.
1. Make predictions across a uniform grid to understand the "shape" of
the model.
1. Subtract predictions from the actual values to look at the residuals
and to learn what the model misses.
1. Create succinct numerical summaries when you need to compare
many models.
### Predictions
To visualise the predictions from a model, we start by generating an evenly spaced grid of values that covers the region where our data lies. The easiest way to do that is to use `tidyr::expand()`. It's first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:
```{r}
grid <- heights %>% expand(height)
grid
```
(This will get more interesting when we start to add more variables to our model.)
Next we add predicitons. We'll use `modelr::add_predictions()` which works in exactly the same way as `add_residuals()`, but just compute predictions (so doesn't need a data frame that contains the response variable:)
```{r}
grid <- grid %>% add_predictions(h, "income")
grid
```
And then we plot the predictions. Here the choice of plots is pretty obvious because we only havce two variables. In general, however, figuring out how to the visualise the predictions can be quite challenging, and you'll often need to try a few alternatives before you get the most useful plot. For more complex models, you're likely to need multiple plots, not just one.
```{r}
ggplot(heights, aes(height, income)) +
geom_boxplot(aes(group = height)) +
geom_line(data = grid, colour = "red", size = 1)
```
### Residuals
The flip-side of predictions are residuals. The predictions tell you what the model is doing; the residuals tell you what the model is missing. We can compute residuals with `add_residuals()`. Note that we computing residuals, you'll use the original dataset, not a manufactured grid. Otherwise where would you get the value of the response?
```{r}
heights <- heights %>% add_residuals(h)
```
There are a few different ways to understand what the residuals tell us about the model. One way is to simply draw a frequency polygon to help us understand the spread of the residuals:
```{r}
ggplot(heights, aes(resid)) +
geom_freqpoly(binwidth = 2000)
```
(I prefer the frequency polygon over the histogram here because it makes it easier to display other categorical variables on the same plot.)
Here you can see that the range of the residuals is quite large. (Note that by the formal definiton of the linear model, the mean of the residuals will always be zero).
For many problems, the sign of the residual (i.e. whether the prediction is too high or too low) isn't important, and you might just want to focus on the magnitude of the residuals. You can do that by plotting the absolute value:
```{r}
ggplot(heights, aes(abs(resid))) +
geom_freqpoly(binwidth = 2000, boundary = 0)
```
You can also explore how the residuals vary with other variables in the data:
```{r}
ggplot(heights, aes(height, resid)) + geom_point()
```
Iterative plotting the residuals instead of the original response leads to a natual way of building up a complex model in simple steps, which we'll explore in detail in the next chapter.
### Numeric summaries of model quality
When you start dealing with many models, it's helpful to have some rough way of comparing them so you can spend your time looking at the models that do the best job of capturing important features in the data.
One way to capture the quality of the model is to summarise the distribution of the residuals. For example, you could look at the quantiles of the absolute residuals. For this dataset, 25% of predictions are less than \$7,400 away, and 75% are less than \$25,800 away. That seems like quite a bit of error when predicting someone's income!
```{r}
qae(h, heights)
range(heights$income)
```
You might be familiar with the $R^2$. That's a single number summary that rescales the variance of the residuals to between 0 (very bad) and 1 (very good):
```{r}
rsquare(h, heights)
```
$R^2$ can be interpreted as the amount of variation in the data explained by the model. Here we're explaining 3% of the total variation - not a lot! But I don't think worrying about the relative amount of variation explained is that useful; instead I think you need to consider whether the absolute amount of variation explained is useful for your project.
It's called the $R^2$ because for simple models like this, it's just the square of the correlation between the variables:
```{r}
cor(heights$income, heights$height) ^ 2
```
The $R^2$ is an ok single number summary, but I prefer to think about the unscaled residuals because it's easier to interpret in the context of the original data. As you'll also learn later, it's also a rather optimistic interpretation of the model. Because you're asssessing the model using the same data that was used to fit it, it really gives more of an upper bound on the quality of the model, not a fair assessment.
### Exercises
1. In the plot of the absolute residuals, there's a bin smaller than
zero. What does this bin represent, and why is it necessary?
1. Explore how the distribution of residuals varies with the sex and
race. What makes understanding the residuals split up by race
particularly challenging?
1. P-values are an important part of model interpretation, particularly in
science, that we're not going to talk much about. <https://xkcd.com/882/>
1. It's often useful to recreate your initial EDA plots using residuals
instead of the original missing values. How does visualising `resid`
instead of `height` change your understanding of the heights data?
## Multiple predictors
In most cases, like this one, a single variable is not enough to generate a useful model. Instead we need multiple variables in the model, which you can include with either `+` or `*` in the modelling formula.
The distinction between `+` and `*` is important:
* `x + y` adds the variables independently: the model will estimate the
effect of `x` and the effect of `y` completely separately.
* `x * y` will estimate the effect of the model together.
We'll explore what this means in the sections below, as we add more categorical and continuous variables to the dataset.
### Categorical
Our model so far is extremely simple: it only uses one variable to try and predict income. We also know something else important: women tend to be shorter than men and tend to get paid less.
```{r}
ggplot(heights, aes(height, colour = sex)) +
geom_freqpoly(binwidth = 1)
ggplot(heights, aes(income, colour = sex)) +
geom_freqpoly(binwidth = 5000)
```
What happens if we also include `sex` in the model?
```{r}
h2 <- lm(income ~ height * sex, data = heights)
grid <- heights %>%
expand(height, sex) %>%
add_predictions(h2, "income")
ggplot(heights, aes(height, income)) +
geom_point() +
geom_line(data = grid) +
facet_wrap(~sex)
```
Need to commment about predictions for tall women and short men - there is not a lot of data there. Need to be particularly sceptical.
`*` vs `+`.
```{r}
h3 <- lm(income ~ height + sex, data = heights)
grid <- heights %>%
expand(height, sex) %>%
gather_predictions(h2, h3)
ggplot(grid, aes(height, pred, colour = sex)) +
geom_line() +
facet_wrap(~model)
```
#### Factors
R stores categorical data as factors. If you add a string to a model, R will convert it to a factor for the purposes of the model.
A factor is an integer vector with a levels attribute. You can make a factor with `factor()`.
```{r}
fac <- factor(c("c", "a", "b"),
levels = c("a", "b", "c"),
labels = c("blond", "brunette", "red"))
fac
unclass(fac)
```
Each level of the factor (i.e. unique value) is encoded as an integer and displayed with the label that is associated with that integer.
If you use factors outside of a model, you will notice some limiting behavior:
* You cannot add values to a factor that do not appear in its levels.
* Factors retain all of their levels when you subset them. To avoid this use `drop = TRUE`.
```{r}
fac[1]
fac[1, drop = TRUE]
```
* If you coerce a factor to a number with `as.numeric()`, R will convert the integer vector that underlies the factor to a number, not the level labels that you see when you print the factor.
```{r}
num_fac <- factor(1:3, levels = 1:3, labels = c("100", "200", "300"))
num_fac
as.numeric(num_fac)
```
To coerce the labels that you see to a new data type, first coerce the factor to a character string with `as.character()`
```{r}
as.numeric(as.character(num_fac))
```
#### Interpretation
Add categorical variables to a model in the same way that you would add continuous variables.
```{r}
s <- lm(income ~ sex, data = heights)
tidy(s)
```
Every level of the factor except one receives its own coefficient. The missing level acts as a baseline.
To change the baseline, create a new factor with a new levels attribute. R will use the first level in the levels attribute as the baseline.
```{r}
heights$sex <- factor(heights$sex, levels = c("male", "female"))
```
```{r}
hes <- lm(income ~ height + education + sex, data = heights)
tidy(hes)
```
```{r}
heights %>%
group_by(sex) %>%
do(glance(lm(income ~ height, data = .)))
```
```{r}
hes2 <- lm(income ~ height + education * sex, data = heights)
tidy(hes2)
```
### Nested variables
Another case that occassionally crops up is nested variables: you have an identifier that is locally unique, not globally unique. For example you might have this data about students in schools:
```{r}
students <- tibble::frame_data(
~student_id, ~school_id,
1, 1,
2, 1,
1, 2,
1, 3,
2, 3,
3, 3
)
```
The student id only makes sense in the context of the school: it doesn't make sense to generate every combination of student and school. You can use `nesting()` for this case:
```{r}
students %>% expand(nesting(school_id, student_id))
```
### Continuous
There appears to be a relationship between a person's education and how poorly the model predicts their income. If we graph the model residuals against `education` above, we see that the more a person is educated, the worse the model underestimates their income:
But before we add a variable to our model, we need to do a little EDA + cleaning:
```{r}
ggplot(heights, aes(education)) + geom_bar()
heights_ed <- heights %>% filter(education >= 12)
nrow(heights) / n
```
We could improve the model by adding education:
```{r}
he1 <- lm(income ~ height + education, data = heights_ed)
he2 <- lm(income ~ height * education, data = heights_ed)
```
How can we visualise the results of this model? One way to think about it as a surface: we have a 2d grid of height and education, and point on that grid gets a predicted income.
```{r}
grid <- heights_ed %>%
expand(height, education) %>%
gather_predictions(he1, he2)
ggplot(grid, aes(height, education, fill = pred)) +
geom_raster() +
facet_wrap(~model)
```
It's easier to see what's going on in a line plot:
```{r}
ggplot(grid, aes(height, pred, group = education)) +
geom_line() +
facet_wrap(~model)
ggplot(grid, aes(education, pred, group = height)) +
geom_line() +
facet_wrap(~model)
```
One of the big advantages to `+` instead of `*` is that because the terms are independent we display them using two simple plots instead of one complex plot:
```{r}
heights_ed %>%
expand(
height = seq_range(height, 10),
education = mean(education, na.rm = TRUE)
) %>%
add_predictions(he1, "income") %>%
ggplot(aes(height, income)) +
geom_line()
heights_ed %>%
expand(
height = mean(height, na.rm = TRUE),
education = seq_range(education, 10)
) %>%
add_predictions(he1, "income") %>%
ggplot(aes(education, income)) +
geom_line()
```
The full interaction suggests that height matters less as education increases. But which model is "better"? We'll come back to that question later.
What happens if we add the data back in to the plot? Do you get more or less sceptical about the results from this model?
You can imagine that if you had a model with four continuous predictions all interacting, that it would be pretty complicated to understand what's going in the model! And certainly you don't have to - it's totally fine to use a model simply as a tool for predicting new values, and in the next chapters you'll learn some techniques to help evaluate such models without looking at them. However, I think the more you can connect your understand of the domain to the model, the more likely you are to detect potential problems before they occur. The goal is not to undertand every last nuance of the model, but instead to understand more than what you did previously.
condvis.
### Splines
But what if the relationship between variables is not linear? For example, the relationship between income and education does not seem to be linear:
```{r}
ggplot(heights_ed, aes(education, income)) +
geom_boxplot(aes(group = education)) +
geom_smooth(se = FALSE)
```
One way to introduce non-linearity into our model is to use transformed variants of the predictors.
```{r}
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ education + I(education ^ 2) + I(education ^ 3), data = heights_ed)
heights_ed %>%
expand(education) %>%
gather_predictions(mod_e1, mod_e2) %>%
ggplot(aes(education, pred, colour = model)) +
geom_point() +
geom_line()
```
This is a bit clunky because we have to surround each transformation with `I()`. This is because the rules of model algebra are a little different to usual algebra. `x ^ 2` is equivalent to `x * x` which in the modelling algebra is equivalent to `x + x + x:x` which is the same as `x`. This is useful because `(x + y + z)^2` fit all all major terms and second order interactions of x, y, and z.
```{r}
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ poly(education, 2), data = heights_ed)
mod_e3 <- lm(income ~ poly(education, 3), data = heights_ed)
heights_ed %>%
expand(education) %>%
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
ggplot(aes(education, pred, colour = model)) +
geom_point() +
geom_line()
```
However: there's one major problem with using `poly()`: outside the range of the data, polynomials are going to rapidly shoot off to positive or negative infinity.
```{r}
data_frame(education = seq(5, 25)) %>%
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
ggplot(aes(education, pred, colour = model)) +
geom_line()
```
Splines avoid this problem by linearly interpolating outside the range of the data. This isn't great either, but it's a safer default when you don't know for sure what's going to happen.
```{r}
library(splines)
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ ns(education, 2), data = heights_ed)
mod_e3 <- lm(income ~ ns(education, 3), data = heights_ed)
data_frame(education = seq(5, 25)) %>%
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
ggplot(aes(education, pred, colour = model)) +
geom_line()
```
Other useful arguments to `seq_range()`:
* `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks
nice to the human eye. This is useful if you want to produce tables of
output:
```{r}
seq_range(c(0.0123, 0.923423), n = 5)
seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
```
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
variables has an long tailed distribution and you want to focus on generating
values near the center:
```{r}
x <- rcauchy(100)
seq_range(x, n = 5)
seq_range(x, n = 5, trim = 0.10)
seq_range(x, n = 5, trim = 0.25)
seq_range(x, n = 5, trim = 0.50)
```
### Interpolation vs extrapolation
One danger with prediction plots is that it's easy to make predictions that are far away from the original data. This is dangerous because it's quite possible that the model (which is a simplification of reality) will no longer apply far away from observed values.
As the number of variables in your model grows ... "the curse of dimensionality": as the number of variables increases the average distance between points increases. That means most of the space is very sparse, and you have to rely on strong assumptions.
To help avoid this problem, it's good practice to include "nearby" observed data points in any prediction plot. These help you see if you're interpolating, making prediction "in between" existing data points, or extrapolating, making predictions about preivously unobserved slices of the data.
One way to do this is to use `condvis::visualweight()`.
<https://cran.rstudio.com/web/packages/condvis/>
## Response variations
### Transformations
```{r}
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
ggplot(diamonds, aes(x = log(carat), y = log(price))) +
geom_point()
```
```{r}
lm(log(price) ~ log(carat), data = diamonds)
# visualize model line
```
### Genearlised linear models
So far the y variable of our models has been a continuous variable, `income`. You can use linear regression to model a categorical y variable by transforming y into a continuous variable with a _link function_. Then model fit a model to the results of the link function and use the link function to back transform and interpret the results.
The most common link function is the logit function, which transforms a bivariate y variable into a continuous range.
Use `glm()` to perform logistic regression in R.
```{r}
she <- glm(sex ~ height + education, family = binomial(link = "logit"), data = heights)
tidy(she)
```
## Other model families
### Robust models
Iteratively re-fit the model down-weighting outlying points (points with high residuals).
### Additive models
```{r}
library(mgcv)
gam(income ~ s(education), data = heights)
ggplot(data = heights, mapping = aes(x = education, y = income)) +
geom_point() +
geom_smooth(method = gam, formula = y ~ s(x))
```
```{r eval = FALSE}
# Linear z
gam(y ~ s(x) + z, data = df)
# Smooth x and smooth z
gam(y ~ s(x) + s(z), data = df)
# Smooth surface of x and z
# (a smooth function that takes both x and z)
gam(y ~ s(x, z), data = df)
```
### Random forrests
## Summary
We've avoided two things in this chapter that are usually conflated with models: hypothesis testing and predictive analysis.
There are other types of modeling algorithms; each provides a valid description of the data.
Which description will be best? Does the relationship have a known form? Does the data have a known structure? Are you going to attempt hypothesis testing that imposes its own constraints?