A model is a function that summarizes how the values of one variable vary in relation to the values of other variables. Models play a large role in hypothesis testing and prediction, but for the moment you should think of models just like you think of statistics. A statistic summarizes a *distribution* in a way that is easy to understand; and a model summarizes *covariation* in a way that is easy to understand. In other words, a model is just another way to describe data.
This chapter will explain how to build useful models with R.
*Section 1* will show you how to build linear models, the most commonly used type of model. Along the way, you will learn R's model syntax, a general syntax that you can reuse with most of R's modeling functions.
*Section 2* will show you the best ways to use R's model output, which often requires additional wrangling.
*Section 3* will teach you to build and interpret multivariate linear models, models that use more than one explanatory variable to explain the values of a response variable.
*Section 4* will explain how to use categorical variables in your models and how to interpret the results of models that use categorical variables. Here you will learn about interaction effects, as well as logistic models.
*Section 5* will present a logical way to extend linear models to describe non-linear relationships.
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that increases their income. Could this be true?
Luckily, it is easy to measure someone's height, as well as their income, which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering just how your tax dollars are being spent, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident.
You can load the latest cross-section of NLS data, collected in 2013 with the code below.
* `income` - The self-reported income of each subject
* `height` - The height of each subject in inches
* `weight` - The weight of each subject in pounds
* `sex` - The sex of each subject
* `race` - The race of each subject
* `education` - The number of years of education completed by each subject
* `asvab` - Each subject's score on the Armed Services Vocational Aptitude Battery (ASVAB), an intelligence assessment, out of 100.
* `sat_math` - Each subject's score on the math portion of the Scholastic Aptitude Test (SAT), out of 800.
* `bdate` - Month of birth with 1 = January.
Now that you have the data, you can visualize the relationship between height and income. But what does the data say? How would you describe the relationship?
First, let's address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That's not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Bureau of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data for data science.
Also, you can see that heights have been rounded to the nearest inch so using boxplots will make it easier to see the pattern. We'll also remove the very tall and very short people so we can focus on the most typically heights:
You can see there seems to be a fairly weak relationship: as height increase the median wage also seems to increase. But how could we summarise that more quantitiatively?
One option is the __correlation__, $r$, from statistics, which measures how strongly the values of two variables are related. The sign of the correlation describes whether the variables have a positive or negative relationship. The magnitude of the correlation describes how strongly the values of one variable determine the values of the second. A correlation of 1 or -1 implies that the value of one variable completely determines the value of the second variable.
A model describes the relationship between two or more variables. There are multiple ways to describe any relationship. Which is best?
A common choice: decide the form of the relationship, then minimize residuals.
Use R's `lm()` function to fit a linear model to your data. The first argument of `lm()` should be a formula, two or more variables separated by a `~`. You've seen formulas before, we used them in Chapter 2 to facet graphs.
```{r}
income ~ height
h <- lm(income ~ height, data = heights)
h
```
`lm()` fits a straight line that describes the relationship between the variables in your formula. You can picture the result visually like this.
`lm()` treats the variable(s) on the right-hand side of the formula as _explanatory variables_ that partially determine the value of the variable on the left-hand side of the formula, which is known as the _response variable_. In other words, it acts as if the _response variable_ is determined by a function of the _explanatory variables_. It then spots the linear function that best fits the data.
Linear models are straightforward to interpret. Incomes have a baseline mean of $`r coef(h)[1]`$. Each one inch increase of height above zero is associated with an increase of $`r coef(h)[2]`$ in income.
For simple models, like this one, you can figure out what the model says about the data by carefully studying the coefficients. If you ever take a formal statistics course on modelling, you'll spend a lot of time doing that. Here, however, we're going to take a different tack. In this book, we're going to focus on understanding a model by looking at its predictions.
To do that, we first need to generate a grid of values to compute predictions for. The easiest way to do that is to use `tidyr::expand()`. It's first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:
```{r}
grid <- heights %>% expand(height)
grid
```
(This will get more interesting when we start to add more variables to our model.)
Next we add predicitons. We'll use `modelr::add_predictions()` which works in exactly the same way as `add_residuals()`, but just compute predictions (so doesn't need a data frame that contains the response variable:)
```{r}
grid <-
grid %>%
add_predictions(income = h)
grid
```
And then we plot the predictions. Plotting predictions is usually the hardest bit and you'll often need to try a few alternatives before you get a plot that is most informative. Depending on your model it's quite possible that you'll need multiple plots to fully convey what the model is telling you about the data. It's pretty simple here because there are only two variables involved.
```{r}
ggplot(heights, aes(height, income)) +
geom_boxplot(aes(group = height)) +
geom_line(data = grid, colour = "red", size = 1)
```
Need a summary of model quality: I think the standard error of the residuals is probably a reasonable
## Multivariate models
### Categorical
Our model so far is extremely simple: it only uses one variable to try and predict income. We also know something else important: women tend to be shorter than men and tend to get paid less.
```{r}
ggplot(heights, aes(height, colour = sex)) +
geom_freqpoly(binwidth = 1)
ggplot(heights, aes(income, colour = sex)) +
geom_freqpoly(binwidth = 5000)
```
What happens if we also include `sex` in the model?
```{r}
h2 <- lm(income ~ height * sex, data = heights)
grid <- heights %>%
expand(height, sex) %>%
add_predictions(income = h2)
ggplot(heights, aes(height, income)) +
geom_point() +
geom_line(data = grid) +
facet_wrap(~sex)
```
Need to commment about predictions for tall women and short men - there is not a lot of data there. Need to be particularly sceptical.
There appears to be a relationship between a person's education and how poorly the model predicts their income. When we graphed the model residuals against `education` above, we see that the more a person is educated, the worse the model underestimates their income.
Patterns in the residuals suggest that relationships exist between y and other variables, even when the effect of heights is accounted for.
Add variables to a model by adding variables to the right-hand side of the model formula.
```{r}
he1 <- lm(income ~ height + education, data = heights)
he2 <- lm(income ~ height * education, data = heights)
grid <- heights %>%
expand(height, education) %>%
add_predictions(he1 = he1, he2 = he2) %>%
gather(model, prediction, he1:he2)
ggplot(grid, aes(height, education, fill = prediction)) +
geom_raster() +
facet_wrap(~model)
```
It's easier to see what's going on in a line plot:
```{r}
ggplot(grid, aes(height, prediction, group = education)) +
geom_line() +
facet_wrap(~model)
```
The full interaction suggests that height matters less as education increases. But which model is "better"? We'll come back to that question later.
What happens if we add the data back in to the plot? Do you get more or less sceptical about the results from this model?
R's model output is not very tidy. It is designed to provide a data store from which you can extract information with helper functions. You will learn more about tidy data in Tidy Data.
```{r}
coef(h)
predict(h)[1:5]
resid(h)[1:5]
```
The `broom` package provides the most useful helper functions for working with R models. `broom` functions return the most useful model information as data frames, which lets you quickly embed the information into your data science workflow.
What about sex? Many sources have observed that there is a difference in income between genders. Might this explain the height effect? We can find the effect of height independent of sex by adding sex to the model; however, sex is a categorical variable.
### Factors
R stores categorical data as factors. If you add a string to a model, R will convert it to a factor for the purposes of the model.
A factor is an integer vector with a levels attribute. You can make a factor with `factor()`.
```{r}
fac <- factor(c("c", "a", "b"),
levels = c("a", "b", "c"),
labels = c("blond", "brunette", "red"))
fac
unclass(fac)
```
Each level of the factor (i.e. unique value) is encoded as an integer and displayed with the label that is associated with that integer.
If you use factors outside of a model, you will notice some limiting behavior:
* You cannot add values to a factor that do not appear in its levels.
* Factors retain all of their levels when you subset them. To avoid this use `drop = TRUE`.
```{r}
fac[1]
fac[1, drop = TRUE]
```
* If you coerce a factor to a number with `as.numeric()`, R will convert the integer vector that underlies the factor to a number, not the level labels that you see when you print the factor.
So far the y variable of our models has been a continuous variable, `income`. You can use linear regression to model a categorical y variable by transforming y into a continuous variable with a _link function_. Then model fit a model to the results of the link function and use the link function to back transform and interpret the results.
The most common link function is the logit function, which transforms a bivariate y variable into a continuous range.
Use `glm()` to perform logistic regression in R.
```{r}
she <- glm(sex ~ height + education, family = binomial(link = "logit"), data = heights)
tidy(she)
```
## Non-linear models
But what if the relationship between variables is not linear? For example, the relationship between income and education does not seem to be linear.
```{r}
ggplot(data = heights, mapping = aes(x = education, y = income)) +
geom_boxplot(aes(group = education)) +
geom_smooth() +
coord_cartesian(ylim = c(0, 125000))
```
You can still use linear regression to model non-linear relationships.
### Transformations
```{r}
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
ggplot(diamonds, aes(x = log(carat), y = log(price))) +
geom_point()
```
```{r}
lm(log(price) ~ log(carat), data = diamonds)
# visualize model line
```
### Splines
```{r eval = FALSE}
bs(degree = 1) # linear splines
bs() # cubic splines
ns() # natural splines
```
```{r}
library(splines)
tidy(lm(income ~ ns(education, knots = c(10, 17)), data = heights))
tidy(lm(income ~ ns(education, df = 4), data = heights))
```
```{r}
ggplot(data = heights, mapping = aes(x= education, y = income)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ ns(x, df = 4)) +
coord_cartesian(ylim = c(0, 125000))
```
### Additive models
```{r}
gam(income ~ s(education), data = heights)
ggplot(data = heights, mapping = aes(x = education, y = income)) +
geom_point() +
geom_smooth(method = gam, formula = y ~ s(x))
```
```{r eval = FALSE}
# Linear z
gam(y ~ s(x) + z, data = df)
# Smooth x and smooth z
gam(y ~ s(x) + s(z), data = df)
# Smooth surface of x and z
# (a smooth function that takes both x and z)
gam(y ~ s(x, z), data = df)
```
## Summary
We've avoided two things in this chapter that are usually conflated with models: hypothesis testing and predictive analysis.
There are other types of modeling algorithms; each provides a valid description of the data.
Which description will be best? Does the relationship have a known form? Does the data have a known structure? Are you going to attempt hypothesis testing that imposes its own constraints?