The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the fitted model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in).
This chapter of the book is unique because it only uses simulated datasets. These datasets are very simple, and not at all interesting, but they will help you understand the essence of modelling before you apply the same techniques to real data in the next chapter.
A fitted model is a function: you give it a set of inputs (often called the predictors), and it returns a value (typically called the prediction). Fitting a model takes place in two steps:
1. You define a __family of models__ that you think will capture an
interesting pattern in your data. This class will look like a mathematical
formula that relates your predictors to the response. For a simple 1d
model, this might look like `y = a * x + b` or `y = a * x ^ b`.
1. You generate a __fitted model__ by finding the one model from the family
that is the closest to your data. This takes the generic model family
and makes it specific, like `y = 3 * x + 7` or `y = 9 * x ^ 2`.
It's important to understand that a fitted model is just the closest model from a family of models. That implies that you have the "best" model (according to some criteria); it doesn't imply that you have a good model and it certainly doesn't imply that the model is "true". George Box puts this well in his famous aphorism that you've probably heard before:
> All models are wrong, but some are useful.
It's worth reading the fuller context of the quote:
The goal of a model is not to uncover Truth, but to discover a simple approximation that is still useful. This tension is at the heart of modelling: you want your model to be simpler than reality (so you can understand it more easily) while still generating good predictions. There is almost always a tradeoff between understandability and quality of predictions.
We need a couple of packages specifically designed for modelling, and all the packages you've used before for EDA. I also recommend setting some options that tweak the default linear model behaviour to be a little less surprising.
You can see a strong pattern in the data and we can use a model to capture it. This data looks like it comes from a linear famiy, i.e. `y = a * x + b`. Let's start by randomly generating a few models from that family and overlaying them on the data:
There are an infinite number of possible models, but some are obviously better than others. Intuitively, it seems reasonable that a good model should generate a line that's close to the data. How can we make that intuition concrete?
And then overlay only the 10 best models on to the data. I've coloured the points by `-dist`: this is an easy way to make sure that the best models (i.e. the ones with the smallest distance) are coloured brightly.
You could imagine iteratively making the grid finer and finer until you narrowed in on the best model. A more efficient approach is to use a numerical minimisation tool called a Newton-Rhapson search. The intuition of Newton-Rhapson is pretty simple: you pick a starting point and look around for the steepest slope. You then ski down that slope a little way, and then repeat.
You can that basic approach for any family of model that you can write an equation for. However, the specfic model we fit is a part of a broader class called linear models. We can use the `lm()` function to solve this specific problem. `lm()` has a special way to specify the model family: a formula like `y ~ x` which `lm()` translates to `y = a * x + b`. We can fit the model and look at the output:
Generally, model functions will do something more sophisticated than a gradient search: for linear models based on linear algebra. Using connection between geometry, calculus, and linear algebra: the closest linear model to the data can always be found by constructing and then inverting a matrix.
For simple models, like this one, you can figure out what the model says about the data by carefully studying the coefficients. If you ever take a formal statistics course on modelling, you'll spend a lot of time doing that. Here, however, we're going to take a different tack. In this book, we're going to focus on understanding a model by looking at its predictions. This has a big advantage: every type of model makes predictions (otherwise what use would it be?) so we can use the same set of techniques to understand simple linear models or complex random forrests. We'll see that advantage later on when we explore some other families of models.
To visualise the predictions from a model, we start by generating an evenly spaced grid of values that covers the region where our data lies. The easiest way to do that is to use `tidyr::expand()`. It's first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:
(This will get more interesting when we start to add more variables to our model.)
Next we add predicitons. We'll use `modelr::add_predictions()` which works in exactly the same way as `add_residuals()`, but just compute predictions (so doesn't need a data frame that contains the response variable:)
And then we plot the predictions. Here the choice of plots is pretty obvious because we only havee two variables. In general, however, figuring out how to the visualise the predictions can be quite challenging, and you'll often need to try a few alternatives before you get the most useful plot. For more complex models, you're likely to need multiple plots, not just one.
The flip-side of predictions are residuals. The predictions tell you what the model is doing; the residuals tell you what the model is missing. The residuals are just the distances between the observed and predicted values that we computed above!
We can compute residuals with `add_residuals()`. Note that we computing residuals, you'll use the original dataset, not a manufactured grid. Otherwise where would you get the value of the response?
There are a few different ways to understand what the residuals tell us about the model. One way is to simply draw a frequency polygon to help us understand the spread of the residuals:
(I prefer the frequency polygon over the histogram here because it makes it easier to display other categorical variables on the same plot.)
For many problems, the sign of the residual (i.e. whether the prediction is too high or too low) isn't important, and you might just want to focus on the magnitude of the residuals. You can do that by plotting the absolute value:
In most cases, like this one, a single variable is not enough to generate a useful model. Instead we need multiple variables in the model, which you can include with either `+` or `*` in the modelling formula.
The distinction between `+` and `*` is important:
* `x + y` adds the variables independently: the model will estimate the
effect of `x` and the effect of `y` completely separately.
* `x * y` will estimate the effect of the model together.
We'll explore what this means in the sections below, as we add more categorical and continuous variables to the dataset.
R stores categorical data as factors. If you add a string to a model, R will convert it to a factor for the purposes of the model.
A factor is an integer vector with a levels attribute. You can make a factor with `factor()`.
```{r}
fac <- factor(c("c", "a", "b"),
levels = c("a", "b", "c"),
labels = c("blond", "brunette", "red"))
fac
unclass(fac)
```
Each level of the factor (i.e. unique value) is encoded as an integer and displayed with the label that is associated with that integer.
If you use factors outside of a model, you will notice some limiting behavior:
* You cannot add values to a factor that do not appear in its levels.
* Factors retain all of their levels when you subset them. To avoid this use `drop = TRUE`.
```{r}
fac[1]
fac[1, drop = TRUE]
```
* If you coerce a factor to a number with `as.numeric()`, R will convert the integer vector that underlies the factor to a number, not the level labels that you see when you print the factor.
Another case that occassionally crops up is nested variables: you have an identifier that is locally unique, not globally unique. For example you might have this data about students in schools:
```{r}
students <- tibble::frame_data(
~student_id, ~school_id,
1, 1,
2, 1,
1, 2,
1, 3,
2, 3,
3, 3
)
```
The student id only makes sense in the context of the school: it doesn't make sense to generate every combination of student and school. You can use `nesting()` for this case:
```{r}
students %>% expand(nesting(school_id, student_id))
```
### Interpolation vs extrapolation
One danger with prediction plots is that it's easy to make predictions that are far away from the original data. This is dangerous because it's quite possible that the model (which is a simplification of reality) will no longer apply far away from observed values.
As the number of variables in your model grows ... "the curse of dimensionality": as the number of variables increases the average distance between points increases. That means most of the space is very sparse, and you have to rely on strong assumptions.
To help avoid this problem, it's good practice to include "nearby" observed data points in any prediction plot. These help you see if you're interpolating, making prediction "in between" existing data points, or extrapolating, making predictions about preivously unobserved slices of the data.
One way to do this is to use `condvis::visualweight()`.
So far the y variable of our models has been a continuous variable, `income`. You can use linear regression to model a categorical y variable by transforming y into a continuous variable with a _link function_. Then model fit a model to the results of the link function and use the link function to back transform and interpret the results.