Small edits to model.Rmd
This commit is contained in:
parent
6803e565fd
commit
fd1f530d6c
26
model.Rmd
26
model.Rmd
|
@ -1,6 +1,4 @@
|
|||
---
|
||||
output: pdf_document
|
||||
---
|
||||
|
||||
# Model
|
||||
|
||||
A model is a function that summarizes how the values of one variable vary in relation to the values of other variables. Models play a large role in hypothesis testing and prediction, but for the moment you should think of models just like you think of statistics. A statistic summarizes a *distribution* in a way that is easy to understand; and a model summarizes *covariation* in a way that is easy to understand. In other words, a model is just another way to describe data.
|
||||
|
@ -11,7 +9,7 @@ This chapter will explain how to build useful models with R.
|
|||
|
||||
*Section 1* will show you how to build linear models, the most commonly used type of model. Along the way, you will learn R's model syntax, a general syntax that you can reuse with most of R's modeling functions.
|
||||
|
||||
*Section 2* will show you the best ways to use R's model output, which is often requires additional wrangling.
|
||||
*Section 2* will show you the best ways to use R's model output, which often requires additional wrangling.
|
||||
|
||||
*Section 3* will teach you to build and interpret multivariate linear models, models that use more than one explanatory variable to explain the values of a response variable.
|
||||
|
||||
|
@ -37,7 +35,7 @@ library(broom)
|
|||
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that increases their income. Could this be true?
|
||||
|
||||
|
||||
Luckily, it is easy to measure someone's height, as well as their income (and a swath of other variables besides), which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident.
|
||||
Luckily, it is easy to measure someone's height, as well as their income, which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering just how your tax dollars are being spent, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident.
|
||||
|
||||
You can load the latest cross-section of NLS data, collected in 2013 with the code below.
|
||||
|
||||
|
@ -144,7 +142,7 @@ lm(income ~ 0 + height, data = heights)
|
|||
|
||||
## Using model output
|
||||
|
||||
R's model output is not very tidy. It is designed to provide a data store from which you can extract information with helper functions.
|
||||
R's model output is not very tidy. It is designed to provide a data store from which you can extract information with helper functions. You will learn more about tidy data in Tidy Data.
|
||||
|
||||
```{r}
|
||||
coef(h)
|
||||
|
@ -213,12 +211,14 @@ What about sex? Many sources have observed that there is a difference in income
|
|||
|
||||
### Factors
|
||||
|
||||
R stores categorical data as factors or character strings. If you add a string to a model, R will convert it to a factor for the purposes of the model.
|
||||
R stores categorical data as factors. If you add a string to a model, R will convert it to a factor for the purposes of the model.
|
||||
|
||||
A factor is an integer vector with a levels attribute. You can make a factor with `factor()`.
|
||||
|
||||
```{r}
|
||||
fac <- factor(c("c", "a", "b"), levels = c("a", "b", "c"), labels = c("blond", "brunette", "red"))
|
||||
fac <- factor(c("c", "a", "b"),
|
||||
levels = c("a", "b", "c"),
|
||||
labels = c("blond", "brunette", "red"))
|
||||
fac
|
||||
unclass(fac)
|
||||
```
|
||||
|
@ -227,19 +227,19 @@ Each level of the factor (i.e. unique value) is encoded as an integer and displa
|
|||
|
||||
If you use factors outside of a model, you will notice some limiting behavior:
|
||||
|
||||
* You cannot add to a factor values that do not appear in its levels attribute
|
||||
* factors retain all of their levels attribute when you subset them. To avoid this use `drop = TRUE`.
|
||||
* You cannot add values to a factor that do not appear in its levels.
|
||||
* Factors retain all of their levels when you subset them. To avoid this use `drop = TRUE`.
|
||||
```{r}
|
||||
fac[1]
|
||||
fac[1, drop = TRUE]
|
||||
```
|
||||
* If you coerce a factor to a numeric, R will convert the integer vector that underlies the factor, not the level labels that you see when you print the factor.
|
||||
* If you coerce a factor to a number with `as.numeric()`, R will convert the integer vector that underlies the factor to a number, not the level labels that you see when you print the factor.
|
||||
```{r}
|
||||
num_fac <- factor(1:3, levels = 1:3, labels = c("100", "200", "300"))
|
||||
num_fac
|
||||
as.numeric(num_fac)
|
||||
```
|
||||
To coerce these labels to a different data type, first coerce the factor to a character string with `as.character()`
|
||||
To coerce the labels that you see to a new data type, first coerce the factor to a character string with `as.character()`
|
||||
```{r}
|
||||
as.numeric(as.character(num_fac))
|
||||
```
|
||||
|
@ -292,7 +292,7 @@ tidy(she)
|
|||
|
||||
## Non-linear models
|
||||
|
||||
But what if the relationship between variables is not linear. For example, the relationship between income and education does not seem to be linear.
|
||||
But what if the relationship between variables is not linear? For example, the relationship between income and education does not seem to be linear.
|
||||
|
||||
```{r}
|
||||
ggplot(data = heights, mapping = aes(x = education, y = income)) +
|
||||
|
|
Loading…
Reference in New Issue