Merge pull request #51 from radugrosu/patch-9

Update model.Rmd
This commit is contained in:
Hadley Wickham 2016-03-21 08:52:09 -05:00
commit ca7960a32f
1 changed files with 13 additions and 13 deletions

View File

@ -8,7 +8,7 @@ This chapter will explain how to build useful models with R.
*Section 1* will show you how to build linear models, the most commonly used type of model. Along the way, you will learn R's model syntax, a general syntax that you can reuse with most of R's modeling functions.
*Section 2* will show you the best ways to use R's model output, which is often reguires additional wrangling.
*Section 2* will show you the best ways to use R's model output, which is often requires additional wrangling.
*Section 3* will teach you to build and interpret multivariate linear models, models that use more than one explanatory variable to explain the values of a response variable.
@ -33,7 +33,7 @@ library(broom)
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that directly inflates the income of the vertically gifted. Do you think this is true?
Luckily, it is easy to measure someone's height, as well as their income (and a swath of other variables besides), which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering, the point of the NLS is not to study the relationhip between height and income, that's just a lucky accident.
Luckily, it is easy to measure someone's height, as well as their income (and a swath of other variables besides), which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident.
You can load the latest cross-section of NLS data, collected in 2013 with the code below.
@ -43,7 +43,7 @@ heights <- readRDS("data/heights.RDS")
I've narrowed the data down to 10 variables:
* `id` - A number ot identify each subject
* `id` - A number to identify each subject
* `income` - The self-reported income of each subject
* `height` - The height of each subject in inches
* `weight` - The weight of each subject in pounds
@ -66,7 +66,7 @@ ggplot(data = heights, mapping = aes(x = height, y = income)) +
geom_point()
```
First, let's address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That's not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Burea of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data for data science.
First, let's address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That's not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Bureau of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data for data science.
Also, you can see that heights have been rounded to the nearest inch.
@ -78,9 +78,9 @@ cor(heights$height, heights$income, use = "na")
A model describes the relationship between two or more variables. There are multiple ways to describe any relationship. Which is best?
A common choice: decide form of relationship, then minimize residuals.
A common choice: decide the form of the relationship, then minimize residuals.
Use R's `lm()` function to fit a linear model to your data. The first argument of `lm()` should be a formula, two or more varibles separated by a `~`. You've seen forumlas before, we used them in Chapter 2 to facet graphs.
Use R's `lm()` function to fit a linear model to your data. The first argument of `lm()` should be a formula, two or more variables separated by a `~`. You've seen formulas before, we used them in Chapter 2 to facet graphs.
```{r}
income ~ height
@ -113,7 +113,7 @@ lm(income ~ 0 + height, data = heights)
## Using model output
R model output is not very tidy. It is designed to provide a data store that you can extract information from with helper functions.
R's model output is not very tidy. It is designed to provide a data store from which you can extract information with helper functions.
```{r}
coef(h)
@ -121,7 +121,7 @@ predict(h)[1:5]
resid(h)[1:5]
```
The `broom` package provides the most useful helper functions for working with R models. `broom` functions return the most useful model information as a data frames, which lets you quickly embed the information into your data science workflow.
The `broom` package provides the most useful helper functions for working with R models. `broom` functions return the most useful model information as data frames, which lets you quickly embed the information into your data science workflow.
### tidy()
@ -155,7 +155,7 @@ There appears to be a relationship between a person's education and how poorly t
Patterns in the residuals suggest that relationships exist between y and other variables, even when the effect of heights is accounted for.
Add variables to a model by adding variables to the righthand side of the model formula.
Add variables to a model by adding variables to the right-hand side of the model formula.
```{r}
income ~ height + education
@ -165,7 +165,7 @@ tidy(he)
### Interpretation
The coefficient of each variable displays the change of income that is associated with a one unit change in the variable _when all other variables are held constant_.
The coefficient of each variable represents the increase in income associated with a one unit increase in the variable _when all other variables are held constant_.
### Interaction effects
@ -196,7 +196,7 @@ Each level of the factor (i.e. unique value) is encoded as an integer and displa
If you use factors outside of a model, you will notice some limiting behavior:
* You cannot add values to a factor that do not appear in its levels attribute
* You cannot add to a factor values that do not appear in its levels attribute
* factors retain all of their levels attribute when you subset them. To avoid this use `drop = TRUE`.
```{r}
fac[1]
@ -208,7 +208,7 @@ num_fac <- factor(1:3, levels = 1:3, labels = c("100", "200", "300"))
num_fac
as.numeric(num_fac)
```
To coerce these labels to a different data type, first coerce the factor to a charater string with `as.character()`
To coerce these labels to a different data type, first coerce the factor to a character string with `as.character()`
```{r}
as.numeric(as.character(num_fac))
```
@ -333,7 +333,7 @@ gam(y ~ s(x, z), data = df)
We've avoided two things in this chapter that are usually conflated with models: hypothesis testing and predictive analysis.
There are other types of modeling algorithms; each provides a valid description about the data.
There are other types of modeling algorithms; each provides a valid description of the data.
Which description will be best? Does the relationship have a known form? Does the data have a known structure? Are you going to attempt hypothesis testing that imposes its own constraints?