Delete unused chapters
Move extra bits to own directory
This commit is contained in:
152
extra/clustering.Rmd
Normal file
152
extra/clustering.Rmd
Normal file
@@ -0,0 +1,152 @@
|
||||
|
||||
## Visualizing three or more variables
|
||||
|
||||
In general, outliers, clusters, and patterns become easier to spot as you look at the interaction of more and more variables. However, as you include more variables in your plot, data becomes harder to visualize.
|
||||
|
||||
You can extend scatterplots into three dimensions with the plotly, rgl, rglwidget, and threejs packages (among others). Each creates a "three dimensional," graph that you can rotate with your mouse. Below is an example from plotly, displayed as a static image.
|
||||
|
||||
```{r eval = FALSE}
|
||||
library(plotly)
|
||||
plot_ly(data = iris, x = Sepal.Length, y = Sepal.Width, z = Petal.Width,
|
||||
color = Species, type = "scatter3d", mode = "markers")
|
||||
```
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/EDA-plotly.png")
|
||||
```
|
||||
|
||||
You can extend this approach into n-dimensional hyperspace with the ggobi package, but you will soon notice a weakness of multidimensional graphs. You can only visualize multidimensional space by projecting it onto your two dimensional retinas. In the case of 3D graphics, you can combine 2D projections with rotation to create an intuitive illusion of 3D space, but the illusion ceases to be intuitive as soon as you add a fourth dimension.
|
||||
|
||||
This doesn't mean that you should ignore complex interactions in your data. You can explore multivariate relationships in several ways. You can
|
||||
|
||||
* visualize each combination of variables in a multivariate relationship, two at a time
|
||||
|
||||
* use aesthetics and facetting to add additional variables to a 2D plot
|
||||
|
||||
* use a clustering algorithm to spot clusters in multivariate space
|
||||
|
||||
* use a modeling algorithm to spot patterns and outliers in multivariate space
|
||||
|
||||
## Clusters
|
||||
|
||||
Cluster algorithms are automated tools that seek out clusters in n-dimensional space for you. Base R provides two easy to use clustering algorithms: hierarchical clustering and k means clustering.
|
||||
|
||||
### Hierarchical clustering
|
||||
|
||||
Hierarchical clustering uses a simple algorithm to locate groups of points that are near each other in n-dimensional space:
|
||||
|
||||
1. Identify the two points that are closest to each other
|
||||
2. Combine these points into a cluster
|
||||
3. Treat the new cluster as a point
|
||||
4. Repeat until all of the points are grouped into a single cluster
|
||||
|
||||
You can visualize the results of the algorithm as a dendrogram, and you can use the dendrogram to divide your data into any number of clusters. The figure below demonstrates how the algorithm would proceed in a two dimensional dataset.
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/EDA-hclust.png")
|
||||
```
|
||||
|
||||
To use hierarchical clustering in R, begin by selecting the numeric columns from your data; you can only apply hierarchical clustering to numeric data. Then apply the `dist()` function to the data and pass the results to `hclust()`. `dist()` computes the distances between your points in the n dimensional space defined by your numeric vectors. `hclust()` performs the clustering algorithm.
|
||||
|
||||
```{r}
|
||||
small_iris <- sample_n(iris, 50)
|
||||
|
||||
iris_hclust <- small_iris %>%
|
||||
select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>%
|
||||
dist() %>%
|
||||
hclust(method = "complete")
|
||||
```
|
||||
|
||||
Use `plot()` to visualize the results as a dendrogram. Each observation in the dataset will appear at the bottom of the dendrogram labeled by its rowname. You can use the labels argument to set the labels to something more informative.
|
||||
|
||||
```{r fig.height = 4}
|
||||
plot(iris_hclust, labels = small_iris$Species)
|
||||
```
|
||||
|
||||
To see how near two data points are to each other, trace the paths of the data points up through the tree until they intersect. The y value of the intersection displays how far apart the points are in n-dimensional space. Points that are close to each other will intersect at a small y value, points that are far from each other will intersect at a large y value. Groups of points that are near each other will look like "leaves" that all grow on the same "branch." The ordering of the x axis in the dendrogram is somewhat arbitrary (think of the tree as a mobile, each horizontal branch can spin around meaninglessly).
|
||||
|
||||
You can split your data into any number of clusters by drawing a horizontal line across the tree. Each vertical branch that the line crosses will represent a cluster that contains all of the points downstream from the branch. Move the line up the y axis to intersect fewer branches (and create fewer clusters), move the line down the y axis to intersect more branches and (create more clusters).
|
||||
|
||||
`cutree()` provides a useful way to split data points into clusters. Give cutree the output of `hclust()` as well as the number of clusters that you want to split the data into. `cutree()` will return a vector of cluster labels for your dataset. To visualize the results, map the output of `cutree()` to an aesthetic.
|
||||
|
||||
```{r}
|
||||
(clusters <- cutree(iris_hclust, 3))
|
||||
|
||||
ggplot(small_iris, aes(x = Sepal.Width, y = Sepal.Length)) +
|
||||
geom_point(aes(color = factor(clusters)))
|
||||
```
|
||||
|
||||
You can modify the hierarchical clustering algorithm by setting the method argument of hclust to one of "complete", "single", "average", or "centroid". The method determines how to measure the distance between two clusters or a lone point and a cluster, a measurement that affects the outcome of the algorithm.
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/EDA-linkage.png")
|
||||
```
|
||||
|
||||
* *complete* - Measures the greatest distance between any two points in the separate clusters. Tends to create distinct clusters and subclusters.
|
||||
|
||||
* *single* - Measures the smallest distance between any two points in the separate clusters. Tends to add points one at a time to existing clusters, creating ambiguously defined clusters.
|
||||
|
||||
* *average* - Measures the average distance between all combinations of points in the separate clusters. Tends to add points one at a time to existing clusters.
|
||||
|
||||
* *centroid* - Measures the distance between the average location of the points in each cluster.
|
||||
|
||||
|
||||
```{r fig.height = 4}
|
||||
small_iris %>%
|
||||
select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>%
|
||||
dist() %>%
|
||||
hclust(method = "single") %>%
|
||||
plot(labels = small_iris$Species)
|
||||
```
|
||||
|
||||
|
||||
### K means clustering
|
||||
|
||||
K means clustering provides a simulation based alternative to hierarchical clustering. It identifies the "best" way to group your data into a predefined number of clusters. The figure below visualizes (in two dimensional space) the k means algorithm:
|
||||
|
||||
1. Randomly assign each data point to one of $k$ groups
|
||||
2. Compute the centroid of each group
|
||||
3. Reassign each point to the group whose centroid it is nearest to
|
||||
4. Repeat steps 2 and 3 until group memberships cease to change
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("images/EDA-kmeans.png")
|
||||
```
|
||||
|
||||
Use `kmeans()` to perform k means clustering with R. As with hierarchical clustering, you can only apply k means clustering to numerical data. Pass your numerical data to the `kmeans()` function, then set `center` to the number of clusters to search for ($k$) and `nstart` to the number of simulations to run. Since the results of k means clustering depend on the initial assignment of points to groups, which is random, R will run `nstart` simulations and then return the best results (as measured by the minimum sum of squared distances between each point and the centroid of the group it is assigned to). Finally, set the maximum number of iterations to let each simulation run in case the simulation cannot quickly find a stable grouping.
|
||||
|
||||
```{r}
|
||||
iris_kmeans <- small_iris %>%
|
||||
select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>%
|
||||
kmeans(centers = 3, nstart = 20, iter.max = 50)
|
||||
|
||||
iris_kmeans$cluster
|
||||
```
|
||||
|
||||
Unlike `hclust()`, the k means algorithm does not provide an intuitive visual interface. Instead, `kmeans()` returns a kmeans class object. Subset the object with `$cluster` to access a list of cluster assignments for your dataset, e.g. `iris_kmeans$cluster`. You can visualize the results by mapping them to an aesthetic, or you can apply the results by passing them to dplyr's `group_by()` function.
|
||||
|
||||
```{r}
|
||||
ggplot(small_iris, aes(x = Sepal.Width, y = Sepal.Length)) +
|
||||
geom_point(aes(color = factor(iris_kmeans$cluster)))
|
||||
|
||||
small_iris %>%
|
||||
group_by(iris_kmeans$cluster) %>%
|
||||
summarise(n_obs = n(), avg_width = mean(Sepal.Width), avg_length = mean(Sepal.Length))
|
||||
```
|
||||
|
||||
|
||||
### Asking questions about clustering
|
||||
|
||||
Ask the same questions about clusters that you find with `hclust()` and `kmeans()` that you would ask about clusters that you find with a graph. Ask yourself:
|
||||
|
||||
* Do the clusters seem to identify real differences between your points? How can you tell?
|
||||
|
||||
* Are the points within each cluster similar in some way?
|
||||
|
||||
* Are the points in separate clusters different in some way?
|
||||
|
||||
* Might there be a mismatch between the number of clusters that you found and the number that exist in real life? Are only a couple of the clusters meaningful? Are there more clusters in the data than you found?
|
||||
|
||||
* How stable are the clusters if you rerun the algorithm?
|
||||
|
||||
Keep in mind that both algorithms _will always_ return a set of clusters, whether your data appears clustered or not. As a result, you should always be skeptical about the results. They can be quite insightful, but there is no reason to treat them as a fact without doing further research.
|
||||
21
extra/databases.Rmd
Normal file
21
extra/databases.Rmd
Normal file
@@ -0,0 +1,21 @@
|
||||
# Databases
|
||||
|
||||
### Two-table verbs
|
||||
|
||||
Each two-table verb has a straightforward SQL equivalent:
|
||||
|
||||
| R | SQL
|
||||
|------------------|--------
|
||||
| `inner_join()` | `SELECT * FROM x JOIN y ON x.a = y.a`
|
||||
| `left_join()` | `SELECT * FROM x LEFT JOIN y ON x.a = y.a`
|
||||
| `right_join()` | `SELECT * FROM x RIGHT JOIN y ON x.a = y.a`
|
||||
| `full_join()` | `SELECT * FROM x FULL JOIN y ON x.a = y.a`
|
||||
| `semi_join()` | `SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)`
|
||||
| `anti_join()` | `SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a)`
|
||||
| `intersect(x, y)`| `SELECT * FROM x INTERSECT SELECT * FROM y`
|
||||
| `union(x, y)` | `SELECT * FROM x UNION SELECT * FROM y`
|
||||
| `setdiff(x, y)` | `SELECT * FROM x EXCEPT SELECT * FROM y`
|
||||
|
||||
`x` and `y` don't have to be tables in the same database. If you specify `copy = TRUE`, dplyr will copy the `y` table into the same location as the `x` variable. This is useful if you've downloaded a summarised dataset and determined a subset of interest that you now want the full data for. You can use `semi_join(x, y, copy = TRUE)` to upload the indices of interest to a temporary table in the same database as `x`, and then perform a efficient semi join in the database.
|
||||
|
||||
If you're working with large data, it maybe also be helpful to set `auto_index = TRUE`. That will automatically add an index on the join variables to the temporary table.
|
||||
336
extra/heights.Rmd
Normal file
336
extra/heights.Rmd
Normal file
@@ -0,0 +1,336 @@
|
||||
|
||||
## Heights data
|
||||
|
||||
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that increases their income. Could this be true?
|
||||
|
||||
Luckily, it is easy to measure someone's height, as well as their income, which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering just how your tax dollars are being spent, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident.
|
||||
|
||||
A small sample of the full dataset is included in modelr:
|
||||
|
||||
```{r}
|
||||
heights
|
||||
```
|
||||
|
||||
As well as `height` and `income` there are some other variables that might affect someone's income: `age`, `sex`, `race`, years of `education`, and their score on the `afqt` (Armed Forces Qualification Test).
|
||||
|
||||
Now that you have the data, you can visualize the relationship between height and income. But what does the data say? How would you describe the relationship?
|
||||
|
||||
```{r warnings = FALSE}
|
||||
ggplot(heights, aes(height, income)) +
|
||||
geom_point()
|
||||
```
|
||||
|
||||
First, let's address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That's not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Bureau of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data for data science.
|
||||
|
||||
```{r}
|
||||
n <- nrow(heights)
|
||||
heights <- heights %>% filter(income < 150000)
|
||||
nrow(heights) / n
|
||||
```
|
||||
|
||||
I'm going to record the original number of observations in `n`. We'll come back to this every now and then to make sure that we haven't throw out too much of our data.
|
||||
|
||||
Also, you can see that heights have been rounded to the nearest inch so using boxplots will make it easier to see the pattern. We'll also remove the very tall and very short people so we can focus on the most typically heights:
|
||||
|
||||
```{r}
|
||||
heights <- heights %>% filter(between(height, 59, 78))
|
||||
nrow(heights) / n
|
||||
|
||||
ggplot(heights, aes(height, income, group = height)) +
|
||||
geom_boxplot()
|
||||
```
|
||||
|
||||
(Throwing away data in the first pass at a model is perfectly acceptable: starting with a simple subset of a problem that you can easily solve is a good general strategy. But in a real analysis, once you've got the first simple model working, you really should come back and all look at the full dataset. Is removing the data still a good idea?)
|
||||
|
||||
You can see there seems to be a fairly weak relationship: as height increase the median wage also seems to increase. But how could we summarise that more quantitiatively?
|
||||
|
||||
## Linear models
|
||||
|
||||
One way is to use a linear model. A linear model is a very broad family of models: it encompasses all models that are a weighted sum of variables.
|
||||
|
||||
The formula specifies a family of models: for example, `income ~ height` describes the family of models specified by `x1 * income + x0`, where `x0` and `x1` are real numbers.
|
||||
|
||||
```{r}
|
||||
income ~ height
|
||||
```
|
||||
|
||||
We fit the model by supplying the family of models (the formula), and the data, to a model fitting function, `lm()`. `lm()` finds the single model in the family of models that is closest to the data:
|
||||
|
||||
```{r}
|
||||
h <- lm(income ~ height, data = heights)
|
||||
h
|
||||
```
|
||||
|
||||
We can extract the coefficients of this fitted model and write down the model it specifies:
|
||||
|
||||
```{r}
|
||||
coef(h)
|
||||
```
|
||||
|
||||
This tells says the model is $`r coef(h)[1]` + `r coef(h)[2]` * height$. In other words, one inch increase of height associated with an increase of \$937 in income.
|
||||
|
||||
|
||||
The definition that `lm()` uses for closeness is that it looks for a model that minimises the "root mean squared error".
|
||||
|
||||
`lm()` fits a straight line that describes the relationship between the variables in your formula. You can picture the result visually like this.
|
||||
|
||||
```{r}
|
||||
ggplot(heights, aes(height, income)) +
|
||||
geom_boxplot(aes(group = height)) +
|
||||
geom_smooth(method = lm, se = FALSE)
|
||||
```
|
||||
|
||||
`lm()` treats the variable(s) on the right-hand side of the formula as _explanatory variables_ that partially determine the value of the variable on the left-hand side of the formula, which is known as the _response variable_. In other words, it acts as if the _response variable_ is determined by a function of the _explanatory variables_. Linear regression is _linear_ because it finds the linear combination of the explanatory variables that best predict the response.
|
||||
|
||||
|
||||
### Exercises
|
||||
|
||||
1. What variables in `heights` do you expect to be most highly correlated with
|
||||
income? Use `cor()` plus `purrr::map_dbl()` to check your guesses.
|
||||
|
||||
1. Correlation only summarises the linear relationship between two continuous
|
||||
variables. There are some famous drawbacks to the correlation. What
|
||||
are they? Hint: google for Anscombe's quartet, read <https://xkcd.com/552/>.
|
||||
|
||||
### Categorical
|
||||
|
||||
Our model so far is extremely simple: it only uses one variable to try and predict income. We also know something else important: women tend to be shorter than men and tend to get paid less.
|
||||
|
||||
```{r}
|
||||
ggplot(heights, aes(height, colour = sex)) +
|
||||
geom_freqpoly(binwidth = 1)
|
||||
ggplot(heights, aes(income, colour = sex)) +
|
||||
geom_freqpoly(binwidth = 5000)
|
||||
```
|
||||
|
||||
What happens if we also include `sex` in the model?
|
||||
|
||||
```{r}
|
||||
h2 <- lm(income ~ height * sex, data = heights)
|
||||
grid <- heights %>%
|
||||
expand(height, sex) %>%
|
||||
add_predictions(h2, "income")
|
||||
|
||||
ggplot(heights, aes(height, income)) +
|
||||
geom_point() +
|
||||
geom_line(data = grid) +
|
||||
facet_wrap(~sex)
|
||||
```
|
||||
|
||||
Need to commment about predictions for tall women and short men - there is not a lot of data there. Need to be particularly sceptical.
|
||||
|
||||
`*` vs `+`.
|
||||
|
||||
```{r}
|
||||
h3 <- lm(income ~ height + sex, data = heights)
|
||||
grid <- heights %>%
|
||||
expand(height, sex) %>%
|
||||
gather_predictions(h2, h3)
|
||||
|
||||
ggplot(grid, aes(height, pred, colour = sex)) +
|
||||
geom_line() +
|
||||
facet_wrap(~model)
|
||||
```
|
||||
|
||||
### Continuous
|
||||
|
||||
There appears to be a relationship between a person's education and how poorly the model predicts their income. If we graph the model residuals against `education` above, we see that the more a person is educated, the worse the model underestimates their income:
|
||||
|
||||
But before we add a variable to our model, we need to do a little EDA + cleaning:
|
||||
|
||||
```{r}
|
||||
ggplot(heights, aes(education)) + geom_bar()
|
||||
heights_ed <- heights %>% filter(education >= 12)
|
||||
nrow(heights) / n
|
||||
```
|
||||
|
||||
We could improve the model by adding education:
|
||||
|
||||
```{r}
|
||||
he1 <- lm(income ~ height + education, data = heights_ed)
|
||||
he2 <- lm(income ~ height * education, data = heights_ed)
|
||||
```
|
||||
|
||||
How can we visualise the results of this model? One way to think about it as a surface: we have a 2d grid of height and education, and point on that grid gets a predicted income.
|
||||
|
||||
```{r}
|
||||
grid <- heights_ed %>%
|
||||
expand(height, education) %>%
|
||||
gather_predictions(he1, he2)
|
||||
|
||||
ggplot(grid, aes(height, education, fill = pred)) +
|
||||
geom_raster() +
|
||||
facet_wrap(~model)
|
||||
```
|
||||
|
||||
It's easier to see what's going on in a line plot:
|
||||
|
||||
```{r}
|
||||
ggplot(grid, aes(height, pred, group = education)) +
|
||||
geom_line() +
|
||||
facet_wrap(~model)
|
||||
ggplot(grid, aes(education, pred, group = height)) +
|
||||
geom_line() +
|
||||
facet_wrap(~model)
|
||||
```
|
||||
|
||||
One of the big advantages to `+` instead of `*` is that because the terms are independent we display them using two simple plots instead of one complex plot:
|
||||
|
||||
```{r}
|
||||
heights_ed %>%
|
||||
expand(
|
||||
height = seq_range(height, 10),
|
||||
education = mean(education, na.rm = TRUE)
|
||||
) %>%
|
||||
add_predictions(he1, "income") %>%
|
||||
ggplot(aes(height, income)) +
|
||||
geom_line()
|
||||
|
||||
heights_ed %>%
|
||||
expand(
|
||||
height = mean(height, na.rm = TRUE),
|
||||
education = seq_range(education, 10)
|
||||
) %>%
|
||||
add_predictions(he1, "income") %>%
|
||||
ggplot(aes(education, income)) +
|
||||
geom_line()
|
||||
```
|
||||
|
||||
The full interaction suggests that height matters less as education increases. But which model is "better"? We'll come back to that question later.
|
||||
|
||||
What happens if we add the data back in to the plot? Do you get more or less sceptical about the results from this model?
|
||||
|
||||
You can imagine that if you had a model with four continuous predictions all interacting, that it would be pretty complicated to understand what's going in the model! And certainly you don't have to - it's totally fine to use a model simply as a tool for predicting new values, and in the next chapters you'll learn some techniques to help evaluate such models without looking at them. However, I think the more you can connect your understand of the domain to the model, the more likely you are to detect potential problems before they occur. The goal is not to undertand every last nuance of the model, but instead to understand more than what you did previously.
|
||||
|
||||
condvis.
|
||||
|
||||
### Categorical
|
||||
|
||||
|
||||
```{r}
|
||||
s <- lm(income ~ sex, data = heights)
|
||||
tidy(s)
|
||||
```
|
||||
|
||||
Every level of the factor except one receives its own coefficient. The missing level acts as a baseline.
|
||||
|
||||
To change the baseline, create a new factor with a new levels attribute. R will use the first level in the levels attribute as the baseline.
|
||||
|
||||
```{r}
|
||||
heights$sex <- factor(heights$sex, levels = c("male", "female"))
|
||||
```
|
||||
|
||||
```{r}
|
||||
hes <- lm(income ~ height + education + sex, data = heights)
|
||||
tidy(hes)
|
||||
```
|
||||
|
||||
```{r}
|
||||
heights %>%
|
||||
group_by(sex) %>%
|
||||
do(glance(lm(income ~ height, data = .)))
|
||||
```
|
||||
|
||||
```{r}
|
||||
hes2 <- lm(income ~ height + education * sex, data = heights)
|
||||
tidy(hes2)
|
||||
```
|
||||
|
||||
### Splines
|
||||
|
||||
But what if the relationship between variables is not linear? For example, the relationship between income and education does not seem to be linear:
|
||||
|
||||
```{r}
|
||||
ggplot(heights_ed, aes(education, income)) +
|
||||
geom_boxplot(aes(group = education)) +
|
||||
geom_smooth(se = FALSE)
|
||||
```
|
||||
|
||||
One way to introduce non-linearity into our model is to use transformed variants of the predictors.
|
||||
|
||||
```{r}
|
||||
mod_e1 <- lm(income ~ education, data = heights_ed)
|
||||
mod_e2 <- lm(income ~ education + I(education ^ 2) + I(education ^ 3), data = heights_ed)
|
||||
|
||||
heights_ed %>%
|
||||
expand(education) %>%
|
||||
gather_predictions(mod_e1, mod_e2) %>%
|
||||
ggplot(aes(education, pred, colour = model)) +
|
||||
geom_point() +
|
||||
geom_line()
|
||||
```
|
||||
|
||||
This is a bit clunky because we have to surround each transformation with `I()`. This is because the rules of model algebra are a little different to usual algebra. `x ^ 2` is equivalent to `x * x` which in the modelling algebra is equivalent to `x + x + x:x` which is the same as `x`. This is useful because `(x + y + z)^2` fit all all major terms and second order interactions of x, y, and z.
|
||||
|
||||
```{r}
|
||||
mod_e1 <- lm(income ~ education, data = heights_ed)
|
||||
mod_e2 <- lm(income ~ poly(education, 2), data = heights_ed)
|
||||
mod_e3 <- lm(income ~ poly(education, 3), data = heights_ed)
|
||||
|
||||
heights_ed %>%
|
||||
expand(education) %>%
|
||||
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
|
||||
ggplot(aes(education, pred, colour = model)) +
|
||||
geom_point() +
|
||||
geom_line()
|
||||
```
|
||||
|
||||
However: there's one major problem with using `poly()`: outside the range of the data, polynomials are going to rapidly shoot off to positive or negative infinity.
|
||||
|
||||
```{r}
|
||||
tibble(education = seq(5, 25)) %>%
|
||||
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
|
||||
ggplot(aes(education, pred, colour = model)) +
|
||||
geom_line()
|
||||
```
|
||||
|
||||
Splines avoid this problem by linearly interpolating outside the range of the data. This isn't great either, but it's a safer default when you don't know for sure what's going to happen.
|
||||
|
||||
```{r}
|
||||
library(splines)
|
||||
mod_e1 <- lm(income ~ education, data = heights_ed)
|
||||
mod_e2 <- lm(income ~ ns(education, 2), data = heights_ed)
|
||||
mod_e3 <- lm(income ~ ns(education, 3), data = heights_ed)
|
||||
|
||||
tibble(education = seq(5, 25)) %>%
|
||||
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
|
||||
ggplot(aes(education, pred, colour = model)) +
|
||||
geom_line()
|
||||
```
|
||||
|
||||
|
||||
Other useful arguments to `seq_range()`:
|
||||
|
||||
* `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks
|
||||
nice to the human eye. This is useful if you want to produce tables of
|
||||
output:
|
||||
|
||||
```{r}
|
||||
seq_range(c(0.0123, 0.923423), n = 5)
|
||||
seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
|
||||
```
|
||||
|
||||
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
|
||||
variables has an long tailed distribution and you want to focus on generating
|
||||
values near the center:
|
||||
|
||||
```{r}
|
||||
x <- rcauchy(100)
|
||||
seq_range(x, n = 5)
|
||||
seq_range(x, n = 5, trim = 0.10)
|
||||
seq_range(x, n = 5, trim = 0.25)
|
||||
seq_range(x, n = 5, trim = 0.50)
|
||||
```
|
||||
|
||||
|
||||
### Additive models
|
||||
|
||||
|
||||
```{r, dev = "png"}
|
||||
library(mgcv)
|
||||
gam(income ~ s(education), data = heights)
|
||||
|
||||
ggplot(data = heights, mapping = aes(x = education, y = income)) +
|
||||
geom_point() +
|
||||
geom_smooth(method = gam, formula = y ~ s(x))
|
||||
```
|
||||
227
extra/robust-code.Rmd
Normal file
227
extra/robust-code.Rmd
Normal file
@@ -0,0 +1,227 @@
|
||||
```{r, include = FALSE}
|
||||
library(magrittr)
|
||||
```
|
||||
|
||||
# Robust code
|
||||
|
||||
(This is an advanced topic. You shouldn't worry too much about it when you first start writing functions. Instead you should focus on getting a function that works right for the easiest 80% of the problem. Then in time, you'll learn how to get to 99% with minimal extra effort. The defaults in this book should steer you in the right direction: we avoid teaching you functions with major surprises.)
|
||||
|
||||
In this section you'll learn an important principle that lends itself to reliable and readable code: favour code that can be understood with a minimum of context. On one extreme, take this code:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
baz <- foo(bar, qux)
|
||||
```
|
||||
|
||||
What does it do? You can glean only a little from the context: `foo()` is a function that takes (at least) two arguments, and it returns a result we store in `baz`. But apart from that, you have no idea. To understand what this function does, you need to read the definitions of `foo()`, `bar`, and `qux`. Using better variable names helps a lot:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
df2 <- arrange(df, qux)
|
||||
```
|
||||
|
||||
It's now much easier to see what's going on! Function and variable names are important because they tell you about (or at least jog your memory of) what the code does. That helps you understand code in isolation, even if you don't completely understand all the details. Unfortunately naming things is hard, and it's hard to give concrete advice apart from giving objects short but evocative names. As autocomplete in RStudio has gotten better, I've tended to use longer names that are more descriptive. Short names are faster to type, but you write code relatively infrequently compared to the number of times that you read it.
|
||||
|
||||
The idea of minimising the context needed to understand your code goes beyond just good naming. You also want to favour functions with predictable behaviour and few surprises. If a function does radically different things when its inputs differ slightly, you'll need to carefully read the surrounding context in order to predict what it will do. The goal of this section is to educate you about the most common ways R functions can be surprising and to provide you with unsurprising alternatives.
|
||||
|
||||
There are three common classes of surprises in R:
|
||||
|
||||
1. Unstable types: What will `df[, x]` return? You can assume that `df`
|
||||
is a data frame and `x` is a vector because of their names. But you don't
|
||||
know whether this code will return a data frame or a vector because the
|
||||
behaviour of `[` depends on the length of x.
|
||||
|
||||
1. Non-standard evaluation: What will `filter(df, x == y)` do? It depends on
|
||||
whether `x` or `y` or both are variable in `df` or variables in the current
|
||||
environment.
|
||||
|
||||
1. Hidden arguments: What sort of variable will `data.frame(x = "a")`
|
||||
create? It will be either a character vector or a factor depending on
|
||||
the value of the global `stringsAsFactors` option.
|
||||
|
||||
Avoiding these three types of functions helps you to write code that you is easily understand and fails obviously with unexpected input. If this behaviour is so important, why do any functions behave differently? It's because R is not just a programming language, but it's also an environment for interactive data analysis. Some things make sense for interactive use (where you quickly check the output and guessing what you want is ok) but don't make sense for programming (where you want errors to arise as quickly as possible).
|
||||
You might notice that these issues revolve around data frames. That's unfortunate because data frames are the data structure you'll use most commonly. It's ironic, the most frustrating things about programming in R are features that were originally designed to make your data analysis easier! Data frames try very hard to be helpful:
|
||||
|
||||
```{r}
|
||||
df <- data.frame(xy = c("x", "y"))
|
||||
# Character vectors were hard to work with for a long time, so R
|
||||
# helpfully converts to a factor for you:
|
||||
class(df$xy)
|
||||
|
||||
# If you're only selecting a single column, R tries to be helpful
|
||||
# and give you that column, rather than giving you a single column
|
||||
# data frame
|
||||
class(df[, "xy"])
|
||||
|
||||
# If you have long variable names, R is "helpful" and lets you select
|
||||
# them with a unique prefix
|
||||
df$x
|
||||
```
|
||||
|
||||
These features all made sense at the time they were added to R, but computing environments have changed a lot, and these features now tend to cause a lot of problems. tibble disables them for you:
|
||||
|
||||
```{r, error = TRUE}
|
||||
df <- tibble::tibble(xy = c("x", "y"))
|
||||
class(df$xy)
|
||||
class(df[, "xy"])
|
||||
df$x
|
||||
```
|
||||
|
||||
### Unpredictable types
|
||||
|
||||
One of the aspects most frustrating for programming is that `[` returns a vector if the result has a single column, and returns a data frame otherwise. In other words, if you see code like `df[x, ]` you can't predict what it will return without knowing the value of `x`. This can trip you up in surprising ways. For example, imagine you've written this function to return the last row of a data frame:
|
||||
|
||||
```{r}
|
||||
last_row <- function(df) {
|
||||
df[nrow(df), ]
|
||||
}
|
||||
```
|
||||
|
||||
It's not always going to return a row! If you give it a single column data frame, it will return a single number:
|
||||
|
||||
```{r}
|
||||
df <- data.frame(x = 1:3)
|
||||
last_row(df)
|
||||
```
|
||||
|
||||
There are two ways to avoid this problem:
|
||||
|
||||
* Use `drop = FALSE`: `df[x, , drop = FALSE]`.
|
||||
* Subset the data frame like a list: `df[x]`.
|
||||
|
||||
Using one of those techniques for `last_row()` makes it more predictable: you know it will always return a data frame.
|
||||
|
||||
```{r}
|
||||
last_row <- function(df) {
|
||||
df[nrow(df), , drop = FALSE]
|
||||
}
|
||||
last_row(df)
|
||||
```
|
||||
|
||||
Another common cause of problems is the `sapply()` function. If you've never heard of it before, feel free to skip this bit: just remember to avoid it! The problem with `sapply()` is that it tries to guess what the simplest form of output is, and it always succeeds.
|
||||
|
||||
The following code shows how `sapply()` can produce three different types of data depending on the input.
|
||||
|
||||
```{r}
|
||||
df <- data.frame(
|
||||
a = 1L,
|
||||
b = 1.5,
|
||||
y = Sys.time(),
|
||||
z = ordered(1)
|
||||
)
|
||||
|
||||
|
||||
df[1:4] %>% sapply(class) %>% str()
|
||||
df[1:2] %>% sapply(class) %>% str()
|
||||
df[3:4] %>% sapply(class) %>% str()
|
||||
```
|
||||
|
||||
In the next chapter, you'll learn about the purrr package which provides a variety of alternatives. In this case, you could use `map_chr()` which always returns a character vector: if it can't, it will throw an error. Another option is the base `vapply()` function which takes a third argument indicating what the output should look like.
|
||||
|
||||
This doesn't make `sapply()` bad and `vapply()` and `map_chr()` good. `sapply()` is nice because you can use it interactively without having to think about what `f` will return. 95% of the time it will do the right thing, and if it doesn't you can quickly fix it. `map_chr()` is more important when you're programming because a clear error message is more valuable when an operation is buried deep inside a tree of function calls. At this point it's worth thinking more about
|
||||
|
||||
### Non-standard evaluation
|
||||
|
||||
You've learned a number of functions that implement special lookup rules:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
ggplot(mpg, aes(displ, cty)) + geom_point()
|
||||
filter(mpg, displ > 10)
|
||||
```
|
||||
|
||||
These are called "non-standard evaluation", or NSE for short, because the usual lookup rules don't apply. In both cases above neither `displ` nor `cty` are present in the global environment. Instead both ggplot2 and dplyr look for them first in a data frame. This is great for interactive use, but can cause problems inside a function because they'll fall back to the global environment if the variable isn't found.
|
||||
|
||||
[Talk a little bit about the standard scoping rules]
|
||||
|
||||
For example, take this function:
|
||||
|
||||
```{r}
|
||||
big_x <- function(df, threshold) {
|
||||
dplyr::filter(df, x > threshold)
|
||||
}
|
||||
```
|
||||
|
||||
There are two ways in which this function can fail:
|
||||
|
||||
1. `df$x` might not exist. There are two potential failure modes:
|
||||
|
||||
```{r, error = TRUE}
|
||||
big_x(mtcars, 10)
|
||||
|
||||
x <- 1
|
||||
big_x(mtcars, 10)
|
||||
```
|
||||
|
||||
The second failure mode is particularly pernicious because it doesn't
|
||||
throw an error, but instead silently returns an incorrect result. It
|
||||
works because by design `filter()` looks in both the data frame and
|
||||
the parent environment.
|
||||
|
||||
It is unlikely that the variable you care about will both be missing where
|
||||
you expect it, and present where you don't expect it. But I think it's
|
||||
worth weighing heavily in your analysis of potential failure modes because
|
||||
it's a failure that's easy to miss (since it just silently gives a bad
|
||||
result), and hard to track down (since you need to read a lot of context).
|
||||
|
||||
1. `df$threshold` might exist:
|
||||
|
||||
```{r}
|
||||
df <- tibble::tibble(x = 1:10, threshold = 100)
|
||||
big_x(df, 5)
|
||||
```
|
||||
|
||||
Again, this is bad because it silently gives an unexpected result.
|
||||
|
||||
How can you avoid this problem? Currently, you need to do this:
|
||||
|
||||
```{r}
|
||||
big_x <- function(df, threshold) {
|
||||
if (!"x" %in% names(df))
|
||||
stop("`df` must contain variable called `x`.", call. = FALSE)
|
||||
|
||||
if ("threshold" %in% names(df))
|
||||
stop("`df` must not contain variable called `threshold`.", call. = FALSE)
|
||||
|
||||
dplyr::filter(df, x > threshold)
|
||||
}
|
||||
```
|
||||
|
||||
Because dplyr currently has no way to force a name to be interpreted as either a local or parent variable, as I've only just realised, that's really why you should avoid NSE. In a future version you should be able to do:
|
||||
|
||||
```{r}
|
||||
big_x <- function(df, threshold) {
|
||||
dplyr::filter(df, local(x) > parent(threshold))
|
||||
}
|
||||
```
|
||||
|
||||
Another option is to implement it yourself using base subsetting:
|
||||
|
||||
```{r}
|
||||
big_x <- function(df, threshold) {
|
||||
rows <- df$x > threshold
|
||||
df[!is.na(rows) & rows, , drop = FALSE]
|
||||
}
|
||||
```
|
||||
|
||||
The challenge is remembering that `filter()` also drops missing values, and you also need to remember to use `drop = FALSE`!
|
||||
|
||||
### Relying on global options
|
||||
|
||||
Functions are easiest to reason about if they have two properties:
|
||||
|
||||
1. Their output only depends on their inputs.
|
||||
1. They don't affect the outside world except through their return value.
|
||||
|
||||
The first property is particularly important. If a function has hidden additional inputs, it's very difficult to even know where the important context is!
|
||||
|
||||
The biggest breakers of this rule in base R are functions that create data frames. Most of these functions have a `stringsAsFactors` argument that defaults to `getOption("stringsAsFactors")`. This means that a global option affects the operation of a very large number of functions, and you need to be aware that, depending on an external state, a function might produce either a character vector or a factor. In this book, we steer you away from that problem by recommending functions like `readr::read_csv()` and `tibble::tibble()` that don't rely on this option. But be aware of it! Generally if a function is affected by a global option, you should avoid setting it.
|
||||
|
||||
Only use `options()` to control side-effects of a function. The value of an option should never affect the return value of a function. There are only three violations of this rule in base R: `stringsAsFactors`, `encoding`, `na.action`. For example, base R lets you control the number of digits printed in default displays with (e.g.) `options(digits = 3)`. This is a good use of an option because it's something that people frequently want control over, but doesn't affect the computation of a result, just its display. Follow this principle with your own use of options.
|
||||
|
||||
### Trying too hard
|
||||
|
||||
Another class of problems is functions that try really really hard to always return a useful result. Unfortunately they try so hard that they never throw error messages so you never find out if the input is really really weird.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Look at the `encoding` argument to `file()`, `url()`, `gzfile()` etc.
|
||||
What is the default value? Why should you avoid setting the default
|
||||
value on a global level?
|
||||
Reference in New Issue
Block a user