Use the visual editor (#923)

* Sentence wrap + visual editor Rmd style updates

* Move files not called in bookdown.yml to extra

* UK spelling + canonical source + sentence wrap

* Rename .rmd -> .Rmd

* Sentence wrap + visual editor Rmd style

* Fix capitalization
This commit is contained in:
Mine Cetinkaya-Rundel 2021-02-21 15:40:40 +00:00 committed by GitHub
parent 1795c94855
commit 32c625c6ef
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
38 changed files with 5550 additions and 4179 deletions

470
EDA.Rmd
View File

@ -2,17 +2,25 @@
## Introduction
This chapter will show you how to use visualisation and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:
This chapter will show you how to use visualisation and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short.
EDA is an iterative cycle.
You:
1. Generate questions about your data.
1. Generate questions about your data.
1. Search for answers by visualising, transforming, and modelling your data.
2. Search for answers by visualising, transforming, and modelling your data.
1. Use what you learn to refine your questions and/or generate new questions.
3. Use what you learn to refine your questions and/or generate new questions.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you'll eventually write up and communicate to others.
EDA is not a formal process with a strict set of rules.
More than anything, EDA is a state of mind.
During the initial phases of EDA you should feel free to investigate every idea that occurs to you.
Some of these ideas will pan out, and some will be dead ends.
As your exploration continues, you will home in on a few particularly productive areas that you'll eventually write up and communicate to others.
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you'll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data.
Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not.
To do data cleaning, you'll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
### Prerequisites
@ -24,65 +32,80 @@ library(tidyverse)
## Questions
> "There are no routine statistical questions, only questionable statistical
> routines." --- Sir David Cox
> "There are no routine statistical questions, only questionable statistical routines." --- Sir David Cox
> "Far better an approximate answer to the right question, which is often
> vague, than an exact answer to the wrong question, which can always be made
> precise." --- John Tukey
> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise." --- John Tukey
Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
Your goal during EDA is to develop an understanding of your data.
The easiest way to do this is to use questions as tools to guide your investigation.
When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
EDA is fundamentally a creative process. And like most creative processes, the key to asking _quality_ questions is to generate a large _quantity_ of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought-provoking questions---if you follow up each question with a new question based on what you find.
EDA is fundamentally a creative process.
And like most creative processes, the key to asking *quality* questions is to generate a large *quantity* of questions.
It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset.
On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery.
You can quickly drill down into the most interesting parts of your data---and develop a set of thought-provoking questions---if you follow up each question with a new question based on what you find.
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
There is no rule about which questions you should ask to guide your research.
However, two types of questions will always be useful for making discoveries within your data.
You can loosely word these questions as:
1. What type of variation occurs within my variables?
1. What type of variation occurs within my variables?
1. What type of covariation occurs between my variables?
2. What type of covariation occurs between my variables?
The rest of this chapter will look at these two questions. I'll explain what variation and covariation are, and I'll show you several ways to answer each question. To make the discussion easier, let's define some terms:
The rest of this chapter will look at these two questions.
I'll explain what variation and covariation are, and I'll show you several ways to answer each question.
To make the discussion easier, let's define some terms:
* A __variable__ is a quantity, quality, or property that you can measure.
- A **variable** is a quantity, quality, or property that you can measure.
* A __value__ is the state of a variable when you measure it. The value of a
variable may change from measurement to measurement.
* An __observation__ is a set of measurements made under similar conditions
(you usually make all of the measurements in an observation at the same
time and on the same object). An observation will contain several values,
each associated with a different variable. I'll sometimes refer to
an observation as a data point.
- A **value** is the state of a variable when you measure it.
The value of a variable may change from measurement to measurement.
* __Tabular data__ is a set of values, each associated with a variable and an
observation. Tabular data is _tidy_ if each value is placed in its own
"cell", each variable in its own column, and each observation in its own
row.
- An **observation** is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object).
An observation will contain several values, each associated with a different variable.
I'll sometimes refer to an observation as a data point.
So far, all of the data that you've seen has been tidy. In real-life, most data isn't tidy, so we'll come back to these ideas again in [tidy data].
- **Tabular data** is a set of values, each associated with a variable and an observation.
Tabular data is *tidy* if each value is placed in its own "cell", each variable in its own column, and each observation in its own row.
So far, all of the data that you've seen has been tidy.
In real-life, most data isn't tidy, so we'll come back to these ideas again in [tidy data].
## Variation
**Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable's values.
**Variation** is the tendency of the values of a variable to change from measurement to measurement.
You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results.
This is true even if you measure quantities that are constant, like the speed of light.
Each of your measurements will include a small amount of error that varies from measurement to measurement.
Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal interesting information.
The best way to understand that pattern is to visualise the distribution of the variable's values.
### Visualising distributions
How you visualise the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only take one of a small set of values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, use a bar chart:
How you visualise the distribution of a variable will depend on whether the variable is categorical or continuous.
A variable is **categorical** if it can only take one of a small set of values.
In R, categorical variables are usually saved as factors or character vectors.
To examine the distribution of a categorical variable, use a bar chart:
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
The height of the bars displays how many observations occurred with each x value. You can compute these values manually with `dplyr::count()`:
The height of the bars displays how many observations occurred with each x value.
You can compute these values manually with `dplyr::count()`:
```{r}
diamonds %>%
count(cut)
```
A variable is **continuous** if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:
A variable is **continuous** if it can take any of an infinite set of ordered values.
Numbers and date-times are two examples of continuous variables.
To examine the distribution of a continuous variable, use a histogram:
```{r}
ggplot(data = diamonds) +
@ -96,9 +119,12 @@ diamonds %>%
count(cut_width(carat, 0.5))
```
A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.
A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin.
In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable.
You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns.
For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
```{r}
smaller <- diamonds %>%
@ -108,7 +134,9 @@ ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
```
If you wish to overlay multiple histograms in the same plot, I recommend using `geom_freqpoly()` instead of `geom_histogram()`. `geom_freqpoly()` performs the same calculation as `geom_histogram()`, but instead of displaying the counts with bars, uses lines instead. It's much easier to understand overlapping lines than bars.
If you wish to overlay multiple histograms in the same plot, I recommend using `geom_freqpoly()` instead of `geom_histogram()`.
`geom_freqpoly()` performs the same calculation as `geom_histogram()`, but instead of displaying the counts with bars, uses lines instead.
It's much easier to understand overlapping lines than bars.
```{r}
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
@ -117,71 +145,89 @@ ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
There are a few challenges with this type of plot, which we will come back to in [visualising a categorical and a continuous variable](#cat-cont).
Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).
Now that you can visualise variation, what should you look for in your plots?
And what type of follow-up questions should you ask?
I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information.
The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).
### Typical values
In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:
In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values.
Places that do not have bars reveal values that were not seen in your data.
To turn this information into useful questions, look for anything unexpected:
* Which values are the most common? Why?
- Which values are the most common?
Why?
* Which values are rare? Why? Does that match your expectations?
- Which values are rare?
Why?
Does that match your expectations?
* Can you see any unusual patterns? What might explain them?
- Can you see any unusual patterns?
What might explain them?
As an example, the histogram below suggests several interesting questions:
As an example, the histogram below suggests several interesting questions:
* Why are there more diamonds at whole carats and common fractions of carats?
- Why are there more diamonds at whole carats and common fractions of carats?
* Why are there more diamonds slightly to the right of each peak than there
are slightly to the left of each peak?
* Why are there no diamonds bigger than 3 carats?
- Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
- Why are there no diamonds bigger than 3 carats?
```{r}
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)
```
Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
Clusters of similar values suggest that subgroups exist in your data.
To understand the subgroups, ask:
* How are the observations within each cluster similar to each other?
- How are the observations within each cluster similar to each other?
* How are the observations in separate clusters different from each other?
- How are the observations in separate clusters different from each other?
* How can you explain or describe the clusters?
- How can you explain or describe the clusters?
* Why might the appearance of clusters be misleading?
- Why might the appearance of clusters be misleading?
The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.
The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park.
Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.
```{r}
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
```
```
Many of the questions above will prompt you to explore a relationship *between* variables, for example, to see if the values of one variable can explain the behavior of another variable. We'll get to that shortly.
Many of the questions above will prompt you to explore a relationship *between* variables, for example, to see if the values of one variable can explain the behavior of another variable.
We'll get to that shortly.
### Unusual values
Outliers are observations that are unusual; data points that don't seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the `y` variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.
Outliers are observations that are unusual; data points that don't seem to fit the pattern.
Sometimes outliers are data entry errors; other times outliers suggest important new science.
When you have a lot of data, outliers are sometimes difficult to see in a histogram.
For example, take the distribution of the `y` variable from the diamonds dataset.
The only evidence of outliers is the unusually wide limits on the x-axis.
```{r}
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
```
```
There are so many observations in the common bins that the rare bins are so short that you can't see them (although maybe if you stare intently at 0 you'll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with `coord_cartesian()`:
There are so many observations in the common bins that the rare bins are so short that you can't see them (although maybe if you stare intently at 0 you'll spot something).
To make it easy to see the unusual values, we need to zoom to small values of the y-axis with `coord_cartesian()`:
```{r}
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
```
```
(`coord_cartesian()` also has an `xlim()` argument for when you need to zoom into the x-axis. ggplot2 also has `xlim()` and `ylim()` functions that work slightly differently: they throw away the data outside the limits.)
(`coord_cartesian()` also has an `xlim()` argument for when you need to zoom into the x-axis.
ggplot2 also has `xlim()` and `ylim()` functions that work slightly differently: they throw away the data outside the limits.)
This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:
This allows us to see that there are three unusual values: 0, \~30, and \~60.
We pluck them out with dplyr:
```{r, include = FALSE}
old <- options(tibble.print_max = 10, tibble.print_min = 10)
@ -199,28 +245,33 @@ unusual
options(old)
```
The `y` variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can't have a width of 0mm, so these values must be incorrect. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don't cost hundreds of thousands of dollars!
It's good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can't figure out why they're there, it's reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn't drop them without justification. You'll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.
The `y` variable measures one of the three dimensions of these diamonds, in mm.
We know that diamonds can't have a width of 0mm, so these values must be incorrect.
We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don't cost hundreds of thousands of dollars!
It's good practice to repeat your analysis with and without the outliers.
If they have minimal effect on the results, and you can't figure out why they're there, it's reasonable to replace them with missing values, and move on.
However, if they have a substantial effect on your results, you shouldn't drop them without justification.
You'll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.
### Exercises
1. Explore the distribution of each of the `x`, `y`, and `z` variables
in `diamonds`. What do you learn? Think about a diamond and how you
might decide which dimension is the length, width, and depth.
1. Explore the distribution of each of the `x`, `y`, and `z` variables in `diamonds`.
What do you learn?
Think about a diamond and how you might decide which dimension is the length, width, and depth.
1. Explore the distribution of `price`. Do you discover anything unusual
or surprising? (Hint: Carefully think about the `binwidth` and make sure
you try a wide range of values.)
2. Explore the distribution of `price`.
Do you discover anything unusual or surprising?
(Hint: Carefully think about the `binwidth` and make sure you try a wide range of values.)
1. How many diamonds are 0.99 carat? How many are 1 carat? What
do you think is the cause of the difference?
1. Compare and contrast `coord_cartesian()` vs `xlim()` or `ylim()` when
zooming in on a histogram. What happens if you leave `binwidth` unset?
3. How many diamonds are 0.99 carat?
How many are 1 carat?
What do you think is the cause of the difference?
4. Compare and contrast `coord_cartesian()` vs `xlim()` or `ylim()` when zooming in on a histogram.
What happens if you leave `binwidth` unset?
What happens if you try and zoom so only half a bar shows?
## Missing values
If you've encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.
@ -231,25 +282,27 @@ If you've encountered unusual values in your dataset, and simply want to move on
diamonds2 <- diamonds %>%
filter(between(y, 3, 20))
```
I don't recommend this option because just because one measurement
is invalid, doesn't mean all the measurements are. Additionally, if you
have low quality data, by time that you've applied this approach to every
variable you might find that you don't have any data left!
1. Instead, I recommend replacing the unusual values with missing values.
The easiest way to do this is to use `mutate()` to replace the variable
with a modified copy. You can use the `ifelse()` function to replace
unusual values with `NA`:
I don't recommend this option because just because one measurement is invalid, doesn't mean all the measurements are.
Additionally, if you have low quality data, by time that you've applied this approach to every variable you might find that you don't have any data left!
2. Instead, I recommend replacing the unusual values with missing values.
The easiest way to do this is to use `mutate()` to replace the variable with a modified copy.
You can use the `ifelse()` function to replace unusual values with `NA`:
```{r}
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
```
`ifelse()` has three arguments. The first argument `test` should be a logical vector. The result will contain the value of the second argument, `yes`, when `test` is `TRUE`, and the value of the third argument, `no`, when it is false. Alternatively to ifelse, use `dplyr::case_when()`. `case_when()` is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables.
`ifelse()` has three arguments.
The first argument `test` should be a logical vector.
The result will contain the value of the second argument, `yes`, when `test` is `TRUE`, and the value of the third argument, `no`, when it is false.
Alternatively to ifelse, use `dplyr::case_when()`.
`case_when()` is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables.
Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but it does warn that they've been removed:
Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing.
It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but it does warn that they've been removed:
```{r, dev = "png"}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
@ -263,7 +316,10 @@ ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)
```
Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in `nycflights13::flights`, missing values in the `dep_time` variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable with `is.na()`.
Other times you want to understand what makes observations with missing values different to observations with recorded values.
For example, in `nycflights13::flights`, missing values in the `dep_time` variable indicate that the flight was cancelled.
So you might want to compare the scheduled departure times for cancelled and non-cancelled times.
You can do this by making a new variable with `is.na()`.
```{r}
nycflights13::flights %>%
@ -277,22 +333,30 @@ nycflights13::flights %>%
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
```
However this plot isn't great because there are many more non-cancelled flights than cancelled flights. In the next section we'll explore some techniques for improving this comparison.
However this plot isn't great because there are many more non-cancelled flights than cancelled flights.
In the next section we'll explore some techniques for improving this comparison.
### Exercises
1. What happens to missing values in a histogram? What happens to missing
values in a bar chart? Why is there a difference?
1. What happens to missing values in a histogram?
What happens to missing values in a bar chart?
Why is there a difference?
1. What does `na.rm = TRUE` do in `mean()` and `sum()`?
2. What does `na.rm = TRUE` do in `mean()` and `sum()`?
## Covariation
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.
If variation describes the behavior *within* a variable, covariation describes the behavior *between* variables.
**Covariation** is the tendency for the values of two or more variables to vary together in a related way.
The best way to spot covariation is to visualise the relationship between two or more variables.
How you do that should again depend on the type of variables involved.
### A categorical and continuous variable {#cat-cont}
It's common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, let's explore how the price of a diamond varies with its quality:
It's common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon.
The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count.
That means if one of the groups is much smaller than the others, it's hard to see the differences in shape.
For example, let's explore how the price of a diamond varies with its quality:
```{r}
ggplot(data = diamonds, mapping = aes(x = price)) +
@ -306,30 +370,30 @@ ggplot(diamonds) +
geom_bar(mapping = aes(x = cut))
```
To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we'll display __density__, which is the count standardised so that the area under each frequency polygon is one.
To make the comparison easier we need to swap what is displayed on the y-axis.
Instead of displaying count, we'll display **density**, which is the count standardised so that the area under each frequency polygon is one.
```{r}
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
```
There's something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that's because frequency polygons are a little hard to interpret - there's a lot going on in this plot.
There's something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price!
But maybe that's because frequency polygons are a little hard to interpret - there's a lot going on in this plot.
Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A **boxplot** is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:
Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot.
A **boxplot** is a type of visual shorthand for a distribution of values that is popular among statisticians.
Each boxplot consists of:
* A box that stretches from the 25th percentile of the distribution to the
75th percentile, a distance known as the interquartile range (IQR). In the
middle of the box is a line that displays the median, i.e. 50th percentile,
of the distribution. These three lines give you a sense of the spread of the
distribution and whether or not the distribution is symmetric about the
median or skewed to one side.
- A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR).
In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution.
These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
* Visual points that display observations that fall more than 1.5 times the
IQR from either edge of the box. These outlying points are unusual
so are plotted individually.
- Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box.
These outlying points are unusual so are plotted individually.
* A line (or whisker) that extends from each end of the box and goes to the
farthest non-outlier point in the distribution.
- A line (or whisker) that extends from each end of the box and goes to the\
farthest non-outlier point in the distribution.
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/EDA-boxplot.png")
@ -342,11 +406,16 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
```
We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuitive finding that better quality diamonds are cheaper on average! In the exercises, you'll be challenged to figure out why.
We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot).
It supports the counterintuitive finding that better quality diamonds are cheaper on average!
In the exercises, you'll be challenged to figure out why.
`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don't have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the `reorder()` function.
`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on.
Many categorical variables don't have such an intrinsic order, so you might want to reorder them to make a more informative display.
One way to do that is with the `reorder()` function.
For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes:
For example, take the `class` variable in the `mpg` dataset.
You might be interested to know how highway mileage varies across classes:
```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
@ -360,7 +429,8 @@ ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
```
If you have long variable names, `geom_boxplot()` will work better if you flip it 90°. You can do that with `coord_flip()`.
If you have long variable names, `geom_boxplot()` will work better if you flip it 90°.
You can do that with `coord_flip()`.
```{r}
ggplot(data = mpg) +
@ -370,43 +440,40 @@ ggplot(data = mpg) +
#### Exercises
1. Use what you've learned to improve the visualisation of the departure times
of cancelled vs. non-cancelled flights.
1. Use what you've learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.
1. What variable in the diamonds dataset is most important for predicting
the price of a diamond? How is that variable correlated with cut?
Why does the combination of those two relationships lead to lower quality
diamonds being more expensive?
2. What variable in the diamonds dataset is most important for predicting the price of a diamond?
How is that variable correlated with cut?
Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
1. Exchange x variable and y variable in a vertical boxplot, and create
a horizontal boxplot. How does this compare to using `coord_flip()`?
3. Exchange x variable and y variable in a vertical boxplot, and create a horizontal boxplot.
How does this compare to using `coord_flip()`?
1. One problem with boxplots is that they were developed in an era of
much smaller datasets and tend to display a prohibitively large
number of "outlying values". One approach to remedy this problem is
the letter value plot. Install the lvplot package, and try using
`geom_lv()` to display the distribution of price vs cut. What
do you learn? How do you interpret the plots?
4. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of "outlying values".
One approach to remedy this problem is the letter value plot.
Install the lvplot package, and try using `geom_lv()` to display the distribution of price vs cut.
What do you learn?
How do you interpret the plots?
1. Compare and contrast `geom_violin()` with a facetted `geom_histogram()`,
or a coloured `geom_freqpoly()`. What are the pros and cons of each
method?
5. Compare and contrast `geom_violin()` with a facetted `geom_histogram()`, or a coloured `geom_freqpoly()`.
What are the pros and cons of each method?
1. If you have a small dataset, it's sometimes useful to use `geom_jitter()`
to see the relationship between a continuous and categorical variable.
The ggbeeswarm package provides a number of methods similar to
`geom_jitter()`. List them and briefly describe what each one does.
6. If you have a small dataset, it's sometimes useful to use `geom_jitter()` to see the relationship between a continuous and categorical variable.
The ggbeeswarm package provides a number of methods similar to `geom_jitter()`.
List them and briefly describe what each one does.
### Two categorical variables
To visualise the covariation between categorical variables, you'll need to count the number of observations for each combination. One way to do that is to rely on the built-in `geom_count()`:
To visualise the covariation between categorical variables, you'll need to count the number of observations for each combination.
One way to do that is to rely on the built-in `geom_count()`:
```{r}
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
```
The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.
The size of each circle in the plot displays how many observations occurred at each combination of values.
Covariation will appear as a strong correlation between specific x values and specific y values.
Another approach is to compute the count with dplyr:
@ -424,23 +491,24 @@ diamonds %>%
geom_tile(mapping = aes(fill = n))
```
If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.
If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns.
For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.
#### Exercises
1. How could you rescale the count dataset above to more clearly show
the distribution of cut within colour, or colour within cut?
1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?
1. Use `geom_tile()` together with dplyr to explore how average flight
delays vary by destination and month of year. What makes the
plot difficult to read? How could you improve it?
2. Use `geom_tile()` together with dplyr to explore how average flight delays vary by destination and month of year.
What makes the plot difficult to read?
How could you improve it?
1. Why is it slightly better to use `aes(x = color, y = cut)` rather
than `aes(x = cut, y = color)` in the example above?
3. Why is it slightly better to use `aes(x = color, y = cut)` rather than `aes(x = cut, y = color)` in the example above?
### Two continuous variables
You've already seen one great way to visualise the covariation between two continuous variables: draw a scatterplot with `geom_point()`. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.
You've already seen one great way to visualise the covariation between two continuous variables: draw a scatterplot with `geom_point()`.
You can see covariation as a pattern in the points.
For example, you can see an exponential relationship between the carat size and price of a diamond.
```{r, dev = "png"}
ggplot(data = diamonds) +
@ -455,9 +523,15 @@ ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
```
But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used `geom_histogram()` and `geom_freqpoly()` to bin in one dimension. Now you'll learn how to use `geom_bin2d()` and `geom_hex()` to bin in two dimensions.
But using transparency can be challenging for very large datasets.
Another solution is to use bin.
Previously you used `geom_histogram()` and `geom_freqpoly()` to bin in one dimension.
Now you'll learn how to use `geom_bin2d()` and `geom_hex()` to bin in two dimensions.
`geom_bin2d()` and `geom_hex()` divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.
`geom_bin2d()` and `geom_hex()` divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin.
`geom_bin2d()` creates rectangular bins.
`geom_hex()` creates hexagonal bins.
You will need to install the hexbin package to use `geom_hex()`.
```{r, fig.asp = 1, out.width = "50%", fig.align = "default", message = FALSE}
ggplot(data = smaller) +
@ -468,16 +542,21 @@ ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
```
Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a categorical and a continuous variable that you learned about. For example, you could bin `carat` and then for each group, display a boxplot:
Another option is to bin one continuous variable so it acts like a categorical variable.
Then you can use one of the techniques for visualising the combination of a categorical and a continuous variable that you learned about.
For example, you could bin `carat` and then for each group, display a boxplot:
```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
```
`cut_width(x, width)`, as used above, divides `x` into bins of width `width`. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`.
`cut_width(x, width)`, as used above, divides `x` into bins of width `width`.
By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summarises a different number of points.
One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`.
Another approach is to display approximately the same number of points in each bin. That's the job of `cut_number()`:
Another approach is to display approximately the same number of points in each bin.
That's the job of `cut_number()`:
```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
@ -486,56 +565,63 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
#### Exercises
1. Instead of summarising the conditional distribution with a boxplot, you
could use a frequency polygon. What do you need to consider when using
`cut_width()` vs `cut_number()`? How does that impact a visualisation of
the 2d distribution of `carat` and `price`?
1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon.
What do you need to consider when using `cut_width()` vs `cut_number()`?
How does that impact a visualisation of the 2d distribution of `carat` and `price`?
1. Visualise the distribution of carat, partitioned by price.
2. Visualise the distribution of carat, partitioned by price.
1. How does the price distribution of very large diamonds compare to small
diamonds? Is it as you expect, or does it surprise you?
1. Combine two of the techniques you've learned to visualise the
combined distribution of cut, carat, and price.
3. How does the price distribution of very large diamonds compare to small diamonds?
Is it as you expect, or does it surprise you?
4. Combine two of the techniques you've learned to visualise the combined distribution of cut, carat, and price.
5. Two dimensional plots reveal outliers that are not visible in one dimensional plots.
For example, some points in the plot below have an unusual combination of `x` and `y` values, which makes the points outliers even though their `x` and `y` values appear normal when examined separately.
1. Two dimensional plots reveal outliers that are not visible in one
dimensional plots. For example, some points in the plot below have an
unusual combination of `x` and `y` values, which makes the points outliers
even though their `x` and `y` values appear normal when examined separately.
```{r, dev = "png"}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
```
Why is a scatterplot a better display than a binned plot for this case?
## Patterns and models
Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
Patterns in your data provide clues about relationships.
If a systematic relationship exists between two variables it will appear as a pattern in the data.
If you spot a pattern, ask yourself:
+ Could this pattern be due to coincidence (i.e. random chance)?
- Could this pattern be due to coincidence (i.e. random chance)?
+ How can you describe the relationship implied by the pattern?
- How can you describe the relationship implied by the pattern?
+ How strong is the relationship implied by the pattern?
- How strong is the relationship implied by the pattern?
+ What other variables might affect the relationship?
- What other variables might affect the relationship?
+ Does the relationship change if you look at individual subgroups of the data?
- Does the relationship change if you look at individual subgroups of the data?
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions.
The scatterplot also displays the two clusters that we noticed above.
```{r fig.height = 2}
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
```
```
Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
Patterns provide one of the most useful tools for data scientists because they reveal covariation.
If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it.
If two variables covary, you can use the values of one variable to make better predictions about the values of the second.
If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It's hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It's possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts `price` from `carat` and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed.
Models are a tool for extracting patterns out of data.
For example, consider the diamonds data.
It's hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related.
It's possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain.
The following code fits a model that predicts `price` from `carat` and then computes the residuals (the difference between the predicted value and the actual value).
The residuals give us a view of the price of the diamond, once the effect of carat has been removed.
```{r, dev = "png"}
library(modelr)
@ -550,25 +636,31 @@ ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))
```
Once you've removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
Once you've removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
```{r}
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
```
You'll learn how models, and the modelr package, work in the final part of the book, [model](#model-intro). We're saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
You'll learn how models, and the modelr package, work in the final part of the book, [model](#model-intro).
We're saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
## ggplot2 calls
As we move on from these introductory chapters, we'll transition to a more concise expression of ggplot2 code. So far we've been very explicit, which is helpful when you are learning:
As we move on from these introductory chapters, we'll transition to a more concise expression of ggplot2 code.
So far we've been very explicit, which is helpful when you are learning:
```{r, eval = FALSE}
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)
```
Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to `ggplot()` are `data` and `mapping`, and the first two arguments to `aes()` are `x` and `y`. In the remainder of the book, we won't supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what's different between plots. That's a really important programming concern that we'll come back in [functions].
Typically, the first one or two arguments to a function are so important that you should know them by heart.
The first two arguments to `ggplot()` are `data` and `mapping`, and the first two arguments to `aes()` are `x` and `y`.
In the remainder of the book, we won't supply those names.
That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what's different between plots.
That's a really important programming concern that we'll come back in [functions].
Rewriting the previous plot more concisely yields:
@ -577,7 +669,9 @@ ggplot(faithful, aes(eruptions)) +
geom_freqpoly(binwidth = 0.25)
```
Sometimes we'll turn the end of a pipeline of data transformation into a plot. Watch for the transition from `%>%` to `+`. I wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered.
Sometimes we'll turn the end of a pipeline of data transformation into a plot.
Watch for the transition from `%>%` to `+`.
I wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered.
```{r, eval = FALSE}
diamonds %>%
@ -588,8 +682,12 @@ diamonds %>%
## Learning more
If you want to learn more about the mechanics of ggplot2, I'd highly recommend grabbing a copy of the ggplot2 book: <https://amzn.com/331924275X>. It's been recently updated, so it includes dplyr and tidyr code, and has much more space to explore all the facets of visualisation. Unfortunately the book isn't generally available for free, but if you have a connection to a university you can probably get an electronic version for free through SpringerLink.
If you want to learn more about the mechanics of ggplot2, I'd highly recommend grabbing a copy of the ggplot2 book: <https://amzn.com/331924275X>.
It's been recently updated, so it includes dplyr and tidyr code, and has much more space to explore all the facets of visualisation.
Unfortunately the book isn't generally available for free, but if you have a connection to a university you can probably get an electronic version for free through SpringerLink.
Another useful resource is the [_R Graphics Cookbook_](https://amzn.com/1449316956) by Winston Chang. Much of the contents are available online at <http://www.cookbook-r.com/Graphs/>.
Another useful resource is the [*R Graphics Cookbook*](https://amzn.com/1449316956) by Winston Chang.
Much of the contents are available online at <http://www.cookbook-r.com/Graphs/>.
I also recommend [_Graphical Data Analysis with R_](https://amzn.com/1498715230), by Antony Unwin. This is a book-length treatment similar to the material covered in this chapter, but has the space to go into much greater depth.
I also recommend [*Graphical Data Analysis with R*](https://amzn.com/1498715230), by Antony Unwin.
This is a book-length treatment similar to the material covered in this chapter, but has the space to go into much greater depth.

View File

@ -2,15 +2,29 @@
## Introduction
In [exploratory data analysis], you learned how to use plots as tools for _exploration_. When you make exploratory plots, you know---even before looking---which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, you'll produce tens or hundreds of plots, most of which are immediately thrown away.
In [exploratory data analysis], you learned how to use plots as tools for *exploration*.
When you make exploratory plots, you know---even before looking---which variables the plot will display.
You made each plot for a purpose, could quickly look at it, and then move on to the next plot.
In the course of most analyses, you'll produce tens or hundreds of plots, most of which are immediately thrown away.
Now that you understand your data, you need to _communicate_ your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, you'll learn some of the tools that ggplot2 provides to do so.
Now that you understand your data, you need to *communicate* your understanding to others.
Your audience will likely not share your background knowledge and will not be deeply invested in the data.
To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible.
In this chapter, you'll learn some of the tools that ggplot2 provides to do so.
This chapter focuses on the tools you need to create good graphics. I assume that you know what you want, and just need to know how to do it. For that reason, I highly recommend pairing this chapter with a good general visualisation book. I particularly like [_The Truthful Art_](https://amzn.com/0321934075), by Albert Cairo. It doesn't teach the mechanics of creating visualisations, but instead focuses on what you need to think about in order to create effective graphics.
This chapter focuses on the tools you need to create good graphics.
I assume that you know what you want, and just need to know how to do it.
For that reason, I highly recommend pairing this chapter with a good general visualisation book.
I particularly like [*The Truthful Art*](https://amzn.com/0321934075), by Albert Cairo.
It doesn't teach the mechanics of creating visualisations, but instead focuses on what you need to think about in order to create effective graphics.
### Prerequisites
In this chapter, we'll focus once again on ggplot2. We'll also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including __ggrepel__ and __viridis__. Rather than loading those extensions here, we'll refer to their functions explicitly, using the `::` notation. This will help make it clear which functions are built into ggplot2, and which come from other packages. Don't forget you'll need to install those packages with `install.packages()` if you don't already have them.
In this chapter, we'll focus once again on ggplot2.
We'll also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including **ggrepel** and **viridis**.
Rather than loading those extensions here, we'll refer to their functions explicitly, using the `::` notation.
This will help make it clear which functions are built into ggplot2, and which come from other packages.
Don't forget you'll need to install those packages with `install.packages()` if you don't already have them.
```{r, message = FALSE}
library(tidyverse)
@ -18,7 +32,9 @@ library(tidyverse)
## Label
The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the `labs()` function. This example adds a plot title:
The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels.
You add labels with the `labs()` function.
This example adds a plot title:
```{r, message = FALSE}
ggplot(mpg, aes(displ, hwy)) +
@ -27,14 +43,14 @@ ggplot(mpg, aes(displ, hwy)) +
labs(title = "Fuel efficiency generally decreases with engine size")
```
The purpose of a plot title is to summarise the main finding. Avoid titles that just describe what the plot is, e.g. "A scatterplot of engine displacement vs. fuel economy".
The purpose of a plot title is to summarise the main finding.
Avoid titles that just describe what the plot is, e.g. "A scatterplot of engine displacement vs. fuel economy".
If you need to add more text, there are two other useful labels that you can use in ggplot2 2.2.0 and above (which should be available by the time you're reading this book):
* `subtitle` adds additional detail in a smaller font beneath the title.
- `subtitle` adds additional detail in a smaller font beneath the title.
* `caption` adds text at the bottom right of the plot, often used to describe
the source of the data.
- `caption` adds text at the bottom right of the plot, often used to describe the source of the data.
```{r, message = FALSE}
ggplot(mpg, aes(displ, hwy)) +
@ -47,7 +63,8 @@ ggplot(mpg, aes(displ, hwy)) +
)
```
You can also use `labs()` to replace the axis and legend titles. It's usually a good idea to replace short variable names with more detailed descriptions, and to include the units.
You can also use `labs()` to replace the axis and legend titles.
It's usually a good idea to replace short variable names with more detailed descriptions, and to include the units.
```{r, message = FALSE}
ggplot(mpg, aes(displ, hwy)) +
@ -60,7 +77,8 @@ ggplot(mpg, aes(displ, hwy)) +
)
```
It's possible to use mathematical equations instead of text strings. Just switch `""` out for `quote()` and read about the available options in `?plotmath`:
It's possible to use mathematical equations instead of text strings.
Just switch `""` out for `quote()` and read about the available options in `?plotmath`:
```{r, fig.asp = 1, out.width = "50%", fig.width = 3}
df <- tibble(
@ -77,22 +95,23 @@ ggplot(df, aes(x, y)) +
### Exercises
1. Create one plot on the fuel economy data with customised `title`,
`subtitle`, `caption`, `x`, `y`, and `colour` labels.
1. Create one plot on the fuel economy data with customised `title`, `subtitle`, `caption`, `x`, `y`, and `colour` labels.
1. The `geom_smooth()` is somewhat misleading because the `hwy` for
large engines is skewed upwards due to the inclusion of lightweight
sports cars with big engines. Use your modelling tools to fit and display
a better model.
2. The `geom_smooth()` is somewhat misleading because the `hwy` for large engines is skewed upwards due to the inclusion of lightweight sports cars with big engines.
Use your modelling tools to fit and display a better model.
1. Take an exploratory graphic that you've created in the last month, and add
informative titles to make it easier for others to understand.
3. Take an exploratory graphic that you've created in the last month, and add informative titles to make it easier for others to understand.
## Annotations
In addition to labelling major components of your plot, it's often useful to label individual observations or groups of observations. The first tool you have at your disposal is `geom_text()`. `geom_text()` is similar to `geom_point()`, but it has an additional aesthetic: `label`. This makes it possible to add textual labels to your plots.
In addition to labelling major components of your plot, it's often useful to label individual observations or groups of observations.
The first tool you have at your disposal is `geom_text()`.
`geom_text()` is similar to `geom_point()`, but it has an additional aesthetic: `label`.
This makes it possible to add textual labels to your plots.
There are two possible sources of labels. First, you might have a tibble that provides labels. The plot below isn't terribly useful, but it illustrates a useful approach: pull out the most efficient car in each class with dplyr, and then label it on the plot:
There are two possible sources of labels.
First, you might have a tibble that provides labels.
The plot below isn't terribly useful, but it illustrates a useful approach: pull out the most efficient car in each class with dplyr, and then label it on the plot:
```{r}
best_in_class <- mpg %>%
@ -104,7 +123,9 @@ ggplot(mpg, aes(displ, hwy)) +
geom_text(aes(label = model), data = best_in_class)
```
This is hard to read because the labels overlap with each other, and with the points. We can make things a little better by switching to `geom_label()` which draws a rectangle behind the text. We also use the `nudge_y` parameter to move the labels slightly above the corresponding points:
This is hard to read because the labels overlap with each other, and with the points.
We can make things a little better by switching to `geom_label()` which draws a rectangle behind the text.
We also use the `nudge_y` parameter to move the labels slightly above the corresponding points:
```{r}
ggplot(mpg, aes(displ, hwy)) +
@ -112,7 +133,11 @@ ggplot(mpg, aes(displ, hwy)) +
geom_label(aes(label = model), data = best_in_class, nudge_y = 2, alpha = 0.5)
```
That helps a bit, but if you look closely in the top-left hand corner, you'll notice that there are two labels practically on top of each other. This happens because the highway mileage and displacement for the best cars in the compact and subcompact categories are exactly the same. There's no way that we can fix these by applying the same transformation for every label. Instead, we can use the __ggrepel__ package by Kamil Slowikowski. This useful package will automatically adjust labels so that they don't overlap:
That helps a bit, but if you look closely in the top-left hand corner, you'll notice that there are two labels practically on top of each other.
This happens because the highway mileage and displacement for the best cars in the compact and subcompact categories are exactly the same.
There's no way that we can fix these by applying the same transformation for every label.
Instead, we can use the **ggrepel** package by Kamil Slowikowski.
This useful package will automatically adjust labels so that they don't overlap:
```{r}
ggplot(mpg, aes(displ, hwy)) +
@ -123,7 +148,9 @@ ggplot(mpg, aes(displ, hwy)) +
Note another handy technique used here: I added a second layer of large, hollow points to highlight the points that I've labelled.
You can sometimes use the same idea to replace the legend with labels placed directly on the plot. It's not wonderful for this plot, but it isn't too bad. (`theme(legend.position = "none"`) turns the legend off --- we'll talk about it more shortly.)
You can sometimes use the same idea to replace the legend with labels placed directly on the plot.
It's not wonderful for this plot, but it isn't too bad.
(`theme(legend.position = "none"`) turns the legend off --- we'll talk about it more shortly.)
```{r}
class_avg <- mpg %>%
@ -144,7 +171,8 @@ ggplot(mpg, aes(displ, hwy, colour = class)) +
theme(legend.position = "none")
```
Alternatively, you might just want to add a single label to the plot, but you'll still need to create a data frame. Often, you want the label in the corner of the plot, so it's convenient to create a new data frame using `summarise()` to compute the maximum values of x and y.
Alternatively, you might just want to add a single label to the plot, but you'll still need to create a data frame.
Often, you want the label in the corner of the plot, so it's convenient to create a new data frame using `summarise()` to compute the maximum values of x and y.
```{r}
label <- mpg %>%
@ -159,7 +187,8 @@ ggplot(mpg, aes(displ, hwy)) +
geom_text(aes(label = label), data = label, vjust = "top", hjust = "right")
```
If you want to place the text exactly on the borders of the plot, you can use `+Inf` and `-Inf`. Since we're no longer computing the positions from `mpg`, we can use `tibble()` to create the data frame:
If you want to place the text exactly on the borders of the plot, you can use `+Inf` and `-Inf`.
Since we're no longer computing the positions from `mpg`, we can use `tibble()` to create the data frame:
```{r}
label <- tibble(
@ -173,7 +202,8 @@ ggplot(mpg, aes(displ, hwy)) +
geom_text(aes(label = label), data = label, vjust = "top", hjust = "right")
```
In these examples, I manually broke the label up into lines using `"\n"`. Another approach is to use `stringr::str_wrap()` to automatically add line breaks, given the number of characters you want per line:
In these examples, I manually broke the label up into lines using `"\n"`.
Another approach is to use `stringr::str_wrap()` to automatically add line breaks, given the number of characters you want per line:
```{r}
"Increasing engine size is related to decreasing fuel economy." %>%
@ -181,7 +211,8 @@ In these examples, I manually broke the label up into lines using `"\n"`. Anothe
writeLines()
```
Note the use of `hjust` and `vjust` to control the alignment of the label. Figure \@ref(fig:just) shows all nine possible combinations.
Note the use of `hjust` and `vjust` to control the alignment of the label.
Figure \@ref(fig:just) shows all nine possible combinations.
```{r just, echo = FALSE, fig.cap = "All nine combinations of `hjust` and `vjust`.", fig.asp = 0.5, fig.width = 4.5, out.width = "60%"}
vjust <- c(bottom = 0, center = 0.5, top = 1)
@ -201,44 +232,45 @@ ggplot(df, aes(x, y)) +
labs(x = NULL, y = NULL)
```
Remember, in addition to `geom_text()`, you have many other geoms in ggplot2 available to help annotate your plot. A few ideas:
Remember, in addition to `geom_text()`, you have many other geoms in ggplot2 available to help annotate your plot.
A few ideas:
* Use `geom_hline()` and `geom_vline()` to add reference lines. I often make
them thick (`size = 2`) and white (`colour = white`), and draw them
underneath the primary data layer. That makes them easy to see, without
drawing attention away from the data.
- Use `geom_hline()` and `geom_vline()` to add reference lines.
I often make them thick (`size = 2`) and white (`colour = white`), and draw them underneath the primary data layer.
That makes them easy to see, without drawing attention away from the data.
* Use `geom_rect()` to draw a rectangle around points of interest. The
boundaries of the rectangle are defined by aesthetics `xmin`, `xmax`,
`ymin`, `ymax`.
- Use `geom_rect()` to draw a rectangle around points of interest.
The boundaries of the rectangle are defined by aesthetics `xmin`, `xmax`, `ymin`, `ymax`.
* Use `geom_segment()` with the `arrow` argument to draw attention
to a point with an arrow. Use aesthetics `x` and `y` to define the
starting location, and `xend` and `yend` to define the end location.
- Use `geom_segment()` with the `arrow` argument to draw attention to a point with an arrow.
Use aesthetics `x` and `y` to define the starting location, and `xend` and `yend` to define the end location.
The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!
### Exercises
1. Use `geom_text()` with infinite positions to place text at the
four corners of the plot.
1. Use `geom_text()` with infinite positions to place text at the four corners of the plot.
1. Read the documentation for `annotate()`. How can you use it to add a text
label to a plot without having to create a tibble?
2. Read the documentation for `annotate()`.
How can you use it to add a text label to a plot without having to create a tibble?
1. How do labels with `geom_text()` interact with faceting? How can you
add a label to a single facet? How can you put a different label in
each facet? (Hint: think about the underlying data.)
3. How do labels with `geom_text()` interact with faceting?
How can you add a label to a single facet?
How can you put a different label in each facet?
(Hint: think about the underlying data.)
1. What arguments to `geom_label()` control the appearance of the background
box?
4. What arguments to `geom_label()` control the appearance of the background box?
1. What are the four arguments to `arrow()`? How do they work? Create a series
of plots that demonstrate the most important options.
5. What are the four arguments to `arrow()`?
How do they work?
Create a series of plots that demonstrate the most important options.
## Scales
The third way you can make your plot better for communication is to adjust the scales. Scales control the mapping from data values to things that you can perceive. Normally, ggplot2 automatically adds scales for you. For example, when you type:
The third way you can make your plot better for communication is to adjust the scales.
Scales control the mapping from data values to things that you can perceive.
Normally, ggplot2 automatically adds scales for you.
For example, when you type:
```{r default-scales, fig.show = "hide"}
ggplot(mpg, aes(displ, hwy)) +
@ -255,21 +287,25 @@ ggplot(mpg, aes(displ, hwy)) +
scale_colour_discrete()
```
Note the naming scheme for scales: `scale_` followed by the name of the aesthetic, then `_`, then the name of the scale. The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. There are lots of non-default scales which you'll learn about below.
Note the naming scheme for scales: `scale_` followed by the name of the aesthetic, then `_`, then the name of the scale.
The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date.
There are lots of non-default scales which you'll learn about below.
The default scales have been carefully chosen to do a good job for a wide range of inputs. Nevertheless, you might want to override the defaults for two reasons:
The default scales have been carefully chosen to do a good job for a wide range of inputs.
Nevertheless, you might want to override the defaults for two reasons:
* You might want to tweak some of the parameters of the default scale.
This allows you to do things like change the breaks on the axes, or the
key labels on the legend.
- You might want to tweak some of the parameters of the default scale.
This allows you to do things like change the breaks on the axes, or the key labels on the legend.
* You might want to replace the scale altogether, and use a completely
different algorithm. Often you can do better than the default because
you know more about the data.
- You might want to replace the scale altogether, and use a completely different algorithm.
Often you can do better than the default because you know more about the data.
### Axis ticks and legend keys
There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: `breaks` and `labels`. Breaks controls the position of the ticks, or the values associated with the keys. Labels controls the text label associated with each tick/key. The most common use of `breaks` is to override the default choice:
There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: `breaks` and `labels`.
Breaks controls the position of the ticks, or the values associated with the keys.
Labels controls the text label associated with each tick/key.
The most common use of `breaks` is to override the default choice:
```{r}
ggplot(mpg, aes(displ, hwy)) +
@ -277,7 +313,8 @@ ggplot(mpg, aes(displ, hwy)) +
scale_y_continuous(breaks = seq(15, 40, by = 5))
```
You can use `labels` in the same way (a character vector the same length as `breaks`), but you can also set it to `NULL` to suppress the labels altogether. This is useful for maps, or for publishing plots where you can't share the absolute numbers.
You can use `labels` in the same way (a character vector the same length as `breaks`), but you can also set it to `NULL` to suppress the labels altogether.
This is useful for maps, or for publishing plots where you can't share the absolute numbers.
```{r}
ggplot(mpg, aes(displ, hwy)) +
@ -286,9 +323,12 @@ ggplot(mpg, aes(displ, hwy)) +
scale_y_continuous(labels = NULL)
```
You can also use `breaks` and `labels` to control the appearance of legends. Collectively axes and legends are called __guides__. Axes are used for x and y aesthetics; legends are used for everything else.
You can also use `breaks` and `labels` to control the appearance of legends.
Collectively axes and legends are called **guides**.
Axes are used for x and y aesthetics; legends are used for everything else.
Another use of `breaks` is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term.
Another use of `breaks` is when you have relatively few data points and want to highlight exactly where the observations occur.
For example, take this plot that shows when each US president started and ended their term.
```{r}
presidential %>%
@ -301,16 +341,18 @@ presidential %>%
Note that the specification of breaks and labels for date and datetime scales is a little different:
* `date_labels` takes a format specification, in the same form as
`parse_datetime()`.
- `date_labels` takes a format specification, in the same form as `parse_datetime()`.
* `date_breaks` (not shown here), takes a string like "2 days" or "1 month".
- `date_breaks` (not shown here), takes a string like "2 days" or "1 month".
### Legend layout
You will most often use `breaks` and `labels` to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.
You will most often use `breaks` and `labels` to tweak the axes.
While they both also work for legends, there are a few other techniques you are more likely to use.
To control the overall position of the legend, you need to use a `theme()` setting. We'll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting `legend.position` controls where the legend is drawn:
To control the overall position of the legend, you need to use a `theme()` setting.
We'll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot.
The theme setting `legend.position` controls where the legend is drawn:
```{r fig.asp = 1, fig.align = "default", out.width = "50%", fig.width = 4}
base <- ggplot(mpg, aes(displ, hwy)) +
@ -324,7 +366,9 @@ base + theme(legend.position = "right") # the default
You can also use `legend.position = "none"` to suppress the display of the legend altogether.
To control the display of individual legends, use `guides()` along with `guide_legend()` or `guide_colourbar()`. The following example shows two important settings: controlling the number of rows the legend uses with `nrow`, and overriding one of the aesthetics to make the points bigger. This is particularly useful if you have used a low `alpha` to display many points on a plot.
To control the display of individual legends, use `guides()` along with `guide_legend()` or `guide_colourbar()`.
The following example shows two important settings: controlling the number of rows the legend uses with `nrow`, and overriding one of the aesthetics to make the points bigger.
This is particularly useful if you have used a low `alpha` to display many points on a plot.
```{r}
ggplot(mpg, aes(displ, hwy)) +
@ -336,9 +380,12 @@ ggplot(mpg, aes(displ, hwy)) +
### Replacing a scale
Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you're mostly likely to want to switch out: continuous position scales and colour scales. Fortunately, the same principles apply to all the other aesthetics, so once you've mastered position and colour, you'll be able to quickly pick up other scale replacements.
Instead of just tweaking the details a little, you can instead replace the scale altogether.
There are two types of scales you're mostly likely to want to switch out: continuous position scales and colour scales.
Fortunately, the same principles apply to all the other aesthetics, so once you've mastered position and colour, you'll be able to quickly pick up other scale replacements.
It's very useful to plot transformations of your variable. For example, as we've seen in [diamond prices](diamond-prices) it's easier to see the precise relationship between `carat` and `price` if we log transform them:
It's very useful to plot transformations of your variable.
For example, as we've seen in [diamond prices](diamond-prices) it's easier to see the precise relationship between `carat` and `price` if we log transform them:
```{r, fig.align = "default", out.width = "50%"}
ggplot(diamonds, aes(carat, price)) +
@ -348,7 +395,9 @@ ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d()
```
However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale.
However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot.
Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale.
This is visually identical, except the axes are labelled on the original data scale.
```{r}
ggplot(diamonds, aes(carat, price)) +
@ -357,7 +406,10 @@ ggplot(diamonds, aes(carat, price)) +
scale_y_log10()
```
Another scale that is frequently customised is colour. The default categorical scale picks colours that are evenly spaced around the colour wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness.
Another scale that is frequently customised is colour.
The default categorical scale picks colours that are evenly spaced around the colour wheel.
Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness.
The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness.
```{r, fig.align = "default", out.width = "50%"}
ggplot(mpg, aes(displ, hwy)) +
@ -368,7 +420,9 @@ ggplot(mpg, aes(displ, hwy)) +
scale_colour_brewer(palette = "Set1")
```
Don't forget simpler techniques. If there are just a few colours, you can add a redundant shape mapping. This will also help ensure your plot is interpretable in black and white.
Don't forget simpler techniques.
If there are just a few colours, you can add a redundant shape mapping.
This will also help ensure your plot is interpretable in black and white.
```{r}
ggplot(mpg, aes(displ, hwy)) +
@ -376,14 +430,18 @@ ggplot(mpg, aes(displ, hwy)) +
scale_colour_brewer(palette = "Set1")
```
The ColorBrewer scales are documented online at <http://colorbrewer2.org/> and made available in R via the __RColorBrewer__ package, by Erich Neuwirth. Figure \@ref(fig:brewer) shows the complete list of all palettes. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a "middle". This often arises if you've used `cut()` to make a continuous variable into a categorical variable.
The ColorBrewer scales are documented online at <http://colorbrewer2.org/> and made available in R via the **RColorBrewer** package, by Erich Neuwirth.
Figure \@ref(fig:brewer) shows the complete list of all palettes.
The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a "middle".
This often arises if you've used `cut()` to make a continuous variable into a categorical variable.
```{r brewer, fig.asp = 2.5, echo = FALSE, fig.cap = "All ColourBrewer scales."}
par(mar = c(0, 3, 0, 0))
RColorBrewer::display.brewer.all()
```
When you have a predefined mapping between values and colours, use `scale_colour_manual()`. For example, if we map presidential party to colour, we want to use the standard mapping of red for Republicans and blue for Democrats:
When you have a predefined mapping between values and colours, use `scale_colour_manual()`.
For example, if we map presidential party to colour, we want to use the standard mapping of red for Republicans and blue for Democrats:
```{r}
presidential %>%
@ -394,9 +452,15 @@ presidential %>%
scale_colour_manual(values = c(Republican = "red", Democratic = "blue"))
```
For continuous colour, you can use the built-in `scale_colour_gradient()` or `scale_fill_gradient()`. If you have a diverging scale, you can use `scale_colour_gradient2()`. That allows you to give, for example, positive and negative values different colours. That's sometimes also useful if you want to distinguish points above or below the mean.
For continuous colour, you can use the built-in `scale_colour_gradient()` or `scale_fill_gradient()`.
If you have a diverging scale, you can use `scale_colour_gradient2()`.
That allows you to give, for example, positive and negative values different colours.
That's sometimes also useful if you want to distinguish points above or below the mean.
Another option is `scale_colour_viridis()` provided by the __viridis__ package. It's a continuous analog of the categorical ColorBrewer scales. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored a continuous colour scheme that has good perceptual properties. Here's an example from the viridis vignette.
Another option is `scale_colour_viridis()` provided by the **viridis** package.
It's a continuous analog of the categorical ColorBrewer scales.
The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored a continuous colour scheme that has good perceptual properties.
Here's an example from the viridis vignette.
```{r, fig.align = "default", fig.asp = 1, out.width = "50%", fig.width = 4}
df <- tibble(
@ -426,17 +490,18 @@ Note that all colour scales come in two variety: `scale_colour_x()` and `scale_f
coord_fixed()
```
1. What is the first argument to every scale? How does it compare to `labs()`?
2. What is the first argument to every scale?
How does it compare to `labs()`?
1. Change the display of the presidential terms by:
3. Change the display of the presidential terms by:
1. Combining the two variants shown above.
1. Improving the display of the y axis.
1. Labelling each term with the name of the president.
1. Adding informative plot labels.
1. Placing breaks every 4 years (this is trickier than it seems!).
1. Combining the two variants shown above.
2. Improving the display of the y axis.
3. Labelling each term with the name of the president.
4. Adding informative plot labels.
5. Placing breaks every 4 years (this is trickier than it seems!).
1. Use `override.aes` to make the legend on the following plot easier to see.
4. Use `override.aes` to make the legend on the following plot easier to see.
```{r, dev = "png", out.width = "50%"}
ggplot(diamonds, aes(carat, price)) +
@ -447,11 +512,12 @@ Note that all colour scales come in two variety: `scale_colour_x()` and `scale_f
There are three ways to control the plot limits:
1. Adjusting what data are plotted
1. Setting the limits in each scale
1. Setting `xlim` and `ylim` in `coord_cartesian()`
1. Adjusting what data are plotted
2. Setting the limits in each scale
3. Setting `xlim` and `ylim` in `coord_cartesian()`
To zoom in on a region of the plot, it's generally best to use `coord_cartesian()`. Compare the following two plots:
To zoom in on a region of the plot, it's generally best to use `coord_cartesian()`.
Compare the following two plots:
```{r out.width = "50%", fig.align = "default", message = FALSE}
ggplot(mpg, mapping = aes(displ, hwy)) +
@ -466,7 +532,10 @@ mpg %>%
geom_smooth()
```
You can also set the `limits` on individual scales. Reducing the limits is basically equivalent to subsetting the data. It is generally more useful if you want _expand_ the limits, for example, to match scales across different plots. For example, if we extract two classes of cars and plot them separately, it's difficult to compare the plots because all three scales (the x-axis, the y-axis, and the colour aesthetic) have different ranges.
You can also set the `limits` on individual scales.
Reducing the limits is basically equivalent to subsetting the data.
It is generally more useful if you want *expand* the limits, for example, to match scales across different plots.
For example, if we extract two classes of cars and plot them separately, it's difficult to compare the plots because all three scales (the x-axis, the y-axis, and the colour aesthetic) have different ranges.
```{r out.width = "50%", fig.align = "default", fig.width = 4}
suv <- mpg %>% filter(class == "suv")
@ -512,83 +581,109 @@ ggplot(mpg, aes(displ, hwy)) +
theme_bw()
```
ggplot2 includes eight themes by default, as shown in Figure \@ref(fig:themes). Many more are included in add-on packages like __ggthemes__ (<https://github.com/jrnold/ggthemes>), by Jeffrey Arnold.
ggplot2 includes eight themes by default, as shown in Figure \@ref(fig:themes).
Many more are included in add-on packages like **ggthemes** (<https://github.com/jrnold/ggthemes>), by Jeffrey Arnold.
```{r themes, echo = FALSE, fig.cap = "The eight themes built-in to ggplot2."}
knitr::include_graphics("images/visualization-themes.png")
```
Many people wonder why the default theme has a grey background. This was a deliberate choice because it puts the data forward while still making the grid lines visible. The white grid lines are visible (which is important because they significantly aid position judgements), but they have little visual impact and we can easily tune them out. The grey background gives the plot a similar typographic colour to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity.
Many people wonder why the default theme has a grey background.
This was a deliberate choice because it puts the data forward while still making the grid lines visible.
The white grid lines are visible (which is important because they significantly aid position judgements), but they have little visual impact and we can easily tune them out.
The grey background gives the plot a similar typographic colour to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background.
Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity.
It's also possible to control individual components of each theme, like the size and colour of the font used for the y axis. Unfortunately, this level of detail is outside the scope of this book, so you'll need to read the [ggplot2 book](https://amzn.com/331924275X) for the full details. You can also create your own themes, if you are trying to match a particular corporate or journal style.
It's also possible to control individual components of each theme, like the size and colour of the font used for the y axis.
Unfortunately, this level of detail is outside the scope of this book, so you'll need to read the [ggplot2 book](https://amzn.com/331924275X) for the full details.
You can also create your own themes, if you are trying to match a particular corporate or journal style.
## Saving your plots
There are two main ways to get your plots out of R and into your final write-up: `ggsave()` and knitr. `ggsave()` will save the most recent plot to disk:
There are two main ways to get your plots out of R and into your final write-up: `ggsave()` and knitr.
`ggsave()` will save the most recent plot to disk:
```{r, fig.show = "none"}
ggplot(mpg, aes(displ, hwy)) + geom_point()
ggsave("my-plot.pdf")
```
```{r, include = FALSE}
file.remove("my-plot.pdf")
```
If you don't specify the `width` and `height` they will be taken from the dimensions of the current plotting device. For reproducible code, you'll want to specify them.
If you don't specify the `width` and `height` they will be taken from the dimensions of the current plotting device.
For reproducible code, you'll want to specify them.
Generally, however, I think you should be assembling your final reports using R Markdown, so I want to focus on the important code chunk options that you should know about for graphics. You can learn more about `ggsave()` in the documentation.
Generally, however, I think you should be assembling your final reports using R Markdown, so I want to focus on the important code chunk options that you should know about for graphics.
You can learn more about `ggsave()` in the documentation.
### Figure sizing
The biggest challenge of graphics in R Markdown is getting your figures the right size and shape. There are five main options that control figure sizing: `fig.width`, `fig.height`, `fig.asp`, `out.width` and `out.height`. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three).
The biggest challenge of graphics in R Markdown is getting your figures the right size and shape.
There are five main options that control figure sizing: `fig.width`, `fig.height`, `fig.asp`, `out.width` and `out.height`.
Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three).
I only ever use three of the five options:
* I find it most aesthetically pleasing for plots to have a consistent
width. To enforce this, I set `fig.width = 6` (6") and `fig.asp = 0.618`
(the golden ratio) in the defaults. Then in individual chunks, I only
adjust `fig.asp`.
- I find it most aesthetically pleasing for plots to have a consistent width.
To enforce this, I set `fig.width = 6` (6") and `fig.asp = 0.618` (the golden ratio) in the defaults.
Then in individual chunks, I only adjust `fig.asp`.
* I control the output size with `out.width` and set it to a percentage
of the line width. I default to `out.width = "70%"`
and `fig.align = "center"`. That give plots room to breathe, without taking
up too much space.
- I control the output size with `out.width` and set it to a percentage of the line width.
I default to `out.width = "70%"` and `fig.align = "center"`.
That give plots room to breathe, without taking up too much space.
* To put multiple plots in a single row I set the `out.width` to
`50%` for two plots, `33%` for 3 plots, or `25%` to 4 plots, and set
`fig.align = "default"`. Depending on what I'm trying to illustrate (e.g.
show data or show plot variations), I'll also tweak `fig.width`, as
discussed below.
- To put multiple plots in a single row I set the `out.width` to `50%` for two plots, `33%` for 3 plots, or `25%` to 4 plots, and set `fig.align = "default"`.
Depending on what I'm trying to illustrate (e.g. show data or show plot variations), I'll also tweak `fig.width`, as discussed below.
If you find that you're having to squint to read the text in your plot, you need to tweak `fig.width`. If `fig.width` is larger than the size the figure is rendered in the final doc, the text will be too small; if `fig.width` is smaller, the text will be too big. You'll often need to do a little experimentation to figure out the right ratio between the `fig.width` and the eventual width in your document. To illustrate the principle, the following three plots have `fig.width` of 4, 6, and 8 respectively:
If you find that you're having to squint to read the text in your plot, you need to tweak `fig.width`.
If `fig.width` is larger than the size the figure is rendered in the final doc, the text will be too small; if `fig.width` is smaller, the text will be too big.
You'll often need to do a little experimentation to figure out the right ratio between the `fig.width` and the eventual width in your document.
To illustrate the principle, the following three plots have `fig.width` of 4, 6, and 8 respectively:
```{r, include = FALSE}
plot <- ggplot(mpg, aes(displ, hwy)) + geom_point()
```
```{r, fig.width = 4, echo = FALSE}
plot
```
```{r, fig.width = 6, echo = FALSE}
plot
```
```{r, fig.width = 8, echo = FALSE}
plot
```
If you want to make sure the font size is consistent across all your figures, whenever you set `out.width`, you'll also need to adjust `fig.width` to maintain the same ratio with your default `out.width`. For example, if your default `fig.width` is 6 and `out.width` is 0.7, when you set `out.width = "50%"` you'll need to set `fig.width` to 4.3 (6 * 0.5 / 0.7).
If you want to make sure the font size is consistent across all your figures, whenever you set `out.width`, you'll also need to adjust `fig.width` to maintain the same ratio with your default `out.width`.
For example, if your default `fig.width` is 6 and `out.width` is 0.7, when you set `out.width = "50%"` you'll need to set `fig.width` to 4.3 (6 \* 0.5 / 0.7).
### Other important options
When mingling code and text, like I do in this book, I recommend setting `fig.show = "hold"` so that plots are shown after the code. This has the pleasant side effect of forcing you to break up large blocks of code with their explanations.
When mingling code and text, like I do in this book, I recommend setting `fig.show = "hold"` so that plots are shown after the code.
This has the pleasant side effect of forcing you to break up large blocks of code with their explanations.
To add a caption to the plot, use `fig.cap`. In R Markdown this will change the figure from inline to "floating".
To add a caption to the plot, use `fig.cap`.
In R Markdown this will change the figure from inline to "floating".
If you're producing PDF output, the default graphics type is PDF. This is a good default because PDFs are high quality vector graphics. However, they can produce very large and slow plots if you are displaying thousands of points. In that case, set `dev = "png"` to force the use of PNGs. They are slightly lower quality, but will be much more compact.
If you're producing PDF output, the default graphics type is PDF.
This is a good default because PDFs are high quality vector graphics.
However, they can produce very large and slow plots if you are displaying thousands of points.
In that case, set `dev = "png"` to force the use of PNGs.
They are slightly lower quality, but will be much more compact.
It's a good idea to name code chunks that produce figures, even if you don't routinely label other chunks. The chunk label is used to generate the file name of the graphic on disk, so naming your chunks makes it much easier to pick out plots and reuse in other circumstances (i.e. if you want to quickly drop a single plot into an email or a tweet).
It's a good idea to name code chunks that produce figures, even if you don't routinely label other chunks.
The chunk label is used to generate the file name of the graphic on disk, so naming your chunks makes it much easier to pick out plots and reuse in other circumstances (i.e. if you want to quickly drop a single plot into an email or a tweet).
## Learning more
The absolute best place to learn more is the ggplot2 book: [_ggplot2: Elegant graphics for data analysis_](https://amzn.com/331924275X). It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems. Unfortunately, the book is not available online for free, although you can find the source code at <https://github.com/hadley/ggplot2-book>.
The absolute best place to learn more is the ggplot2 book: [*ggplot2: Elegant graphics for data analysis*](https://amzn.com/331924275X).
It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems.
Unfortunately, the book is not available online for free, although you can find the source code at <https://github.com/hadley/ggplot2-book>.
Another great resource is the ggplot2 extensions gallery <https://exts.ggplot2.tidyverse.org/gallery/>. This site lists many of the packages that extend ggplot2 with new geoms and scales. It's a great place to start if you're trying to do something that seems hard with ggplot2.
Another great resource is the ggplot2 extensions gallery <https://exts.ggplot2.tidyverse.org/gallery/>.
This site lists many of the packages that extend ggplot2 with new geoms and scales.
It's a great place to start if you're trying to do something that seems hard with ggplot2.

View File

@ -1,8 +1,9 @@
# (PART) Communicate {-}
# (PART) Communicate {.unnumbered}
# Introduction {#communicate-intro}
So far, you've learned the tools to get your data into R, tidy it into a form convenient for analysis, and then understand your data through transformation, visualisation and modelling. However, it doesn't matter how great your analysis is unless you can explain it to others: you need to __communicate__ your results.
So far, you've learned the tools to get your data into R, tidy it into a form convenient for analysis, and then understand your data through transformation, visualisation and modelling.
However, it doesn't matter how great your analysis is unless you can explain it to others: you need to **communicate** your results.
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-communicate.png")
@ -10,23 +11,15 @@ knitr::include_graphics("diagrams/data-science-communicate.png")
Communication is the theme of the following four chapters:
* In [R Markdown], you will learn about R Markdown, a tool for integrating
prose, code, and results. You can use R Markdown in notebook mode for
analyst-to-analyst communication, and in report mode for
analyst-to-decision-maker communication. Thanks to the power of R Markdown
formats, you can even use the same document for both purposes.
* In [Graphics for communication], you will learn how to take your exploratory
graphics and turn them into expository graphics, graphics that help the
newcomer to your analysis understand what's going on as quickly and
easily as possible.
* In [R Markdown formats], you'll learn a little about the many other varieties
of outputs you can produce using R Markdown, including dashboards, websites,
and books.
* We'll finish up with [R Markdown workflow], where you'll learn about the
"analysis notebook" and how to systematically record your successes and
failures so that you can learn from them.
- In [R Markdown], you will learn about R Markdown, a tool for integrating prose, code, and results.
You can use R Markdown in notebook mode for analyst-to-analyst communication, and in report mode for analyst-to-decision-maker communication.
Thanks to the power of R Markdown formats, you can even use the same document for both purposes.
Unfortunately, these chapters focus mostly on the technical mechanics of communication, not the really hard problems of communicating your thoughts to other humans. However, there are lot of other great books about communication, which we'll point you to at the end of each chapter.
- In [Graphics for communication], you will learn how to take your exploratory graphics and turn them into expository graphics, graphics that help the newcomer to your analysis understand what's going on as quickly and easily as possible.
- In [R Markdown formats], you'll learn a little about the many other varieties of outputs you can produce using R Markdown, including dashboards, websites, and books.
- We'll finish up with [R Markdown workflow], where you'll learn about the "analysis notebook" and how to systematically record your successes and failures so that you can learn from them.
Unfortunately, these chapters focus mostly on the technical mechanics of communication, not the really hard problems of communicating your thoughts to other humans.
However, there are lot of other great books about communication, which we'll point you to at the end of each chapter.

19
contribute.Rmd Normal file
View File

@ -0,0 +1,19 @@
# Contributing
This book has been developed in the open, and it wouldn't be nearly as good without your contributions.
There are a number of ways you can help make the book even better:
- If you don't understand something, please [let me know](mailto:h.wickham@gmail.com).
Your feedback on what is confusing or hard to understand is valuable.
- If you spot a typo, feel free to edit the underlying page and send a pull request.
If you've never done this before, the process is very easy:
- Click the edit this page on the sidebar.
- Make the changes using GitHub's in-page editor and save.
- Submit a pull request and include a brief description of your changes.
"Fixing typos" is perfectly adequate.
- If you make significant changes, include the phrase "I assign the copyright of this contribution to Hadley Wickham" - I need this so I can publish the printed book.

View File

@ -1,23 +0,0 @@
# Contributing
This book has been developed in the open, and it wouldn't be nearly as good
without your contributions. There are a number of ways you can help make the
book even better:
* If you don't understand something, please
[let me know](mailto:h.wickham@gmail.com). Your feedback on what is confusing
or hard to understand is valuable.
* If you spot a typo, feel free to edit the underlying page and send a pull
request. If you've never done this before, the process is very easy:
* Click the edit this page on the sidebar.
* Make the changes using github's in-page editor and save.
* Submit a pull request and include a brief description of your changes.
"Fixing typos" is perfectly adequate.
* If you make significant changes, include the phrase "I assign the
copyright of this contribution to Hadley Wickham" - I need this so I can
publish the printed book.

View File

@ -2,19 +2,28 @@
## Introduction
This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don't seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get. To warm up, try these three seemingly simple questions:
This chapter will show you how to work with dates and times in R.
At first glance, dates and times seem simple.
You use them all the time in your regular life, and they don't seem to cause much confusion.
However, the more you learn about dates and times, the more complicated they seem to get.
To warm up, try these three seemingly simple questions:
* Does every year have 365 days?
* Does every day have 24 hours?
* Does every minute have 60 seconds?
- Does every year have 365 days?
- Does every day have 24 hours?
- Does every minute have 60 seconds?
I'm sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year? (It has three parts.) You might have remembered that many parts of the world use daylight savings time (DST), so that some days have 23 hours, and others have 25. You might not have known that some minutes have 61 seconds because every now and then leap seconds are added because the Earth's rotation is gradually slowing down.
I'm sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year?
(It has three parts.) You might have remembered that many parts of the world use daylight savings time (DST), so that some days have 23 hours, and others have 25.
You might not have known that some minutes have 61 seconds because every now and then leap seconds are added because the Earth's rotation is gradually slowing down.
Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST. This chapter won't teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.
Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST.
This chapter won't teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.
### Prerequisites
This chapter will focus on the __lubridate__ package, which makes it easier to work with dates and times in R. lubridate is not part of core tidyverse because you only need it when you're working with dates/times. We will also need nycflights13 for practice data.
This chapter will focus on the **lubridate** package, which makes it easier to work with dates and times in R.
lubridate is not part of core tidyverse because you only need it when you're working with dates/times.
We will also need nycflights13 for practice data.
```{r setup, message = FALSE}
library(tidyverse)
@ -27,18 +36,22 @@ library(nycflights13)
There are three types of date/time data that refer to an instant in time:
* A __date__. Tibbles print this as `<date>`.
- A **date**.
Tibbles print this as `<date>`.
* A __time__ within a day. Tibbles print this as `<time>`.
- A **time** within a day.
Tibbles print this as `<time>`.
* A __date-time__ is a date plus a time: it uniquely identifies an
instant in time (typically to the nearest second). Tibbles print this
as `<dttm>`. Elsewhere in R these are called POSIXct, but I don't think
that's a very useful name.
In this chapter we are only going to focus on dates and date-times as R doesn't have a native class for storing times. If you need one, you can use the __hms__ package.
- A **date-time** is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second).
Tibbles print this as `<dttm>`.
Elsewhere in R these are called POSIXct, but I don't think that's a very useful name.
You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we'll come back to at the end of the chapter.
In this chapter we are only going to focus on dates and date-times as R doesn't have a native class for storing times.
If you need one, you can use the **hms** package.
You should always use the simplest possible data type that works for your needs.
That means if you can use a date instead of a date-time, you should.
Date-times are substantially more complicated because of the need to handle time zones, which we'll come back to at the end of the chapter.
To get the current date or date-time you can use `today()` or `now()`:
@ -49,15 +62,21 @@ now()
Otherwise, there are three ways you're likely to create a date/time:
* From a string.
* From individual date-time components.
* From an existing date/time object.
- From a string.
- From individual date-time components.
- From an existing date/time object.
They work as follows.
### From strings
Date/time data often comes as strings. You've seen one approach to parsing strings into date-times in [date-times](#readr-datetimes). Another approach is to use the helpers provided by lubridate. They automatically work out the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange "y", "m", and "d" in the same order. That gives you the name of the lubridate function that will parse your date. For example:
Date/time data often comes as strings.
You've seen one approach to parsing strings into date-times in [date-times](#readr-datetimes).
Another approach is to use the helpers provided by lubridate.
They automatically work out the format once you specify the order of the component.
To use them, identify the order in which year, month, and day appear in your dates, then arrange "y", "m", and "d" in the same order.
That gives you the name of the lubridate function that will parse your date.
For example:
```{r}
ymd("2017-01-31")
@ -65,13 +84,16 @@ mdy("January 31st, 2017")
dmy("31-Jan-2017")
```
These functions also take unquoted numbers. This is the most concise way to create a single date/time object, as you might need when filtering date/time data. `ymd()` is short and unambiguous:
These functions also take unquoted numbers.
This is the most concise way to create a single date/time object, as you might need when filtering date/time data.
`ymd()` is short and unambiguous:
```{r}
ymd(20170131)
```
`ymd()` and friends create dates. To create a date-time, add an underscore and one or more of "h", "m", and "s" to the name of the parsing function:
`ymd()` and friends create dates.
To create a date-time, add an underscore and one or more of "h", "m", and "s" to the name of the parsing function:
```{r}
ymd_hms("2017-01-31 20:11:59")
@ -86,7 +108,8 @@ ymd(20170131, tz = "UTC")
### From individual components
Instead of a single string, sometimes you'll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:
Instead of a single string, sometimes you'll have the individual components of the date-time spread across multiple columns.
This is what we have in the flights data:
```{r}
flights %>%
@ -101,7 +124,9 @@ flights %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
```
Let's do the same thing for each of the four time columns in `flights`. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once I've created the date-time variables, I focus in on the variables we'll explore in the rest of the chapter.
Let's do the same thing for each of the four time columns in `flights`.
The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components.
Once I've created the date-time variables, I focus in on the variables we'll explore in the rest of the chapter.
```{r}
make_datetime_100 <- function(year, month, day, time) {
@ -138,18 +163,21 @@ flights_dt %>%
geom_freqpoly(binwidth = 600) # 600 s = 10 minutes
```
Note that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.
Note that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day.
For dates, 1 means 1 day.
### From other types
You may want to switch between a date-time and a date. That's the job of `as_datetime()` and `as_date()`:
You may want to switch between a date-time and a date.
That's the job of `as_datetime()` and `as_date()`:
```{r}
as_datetime(today())
as_date(now())
```
Sometimes you'll get date/times as numeric offsets from the "Unix Epoch", 1970-01-01. If the offset is in seconds, use `as_datetime()`; if it's in days, use `as_date()`.
Sometimes you'll get date/times as numeric offsets from the "Unix Epoch", 1970-01-01.
If the offset is in seconds, use `as_datetime()`; if it's in days, use `as_date()`.
```{r}
as_datetime(60 * 60 * 10)
@ -164,9 +192,10 @@ as_date(365 * 10 + 2)
ymd(c("2010-10-10", "bananas"))
```
1. What does the `tzone` argument to `today()` do? Why is it important?
2. What does the `tzone` argument to `today()` do?
Why is it important?
1. Use the appropriate lubridate function to parse each of the following dates:
3. Use the appropriate lubridate function to parse each of the following dates:
```{r}
d1 <- "January 1, 2010"
@ -178,11 +207,13 @@ as_date(365 * 10 + 2)
## Date-time components
Now that you know how to get date-time data into R's date-time data structures, let's explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.
Now that you know how to get date-time data into R's date-time data structures, let's explore what you can do with them.
This section will focus on the accessor functions that let you get and set individual components.
The next section will look at how arithmetic works with date-times.
### Getting components
You can pull out individual parts of the date with the accessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
You can pull out individual parts of the date with the accessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
```{r}
datetime <- ymd_hms("2016-07-08 12:34:56")
@ -195,7 +226,8 @@ yday(datetime)
wday(datetime)
```
For `month()` and `wday()` you can set `label = TRUE` to return the abbreviated name of the month or day of the week. Set `abbr = FALSE` to return the full name.
For `month()` and `wday()` you can set `label = TRUE` to return the abbreviated name of the month or day of the week.
Set `abbr = FALSE` to return the full name.
```{r}
month(datetime, label = TRUE)
@ -211,7 +243,8 @@ flights_dt %>%
geom_bar()
```
There's an interesting pattern if we look at the average departure delay by minute within the hour. It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!
There's an interesting pattern if we look at the average departure delay by minute within the hour.
It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!
```{r}
flights_dt %>%
@ -224,7 +257,7 @@ flights_dt %>%
geom_line()
```
Interestingly, if we look at the _scheduled_ departure time we don't see such a strong pattern:
Interestingly, if we look at the *scheduled* departure time we don't see such a strong pattern:
```{r}
sched_dep <- flights_dt %>%
@ -238,7 +271,9 @@ ggplot(sched_dep, aes(minute, avg_delay)) +
geom_line()
```
So why do we see that pattern with the actual departure times? Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times. Always be alert for this sort of pattern whenever you work with data that involves human judgement!
So why do we see that pattern with the actual departure times?
Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times.
Always be alert for this sort of pattern whenever you work with data that involves human judgement!
```{r}
ggplot(sched_dep, aes(minute, n)) +
@ -247,7 +282,9 @@ ggplot(sched_dep, aes(minute, n)) +
### Rounding
An alternative approach to plotting individual components is to round the date to a nearby unit of time, with `floor_date()`, `round_date()`, and `ceiling_date()`. Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:
An alternative approach to plotting individual components is to round the date to a nearby unit of time, with `floor_date()`, `round_date()`, and `ceiling_date()`.
Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to.
This, for example, allows us to plot the number of flights per week:
```{r}
flights_dt %>%
@ -260,7 +297,7 @@ Computing the difference between a rounded and unrounded date can be particularl
### Setting components
You can also use each accessor function to set the components of a date/time:
You can also use each accessor function to set the components of a date/time:
```{r}
(datetime <- ymd_hms("2016-07-08 12:34:56"))
@ -273,7 +310,8 @@ hour(datetime) <- hour(datetime) + 1
datetime
```
Alternatively, rather than modifying in place, you can create a new date-time with `update()`. This also allows you to set multiple values at once.
Alternatively, rather than modifying in place, you can create a new date-time with `update()`.
This also allows you to set multiple values at once.
```{r}
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
@ -288,7 +326,7 @@ ymd("2015-02-01") %>%
update(hour = 400)
```
You can use `update()` to show the distribution of flights across the course of the day for every day of the year:
You can use `update()` to show the distribution of flights across the course of the day for every day of the year:
```{r}
flights_dt %>%
@ -301,36 +339,35 @@ Setting larger components of a date to a constant is a powerful technique that a
### Exercises
1. How does the distribution of flight times within a day change over the
course of the year?
1. Compare `dep_time`, `sched_dep_time` and `dep_delay`. Are they consistent?
1. How does the distribution of flight times within a day change over the course of the year?
2. Compare `dep_time`, `sched_dep_time` and `dep_delay`.
Are they consistent?
Explain your findings.
1. Compare `air_time` with the duration between the departure and arrival.
Explain your findings. (Hint: consider the location of the airport.)
1. How does the average delay time change over the course of a day?
Should you use `dep_time` or `sched_dep_time`? Why?
3. Compare `air_time` with the duration between the departure and arrival.
Explain your findings.
(Hint: consider the location of the airport.)
1. On what day of the week should you leave if you want to minimise the
chance of a delay?
4. How does the average delay time change over the course of a day?
Should you use `dep_time` or `sched_dep_time`?
Why?
1. What makes the distribution of `diamonds$carat` and
`flights$sched_dep_time` similar?
5. On what day of the week should you leave if you want to minimise the chance of a delay?
1. Confirm my hypothesis that the early departures of flights in minutes
20-30 and 50-60 are caused by scheduled flights that leave early.
Hint: create a binary variable that tells you whether or not a flight
was delayed.
6. What makes the distribution of `diamonds$carat` and `flights$sched_dep_time` similar?
7. Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early.
Hint: create a binary variable that tells you whether or not a flight was delayed.
## Time spans
Next you'll learn about how arithmetic with dates works, including subtraction, addition, and division. Along the way, you'll learn about three important classes that represent time spans:
Next you'll learn about how arithmetic with dates works, including subtraction, addition, and division.
Along the way, you'll learn about three important classes that represent time spans:
* __durations__, which represent an exact number of seconds.
* __periods__, which represent human units like weeks and months.
* __intervals__, which represent a starting and ending point.
- **durations**, which represent an exact number of seconds.
- **periods**, which represent human units like weeks and months.
- **intervals**, which represent a starting and ending point.
### Durations
@ -342,7 +379,8 @@ h_age <- today() - ymd(19791014)
h_age
```
A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the __duration__.
A difftime class object records a time span of seconds, minutes, hours, days, or weeks.
This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the **duration**.
```{r}
as.duration(h_age)
@ -359,7 +397,8 @@ dweeks(3)
dyears(1)
```
Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds at the standard rate (60 seconds in a minute, 60 minutes in an hour, 24 hours in day, 7 days in a week, 365 days in a year).
Durations always record the time span in seconds.
Larger units are created by converting minutes, hours, days, weeks, and years to seconds at the standard rate (60 seconds in a minute, 60 minutes in an hour, 24 hours in day, 7 days in a week, 365 days in a year).
You can add and multiply durations:
@ -384,18 +423,22 @@ one_pm
one_pm + ddays(1)
```
Why is one day after 1pm on March 12, 2pm on March 13?! If you look carefully at the date you might also notice that the time zones have changed. Because of DST, March 12 only has 23 hours, so if we add a full days worth of seconds we end up with a different time.
Why is one day after 1pm on March 12, 2pm on March 13?!
If you look carefully at the date you might also notice that the time zones have changed.
Because of DST, March 12 only has 23 hours, so if we add a full days worth of seconds we end up with a different time.
### Periods
To solve this problem, lubridate provides __periods__. Periods are time spans but don't have a fixed length in seconds, instead they work with "human" times, like days and months. That allows them to work in a more intuitive way:
To solve this problem, lubridate provides **periods**.
Periods are time spans but don't have a fixed length in seconds, instead they work with "human" times, like days and months.
That allows them to work in a more intuitive way:
```{r}
one_pm
one_pm + days(1)
```
Like durations, periods can be created with a number of friendly constructor functions.
Like durations, periods can be created with a number of friendly constructor functions.
```{r}
seconds(15)
@ -414,7 +457,8 @@ You can add and multiply periods:
days(50) + hours(25) + minutes(2)
```
And of course, add them to dates. Compared to durations, periods are more likely to do what you expect:
And of course, add them to dates.
Compared to durations, periods are more likely to do what you expect:
```{r}
# A leap year
@ -426,14 +470,17 @@ one_pm + ddays(1)
one_pm + days(1)
```
Let's use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination _before_ they departed from New York City.
Let's use periods to fix an oddity related to our flight dates.
Some planes appear to have arrived at their destination *before* they departed from New York City.
```{r}
flights_dt %>%
filter(arr_time < dep_time)
```
These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding `days(1)` to the arrival time of each overnight flight.
These are overnight flights.
We used the same date information for both the departure and the arrival times, but these flights arrived on the following day.
We can fix this by adding `days(1)` to the arrival time of each overnight flight.
```{r}
flights_dt <- flights_dt %>%
@ -455,13 +502,17 @@ flights_dt %>%
It's obvious what `dyears(1) / ddays(365)` should return: one, because durations are always represented by a number of seconds, and a duration of a year is defined as 365 days worth of seconds.
What should `years(1) / days(1)` return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! There's not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate, with a warning:
What should `years(1) / days(1)` return?
Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366!
There's not quite enough information for lubridate to give a single clear answer.
What it does instead is give an estimate, with a warning:
```{r}
years(1) / days(1)
```
If you want a more accurate measurement, you'll have to use an __interval__. An interval is a duration with a starting point: that makes it precise so you can determine exactly how long it is:
If you want a more accurate measurement, you'll have to use an **interval**.
An interval is a duration with a starting point: that makes it precise so you can determine exactly how long it is:
```{r}
next_year <- today() + years(1)
@ -476,7 +527,9 @@ To find out how many periods fall into an interval, you need to use integer divi
### Summary
How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
How do you pick between duration, periods, and intervals?
As always, pick the simplest data structure that solves your problem.
If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
Figure \@ref(fig:dt-algebra) summarises permitted arithmetic operations between the different data types.
@ -488,25 +541,35 @@ knitr::include_graphics("diagrams/datetimes-arithmetic.png")
1. Why is there `months()` but no `dmonths()`?
1. Explain `days(overnight * 1)` to someone who has just started
learning R. How does it work?
2. Explain `days(overnight * 1)` to someone who has just started learning R.
How does it work?
1. Create a vector of dates giving the first day of every month in 2015.
Create a vector of dates giving the first day of every month
in the _current_ year.
3. Create a vector of dates giving the first day of every month in 2015.
Create a vector of dates giving the first day of every month in the *current* year.
1. Write a function that given your birthday (as a date), returns
how old you are in years.
4. Write a function that given your birthday (as a date), returns how old you are in years.
1. Why can't `(today() %--% (today() + years(1))) / months(1)` work?
5. Why can't `(today() %--% (today() + years(1))) / months(1)` work?
## Time zones
## Time zones
Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we don't need to dig into all the details as they're not all important for data analysis, but there are a few challenges we'll need to tackle head on.
Time zones are an enormously complicated topic because of their interaction with geopolitical entities.
Fortunately we don't need to dig into all the details as they're not all important for data analysis, but there are a few challenges we'll need to tackle head on.
The first challenge is that everyday names of time zones tend to be ambiguous. For example, if you're American you're probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme "<area>/<location>", typically in the form "\<continent\>/\<city\>" (there are a few exceptions because not every country lies on a continent). Examples include "America/New_York", "Europe/Paris", and "Pacific/Auckland".
The first challenge is that everyday names of time zones tend to be ambiguous.
For example, if you're American you're probably familiar with EST, or Eastern Standard Time.
However, both Australia and Canada also have EST!
To avoid confusion, R uses the international standard IANA time zones.
These use a consistent naming scheme "<area>/<location>", typically in the form "\<continent\>/\<city\>" (there are a few exceptions because not every country lies on a continent).
Examples include "America/New_York", "Europe/Paris", and "Pacific/Auckland".
You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. In the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that name needs to reflect not only to the current behaviour, but also the complete history. For example, there are time zones for both "America/New_York" and "America/Detroit". These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. It's worth reading the raw time zone database (available at <http://www.iana.org/time-zones>) just to read some of these stories!
You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country.
This is because the IANA database has to record decades worth of time zone rules.
In the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same.
Another problem is that name needs to reflect not only to the current behaviour, but also the complete history.
For example, there are time zones for both "America/New_York" and "America/Detroit".
These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name.
It's worth reading the raw time zone database (available at <http://www.iana.org/time-zones>) just to read some of these stories!
You can find out what R thinks your current time zone is with `Sys.timezone()`:
@ -523,7 +586,8 @@ length(OlsonNames())
head(OlsonNames())
```
In R, the time zone is an attribute of the date-time that only controls printing. For example, these three objects represent the same instant in time:
In R, the time zone is an attribute of the date-time that only controls printing.
For example, these three objects represent the same instant in time:
```{r}
(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
@ -538,7 +602,11 @@ x1 - x2
x1 - x3
```
Unless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like `c()`, will often drop the time zone. In that case, the date-times will display in your local time zone:
Unless otherwise specified, lubridate always uses UTC.
UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Mean Time).
It does not have DST, which makes a convenient representation for computation.
Operations that combine date-times, like `c()`, will often drop the time zone.
In that case, the date-times will display in your local time zone:
```{r}
x4 <- c(x1, x2, x3)
@ -547,22 +615,19 @@ x4
You can change the time zone in two ways:
* Keep the instant in time the same, and change how it's displayed.
Use this when the instant is correct, but you want a more natural
display.
- Keep the instant in time the same, and change how it's displayed.
Use this when the instant is correct, but you want a more natural display.
```{r}
x4a <- with_tz(x4, tzone = "Australia/Lord_Howe")
x4a
x4a - x4
```
(This also illustrates another challenge of times zones: they're not
all integer hour offsets!)
* Change the underlying instant in time. Use this when you have an
instant that has been labelled with the incorrect time zone, and you
need to fix it.
(This also illustrates another challenge of times zones: they're not all integer hour offsets!)
- Change the underlying instant in time.
Use this when you have an instant that has been labelled with the incorrect time zone, and you need to fix it.
```{r}
x4b <- force_tz(x4, tzone = "Australia/Lord_Howe")

View File

@ -1,29 +1,27 @@
# (PART) Explore {-}
# (PART) Explore {.unnumbered}
# Introduction {#explore-intro}
The goal of the first part of this book is to get you up to speed with the basic tools of __data exploration__ as quickly as possible. Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again. The goal of data exploration is to generate many promising leads that you can later explore in more depth.
The goal of the first part of this book is to get you up to speed with the basic tools of **data exploration** as quickly as possible.
Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again.
The goal of data exploration is to generate many promising leads that you can later explore in more depth.
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-explore.png")
```
In this part of the book you will learn some useful tools that have an immediate payoff:
In this part of the book you will learn some useful tools that have an immediate payoff:
* Visualisation is a great place to start with R programming, because the
payoff is so clear: you get to make elegant and informative plots that help
you understand data. In [data visualisation] you'll dive into visualisation,
learning the basic structure of a ggplot2 plot, and powerful techniques for
turning data into plots.
- Visualisation is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data.
In [data visualisation] you'll dive into visualisation, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.
* Visualisation alone is typically not enough, so in [data transformation]
you'll learn the key verbs that allow you to select important variables,
filter out key observations, create new variables, and compute summaries.
* Finally, in [exploratory data analysis], you'll combine visualisation and
transformation with your curiosity and scepticism to ask and answer
interesting questions about data.
- Visualisation alone is typically not enough, so in [data transformation] you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet. We'll come back to it in [modelling](#model-intro), once you're better equipped with more data wrangling and programming tools.
- Finally, in [exploratory data analysis], you'll combine visualisation and transformation with your curiosity and scepticism to ask and answer interesting questions about data.
Nestled among these three chapters that teach you the tools of exploration are three chapters that focus on your R workflow. In [workflow: basics], [workflow: scripts], and [workflow: projects] you'll learn good practices for writing and organising your R code. These will set you up for success in the long run, as they'll give you the tools to stay organised when you tackle real projects.
Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet.
We'll come back to it in [modelling](#model-intro), once you're better equipped with more data wrangling and programming tools.
Nestled among these three chapters that teach you the tools of exploration are three chapters that focus on your R workflow.
In [workflow: basics], [workflow: scripts], and [workflow: projects] you'll learn good practices for writing and organising your R code.
These will set you up for success in the long run, as they'll give you the tools to stay organised when you tackle real projects.

217
extra/hierarchy.Rmd Normal file
View File

@ -0,0 +1,217 @@
# Hierarchical data {#hierarchy}
## Introduction
This chapter belongs in [wrangle](#wrangle-intro): it will give you a set of tools for working with hierarchical data, such as the deeply nested lists you often get when working with JSON.
However, you can only learn it now because working with hierarchical structures requires some programming skills, particularly an understanding of data structures, functions, and iteration.
Now you have those tools under your belt, you can learn how to work with hierarchical data.
The
As well as tools to simplify iteration, purrr provides tools for handling deeply nested lists.
There are three common sources of such data:
- JSON and XML
-
The map functions apply a function to every element in a list.
They are the most commonly used part of purrr, but not the only part.
Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy:
- You can extract deeply nested elements in a single call by supplying a character vector to the map functions.
- You can remove a level of the hierarchy with the flatten functions.
- You can flip levels of the hierarchy with the transpose function.
### Prerequisites
This chapter focusses mostly on purrr.
As well as the tools for iteration that you've already learned about, purrr also provides a number of tools specifically designed to manipulate hierarchical data.
```{r setup}
library(purrr)
```
## Initial exploration
Sometimes you get data structures that are very deeply nested.
A common source of such data is JSON from a web API.
I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`.
Now I'm going to load it into a list with jsonlite.
By default `fromJSON()` tries to be helpful and simplifies the structure a little for you.
Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`:
```{r}
# From https://api.github.com/repos/hadley/r4ds/issues
issues <- jsonlite::fromJSON("issues.json", simplifyVector = FALSE)
```
You might be tempted to use `str()` on this data.
Unfortunately, however, `str()` is not designed for lists that are both deep and wide, and you'll tend to get overwhelmed by the output.
A better strategy is to pull the list apart piece by piece.
First, figure out how many elements are in the list, take a look at one, and then check they're all the same structure.
In this case there are eight elements, and the first element is another list.
```{r}
length(issues)
str(issues[[1]])
```
(In this case we got lucky and the structure is (just) simple enough to print out with `str()`.
If you're unlucky, you may need to repeat this procedure.)
```{r}
tibble::tibble(
i = seq_along(issues),
names = issues %>% map(names)
) %>%
tidyr::unnest(names) %>%
table() %>%
t()
```
Another alternative is the **listviewer** package, <https://github.com/timelyportfolio/listviewer>.
## Extracting deeply nested elements
To work with this sort of data, you typically want to turn it into a data frame by extracting the related vectors that you're most interested in:
```{r}
issues %>% map_int("id")
issues %>% map_lgl("locked")
issues %>% map_chr("state")
```
You can use the same technique to extract more deeply nested structure.
For example, imagine you want to extract the name and id of the user.
You could do that in two steps:
```{r}
users <- issues %>% map("user")
users %>% map_chr("login")
users %>% map_int("id")
```
But by supplying a character *vector* to `map_*`, you can do it in one:
```{r}
issues %>% map_chr(c("user", "login"))
issues %>% map_int(c("user", "id"))
```
What happens if that path is missing in some of the elements?
For example, lets try and extract the HTML url to the pull request:
```{r, error = TRUE}
issues %>% map_chr(c("pull_request", "html_url"))
```
Unfortunately that doesn't work.
Whenever you see an error from purrr complaining about the "type" of the result, it's because it's trying to shove it into a simple vector (here a character).
You can diagnose the problem more easily if you use `map()`:
```{r}
issues %>% map(c("pull_request", "html_url"))
```
To get the results into a character vector, we need to tell purrr what it should change `NULL` to.
You can do that with the `.null` argument.
The most common value to use is `NA`:
```{r}
issues %>% map_chr(c("pull_request", "html_url"), .null = NA)
```
(You might wonder why that isn't the default value since it's so useful. Well, if it was the default, you'd never get an error message if you had a typo in the names. You'd just get a vector of missing values. That would be annoying to debug because it's a silent failure.)
It's possible to mix position and named indexing by using a list
```{r}
issues %>% map_chr(list("pull_request", 1), .null = NA)
```
## Removing a level of hierarchy
As well as indexing deeply into hierarchy, it's sometimes useful to flatten it.
That's the job of the flatten family of functions: `flatten()`, `flatten_lgl()`, `flatten_int()`, `flatten_dbl()`, and `flatten_chr()`.
In the code below we take a list of lists of double vectors, then flatten it to a list of double vectors, then to a double vector.
```{r}
x <- list(list(a = 1, b = 2), list(c = 3, d = 4))
str(x)
y <- flatten(x)
str(y)
flatten_dbl(y)
```
Graphically, that sequence of operations looks like:
```{r, echo = FALSE}
knitr::include_graphics("diagrams/lists-flatten.png")
```
Whenever I get confused about a sequence of flattening operations, I'll often draw a diagram like this to help me understand what's going on.
Base R has `unlist()`, but I recommend avoiding it for the same reason I recommend avoiding `sapply()`: it always succeeds.
Even if your data structure accidentally changes, `unlist()` will continue to work silently the wrong type of output.
This tends to create problems that are frustrating to debug.
## Switching levels in the hierarchy {#transpose}
Other times the hierarchy feels "inside out".
You can use `transpose()` to flip the first and second levels of a list:
```{r}
x <- list(
x = list(a = 1, b = 3, c = 5),
y = list(a = 2, b = 4, c = 6)
)
x %>% str()
x %>% transpose() %>% str()
```
Graphically, this looks like:
```{r, echo = FALSE}
knitr::include_graphics("diagrams/lists-transpose.png")
```
You'll see an example of this in the next section, as `transpose()` is particularly useful in conjunction with adverbs like `safely()` and `quietly()`.
It's called transpose by analogy to matrices.
When you subset a transposed matrix, you switch indices: `x[i, j]` is the same as `t(x)[j, i]`.
It's the same idea when transposing a list, but the subsetting looks a little different: `x[[i]][[j]]` is equivalent to `transpose(x)[[j]][[i]]`.
Similarly, a transpose is its own inverse so `transpose(transpose(x))` is equal to `x`.
Transpose is also useful when working with JSON APIs.
Many JSON APIs represent data frames in a row-based format, rather than R's column-based format.
`transpose()` makes it easy to switch between the two:
```{r}
df <- tibble::tibble(x = 1:3, y = c("a", "b", "c"))
df %>% transpose() %>% str()
```
## Turning lists into data frames
- Have a deeply nested list with missing pieces
- Need a tidy data frame so you can visualise, transform, model etc.
- What do you do?
- By hand with purrr, talk about `fromJSON` and `tidyJSON`
- tidyjson
### Exercises
1. Challenge: read all the CSV files in a directory.
Which ones failed and why?
```{r, eval = FALSE}
files <- dir("data", pattern = "\\.csv$")
files %>%
set_names(., basename(.)) %>%
map_df(safely(readr::read_csv), .id = "filename")
```

325
extra/model-assess.Rmd Normal file
View File

@ -0,0 +1,325 @@
# Model assessment
In this chapter, you'll turn the tools of multiple models towards model assessment: learning how the model performs when given new data.
So far we've focussed on models as tools for description, using models to help us understand the patterns in the data we have collected so far.
But ideally a model will do more than just describe what we have seen so far - it will also help predict what will come next.
In other words, we want a model that doesn't just perform well on the sample, but also accurately summarises the underlying population.
In some industries this is primarily the use of models: you spend relatively little time fitting the model compared to how many times you use it.
There are two basic ways that a model can fail with new data:
- You can under- or over-fit the model.
Underfitting is where you fail to model and important trend: you leave too much in the residuals, and not enough in the model.
Overfitting is the opposite: you fit a trend to what is actually random noise: you've too put much model and not left enough in the residuals.
Generally overfitting tends to be more of a problem than underfitting.
- The process that generates the data might change.
There's nothing the model can do about this.
You can protect yourself against this to some extent by creating models that you understand and applying your knowledge to the problem.
Are these fundamentals likely to change?
If you have a model that you are going to use again and again for a long time, you need to plan to maintain the model, regularly checking that it still makes sense.
i.e. is the population the same?
<http://research.google.com/pubs/pub43146.html> <http://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/>
The most common problem with a model that causes it to do poorly with new data is overfitting.
Obviously, there's a bit of a problem here: we don't have new data with which to check the model, and even if we did, we'd presumably use it to make the model better in the first place.
One powerful technique of approaches can help us get around this problem: resampling.
There are two main resampling techniques that we're going to cover.
- We will use **cross-validation** to assess model quality.
In cross-validation, you split the data into test and training sets.
You fit the data to the training set, and evaluate it on the test set.
This avoids intrinsic bias of using the same data to both fit the model and assess it's quality.
However it introduces a new bias: you're not using all the data to fit the model so it's not going to be quite as good as it could be.
- We will use **boostrapping** to understand how stable (or how variable) the model is.
If you sample data from the same population multiple times, how much does your model vary?
Instead of going back to collect new data, you can use the best estimate of the population data: the data you've collected so far.
The amazing idea of the bootstrap is that you can resample from the data you already have.
There are lots of high-level helpers to do these resampling methods in R.
We're going to use the tools provided by the modelr package because they are explicit - you'll see exactly what's going on at each step.
<http://topepo.github.io/caret>.
[Applied Predictive Modeling](https://amzn.com/1461468485), by Max Kuhn and Kjell Johnson.
If you're competing in competitions, like Kaggle, that are predominantly about creating good predictions, developing a good strategy for avoiding overfitting is very important.
Otherwise you risk tricking yourself into thinking that you have a good model, when in reality you just have a model that does a good job of fitting your data.
There is a closely related family that uses a similar idea: model ensembles.
However, instead of trying to find the best models, ensembles make use of all the models, acknowledging that even models that don't fit all the data particularly well can still model some subsets well.
In general, you can think of model ensemble techniques as functions that take a list of models, and a return a single model that attempts to take the best part of each.
### Prerequisites
```{r setup, message = FALSE}
# Standard data manipulation and visulisation
library(dplyr)
library(ggplot2)
# Tools for working with models
library(broom)
library(modelr)
library(splines)
# Tools for working with lots of models
library(purrr)
library(tidyr)
```
```{r}
# Options that make your life easier
options(
contrasts = c("contr.treatment", "contr.treatment"),
na.option = na.exclude
)
```
## Overfitting
Both bootstrapping and cross-validation help us to spot and remedy the problem of **over fitting**, where the model fits the data we've seen so far extremely well, but does a bad job of generalising to new data.
A classic example of over-fitting is to using a polynomial with too many degrees of freedom.
Bias - variance tradeoff.
Simpler = more biased.
Complex = more variable.
Occam's razor.
```{r}
true_model <- function(x) {
1 + 2 * x + rnorm(length(x), sd = 0.25)
}
df <- tibble(
x = seq(0, 1, length = 20),
y = true_model(x)
)
df %>%
ggplot(aes(x, y)) +
geom_point()
```
We can create a model that fits this data very well:
```{r, message = FALSE}
library(splines)
my_model <- function(df) {
lm(y ~ poly(x, 7), data = df)
}
mod <- my_model(df)
rmse(mod, df)
grid <- df %>%
expand(x = seq_range(x, 50))
preds <- grid %>%
add_predictions(mod, var = "y")
df %>%
ggplot(aes(x, y)) +
geom_line(data = preds) +
geom_point()
```
As we fit progressively more and more complicated models, the model error decreases:
```{r}
fs <- list(
y ~ x,
y ~ poly(x, 2),
y ~ poly(x, 3),
y ~ poly(x, 4),
y ~ poly(x, 5),
y ~ poly(x, 6),
y ~ poly(x, 7)
)
models <- tibble(
n = 1:7,
f = fs,
mod = map(f, lm, data = df),
rmse = map2_dbl(mod, list(df), rmse)
)
models %>%
ggplot(aes(n, rmse)) +
geom_line(colour = "grey70") +
geom_point(size = 3)
```
But do you think this model will do well if we apply it to new data from the same population?
In real-life you can't easily go out and recollect your data.
There are two approaches to help you get around this problem.
I'll introduce them briefly here, and then we'll go into more depth in the following sections.
```{r}
boot <- bootstrap(df, 100) %>%
mutate(
mod = map(strap, my_model),
pred = map2(list(grid), mod, add_predictions)
)
boot %>%
unnest(pred) %>%
ggplot(aes(x, pred, group = .id)) +
geom_line(alpha = 1/3)
```
It's a little easier to see what's going on if we zoom on the y axis:
```{r}
last_plot() +
coord_cartesian(ylim = c(0, 5))
```
(You might notice that while each individual model varies a lot, the average of all the models seems like it might not be that bad. That gives rise to a model ensemble technique called model averaging.)
Bootstrapping is a useful tool to help us understand how the model might vary if we'd collected a different sample from the population.
A related technique is cross-validation which allows us to explore the quality of the model.
It works by repeatedly splitting the data into two pieces.
One piece, the training set, is used to fit, and the other piece, the test set, is used to measure the model quality.
The following code generates 100 test-training splits, holding out 20% of the data for testing each time.
We then fit a model to the training set, and evaluate the error on the test set:
```{r}
cv <- crossv_mc(df, 100) %>%
mutate(
mod = map(train, my_model),
rmse = map2_dbl(mod, test, rmse)
)
cv
```
Obviously, a plot is going to help us see distribution more easily.
I've added our original estimate of the model error as a white vertical line (where the same dataset is used for both training and testing), and you can see it's very optimistic.
```{r}
cv %>%
ggplot(aes(rmse)) +
geom_ref_line(v = rmse(mod, df)) +
geom_freqpoly(binwidth = 0.2) +
geom_rug()
```
The distribution of errors is highly skewed: there are a few cases which have very high errors.
These represent samples where we ended up with a few cases on all with low values or high values of x.
Let's take a look:
```{r}
filter(cv, rmse > 1.5) %>%
unnest(map(train, as.data.frame)) %>%
ggplot(aes(x, .id)) +
geom_point() +
xlim(0, 1)
```
All of the models that fit particularly poorly were fit to samples that either missed the first one or two or the last one or two observation.
Because polynomials shoot off to positive and negative, they give very bad predictions for those values.
Now that we've given you a quick overview and intuition for these techniques, let's dive in more detail.
## Resamples
### Building blocks
Both the boostrap and cross-validation are built on top of a "resample" object.
In modelr, you can access these low-level tools directly with the `resample_*` functions.
These functions return an object of class "resample", which represents the resample in a memory efficient way.
Instead of storing the resampled dataset itself, it instead stores the integer indices, and a "pointer" to the original dataset.
This makes resamples take up much less memory.
```{r}
x <- resample_bootstrap(as_tibble(mtcars))
class(x)
x
```
Most modelling functions call `as.data.frame()` on the `data` argument.
This generates a resampled data frame.
Because it's called automatically you can just pass the object.
```{r}
lm(mpg ~ wt, data = x)
```
If you get a strange error, it's probably because the modelling function doesn't do this, and you need to do it yourself.
You'll also need to do it yourself if you want to `unnest()` the data so you can visualise it.
If you want to just get the rows selected, you can use `as.integer()`.
### Dataframe API
`bootstrap()` and `crossv_mc()` are built on top of these simpler primitives.
They are designed to work naturally in a model exploration environment by returning data frames.
Each row of the data frame represents a single sample.
They return slightly different columns:
- `boostrap()` returns a data frame with two columns:
```{r}
bootstrap(df, 3)
```
`strap` gives the bootstrap sample dataset, and `.id` assigns a unique identifier to each model (this is often useful for plotting)
- `crossv_mc()` return a data frame with three columns:
```{r}
crossv_mc(df, 3)
```
`train` contains the data that you should use to fit (train) the model, and `test` contains the data you should use to validate the model.
Together, the test and train columns form an exclusive partition of the full dataset.
## Numeric summaries of model quality
When you start dealing with many models, it's helpful to have some rough way of comparing them so you can spend your time looking at the models that do the best job of capturing important features in the data.
One way to capture the quality of the model is to summarise the distribution of the residuals.
For example, you could look at the quantiles of the absolute residuals.
For this dataset, 25% of predictions are less than \$7,400 away, and 75% are less than \$25,800 away.
That seems like quite a bit of error when predicting someone's income!
```{r}
heights <- tibble(readRDS("data/heights.RDS"))
h <- lm(income ~ height, data = heights)
h
qae(h, heights)
range(heights$income)
```
You might be familiar with the $R^2$.
That's a single number summary that rescales the variance of the residuals to between 0 (very bad) and 1 (very good):
```{r}
rsquare(h, heights)
```
$R^2$ can be interpreted as the amount of variation in the data explained by the model.
Here we're explaining 3% of the total variation - not a lot!
But I don't think worrying about the relative amount of variation explained is that useful; instead I think you need to consider whether the absolute amount of variation explained is useful for your project.
It's called the $R^2$ because for simple models like this, it's just the square of the correlation between the variables:
```{r}
cor(heights$income, heights$height) ^ 2
```
The $R^2$ is an ok single number summary, but I prefer to think about the unscaled residuals because it's easier to interpret in the context of the original data.
As you'll also learn later, it's also a rather optimistic interpretation of the model.
Because you're assessing the model using the same data that was used to fit it, it really gives more of an upper bound on the quality of the model, not a fair assessment.
## Bootstrapping
## Cross-validation

View File

@ -2,13 +2,18 @@
## Introduction
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values.
They are also useful when you want to display character vectors in a non-alphabetical order.
Historically, factors were much easier to work with than characters. As a result, many of the functions in base R automatically convert characters to factors. This means that factors often crop up in places where they're not actually helpful. Fortunately, you don't need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful.
Historically, factors were much easier to work with than characters.
As a result, many of the functions in base R automatically convert characters to factors.
This means that factors often crop up in places where they're not actually helpful.
Fortunately, you don't need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful.
### Prerequisites
To work with factors, we'll use the __forcats__ package, which is part of the core tidyverse. It provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!) using a wide range of helpers for working with factors.
To work with factors, we'll use the **forcats** package, which is part of the core tidyverse.
It provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!) using a wide range of helpers for working with factors.
```{r setup, message = FALSE}
library(tidyverse)
@ -16,7 +21,9 @@ library(tidyverse)
### Learning more
If you want to learn more about factors, I recommend reading Amelia McNamara and Nicholas Hortons paper, [_Wrangling categorical data in R_](https://peerj.com/preprints/3163/). This paper lays out some of the history discussed in [_stringsAsFactors: An unauthorized biography_](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) and [_stringsAsFactors = \<sigh\>_](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods. An early version of the paper help motivate and scope the forcats package; thanks Amelia & Nick!
If you want to learn more about factors, I recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/).
This paper lays out some of the history discussed in [*stringsAsFactors: An unauthorized biography*](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) and [*stringsAsFactors = \<sigh\>*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods.
An early version of the paper help motivate and scope the forcats package; thanks Amelia & Nick!
## Creating factors
@ -28,20 +35,20 @@ x1 <- c("Dec", "Apr", "Jan", "Mar")
Using a string to record this variable has two problems:
1. There are only twelve possible months, and there's nothing saving you
from typos:
1. There are only twelve possible months, and there's nothing saving you from typos:
```{r}
x2 <- c("Dec", "Apr", "Jam", "Mar")
```
1. It doesn't sort in a useful way:
2. It doesn't sort in a useful way:
```{r}
sort(x1)
```
You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid __levels__:
You can fix both of these problems with a factor.
To create a factor you must start by creating a list of the valid **levels**:
```{r}
month_levels <- c(
@ -77,7 +84,8 @@ If you omit the levels, they'll be taken from the data in alphabetical order:
factor(x1)
```
Sometimes you'd prefer that the order of the levels match the order of the first appearance in the data. You can do that when creating the factor by setting levels to `unique(x)`, or after the fact, with `fct_inorder()`:
Sometimes you'd prefer that the order of the levels match the order of the first appearance in the data.
You can do that when creating the factor by setting levels to `unique(x)`, or after the fact, with `fct_inorder()`:
```{r}
f1 <- factor(x1, levels = unique(x1))
@ -95,7 +103,9 @@ levels(f2)
## General Social Survey
For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of data from the [General Social Survey](http://gss.norc.org), which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in `gss_cat` I've selected a handful that will illustrate some common challenges you'll encounter when working with factors.
For the rest of this chapter, we're going to focus on `forcats::gss_cat`.
It's a sample of data from the [General Social Survey](http://gss.norc.org), which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
The survey has thousands of questions, so in `gss_cat` I've selected a handful that will illustrate some common challenges you'll encounter when working with factors.
```{r}
gss_cat
@ -103,7 +113,8 @@ gss_cat
(Remember, since this dataset is provided by a package, you can get more information about the variables with `?gss_cat`.)
When factors are stored in a tibble, you can't see their levels so easily. One way to see them is with `count()`:
When factors are stored in a tibble, you can't see their levels so easily.
One way to see them is with `count()`:
```{r}
gss_cat %>%
@ -117,7 +128,8 @@ ggplot(gss_cat, aes(race)) +
geom_bar()
```
By default, ggplot2 will drop levels that don't have any values. You can force them to display with:
By default, ggplot2 will drop levels that don't have any values.
You can force them to display with:
```{r}
ggplot(gss_cat, aes(race)) +
@ -125,7 +137,8 @@ ggplot(gss_cat, aes(race)) +
scale_x_discrete(drop = FALSE)
```
These levels represent valid values that simply did not occur in this dataset. In dplyr::count() set the `.drop` option to `FALSE`, to show these.
These levels represent valid values that simply did not occur in this dataset.
In dplyr::count() set the `.drop` option to `FALSE`, to show these.
```{r}
gss_cat %>%
@ -133,22 +146,26 @@ gss_cat %>%
.drop = FALSE)
```
When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.
When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels.
Those operations are described in the sections below.
### Exercise
1. Explore the distribution of `rincome` (reported income). What makes the
default bar chart hard to understand? How could you improve the plot?
1. Explore the distribution of `rincome` (reported income).
What makes the default bar chart hard to understand?
How could you improve the plot?
1. What is the most common `relig` in this survey? What's the most
common `partyid`?
2. What is the most common `relig` in this survey?
What's the most common `partyid`?
1. Which `relig` does `denom` (denomination) apply to? How can you find
out with a table? How can you find out with a visualisation?
3. Which `relig` does `denom` (denomination) apply to?
How can you find out with a table?
How can you find out with a visualisation?
## Modifying factor order
It's often useful to change the order of the factor levels in a visualisation. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:
It's often useful to change the order of the factor levels in a visualisation.
For example, imagine you want to explore the average number of hours spent watching TV per day across religions:
```{r}
relig_summary <- gss_cat %>%
@ -162,12 +179,13 @@ relig_summary <- gss_cat %>%
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
```
It is difficult to interpret this plot because there's no overall pattern. We can improve it by reordering the levels of `relig` using `fct_reorder()`. `fct_reorder()` takes three arguments:
It is difficult to interpret this plot because there's no overall pattern.
We can improve it by reordering the levels of `relig` using `fct_reorder()`.
`fct_reorder()` takes three arguments:
* `f`, the factor whose levels you want to modify.
* `x`, a numeric vector that you want to use to reorder the levels.
* Optionally, `fun`, a function that's used if there are multiple values of
`x` for each value of `f`. The default value is `median`.
- `f`, the factor whose levels you want to modify.
- `x`, a numeric vector that you want to use to reorder the levels.
- Optionally, `fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`.
```{r}
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
@ -176,7 +194,8 @@ ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
Reordering religion makes it much easier to see that people in the "Don't know" category watch much more TV, and Hinduism & Other Eastern religions watch much less.
As you start making more complicated transformations, I'd recommend moving them out of `aes()` and into a separate `mutate()` step. For example, you could rewrite the plot above as:
As you start making more complicated transformations, I'd recommend moving them out of `aes()` and into a separate `mutate()` step.
For example, you could rewrite the plot above as:
```{r, eval = FALSE}
relig_summary %>%
@ -184,6 +203,7 @@ relig_summary %>%
ggplot(aes(tvhours, relig)) +
geom_point()
```
What if we create a similar plot looking at how average age varies across reported income level?
```{r}
@ -198,9 +218,13 @@ rincome_summary <- gss_cat %>%
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()
```
Here, arbitrarily reordering the levels isn't a good idea! That's because `rincome` already has a principled order that we shouldn't mess with. Reserve `fct_reorder()` for factors whose levels are arbitrarily ordered.
Here, arbitrarily reordering the levels isn't a good idea!
That's because `rincome` already has a principled order that we shouldn't mess with.
Reserve `fct_reorder()` for factors whose levels are arbitrarily ordered.
However, it does make sense to pull "Not applicable" to the front with the other special levels. You can use `fct_relevel()`. It takes a factor, `f`, and then any number of levels that you want to move to the front of the line.
However, it does make sense to pull "Not applicable" to the front with the other special levels.
You can use `fct_relevel()`.
It takes a factor, `f`, and then any number of levels that you want to move to the front of the line.
```{r}
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
@ -209,7 +233,9 @@ ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
Why do you think the average age for "Not applicable" is so high?
Another type of reordering is useful when you are colouring the lines on a plot. `fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values. This makes the plot easier to read because the line colours line up with the legend.
Another type of reordering is useful when you are colouring the lines on a plot.
`fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values.
This makes the plot easier to read because the line colours line up with the legend.
```{r, fig.align = "default", out.width = "50%", fig.width = 4}
by_age <- gss_cat %>%
@ -226,7 +252,8 @@ ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
labs(colour = "marital")
```
Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing frequency: this is the simplest type of reordering because it doesn't need any extra variables. You may want to combine with `fct_rev()`.
Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing frequency: this is the simplest type of reordering because it doesn't need any extra variables.
You may want to combine with `fct_rev()`.
```{r}
gss_cat %>%
@ -237,24 +264,27 @@ gss_cat %>%
### Exercises
1. There are some suspiciously high numbers in `tvhours`. Is the mean a good
summary?
1. There are some suspiciously high numbers in `tvhours`.
Is the mean a good summary?
1. For each factor in `gss_cat` identify whether the order of the levels is
arbitrary or principled.
2. For each factor in `gss_cat` identify whether the order of the levels is arbitrary or principled.
1. Why did moving "Not applicable" to the front of the levels move it to the
bottom of the plot?
3. Why did moving "Not applicable" to the front of the levels move it to the bottom of the plot?
## Modifying factor levels
More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is `fct_recode()`. It allows you to recode, or change, the value of each level. For example, take the `gss_cat$partyid`:
More powerful than changing the orders of the levels is changing their values.
This allows you to clarify labels for publication, and collapse levels for high-level displays.
The most general and powerful tool is `fct_recode()`.
It allows you to recode, or change, the value of each level.
For example, take the `gss_cat$partyid`:
```{r}
gss_cat %>% count(partyid)
```
The levels are terse and inconsistent. Let's tweak them to be longer and use a parallel construction.
The levels are terse and inconsistent.
Let's tweak them to be longer and use a parallel construction.
```{r}
gss_cat %>%
@ -291,7 +321,8 @@ gss_cat %>%
You must use this technique with care: if you group together categories that are truly different you will end up with misleading results.
If you want to collapse a lot of levels, `fct_collapse()` is a useful variant of `fct_recode()`. For each new variable, you can provide a vector of old levels:
If you want to collapse a lot of levels, `fct_collapse()` is a useful variant of `fct_recode()`.
For each new variable, you can provide a vector of old levels:
```{r}
gss_cat %>%
@ -304,7 +335,8 @@ gss_cat %>%
count(partyid)
```
Sometimes you just want to lump together all the small groups to make a plot or table simpler. That's the job of `fct_lump()`:
Sometimes you just want to lump together all the small groups to make a plot or table simpler.
That's the job of `fct_lump()`:
```{r}
gss_cat %>%
@ -312,7 +344,8 @@ gss_cat %>%
count(relig)
```
The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group. In this case it's not very helpful: it is true that the majority of Americans in this survey are Protestant, but we've probably over collapsed.
The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group.
In this case it's not very helpful: it is true that the majority of Americans in this survey are Protestant, but we've probably over collapsed.
Instead, we can use the `n` parameter to specify how many groups (excluding other) we want to keep:
@ -325,9 +358,9 @@ gss_cat %>%
### Exercises
1. How have the proportions of people identifying as Democrat, Republican, and
Independent changed over time?
1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
1. How could you collapse `rincome` into a small set of categories?
1. Notice there are 9 groups (excluding other) in the `fct_lump` example above. Why not 10? (Hint: type `?fct_lump`, and find the default for the argument `other_level` is "Other".)

View File

@ -1,21 +1,26 @@
# Functions
## Introduction
## Introduction
One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste:
One of the best ways to improve your reach as a data scientist is to write functions.
Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
Writing a function has three big advantages over using copy-and-paste:
1. You can give a function an evocative name that makes your code easier to
understand.
1. You can give a function an evocative name that makes your code easier to understand.
1. As requirements change, you only need to update code in one place, instead
of many.
2. As requirements change, you only need to update code in one place, instead of many.
1. You eliminate the chance of making incidental mistakes when you copy and
paste (i.e. updating a variable name in one place, but not in another).
3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
Writing good functions is a lifetime journey. Even after using R for many years I still learn new techniques and better ways of approaching old problems. The goal of this chapter is not to teach you every esoteric detail of functions but to get you started with some pragmatic advice that you can apply immediately.
Writing good functions is a lifetime journey.
Even after using R for many years I still learn new techniques and better ways of approaching old problems.
The goal of this chapter is not to teach you every esoteric detail of functions but to get you started with some pragmatic advice that you can apply immediately.
As well as practical advice for writing functions, this chapter also gives you some suggestions for how to style your code. Good code style is like correct punctuation. Youcanmanagewithoutit, but it sure makes things easier to read! As with styles of punctuation, there are many possible variations. Here we present the style we use in our code, but the most important thing is to be consistent.
As well as practical advice for writing functions, this chapter also gives you some suggestions for how to style your code.
Good code style is like correct punctuation.
Youcanmanagewithoutit, but it sure makes things easier to read!
As with styles of punctuation, there are many possible variations.
Here we present the style we use in our code, but the most important thing is to be consistent.
### Prerequisites
@ -23,7 +28,9 @@ The focus of this chapter is on writing functions in base R, so you won't need a
## When should you write a function?
You should consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). For example, take a look at this code. What does it do?
You should consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
For example, take a look at this code.
What does it do?
```{r}
df <- tibble::tibble(
@ -43,30 +50,38 @@ df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
```
You might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? I made an error when copying-and-pasting the code for `df$b`: I forgot to change an `a` to a `b`. Extracting repeated code out into a function is a good idea because it prevents you from making this type of mistake.
You might be able to puzzle out that this rescales each column to have a range from 0 to 1.
But did you spot the mistake?
I made an error when copying-and-pasting the code for `df$b`: I forgot to change an `a` to a `b`.
Extracting repeated code out into a function is a good idea because it prevents you from making this type of mistake.
To write a function you need to first analyse the code. How many inputs does it have?
To write a function you need to first analyse the code.
How many inputs does it have?
```{r, eval = FALSE}
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
```
This code only has one input: `df$a`. (If you're surprised that `TRUE` is not an input, you can explore why in the exercise below.) To make the inputs more clear, it's a good idea to rewrite the code using temporary variables with general names. Here this code only requires a single numeric vector, so I'll call it `x`:
This code only has one input: `df$a`.
(If you're surprised that `TRUE` is not an input, you can explore why in the exercise below.) To make the inputs more clear, it's a good idea to rewrite the code using temporary variables with general names.
Here this code only requires a single numeric vector, so I'll call it `x`:
```{r}
x <- df$a
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
```
There is some duplication in this code. We're computing the range of the data three times, so it makes sense to do it in one step:
There is some duplication in this code.
We're computing the range of the data three times, so it makes sense to do it in one step:
```{r}
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
```
Pulling out intermediate calculations into named variables is a good practice because it makes it more clear what the code is doing. Now that I've simplified the code, and checked that it still works, I can turn it into a function:
Pulling out intermediate calculations into named variables is a good practice because it makes it more clear what the code is doing.
Now that I've simplified the code, and checked that it still works, I can turn it into a function:
```{r}
rescale01 <- function(x) {
@ -78,17 +93,17 @@ rescale01(c(0, 5, 10))
There are three key steps to creating a new function:
1. You need to pick a __name__ for the function. Here I've used `rescale01`
because this function rescales a vector to lie between 0 and 1.
1. You need to pick a **name** for the function.
Here I've used `rescale01` because this function rescales a vector to lie between 0 and 1.
1. You list the inputs, or __arguments__, to the function inside `function`.
Here we have just one argument. If we had more the call would look like
`function(x, y, z)`.
2. You list the inputs, or **arguments**, to the function inside `function`.
Here we have just one argument.
If we had more the call would look like `function(x, y, z)`.
1. You place the code you have developed in __body__ of the function, a
`{` block that immediately follows `function(...)`.
3. You place the code you have developed in **body** of the function, a `{` block that immediately follows `function(...)`.
Note the overall process: I only made the function after I'd figured out how to make it work with a simple input. It's easier to start with working code and turn it into a function; it's harder to create a function and then try to make it work.
Note the overall process: I only made the function after I'd figured out how to make it work with a simple input.
It's easier to start with working code and turn it into a function; it's harder to create a function and then try to make it work.
At this point it's a good idea to check your function with a few different inputs:
@ -97,7 +112,9 @@ rescale01(c(-10, 0, 10))
rescale01(c(1, 2, 3, NA, 5))
```
As you write more and more functions you'll eventually want to convert these informal, interactive tests into formal, automated tests. That process is called unit testing. Unfortunately, it's beyond the scope of this book, but you can learn about it in <http://r-pkgs.had.co.nz/tests.html>.
As you write more and more functions you'll eventually want to convert these informal, interactive tests into formal, automated tests.
That process is called unit testing.
Unfortunately, it's beyond the scope of this book, but you can learn about it in <http://r-pkgs.had.co.nz/tests.html>.
We can simplify the original example now that we have a function:
@ -108,9 +125,12 @@ df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we're doing the same thing to multiple columns. We'll learn how to eliminate that duplication in [iteration], once you've learned more about R's data structures in [vectors].
Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors.
There is still quite a bit of duplication since we're doing the same thing to multiple columns.
We'll learn how to eliminate that duplication in [iteration], once you've learned more about R's data structures in [vectors].
Another advantage of functions is that if our requirements change, we only need to make the change in one place. For example, we might discover that some of our variables include infinite values, and `rescale01()` fails:
Another advantage of functions is that if our requirements change, we only need to make the change in one place.
For example, we might discover that some of our variables include infinite values, and `rescale01()` fails:
```{r}
x <- c(1:10, Inf)
@ -127,63 +147,68 @@ rescale01 <- function(x) {
rescale01(x)
```
This is an important part of the "do not repeat yourself" (or DRY) principle. The more repetition you have in your code, the more places you need to remember to update when things change (and they always do!), and the more likely you are to create bugs over time.
This is an important part of the "do not repeat yourself" (or DRY) principle.
The more repetition you have in your code, the more places you need to remember to update when things change (and they always do!), and the more likely you are to create bugs over time.
### Exercises
1. Why is `TRUE` not a parameter to `rescale01()`? What would happen if
`x` contained a single missing value, and `na.rm` was `FALSE`?
1. Why is `TRUE` not a parameter to `rescale01()`?
What would happen if `x` contained a single missing value, and `na.rm` was `FALSE`?
1. In the second variant of `rescale01()`, infinite values are left
unchanged. Rewrite `rescale01()` so that `-Inf` is mapped to 0, and
`Inf` is mapped to 1.
2. In the second variant of `rescale01()`, infinite values are left unchanged.
Rewrite `rescale01()` so that `-Inf` is mapped to 0, and `Inf` is mapped to 1.
1. Practice turning the following code snippets into functions. Think about
what each function does. What would you call it? How many arguments does it
need? Can you rewrite it to be more expressive or less duplicative?
3. Practice turning the following code snippets into functions.
Think about what each function does.
What would you call it?
How many arguments does it need?
Can you rewrite it to be more expressive or less duplicative?
```{r, eval = FALSE}
mean(is.na(x))
x / sum(x, na.rm = TRUE)
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
```
1. Write your own functions to compute the variance and skewness of a numeric vector.
Variance is defined as
$$
4. Write your own functions to compute the variance and skewness of a numeric vector.
Variance is defined as $$
\mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,}
$$
where $\bar{x} = (\sum_i^n x_i) / n$ is the sample mean.
Skewness is defined as
$$
$$ where $\bar{x} = (\sum_i^n x_i) / n$ is the sample mean.
Skewness is defined as $$
\mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.}
$$
1. Write `both_na()`, a function that takes two vectors of the same length
and returns the number of positions that have an `NA` in both vectors.
5. Write `both_na()`, a function that takes two vectors of the same length and returns the number of positions that have an `NA` in both vectors.
6. What do the following functions do?
Why are they useful even though they are so short?
1. What do the following functions do? Why are they useful even though they
are so short?
```{r}
is_directory <- function(x) file.info(x)$isdir
is_readable <- function(x) file.access(x, 4) == 0
```
1. Read the [complete lyrics](https://en.wikipedia.org/wiki/Little_Bunny_Foo_Foo)
to "Little Bunny Foo Foo". There's a lot of duplication in this song.
Extend the initial piping example to recreate the complete song, and use
functions to reduce the duplication.
7. Read the [complete lyrics](https://en.wikipedia.org/wiki/Little_Bunny_Foo_Foo) to "Little Bunny Foo Foo".
There's a lot of duplication in this song.
Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.
## Functions are for humans and computers
It's important to remember that functions are not just for the computer, but are also for humans. R doesn't care what your function is called, or what comments it contains, but these are important for human readers. This section discusses some things that you should bear in mind when writing functions that humans can understand.
It's important to remember that functions are not just for the computer, but are also for humans.
R doesn't care what your function is called, or what comments it contains, but these are important for human readers.
This section discusses some things that you should bear in mind when writing functions that humans can understand.
The name of a function is important. Ideally, the name of your function will be short, but clearly evoke what the function does. That's hard! But it's better to be clear than short, as RStudio's autocomplete makes it easy to type long names.
The name of a function is important.
Ideally, the name of your function will be short, but clearly evoke what the function does.
That's hard!
But it's better to be clear than short, as RStudio's autocomplete makes it easy to type long names.
Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`). A good sign that a noun might be a better choice is if you're using a very broad verb like "get", "compute", "calculate", or "determine". Use your best judgement and don't be afraid to rename a function if you figure out a better name later.
Generally, function names should be verbs, and arguments should be nouns.
There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`).
A good sign that a noun might be a better choice is if you're using a very broad verb like "get", "compute", "calculate", or "determine".
Use your best judgement and don't be afraid to rename a function if you figure out a better name later.
```{r, eval = FALSE}
# Too short
@ -197,7 +222,11 @@ impute_missing()
collapse_years()
```
If your function name is composed of multiple words, I recommend using "snake\_case", where each lowercase word is separated by an underscore. camelCase is a popular alternative. It doesn't really matter which one you pick, the important thing is to be consistent: pick one or the other and stick with it. R itself is not very consistent, but there's nothing you can do about that. Make sure you don't fall into the same trap by making your code as consistent as possible.
If your function name is composed of multiple words, I recommend using "snake_case", where each lowercase word is separated by an underscore.
camelCase is a popular alternative.
It doesn't really matter which one you pick, the important thing is to be consistent: pick one or the other and stick with it.
R itself is not very consistent, but there's nothing you can do about that.
Make sure you don't fall into the same trap by making your code as consistent as possible.
```{r, eval = FALSE}
# Never do this!
@ -205,7 +234,9 @@ col_mins <- function(x, y) {}
rowMaxes <- function(y, x) {}
```
If you have a family of functions that do similar things, make sure they have consistent names and arguments. Use a common prefix to indicate that they are connected. That's better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.
If you have a family of functions that do similar things, make sure they have consistent names and arguments.
Use a common prefix to indicate that they are connected.
That's better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.
```{r, eval = FALSE}
# Good
@ -221,7 +252,8 @@ text_input()
A good example of this design is the stringr package: if you don't remember exactly which function you need, you can type `str_` and jog your memory.
Where possible, avoid overriding existing functions and variables. It's impossible to do in general because so many good names are already taken by other packages, but avoiding the most common names from base R will avoid confusion.
Where possible, avoid overriding existing functions and variables.
It's impossible to do in general because so many good names are already taken by other packages, but avoiding the most common names from base R will avoid confusion.
```{r, eval = FALSE}
# Don't do this!
@ -230,9 +262,17 @@ c <- 10
mean <- function(x) sum(x)
```
Use comments, lines starting with `#`, to explain the "why" of your code. You generally should avoid comments that explain the "what" or the "how". If you can't understand what the code does from reading it, you should think about how to rewrite it to be more clear. Do you need to add some intermediate variables with useful names? Do you need to break out a subcomponent of a large function so you can name it? However, your code can never capture the reasoning behind your decisions: why did you choose this approach instead of an alternative? What else did you try that didn't work? It's a great idea to capture that sort of thinking in a comment.
Use comments, lines starting with `#`, to explain the "why" of your code.
You generally should avoid comments that explain the "what" or the "how".
If you can't understand what the code does from reading it, you should think about how to rewrite it to be more clear.
Do you need to add some intermediate variables with useful names?
Do you need to break out a subcomponent of a large function so you can name it?
However, your code can never capture the reasoning behind your decisions: why did you choose this approach instead of an alternative?
What else did you try that didn't work?
It's a great idea to capture that sort of thinking in a comment.
Another important use of comments is to break up your file into easily readable chunks. Use long lines of `-` and `=` to make it easy to spot the breaks.
Another important use of comments is to break up your file into easily readable chunks.
Use long lines of `-` and `=` to make it easy to spot the breaks.
```{r, eval = FALSE}
# Load data --------------------------------------
@ -248,9 +288,8 @@ knitr::include_graphics("screenshots/rstudio-nav.png")
### Exercises
1. Read the source code for each of the following three functions, puzzle out
what they do, and then brainstorm better names.
1. Read the source code for each of the following three functions, puzzle out what they do, and then brainstorm better names.
```{r}
f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
@ -263,19 +302,19 @@ knitr::include_graphics("screenshots/rstudio-nav.png")
rep(y, length.out = length(x))
}
```
1. Take a function that you've written recently and spend 5 minutes
brainstorming a better name for it and its arguments.
1. Compare and contrast `rnorm()` and `MASS::mvrnorm()`. How could you make
them more consistent?
1. Make a case for why `norm_r()`, `norm_d()` etc would be better than
`rnorm()`, `dnorm()`. Make a case for the opposite.
2. Take a function that you've written recently and spend 5 minutes brainstorming a better name for it and its arguments.
3. Compare and contrast `rnorm()` and `MASS::mvrnorm()`.
How could you make them more consistent?
4. Make a case for why `norm_r()`, `norm_d()` etc would be better than `rnorm()`, `dnorm()`.
Make a case for the opposite.
## Conditional execution
An `if` statement allows you to conditionally execute code. It looks like this:
An `if` statement allows you to conditionally execute code.
It looks like this:
```{r, eval = FALSE}
if (condition) {
@ -285,9 +324,11 @@ if (condition) {
}
```
To get help on `if` you need to surround it in backticks: `` ?`if` ``. The help isn't particularly helpful if you're not already an experienced programmer, but at least you know how to get to it!
To get help on `if` you need to surround it in backticks: `` ?`if` ``.
The help isn't particularly helpful if you're not already an experienced programmer, but at least you know how to get to it!
Here's a simple function that uses an `if` statement. The goal of this function is to return a logical vector describing whether or not each element of a vector is named.
Here's a simple function that uses an `if` statement.
The goal of this function is to return a logical vector describing whether or not each element of a vector is named.
```{r}
has_name <- function(x) {
@ -300,11 +341,14 @@ has_name <- function(x) {
}
```
This function takes advantage of the standard return rule: a function returns the last value that it computed. Here that is either one of the two branches of the `if` statement.
This function takes advantage of the standard return rule: a function returns the last value that it computed.
Here that is either one of the two branches of the `if` statement.
### Conditions
The `condition` must evaluate to either `TRUE` or `FALSE`. If it's a vector, you'll get a warning message; if it's an `NA`, you'll get an error. Watch out for these messages in your own code:
The `condition` must evaluate to either `TRUE` or `FALSE`.
If it's a vector, you'll get a warning message; if it's an `NA`, you'll get an error.
Watch out for these messages in your own code:
```{r, error = TRUE}
if (c(TRUE, FALSE)) {}
@ -312,9 +356,17 @@ if (c(TRUE, FALSE)) {}
if (NA) {}
```
You can use `||` (or) and `&&` (and) to combine multiple logical expressions. These operators are "short-circuiting": as soon as `||` sees the first `TRUE` it returns `TRUE` without computing anything else. As soon as `&&` sees the first `FALSE` it returns `FALSE`. You should never use `|` or `&` in an `if` statement: these are vectorised operations that apply to multiple values (that's why you use them in `filter()`). If you do have a logical vector, you can use `any()` or `all()` to collapse it to a single value.
You can use `||` (or) and `&&` (and) to combine multiple logical expressions.
These operators are "short-circuiting": as soon as `||` sees the first `TRUE` it returns `TRUE` without computing anything else.
As soon as `&&` sees the first `FALSE` it returns `FALSE`.
You should never use `|` or `&` in an `if` statement: these are vectorised operations that apply to multiple values (that's why you use them in `filter()`).
If you do have a logical vector, you can use `any()` or `all()` to collapse it to a single value.
Be careful when testing for equality. `==` is vectorised, which means that it's easy to get more than one output. Either check the length is already 1, collapse with `all()` or `any()`, or use the non-vectorised `identical()`. `identical()` is very strict: it always returns either a single `TRUE` or a single `FALSE`, and doesn't coerce types. This means that you need to be careful when comparing integers and doubles:
Be careful when testing for equality.
`==` is vectorised, which means that it's easy to get more than one output.
Either check the length is already 1, collapse with `all()` or `any()`, or use the non-vectorised `identical()`.
`identical()` is very strict: it always returns either a single `TRUE` or a single `FALSE`, and doesn't coerce types.
This means that you need to be careful when comparing integers and doubles:
```{r}
identical(0L, 0)
@ -347,7 +399,9 @@ if (this) {
}
```
But if you end up with a very long series of chained `if` statements, you should consider rewriting. One useful technique is the `switch()` function. It allows you to evaluate selected code based on position or name.
But if you end up with a very long series of chained `if` statements, you should consider rewriting.
One useful technique is the `switch()` function.
It allows you to evaluate selected code based on position or name.
```{r, echo = FALSE}
function(x, y, op) {
@ -361,13 +415,17 @@ function(x, y, op) {
}
```
Another useful function that can often eliminate long chains of `if` statements is `cut()`. It's used to discretise continuous variables.
Another useful function that can often eliminate long chains of `if` statements is `cut()`.
It's used to discretise continuous variables.
### Code style
Both `if` and `function` should (almost) always be followed by squiggly brackets (`{}`), and the contents should be indented by two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.
Both `if` and `function` should (almost) always be followed by squiggly brackets (`{}`), and the contents should be indented by two spaces.
This makes it easier to see the hierarchy in your code by skimming the left-hand margin.
An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it's followed by `else`. Always indent the code inside curly braces.
An opening curly brace should never go on its own line and should always be followed by a new line.
A closing curly brace should always go on its own line, unless it's followed by `else`.
Always indent the code inside curly braces.
```{r, eval = FALSE}
# Good
@ -400,7 +458,8 @@ y <- 10
x <- if (y < 20) "Too low" else "Too high"
```
I recommend this only for very brief `if` statements. Otherwise, the full form is easier to read:
I recommend this only for very brief `if` statements.
Otherwise, the full form is easier to read:
```{r}
if (y < 20) {
@ -412,21 +471,22 @@ if (y < 20) {
### Exercises
1. What's the difference between `if` and `ifelse()`? Carefully read the help
and construct three examples that illustrate the key differences.
1. What's the difference between `if` and `ifelse()`?
Carefully read the help and construct three examples that illustrate the key differences.
1. Write a greeting function that says "good morning", "good afternoon",
or "good evening", depending on the time of day. (Hint: use a time
argument that defaults to `lubridate::now()`. That will make it
easier to test your function.)
2. Write a greeting function that says "good morning", "good afternoon", or "good evening", depending on the time of day.
(Hint: use a time argument that defaults to `lubridate::now()`.
That will make it easier to test your function.)
1. Implement a `fizzbuzz` function. It takes a single number as input. If
the number is divisible by three, it returns "fizz". If it's divisible by
five it returns "buzz". If it's divisible by three and five, it returns
"fizzbuzz". Otherwise, it returns the number. Make sure you first write
working code before you create the function.
1. How could you use `cut()` to simplify this set of nested if-else statements?
3. Implement a `fizzbuzz` function.
It takes a single number as input.
If the number is divisible by three, it returns "fizz".
If it's divisible by five it returns "buzz".
If it's divisible by three and five, it returns "fizzbuzz".
Otherwise, it returns the number.
Make sure you first write working code before you create the function.
4. How could you use `cut()` to simplify this set of nested if-else statements?
```{r, eval = FALSE}
if (temp <= 0) {
@ -441,14 +501,15 @@ if (y < 20) {
"hot"
}
```
How would you change the call to `cut()` if I'd used `<` instead of `<=`?
What is the other chief advantage of `cut()` for this problem? (Hint:
what happens if you have many values in `temp`?)
What is the other chief advantage of `cut()` for this problem?
(Hint: what happens if you have many values in `temp`?)
1. What happens if you use `switch()` with numeric values?
5. What happens if you use `switch()` with numeric values?
1. What does this `switch()` call do? What happens if `x` is "e"?
6. What does this `switch()` call do?
What happens if `x` is "e"?
```{r, eval = FALSE}
switch(x,
@ -458,25 +519,25 @@ if (y < 20) {
d = "cd"
)
```
Experiment, then carefully read the documentation.
Experiment, then carefully read the documentation.
## Function arguments
The arguments to a function typically fall into two broad sets: one set supplies the __data__ to compute on, and the other supplies arguments that control the __details__ of the computation. For example:
The arguments to a function typically fall into two broad sets: one set supplies the **data** to compute on, and the other supplies arguments that control the **details** of the computation.
For example:
* In `log()`, the data is `x`, and the detail is the `base` of the logarithm.
- In `log()`, the data is `x`, and the detail is the `base` of the logarithm.
* In `mean()`, the data is `x`, and the details are how much data to trim
from the ends (`trim`) and how to handle missing values (`na.rm`).
- In `mean()`, the data is `x`, and the details are how much data to trim from the ends (`trim`) and how to handle missing values (`na.rm`).
* In `t.test()`, the data are `x` and `y`, and the details of the test are
`alternative`, `mu`, `paired`, `var.equal`, and `conf.level`.
* In `str_c()` you can supply any number of strings to `...`, and the details
of the concatenation are controlled by `sep` and `collapse`.
Generally, data arguments should come first. Detail arguments should go on the end, and usually should have default values. You specify a default value in the same way you call a function with a named argument:
- In `t.test()`, the data are `x` and `y`, and the details of the test are `alternative`, `mu`, `paired`, `var.equal`, and `conf.level`.
- In `str_c()` you can supply any number of strings to `...`, and the details of the concatenation are controlled by `sep` and `collapse`.
Generally, data arguments should come first.
Detail arguments should go on the end, and usually should have default values.
You specify a default value in the same way you call a function with a named argument:
```{r}
# Compute confidence interval around mean using normal approximation
@ -491,9 +552,13 @@ mean_ci(x)
mean_ci(x, conf = 0.99)
```
The default value should almost always be the most common value. The few exceptions to this rule are to do with safety. For example, it makes sense for `na.rm` to default to `FALSE` because missing values are important. Even though `na.rm = TRUE` is what you usually put in your code, it's a bad idea to silently ignore missing values by default.
The default value should almost always be the most common value.
The few exceptions to this rule are to do with safety.
For example, it makes sense for `na.rm` to default to `FALSE` because missing values are important.
Even though `na.rm = TRUE` is what you usually put in your code, it's a bad idea to silently ignore missing values by default.
When you call a function, you typically omit the names of the data arguments, because they are used so commonly. If you override the default value of a detail argument, you should use the full name:
When you call a function, you typically omit the names of the data arguments, because they are used so commonly.
If you override the default value of a detail argument, you should use the full name:
```{r, eval = FALSE}
# Good
@ -506,7 +571,8 @@ mean(, TRUE, x = c(1:10, NA))
You can refer to an argument by its unique prefix (e.g. `mean(x, n = TRUE)`), but this is generally best avoided given the possibilities for confusion.
Notice that when you call a function, you should place a space around `=` in function calls, and always put a space after a comma, not before (just like in regular English). Using whitespace makes it easier to skim the function for the important components.
Notice that when you call a function, you should place a space around `=` in function calls, and always put a space after a comma, not before (just like in regular English).
Using whitespace makes it easier to skim the function for the important components.
```{r, eval = FALSE}
# Good
@ -518,20 +584,27 @@ average<-mean(feet/12+inches,na.rm=TRUE)
### Choosing names
The names of the arguments are also important. R doesn't care, but the readers of your code (including future-you!) will. Generally you should prefer longer, more descriptive names, but there are a handful of very common, very short names. It's worth memorising these:
The names of the arguments are also important.
R doesn't care, but the readers of your code (including future-you!) will.
Generally you should prefer longer, more descriptive names, but there are a handful of very common, very short names.
It's worth memorising these:
* `x`, `y`, `z`: vectors.
* `w`: a vector of weights.
* `df`: a data frame.
* `i`, `j`: numeric indices (typically rows and columns).
* `n`: length, or number of rows.
* `p`: number of columns.
- `x`, `y`, `z`: vectors.
- `w`: a vector of weights.
- `df`: a data frame.
- `i`, `j`: numeric indices (typically rows and columns).
- `n`: length, or number of rows.
- `p`: number of columns.
Otherwise, consider matching names of arguments in existing R functions. For example, use `na.rm` to determine if missing values should be removed.
Otherwise, consider matching names of arguments in existing R functions.
For example, use `na.rm` to determine if missing values should be removed.
### Checking values
As you start to write more functions, you'll eventually get to the point where you don't remember exactly how your function works. At this point it's easy to call your function with invalid inputs. To avoid this problem, it's often useful to make constraints explicit. For example, imagine you've written some functions for computing weighted summary statistics:
As you start to write more functions, you'll eventually get to the point where you don't remember exactly how your function works.
At this point it's easy to call your function with invalid inputs.
To avoid this problem, it's often useful to make constraints explicit.
For example, imagine you've written some functions for computing weighted summary statistics:
```{r}
wt_mean <- function(x, w) {
@ -552,7 +625,7 @@ What happens if `x` and `w` are not the same length?
wt_mean(1:6, 1:3)
```
In this case, because of R's vector recycling rules, we don't get an error.
In this case, because of R's vector recycling rules, we don't get an error.
It's good practice to check important preconditions, and throw an error (with `stop()`), if they are not true:
@ -565,7 +638,9 @@ wt_mean <- function(x, w) {
}
```
Be careful not to take this too far. There's a tradeoff between how much time you spend making your function robust, versus how long you spend writing it. For example, if you also added a `na.rm` argument, I probably wouldn't check it carefully:
Be careful not to take this too far.
There's a tradeoff between how much time you spend making your function robust, versus how long you spend writing it.
For example, if you also added a `na.rm` argument, I probably wouldn't check it carefully:
```{r}
wt_mean <- function(x, w, na.rm = FALSE) {
@ -588,7 +663,8 @@ wt_mean <- function(x, w, na.rm = FALSE) {
}
```
This is a lot of extra work for little additional gain. A useful compromise is the built-in `stopifnot()`: it checks that each argument is `TRUE`, and produces a generic error message if not.
This is a lot of extra work for little additional gain.
A useful compromise is the built-in `stopifnot()`: it checks that each argument is `TRUE`, and produces a generic error message if not.
```{r, error = TRUE}
wt_mean <- function(x, w, na.rm = FALSE) {
@ -616,9 +692,13 @@ sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
stringr::str_c("a", "b", "c", "d", "e", "f")
```
How do these functions work? They rely on a special argument: `...` (pronounced dot-dot-dot). This special argument captures any number of arguments that aren't otherwise matched.
How do these functions work?
They rely on a special argument: `...` (pronounced dot-dot-dot).
This special argument captures any number of arguments that aren't otherwise matched.
It's useful because you can then send those `...` on to another function. This is a useful catch-all if your function primarily wraps another function. For example, I commonly create these helper functions that wrap around `str_c()`:
It's useful because you can then send those `...` on to another function.
This is a useful catch-all if your function primarily wraps another function.
For example, I commonly create these helper functions that wrap around `str_c()`:
```{r}
commas <- function(...) stringr::str_c(..., collapse = ", ")
@ -632,7 +712,10 @@ rule <- function(..., pad = "-") {
rule("Important output")
```
Here `...` lets me forward on any arguments that I don't want to deal with to `str_c()`. It's a very convenient technique. But it does come at a price: any misspelled arguments will not raise an error. This makes it easy for typos to go unnoticed:
Here `...` lets me forward on any arguments that I don't want to deal with to `str_c()`.
It's a very convenient technique.
But it does come at a price: any misspelled arguments will not raise an error.
This makes it easy for typos to go unnoticed:
```{r}
x <- c(1, 2)
@ -643,33 +726,41 @@ If you just want to capture the values of the `...`, use `list(...)`.
### Lazy evaluation
Arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called. This is an important property of R as a programming language, but is generally not important when you're writing your own functions for data analysis. You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>.
Arguments in R are lazily evaluated: they're not computed until they're needed.
That means if they're never used, they're never called.
This is an important property of R as a programming language, but is generally not important when you're writing your own functions for data analysis.
You can read more about lazy evaluation at <http://adv-r.had.co.nz/Functions.html#lazy-evaluation>.
### Exercises
1. What does `commas(letters, collapse = "-")` do? Why?
1. What does `commas(letters, collapse = "-")` do?
Why?
1. It'd be nice if you could supply multiple characters to the `pad` argument,
e.g. `rule("Title", pad = "-+")`. Why doesn't this currently work? How
could you fix it?
1. What does the `trim` argument to `mean()` do? When might you use it?
2. It'd be nice if you could supply multiple characters to the `pad` argument, e.g. `rule("Title", pad = "-+")`.
Why doesn't this currently work?
How could you fix it?
1. The default value for the `method` argument to `cor()` is
`c("pearson", "kendall", "spearman")`. What does that mean? What
value is used by default?
3. What does the `trim` argument to `mean()` do?
When might you use it?
4. The default value for the `method` argument to `cor()` is `c("pearson", "kendall", "spearman")`.
What does that mean?
What value is used by default?
## Return values
Figuring out what your function should return is usually straightforward: it's why you created the function in the first place! There are two things you should consider when returning a value:
Figuring out what your function should return is usually straightforward: it's why you created the function in the first place!
There are two things you should consider when returning a value:
1. Does returning early make your function easier to read?
1. Does returning early make your function easier to read?
2. Can you make your function pipeable?
2. Can you make your function pipeable?
### Explicit return statements
The value returned by the function is usually the last statement it evaluates, but you can choose to return early by using `return()`. I think it's best to save the use of `return()` to signal that you can return early with a simpler solution. A common reason to do this is because the inputs are empty:
The value returned by the function is usually the last statement it evaluates, but you can choose to return early by using `return()`.
I think it's best to save the use of `return()` to signal that you can return early with a simpler solution.
A common reason to do this is because the inputs are empty:
```{r}
complicated_function <- function(x, y, z) {
@ -682,7 +773,8 @@ complicated_function <- function(x, y, z) {
```
Another reason is because you have a `if` statement with one complex block and one simple block. For example, you might write an if statement like this:
Another reason is because you have a `if` statement with one complex block and one simple block.
For example, you might write an if statement like this:
```{r, eval = FALSE}
f <- function() {
@ -701,7 +793,8 @@ f <- function() {
}
```
But if the first block is very long, by the time you get to the `else`, you've forgotten the `condition`. One way to rewrite it is to use an early return for the simple case:
But if the first block is very long, by the time you get to the `else`, you've forgotten the `condition`.
One way to rewrite it is to use an early return for the simple case:
```{r, eval = FALSE}
@ -725,9 +818,16 @@ This tends to make the code easier to understand, because you don't need quite s
### Writing pipeable functions
If you want to write your own pipeable functions, it's important to think about the return value. Knowing the return value's object type will mean that your pipeline will "just work". For example, with dplyr and tidyr the object type is the data frame.
If you want to write your own pipeable functions, it's important to think about the return value.
Knowing the return value's object type will mean that your pipeline will "just work".
For example, with dplyr and tidyr the object type is the data frame.
There are two basic types of pipeable functions: transformations and side-effects. With __transformations__, an object is passed to the function's first argument and a modified object is returned. With __side-effects__, the passed object is not transformed. Instead, the function performs an action on the object, like drawing a plot or saving a file. Side-effects functions should "invisibly" return the first argument, so that while they're not printed they can still be used in a pipeline. For example, this simple function prints the number of missing values in a data frame:
There are two basic types of pipeable functions: transformations and side-effects.
With **transformations**, an object is passed to the function's first argument and a modified object is returned.
With **side-effects**, the passed object is not transformed.
Instead, the function performs an action on the object, like drawing a plot or saving a file.
Side-effects functions should "invisibly" return the first argument, so that while they're not printed they can still be used in a pipeline.
For example, this simple function prints the number of missing values in a data frame:
```{r}
show_missings <- function(df) {
@ -757,6 +857,7 @@ And we can still use it in a pipe:
```{r, include = FALSE}
library(dplyr)
```
```{r}
mtcars %>%
show_missings() %>%
@ -766,7 +867,11 @@ mtcars %>%
## Environment
The last component of a function is its environment. This is not something you need to understand deeply when you first start writing functions. However, it's important to know a little bit about environments because they are crucial to how functions work. The environment of a function controls how R finds the value associated with a name. For example, take this function:
The last component of a function is its environment.
This is not something you need to understand deeply when you first start writing functions.
However, it's important to know a little bit about environments because they are crucial to how functions work.
The environment of a function controls how R finds the value associated with a name.
For example, take this function:
```{r}
f <- function(x) {
@ -774,7 +879,9 @@ f <- function(x) {
}
```
In many programming languages, this would be an error, because `y` is not defined inside the function. In R, this is valid code because R uses rules called __lexical scoping__ to find the value associated with a name. Since `y` is not defined inside the function, R will look in the __environment__ where the function was defined:
In many programming languages, this would be an error, because `y` is not defined inside the function.
In R, this is valid code because R uses rules called **lexical scoping** to find the value associated with a name.
Since `y` is not defined inside the function, R will look in the **environment** where the function was defined:
```{r}
y <- 100
@ -784,9 +891,12 @@ y <- 1000
f(10)
```
This behaviour seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it doesn't cause too many problems (especially if you regularly restart R to get to a clean slate).
This behaviour seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it doesn't cause too many problems (especially if you regularly restart R to get to a clean slate).
The advantage of this behaviour is that from a language standpoint it allows R to be very consistent. Every name is looked up using the same set of rules. For `f()` that includes the behaviour of two things that you might not expect: `{` and `+`. This allows you to do devious things like:
The advantage of this behaviour is that from a language standpoint it allows R to be very consistent.
Every name is looked up using the same set of rules.
For `f()` that includes the behaviour of two things that you might not expect: `{` and `+`.
This allows you to do devious things like:
```{r}
`+` <- function(x, y) {
@ -800,4 +910,9 @@ table(replicate(1000, 1 + 2))
rm(`+`)
```
This is a common phenomenon in R. R places few limits on your power. You can do many things that you can't do in other programming languages. You can do many things that 99% of the time are extremely ill-advised (like overriding how addition works!). But this power and flexibility is what makes tools like ggplot2 and dplyr possible. Learning how to make best use of this flexibility is beyond the scope of this book, but you can read about in [_Advanced R_](http://adv-r.had.co.nz).
This is a common phenomenon in R.
R places few limits on your power.
You can do many things that you can't do in other programming languages.
You can do many things that 99% of the time are extremely ill-advised (like overriding how addition works!).
But this power and flexibility is what makes tools like ggplot2 and dplyr possible.
Learning how to make best use of this flexibility is beyond the scope of this book, but you can read about in [*Advanced R*](http://adv-r.had.co.nz).

View File

@ -1,188 +0,0 @@
# Hierarchical data {#hierarchy}
## Introduction
This chapter belongs in [wrangle](#wrangle-intro): it will give you a set of tools for working with hierarchical data, such as the deeply nested lists you often get when working with JSON. However, you can only learn it now because working with hierarchical structures requires some programming skills, particularly an understanding of data structures, functions, and iteration. Now you have those tools under your belt, you can learn how to work with hierarchical data.
The
As well as tools to simplify iteration, purrr provides tools for handling deeply nested lists. There are three common sources of such data:
* JSON and XML
*
The map functions apply a function to every element in a list. They are the most commonly used part of purrr, but not the only part. Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy:
* You can extract deeply nested elements in a single call by supplying
a character vector to the map functions.
* You can remove a level of the hierarchy with the flatten functions.
* You can flip levels of the hierarchy with the transpose function.
### Prerequisites
This chapter focusses mostly on purrr. As well as the tools for iteration that you've already learned about, purrr also provides a number of tools specifically designed to manipulate hierarchical data.
```{r setup}
library(purrr)
```
## Initial exploration
Sometimes you get data structures that are very deeply nested. A common source of such data is JSON from a web API. I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. Now I'm going to load it into a list with jsonlite. By default `fromJSON()` tries to be helpful and simplifies the structure a little for you. Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`:
```{r}
# From https://api.github.com/repos/hadley/r4ds/issues
issues <- jsonlite::fromJSON("issues.json", simplifyVector = FALSE)
```
You might be tempted to use `str()` on this data. Unfortunately, however, `str()` is not designed for lists that are both deep and wide, and you'll tend to get overwhelmed by the output. A better strategy is to pull the list apart piece by piece.
First, figure out how many elements are in the list, take a look at one, and then check they're all the same structure. In this case there are eight elements, and the first element is another list.
```{r}
length(issues)
str(issues[[1]])
```
(In this case we got lucky and the structure is (just) simple enough to print out with `str()`. If you're unlucky, you may need to repeat this procedure.)
```{r}
tibble::tibble(
i = seq_along(issues),
names = issues %>% map(names)
) %>%
tidyr::unnest(names) %>%
table() %>%
t()
```
Another alternative is the __listviewer__ package, <https://github.com/timelyportfolio/listviewer>.
## Extracting deeply nested elements
To work with this sort of data, you typically want to turn it into a data frame by extracting the related vectors that you're most interested in:
```{r}
issues %>% map_int("id")
issues %>% map_lgl("locked")
issues %>% map_chr("state")
```
You can use the same technique to extract more deeply nested structure. For example, imagine you want to extract the name and id of the user. You could do that in two steps:
```{r}
users <- issues %>% map("user")
users %>% map_chr("login")
users %>% map_int("id")
```
But by supplying a character _vector_ to `map_*`, you can do it in one:
```{r}
issues %>% map_chr(c("user", "login"))
issues %>% map_int(c("user", "id"))
```
What happens if that path is missing in some of the elements? For example, lets try and extract the HTML url to the pull request:
```{r, error = TRUE}
issues %>% map_chr(c("pull_request", "html_url"))
```
Unfortunately that doesn't work. Whenever you see an error from purrr complaining about the "type" of the result, it's because it's trying to shove it into a simple vector (here a character). You can diagnose the problem more easily if you use `map()`:
```{r}
issues %>% map(c("pull_request", "html_url"))
```
To get the results into a character vector, we need to tell purrr what it should change `NULL` to. You can do that with the `.null` argument. The most common value to use is `NA`:
```{r}
issues %>% map_chr(c("pull_request", "html_url"), .null = NA)
```
(You might wonder why that isn't the default value since it's so useful. Well, if it was the default, you'd never get an error message if you had a typo in the names. You'd just get a vector of missing values. That would be annoying to debug because it's a silent failure.)
It's possible to mix position and named indexing by using a list
```{r}
issues %>% map_chr(list("pull_request", 1), .null = NA)
```
## Removing a level of hierarchy
As well as indexing deeply into hierarchy, it's sometimes useful to flatten it. That's the job of the flatten family of functions: `flatten()`, `flatten_lgl()`, `flatten_int()`, `flatten_dbl()`, and `flatten_chr()`. In the code below we take a list of lists of double vectors, then flatten it to a list of double vectors, then to a double vector.
```{r}
x <- list(list(a = 1, b = 2), list(c = 3, d = 4))
str(x)
y <- flatten(x)
str(y)
flatten_dbl(y)
```
Graphically, that sequence of operations looks like:
```{r, echo = FALSE}
knitr::include_graphics("diagrams/lists-flatten.png")
```
Whenever I get confused about a sequence of flattening operations, I'll often draw a diagram like this to help me understand what's going on.
Base R has `unlist()`, but I recommend avoiding it for the same reason I recommend avoiding `sapply()`: it always succeeds. Even if your data structure accidentally changes, `unlist()` will continue to work silently the wrong type of output. This tends to create problems that are frustrating to debug.
## Switching levels in the hierarchy {#transpose}
Other times the hierarchy feels "inside out". You can use `transpose()` to flip the first and second levels of a list:
```{r}
x <- list(
x = list(a = 1, b = 3, c = 5),
y = list(a = 2, b = 4, c = 6)
)
x %>% str()
x %>% transpose() %>% str()
```
Graphically, this looks like:
```{r, echo = FALSE}
knitr::include_graphics("diagrams/lists-transpose.png")
```
You'll see an example of this in the next section, as `transpose()` is particularly useful in conjunction with adverbs like `safely()` and `quietly()`.
It's called transpose by analogy to matrices. When you subset a transposed matrix, you switch indices: `x[i, j]` is the same as `t(x)[j, i]`. It's the same idea when transposing a list, but the subsetting looks a little different: `x[[i]][[j]]` is equivalent to `transpose(x)[[j]][[i]]`. Similarly, a transpose is its own inverse so `transpose(transpose(x))` is equal to `x`.
Transpose is also useful when working with JSON APIs. Many JSON APIs represent data frames in a row-based format, rather than R's column-based format. `transpose()` makes it easy to switch between the two:
```{r}
df <- tibble::tibble(x = 1:3, y = c("a", "b", "c"))
df %>% transpose() %>% str()
```
## Turning lists into data frames
* Have a deeply nested list with missing pieces
* Need a tidy data frame so you can visualise, transform, model etc.
* What do you do?
* By hand with purrr, talk about `fromJSON` and `tidyJSON`
* tidyjson
### Exercises
1. Challenge: read all the CSV files in a directory. Which ones failed
and why?
```{r, eval = FALSE}
files <- dir("data", pattern = "\\.csv$")
files %>%
set_names(., basename(.)) %>%
map_df(safely(readr::read_csv), .id = "filename")
```

View File

@ -2,11 +2,14 @@
## Introduction
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, you'll learn how to read plain-text rectangular files into R. Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data. We'll finish with a few pointers to packages that are useful for other types of data.
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
In this chapter, you'll learn how to read plain-text rectangular files into R.
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data.
We'll finish with a few pointers to packages that are useful for other types of data.
### Prerequisites
In this chapter, you'll learn how to load flat files in R with the __readr__ package, which is part of the core tidyverse.
In this chapter, you'll learn how to load flat files in R with the **readr** package, which is part of the core tidyverse.
```{r setup, message = FALSE}
library(tidyverse)
@ -16,21 +19,18 @@ library(tidyverse)
Most of readr's functions are concerned with turning flat files into data frames:
* `read_csv()` reads comma delimited files, `read_csv2()` reads semicolon
separated files (common in countries where `,` is used as the decimal place),
`read_tsv()` reads tab delimited files, and `read_delim()` reads in files
with any delimiter.
- `read_csv()` reads comma delimited files, `read_csv2()` reads semicolon separated files (common in countries where `,` is used as the decimal place), `read_tsv()` reads tab delimited files, and `read_delim()` reads in files with any delimiter.
* `read_fwf()` reads fixed width files. You can specify fields either by their
widths with `fwf_widths()` or their position with `fwf_positions()`.
`read_table()` reads a common variation of fixed width files where columns
are separated by white space.
- `read_fwf()` reads fixed width files.
You can specify fields either by their widths with `fwf_widths()` or their position with `fwf_positions()`.
`read_table()` reads a common variation of fixed width files where columns are separated by white space.
* `read_log()` reads Apache style log files. (But also check out
[webreadr](https://github.com/Ironholds/webreadr) which is built on top
of `read_log()` and provides many more helpful tools.)
- `read_log()` reads Apache style log files.
(But also check out [webreadr](https://github.com/Ironholds/webreadr) which is built on top of `read_log()` and provides many more helpful tools.)
These functions all have similar syntax: once you've mastered one, you can use the others with ease. For the rest of this chapter we'll focus on `read_csv()`. Not only are csv files one of the most common forms of data storage, but once you understand `read_csv()`, you can easily apply your knowledge to all the other functions in readr.
These functions all have similar syntax: once you've mastered one, you can use the others with ease.
For the rest of this chapter we'll focus on `read_csv()`.
Not only are csv files one of the most common forms of data storage, but once you understand `read_csv()`, you can easily apply your knowledge to all the other functions in readr.
The first argument to `read_csv()` is the most important: it's the path to the file to read.
@ -38,9 +38,11 @@ The first argument to `read_csv()` is the most important: it's the path to the f
heights <- read_csv("data/heights.csv")
```
When you run `read_csv()` it prints out a column specification that gives the name and type of each column. That's an important part of readr, which we'll come back to in [parsing a file].
When you run `read_csv()` it prints out a column specification that gives the name and type of each column.
That's an important part of readr, which we'll come back to in [parsing a file].
You can also supply an inline csv file. This is useful for experimenting with readr and for creating reproducible examples to share with others:
You can also supply an inline csv file.
This is useful for experimenting with readr and for creating reproducible examples to share with others:
```{r}
read_csv("a,b,c
@ -48,37 +50,34 @@ read_csv("a,b,c
4,5,6")
```
In both cases `read_csv()` uses the first line of the data for the column names, which is a very common convention. There are two cases where you might want to tweak this behaviour:
In both cases `read_csv()` uses the first line of the data for the column names, which is a very common convention.
There are two cases where you might want to tweak this behaviour:
1. Sometimes there are a few lines of metadata at the top of the file.
You can use `skip = n` to skip the first `n` lines; or use `comment = "#"` to drop all lines that start with (e.g.) `#`.
1. Sometimes there are a few lines of metadata at the top of the file. You can
use `skip = n` to skip the first `n` lines; or use `comment = "#"` to drop
all lines that start with (e.g.) `#`.
```{r}
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
```
1. The data might not have column names. You can use `col_names = FALSE` to
tell `read_csv()` not to treat the first row as headings, and instead
label them sequentially from `X1` to `Xn`:
2. The data might not have column names.
You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings, and instead label them sequentially from `X1` to `Xn`:
```{r}
read_csv("1,2,3\n4,5,6", col_names = FALSE)
```
(`"\n"` is a convenient shortcut for adding a new line. You'll learn more
about it and other types of string escape in [string basics].)
Alternatively you can pass `col_names` a character vector which will be
used as the column names:
(`"\n"` is a convenient shortcut for adding a new line. You'll learn more about it and other types of string escape in [string basics].)
Alternatively you can pass `col_names` a character vector which will be used as the column names:
```{r}
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
```
@ -89,48 +88,46 @@ Another option that commonly needs tweaking is `na`: this specifies the value (o
read_csv("a,b,c\n1,2,.", na = ".")
```
This is all you need to know to read ~75% of CSV files that you'll encounter in practice. You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`. To read in more challenging files, you'll need to learn more about how readr parses each column, turning them into R vectors.
This is all you need to know to read \~75% of CSV files that you'll encounter in practice.
You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`.
To read in more challenging files, you'll need to learn more about how readr parses each column, turning them into R vectors.
### Compared to base R
If you've used R before, you might wonder why we're not using `read.csv()`. There are a few good reasons to favour readr functions over the base equivalents:
If you've used R before, you might wonder why we're not using `read.csv()`.
There are a few good reasons to favour readr functions over the base equivalents:
* They are typically much faster (~10x) than their base equivalents.
Long running jobs have a progress bar, so you can see what's happening.
If you're looking for raw speed, try `data.table::fread()`. It doesn't fit
quite so well into the tidyverse, but it can be quite a bit faster.
- They are typically much faster (\~10x) than their base equivalents.
Long running jobs have a progress bar, so you can see what's happening.
If you're looking for raw speed, try `data.table::fread()`.
It doesn't fit quite so well into the tidyverse, but it can be quite a bit faster.
* They produce tibbles, they don't convert character vectors to factors,
use row names, or munge the column names. These are common sources of
frustration with the base R functions.
- They produce tibbles, they don't convert character vectors to factors, use row names, or munge the column names.
These are common sources of frustration with the base R functions.
* They are more reproducible. Base R functions inherit some behaviour from
your operating system and environment variables, so import code that works
on your computer might not work on someone else's.
- They are more reproducible.
Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else's.
### Exercises
1. What function would you use to read a file where fields were separated with
"|"?
1. Apart from `file`, `skip`, and `comment`, what other arguments do
`read_csv()` and `read_tsv()` have in common?
1. What are the most important arguments to `read_fwf()`?
1. Sometimes strings in a CSV file contain commas. To prevent them from
causing problems they need to be surrounded by a quoting character, like
`"` or `'`. By default, `read_csv()` assumes that the quoting
character will be `"`. What argument to `read_csv()` do you need to specify
to read the following text into a data frame?
1. What function would you use to read a file where fields were separated with\
"\|"?
2. Apart from `file`, `skip`, and `comment`, what other arguments do `read_csv()` and `read_tsv()` have in common?
3. What are the most important arguments to `read_fwf()`?
4. Sometimes strings in a CSV file contain commas.
To prevent them from causing problems they need to be surrounded by a quoting character, like `"` or `'`. By default, `read_csv()` assumes that the quoting character will be `"`.
What argument to `read_csv()` do you need to specify to read the following text into a data frame?
```{r, eval = FALSE}
"x,y\n1,'a,b'"
```
1. Identify what is wrong with each of the following inline CSV files.
5. Identify what is wrong with each of the following inline CSV files.
What happens when you run the code?
```{r, eval = FALSE}
read_csv("a,b\n1,2,3\n4,5,6")
read_csv("a,b,c\n1,2\n1,2,3,4")
@ -141,7 +138,8 @@ If you've used R before, you might wonder why we're not using `read.csv()`. Ther
## Parsing a vector
Before we get into the details of how readr reads files from disk, we need to take a little detour to talk about the `parse_*()` functions. These functions take a character vector and return a more specialised vector like a logical, integer, or date:
Before we get into the details of how readr reads files from disk, we need to take a little detour to talk about the `parse_*()` functions.
These functions take a character vector and return a more specialised vector like a logical, integer, or date:
```{r}
str(parse_logical(c("TRUE", "FALSE", "NA")))
@ -149,7 +147,8 @@ str(parse_integer(c("1", "2", "3")))
str(parse_date(c("2010-01-01", "1979-10-14")))
```
These functions are useful in their own right, but are also an important building block for readr. Once you've learned how the individual parsers work in this section, we'll circle back and see how they fit together to parse a complete file in the next section.
These functions are useful in their own right, but are also an important building block for readr.
Once you've learned how the individual parsers work in this section, we'll circle back and see how they fit together to parse a complete file in the next section.
Like all functions in the tidyverse, the `parse_*()` functions are uniform: the first argument is a character vector to parse, and the `na` argument specifies which strings should be treated as missing:
@ -169,32 +168,29 @@ And the failures will be missing in the output:
x
```
If there are many parsing failures, you'll need to use `problems()` to get the complete set. This returns a tibble, which you can then manipulate with dplyr.
If there are many parsing failures, you'll need to use `problems()` to get the complete set.
This returns a tibble, which you can then manipulate with dplyr.
```{r}
problems(x)
```
Using parsers is mostly a matter of understanding what's available and how they deal with different types of input. There are eight particularly important parsers:
Using parsers is mostly a matter of understanding what's available and how they deal with different types of input.
There are eight particularly important parsers:
1. `parse_logical()` and `parse_integer()` parse logicals and integers
respectively. There's basically nothing that can go wrong with these
parsers so I won't describe them here further.
1. `parse_double()` is a strict numeric parser, and `parse_number()`
is a flexible numeric parser. These are more complicated than you might
expect because different parts of the world write numbers in different
ways.
1. `parse_character()` seems so simple that it shouldn't be necessary. But
one complication makes it quite important: character encodings.
1. `parse_logical()` and `parse_integer()` parse logicals and integers respectively.
There's basically nothing that can go wrong with these parsers so I won't describe them here further.
1. `parse_factor()` create factors, the data structure that R uses to represent
categorical variables with fixed and known values.
2. `parse_double()` is a strict numeric parser, and `parse_number()` is a flexible numeric parser.
These are more complicated than you might expect because different parts of the world write numbers in different ways.
1. `parse_datetime()`, `parse_date()`, and `parse_time()` allow you to
parse various date & time specifications. These are the most complicated
because there are so many different ways of writing dates.
3. `parse_character()` seems so simple that it shouldn't be necessary.
But one complication makes it quite important: character encodings.
4. `parse_factor()` create factors, the data structure that R uses to represent categorical variables with fixed and known values.
5. `parse_datetime()`, `parse_date()`, and `parse_time()` allow you to parse various date & time specifications.
These are the most complicated because there are so many different ways of writing dates.
The following sections describe these parsers in more detail.
@ -202,26 +198,28 @@ The following sections describe these parsers in more detail.
It seems like it should be straightforward to parse a number, but three problems make it tricky:
1. People write numbers differently in different parts of the world.
For example, some countries use `.` in between the integer and fractional
parts of a real number, while others use `,`.
1. Numbers are often surrounded by other characters that provide some
context, like "$1000" or "10%".
1. People write numbers differently in different parts of the world.
For example, some countries use `.` in between the integer and fractional parts of a real number, while others use `,`.
1. Numbers often contain "grouping" characters to make them easier to read,
like "1,000,000", and these grouping characters vary around the world.
2. Numbers are often surrounded by other characters that provide some context, like "\$1000" or "10%".
To address the first problem, readr has the notion of a "locale", an object that specifies parsing options that differ from place to place. When parsing numbers, the most important option is the character you use for the decimal mark. You can override the default value of `.` by creating a new locale and setting the `decimal_mark` argument:
3. Numbers often contain "grouping" characters to make them easier to read, like "1,000,000", and these grouping characters vary around the world.
To address the first problem, readr has the notion of a "locale", an object that specifies parsing options that differ from place to place.
When parsing numbers, the most important option is the character you use for the decimal mark.
You can override the default value of `.` by creating a new locale and setting the `decimal_mark` argument:
```{r}
parse_double("1.23")
parse_double("1,23", locale = locale(decimal_mark = ","))
```
readr's default locale is US-centric, because generally R is US-centric (i.e. the documentation of base R is written in American English). An alternative approach would be to try and guess the defaults from your operating system. This is hard to do well, and, more importantly, makes your code fragile: even if it works on your computer, it might fail when you email it to a colleague in another country.
readr's default locale is US-centric, because generally R is US-centric (i.e. the documentation of base R is written in American English).
An alternative approach would be to try and guess the defaults from your operating system.
This is hard to do well, and, more importantly, makes your code fragile: even if it works on your computer, it might fail when you email it to a colleague in another country.
`parse_number()` addresses the second problem: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
`parse_number()` addresses the second problem: it ignores non-numeric characters before and after the number.
This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
```{r}
parse_number("$100")
@ -244,17 +242,31 @@ parse_number("123'456'789", locale = locale(grouping_mark = "'"))
### Strings {#readr-strings}
It seems like `parse_character()` should be really simple --- it could just return its input. Unfortunately life isn't so simple, as there are multiple ways to represent the same string. To understand what's going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using `charToRaw()`:
It seems like `parse_character()` should be really simple --- it could just return its input.
Unfortunately life isn't so simple, as there are multiple ways to represent the same string.
To understand what's going on, we need to dive into the details of how computers represent strings.
In R, we can get at the underlying representation of a string using `charToRaw()`:
```{r}
charToRaw("Hadley")
```
Each hexadecimal number represents a byte of information: `48` is H, `61` is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII. ASCII does a great job of representing English characters, because it's the __American__ Standard Code for Information Interchange.
Each hexadecimal number represents a byte of information: `48` is H, `61` is a, and so on.
The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
ASCII does a great job of representing English characters, because it's the **American** Standard Code for Information Interchange.
Things get more complicated for languages other than English. In the early days of computing there were many competing standards for encoding non-English characters, and to correctly interpret a string you needed to know both the values and the encoding. For example, two common encodings are Latin1 (aka ISO-8859-1, used for Western European languages) and Latin2 (aka ISO-8859-2, used for Eastern European languages). In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
Things get more complicated for languages other than English.
In the early days of computing there were many competing standards for encoding non-English characters, and to correctly interpret a string you needed to know both the values and the encoding.
For example, two common encodings are Latin1 (aka ISO-8859-1, used for Western European languages) and Latin2 (aka ISO-8859-2, used for Eastern European languages).
In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"!
Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing. This is a good default, but will fail for data produced by older systems that don't understand UTF-8. If this happens to you, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times you'll get complete gibberish. For example:
readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing.
This is a good default, but will fail for data produced by older systems that don't understand UTF-8.
If this happens to you, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times you'll get complete gibberish.
For example:
```{r}
x1 <- "El Ni\xf1o was particularly bad this year"
@ -271,7 +283,11 @@ parse_character(x1, locale = locale(encoding = "Latin1"))
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
```
How do you find the correct encoding? If you're lucky, it'll be included somewhere in the data documentation. Unfortunately, that's rarely the case, so readr provides `guess_encoding()` to help you figure it out. It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start. Expect to try a few different encodings before you find the right one.
How do you find the correct encoding?
If you're lucky, it'll be included somewhere in the data documentation.
Unfortunately, that's rarely the case, so readr provides `guess_encoding()` to help you figure it out.
It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.
```{r}
guess_encoding(charToRaw(x1))
@ -280,90 +296,104 @@ guess_encoding(charToRaw(x2))
The first argument to `guess_encoding()` can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).
Encodings are a rich and complex topic, and I've only scratched the surface here. If you'd like to learn more I'd recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
Encodings are a rich and complex topic, and I've only scratched the surface here.
If you'd like to learn more I'd recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
### Factors {#readr-factors}
R uses factors to represent categorical variables that have a known set of possible values. Give `parse_factor()` a vector of known `levels` to generate a warning whenever an unexpected value is present:
R uses factors to represent categorical variables that have a known set of possible values.
Give `parse_factor()` a vector of known `levels` to generate a warning whenever an unexpected value is present:
```{r}
fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
```
But if you have many problematic entries, it's often easier to leave as character vectors and then use the tools you'll learn about in [strings] and [factors] to clean them up.
But if you have many problematic entries, it's often easier to leave as character vectors and then use the tools you'll learn about in [strings](#readr-strings) and [factors](#readr-factors) to clean them up.
### Dates, date-times, and times {#readr-datetimes}
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight). When called without any additional arguments:
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight).
When called without any additional arguments:
- `parse_datetime()` expects an ISO8601 date-time.
ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.
* `parse_datetime()` expects an ISO8601 date-time. ISO8601 is an
international standard in which the components of a date are
organised from biggest to smallest: year, month, day, hour, minute,
second.
```{r}
parse_datetime("2010-10-01T2010")
# If time is omitted, it will be set to midnight
parse_datetime("20101010")
```
This is the most important date/time standard, and if you work with
dates and times frequently, I recommend reading
<https://en.wikipedia.org/wiki/ISO_8601>
* `parse_date()` expects a four digit year, a `-` or `/`, the month, a `-`
or `/`, then the day:
This is the most important date/time standard, and if you work with dates and times frequently, I recommend reading <https://en.wikipedia.org/wiki/ISO_8601>
- `parse_date()` expects a four digit year, a `-` or `/`, the month, a `-` or `/`, then the day:
```{r}
parse_date("2010-10-01")
```
* `parse_time()` expects the hour, `:`, minutes, optionally `:` and seconds,
and an optional am/pm specifier:
- `parse_time()` expects the hour, `:`, minutes, optionally `:` and seconds, and an optional am/pm specifier:
```{r}
library(hms)
parse_time("01:10 am")
parse_time("20:10:01")
```
Base R doesn't have a great built in class for time data, so we use
the one provided in the hms package.
Base R doesn't have a great built in class for time data, so we use the one provided in the hms package.
If these defaults don't work for your data you can supply your own date-time `format`, built up of the following pieces:
Year
: `%Y` (4 digits).
: `%y` (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
: `%Y` (4 digits).
: `%y` (2 digits); 00-69 -\> 2000-2069, 70-99 -\> 1970-1999.
Month
: `%m` (2 digits).
: `%b` (abbreviated name, like "Jan").
: `%B` (full name, "January").
: `%m` (2 digits).
: `%b` (abbreviated name, like "Jan").
: `%B` (full name, "January").
Day
: `%d` (2 digits).
: `%e` (optional leading space).
: `%d` (2 digits).
: `%e` (optional leading space).
Time
: `%H` 0-23 hour.
: `%I` 0-12, must be used with `%p`.
: `%p` AM/PM indicator.
: `%M` minutes.
: `%S` integer seconds.
: `%OS` real seconds.
: `%Z` Time zone (as name, e.g. `America/Chicago`). Beware of abbreviations:
if you're American, note that "EST" is a Canadian time zone that does not
have daylight savings time. It is _not_ Eastern Standard Time! We'll
come back to this [time zones].
: `%z` (as offset from UTC, e.g. `+0800`).
: `%H` 0-23 hour.
: `%I` 0-12, must be used with `%p`.
: `%p` AM/PM indicator.
: `%M` minutes.
: `%S` integer seconds.
: `%OS` real seconds.
: `%Z` Time zone (as name, e.g. `America/Chicago`).
Beware of abbreviations: if you're American, note that "EST" is a Canadian time zone that does not have daylight savings time.
It is *not* Eastern Standard Time!
We'll come back to this [time zones].
: `%z` (as offset from UTC, e.g. `+0800`).
Non-digits
: `%.` skips one non-digit character.
: `%*` skips any number of non-digits.
The best way to figure out the correct format is to create a few examples in a character vector, and test with one of the parsing functions. For example:
: `%.` skips one non-digit character.
: `%*` skips any number of non-digits.
The best way to figure out the correct format is to create a few examples in a character vector, and test with one of the parsing functions.
For example:
```{r}
parse_date("01/02/15", "%m/%d/%y")
@ -371,7 +401,8 @@ parse_date("01/02/15", "%d/%m/%y")
parse_date("01/02/15", "%y/%m/%d")
```
If you're using `%b` or `%B` with non-English month names, you'll need to set the `lang` argument to `locale()`. See the list of built-in languages in `date_names_langs()`, or if your language is not already included, create your own with `date_names()`.
If you're using `%b` or `%B` with non-English month names, you'll need to set the `lang` argument to `locale()`.
See the list of built-in languages in `date_names_langs()`, or if your language is not already included, create your own with `date_names()`.
```{r}
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
@ -379,29 +410,26 @@ parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
### Exercises
1. What are the most important arguments to `locale()`?
1. What are the most important arguments to `locale()`?
1. What happens if you try and set `decimal_mark` and `grouping_mark`
to the same character? What happens to the default value of
`grouping_mark` when you set `decimal_mark` to ","? What happens
to the default value of `decimal_mark` when you set the `grouping_mark`
to "."?
2. What happens if you try and set `decimal_mark` and `grouping_mark` to the same character?
What happens to the default value of `grouping_mark` when you set `decimal_mark` to ","?
What happens to the default value of `decimal_mark` when you set the `grouping_mark` to "."?
1. I didn't discuss the `date_format` and `time_format` options to
`locale()`. What do they do? Construct an example that shows when
they might be useful.
3. I didn't discuss the `date_format` and `time_format` options to `locale()`.
What do they do?
Construct an example that shows when they might be useful.
1. If you live outside the US, create a new locale object that encapsulates
the settings for the types of file you read most commonly.
1. What's the difference between `read_csv()` and `read_csv2()`?
1. What are the most common encodings used in Europe? What are the
most common encodings used in Asia? Do some googling to find out.
4. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
5. What's the difference between `read_csv()` and `read_csv2()`?
6. What are the most common encodings used in Europe?
What are the most common encodings used in Asia?
Do some googling to find out.
7. Generate the correct format string to parse each of the following dates and times:
1. Generate the correct format string to parse each of the following
dates and times:
```{r}
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
@ -414,14 +442,16 @@ parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
## Parsing a file
Now that you've learned how to parse an individual vector, it's time to return to the beginning and explore how readr parses a file. There are two new things that you'll learn about in this section:
Now that you've learned how to parse an individual vector, it's time to return to the beginning and explore how readr parses a file.
There are two new things that you'll learn about in this section:
1. How readr automatically guesses the type of each column.
1. How to override the default specification.
1. How readr automatically guesses the type of each column.
2. How to override the default specification.
### Strategy
readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. You can emulate this process with a character vector using `guess_parser()`, which returns readr's best guess, and `parse_guess()` which uses that guess to parse the column:
readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column.
You can emulate this process with a character vector using `guess_parser()`, which returns readr's best guess, and `parse_guess()` which uses that guess to parse the column:
```{r}
guess_parser("2010-10-01")
@ -435,28 +465,26 @@ str(parse_guess("2010-10-10"))
The heuristic tries each of the following types, stopping when it finds a match:
* logical: contains only "F", "T", "FALSE", or "TRUE".
* integer: contains only numeric characters (and `-`).
* double: contains only valid doubles (including numbers like `4.5e-5`).
* number: contains valid doubles with the grouping mark inside.
* time: matches the default `time_format`.
* date: matches the default `date_format`.
* date-time: any ISO8601 date.
- logical: contains only "F", "T", "FALSE", or "TRUE".
- integer: contains only numeric characters (and `-`).
- double: contains only valid doubles (including numbers like `4.5e-5`).
- number: contains valid doubles with the grouping mark inside.
- time: matches the default `time_format`.
- date: matches the default `date_format`.
- date-time: any ISO8601 date.
If none of these rules apply, then the column will stay as a vector of strings.
### Problems
These defaults don't always work for larger files. There are two basic problems:
These defaults don't always work for larger files.
There are two basic problems:
1. The first thousand rows might be a special case, and readr guesses
a type that is not sufficiently general. For example, you might have
a column of doubles that only contains integers in the first 1000 rows.
1. The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general.
For example, you might have a column of doubles that only contains integers in the first 1000 rows.
1. The column might contain a lot of missing values. If the first 1000
rows contain only `NA`s, readr will guess that it's a logical
vector, whereas you probably want to parse it as something more
specific.
2. The column might contain a lot of missing values.
If the first 1000 rows contain only `NA`s, readr will guess that it's a logical vector, whereas you probably want to parse it as something more specific.
readr contains a challenging CSV that illustrates both of these problems:
@ -466,19 +494,23 @@ challenge <- read_csv(readr_example("challenge.csv"))
(Note the use of `readr_example()` which finds the path to one of the files included with the package)
There are two printed outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures. It's always a good idea to explicitly pull out the `problems()`, so you can explore them in more depth:
There are two printed outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures.
It's always a good idea to explicitly pull out the `problems()`, so you can explore them in more depth:
```{r}
problems(challenge)
```
A good strategy is to work column by column until there are no problems remaining. Here we can see that there are a lot of parsing problems with the `y` column. If we look at the last few rows, you'll see that they're dates stored in a character vector:
A good strategy is to work column by column until there are no problems remaining.
Here we can see that there are a lot of parsing problems with the `y` column.
If we look at the last few rows, you'll see that they're dates stored in a character vector:
```{r}
tail(challenge)
```
That suggests we need to use a date parser instead. To fix the call, start by copying and pasting the column specification into your original call:
That suggests we need to use a date parser instead.
To fix the call, start by copying and pasting the column specification into your original call:
```{r, eval = FALSE}
challenge <- read_csv(
@ -503,34 +535,34 @@ challenge <- read_csv(
tail(challenge)
```
Every `parse_xyz()` function has a corresponding `col_xyz()` function. You use `parse_xyz()` when the data is in a character vector in R already; you use `col_xyz()` when you want to tell readr how to load the data.
Every `parse_xyz()` function has a corresponding `col_xyz()` function.
You use `parse_xyz()` when the data is in a character vector in R already; you use `col_xyz()` when you want to tell readr how to load the data.
I highly recommend always supplying `col_types`, building up from the print-out provided by readr. This ensures that you have a consistent and reproducible data import script. If you rely on the default guesses and your data changes, readr will continue to read it in. If you want to be really strict, use `stop_for_problems()`: that will throw an error and stop your script if there are any parsing problems.
I highly recommend always supplying `col_types`, building up from the print-out provided by readr.
This ensures that you have a consistent and reproducible data import script.
If you rely on the default guesses and your data changes, readr will continue to read it in.
If you want to be really strict, use `stop_for_problems()`: that will throw an error and stop your script if there are any parsing problems.
### Other strategies
There are a few other general strategies to help you parse files:
* In the previous example, we just got unlucky: if we look at just
one more row than the default, we can correctly parse in one shot:
- In the previous example, we just got unlucky: if we look at just one more row than the default, we can correctly parse in one shot:
```{r}
challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
challenge2
```
* Sometimes it's easier to diagnose problems if you just read in all
the columns as character vectors:
- Sometimes it's easier to diagnose problems if you just read in all the columns as character vectors:
```{r}
challenge2 <- read_csv(readr_example("challenge.csv"),
col_types = cols(.default = col_character())
)
```
This is particularly useful in conjunction with `type_convert()`,
which applies the parsing heuristics to the character columns in a data
frame.
This is particularly useful in conjunction with `type_convert()`, which applies the parsing heuristics to the character columns in a data frame.
```{r}
df <- tribble(
@ -540,33 +572,30 @@ There are a few other general strategies to help you parse files:
"3", "4.56"
)
df
# Note the column types
type_convert(df)
```
* If you're reading a very large file, you might want to set `n_max` to
a smallish number like 10,000 or 100,000. That will accelerate your
iterations while you eliminate common problems.
* If you're having major parsing problems, sometimes it's easier
to just read into a character vector of lines with `read_lines()`,
or even a character vector of length 1 with `read_file()`. Then you
can use the string parsing skills you'll learn later to parse
more exotic formats.
- If you're reading a very large file, you might want to set `n_max` to a smallish number like 10,000 or 100,000.
That will accelerate your iterations while you eliminate common problems.
- If you're having major parsing problems, sometimes it's easier to just read into a character vector of lines with `read_lines()`, or even a character vector of length 1 with `read_file()`.
Then you can use the string parsing skills you'll learn later to parse more exotic formats.
## Writing to a file
readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`. Both functions increase the chances of the output file being read back in correctly by:
readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`.
Both functions increase the chances of the output file being read back in correctly by:
* Always encoding strings in UTF-8.
* Saving dates and date-times in ISO8601 format so they are easily
parsed elsewhere.
- Always encoding strings in UTF-8.
- Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.
If you want to export a csv file to Excel, use `write_excel_csv()` --- this writes a special character (a "byte order mark") at the start of the file which tells Excel that you're using the UTF-8 encoding.
The most important arguments are `x` (the data frame to save), and `path` (the location to save it). You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.
The most important arguments are `x` (the data frame to save), and `path` (the location to save it).
You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.
```{r, eval = FALSE}
write_csv(challenge, "challenge.csv")
@ -580,20 +609,19 @@ write_csv(challenge, "challenge-2.csv")
read_csv("challenge-2.csv")
```
This makes CSVs a little unreliable for caching interim results---you need to recreate the column specification every time you load in. There are two alternatives:
This makes CSVs a little unreliable for caching interim results---you need to recreate the column specification every time you load in.
There are two alternatives:
1. `write_rds()` and `read_rds()` are uniform wrappers around the base functions `readRDS()` and `saveRDS()`.
These store data in R's custom binary format called RDS:
1. `write_rds()` and `read_rds()` are uniform wrappers around the base
functions `readRDS()` and `saveRDS()`. These store data in R's custom
binary format called RDS:
```{r}
write_rds(challenge, "challenge.rds")
read_rds("challenge.rds")
```
1. The feather package implements a fast binary file format that can
be shared across programming languages:
2. The feather package implements a fast binary file format that can be shared across programming languages:
```{r, eval = FALSE}
library(feather)
write_feather(challenge, "challenge.feather")
@ -610,7 +638,8 @@ This makes CSVs a little unreliable for caching interim results---you need to re
#> # ... with 1,994 more rows
```
Feather tends to be faster than RDS and is usable outside of R. RDS supports list-columns (which you'll learn about in [many models]); feather currently does not.
Feather tends to be faster than RDS and is usable outside of R.
RDS supports list-columns (which you'll learn about in [many models]); feather currently does not.
```{r, include = FALSE}
file.remove("challenge-2.csv")
@ -619,16 +648,17 @@ file.remove("challenge.rds")
## Other types of data
To get other types of data into R, we recommend starting with the tidyverse packages listed below. They're certainly not perfect, but they are a good place to start. For rectangular data:
To get other types of data into R, we recommend starting with the tidyverse packages listed below.
They're certainly not perfect, but they are a good place to start.
For rectangular data:
* __haven__ reads SPSS, Stata, and SAS files.
- **haven** reads SPSS, Stata, and SAS files.
* __readxl__ reads excel files (both `.xls` and `.xlsx`).
- **readxl** reads excel files (both `.xls` and `.xlsx`).
* __DBI__, along with a database specific backend (e.g. __RMySQL__,
__RSQLite__, __RPostgreSQL__ etc) allows you to run SQL queries against a
database and return a data frame.
- **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame.
For hierarchical data: use __jsonlite__ (by Jeroen Ooms) for json, and __xml2__ for XML. Jenny Bryan has some excellent worked examples at <https://jennybc.github.io/purrr-tutorial/>.
For hierarchical data: use **jsonlite** (by Jeroen Ooms) for json, and **xml2** for XML.
Jenny Bryan has some excellent worked examples at <https://jennybc.github.io/purrr-tutorial/>.
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [__rio__](https://github.com/leeper/rio) package.
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.

View File

@ -11,16 +11,23 @@ site: bookdown::bookdown_site
documentclass: book
---
# Welcome {-}
# Welcome {.unnumbered}
<a href="http://amzn.to/2aHLAQ1"><img src="cover.png" width="250" height="375" alt="Buy from amazon" class="cover" /></a> This is the website for the work-in-progress 2nd edition of __"R for Data Science"__. This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualising, and exploring data.
<a href="http://amzn.to/2aHLAQ1"><img src="cover.png" alt="Buy from amazon" class="cover" width="250" height="375"/></a> This is the website for the work-in-progress 2nd edition of **"R for Data Science"**. This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it.
In this book, you will find a practicum of skills for data science.
Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides.
These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R.
You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time.
You'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualising, and exploring data.
This website is (and will always be) __free to use__, and is licensed under the [Creative Commons Attribution-NonCommercial-NoDerivs 3.0](http://creativecommons.org/licenses/by-nc-nd/3.0/us/) License. If you'd like a __physical copy__ of the book, you can order it from [amazon](http://amzn.to/2aHLAQ1); it was published by O'Reilly in January 2017. If you'd like to __give back__
please make a donation to [Kākāpō Recovery](https://www.doc.govt.nz/kakapo-donate): the [kākāpō](https://www.youtube.com/watch?v=9T1vfsHYiKY) (which appears on the cover of R4DS) is a critically endangered native NZ parrot; there are only 213 left.
This website is (and will always be) **free to use**, and is licensed under the [Creative Commons Attribution-NonCommercial-NoDerivs 3.0](http://creativecommons.org/licenses/by-nc-nd/3.0/us/) License.
If you'd like a **physical copy** of the book, you can order it from [amazon](http://amzn.to/2aHLAQ1); it was published by O'Reilly in January 2017.
If you'd like to **give back** please make a donation to [Kākāpō Recovery](https://www.doc.govt.nz/kakapo-donate): the [kākāpō](https://www.youtube.com/watch?v=9T1vfsHYiKY) (which appears on the cover of R4DS) is a critically endangered native NZ parrot; there are only 213 left.
Please note that R4DS uses a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this book, you agree to abide by its terms.
Please note that R4DS uses a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html).
By contributing to this book, you agree to abide by its terms.
## Acknowledgements {-}
## Acknowledgements {.unnumbered}
```{r, results = "asis", echo = FALSE, message = FALSE}
library(dplyr)

334
intro.Rmd
View File

@ -1,113 +1,179 @@
# Introduction
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of "R for Data Science" is to help you learn the most important tools in R that will allow you to do data science. After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge.
The goal of "R for Data Science" is to help you learn the most important tools in R that will allow you to do data science.
After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
## What you will learn
Data science is a huge field, and there's no way you can master it by reading a single book. The goal of this book is to give you a solid foundation in the most important tools. Our model of the tools needed in a typical data science project looks something like this:
Data science is a huge field, and there's no way you can master it by reading a single book.
The goal of this book is to give you a solid foundation in the most important tools.
Our model of the tools needed in a typical data science project looks something like this:
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science.png")
```
First you must __import__ your data into R. This typically means that you take data stored in a file, database, or web application programming interface (API), and load it into a data frame in R. If you can't get your data into R, you can't do data science on it!
First you must **import** your data into R.
This typically means that you take data stored in a file, database, or web application programming interface (API), and load it into a data frame in R.
If you can't get your data into R, you can't do data science on it!
Once you've imported your data, it is a good idea to __tidy__ it. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.
Once you've imported your data, it is a good idea to **tidy** it.
Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored.
In brief, when your data is tidy, each column is a variable, and each row is an observation.
Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.
Once you have tidy data, a common first step is to __transform__ it. Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means). Together, tidying and transforming are called __wrangling__, because getting your data in a form that's natural to work with often feels like a fight!
Once you have tidy data, a common first step is to **transform** it.
Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means).
Together, tidying and transforming are called **wrangling**, because getting your data in a form that's natural to work with often feels like a fight!
Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times.
Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualisation and modelling.
These have complementary strengths and weaknesses so any real analysis will iterate between them many times.
__Visualisation__ is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions about the data. A good visualisation might also hint that you're asking the wrong question, or you need to collect different data. Visualisations can surprise you, but don't scale particularly well because they require a human to interpret them.
**Visualisation** is a fundamentally human activity.
A good visualisation will show you things that you did not expect, or raise new questions about the data.
A good visualisation might also hint that you're asking the wrong question, or you need to collect different data.
Visualisations can surprise you, but don't scale particularly well because they require a human to interpret them.
__Models__ are complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature a model cannot question its own assumptions. That means a model cannot fundamentally surprise you.
**Models** are complementary tools to visualisation.
Once you have made your questions sufficiently precise, you can use a model to answer them.
Models are a fundamentally mathematical or computational tool, so they generally scale well.
Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains!
But every model makes assumptions, and by its very nature a model cannot question its own assumptions.
That means a model cannot fundamentally surprise you.
The last step of data science is __communication__, an absolutely critical part of any data analysis project. It doesn't matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.
The last step of data science is **communication**, an absolutely critical part of any data analysis project.
It doesn't matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.
Surrounding all these tools is __programming__. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
Surrounding all these tools is **programming**.
Programming is a cross-cutting tool that you use in every part of the project.
You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
You'll use these tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more.
You'll use these tools in every data science project, but for most projects they're not enough.
There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%.
Throughout this book we'll point you to resources where you can learn more.
## How this book is organised
The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you'll iterate through them multiple times). In our experience, however, this is not the best way to learn them:
The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you'll iterate through them multiple times).
In our experience, however, this is not the best way to learn them:
* Starting with data ingest and tidying is sub-optimal because 80% of the time
it's routine and boring, and the other 20% of the time it's weird and
frustrating. That's a bad place to start learning a new subject! Instead,
we'll start with visualisation and transformation of data that's already been
imported and tidied. That way, when you ingest and tidy your own data, your
motivation will stay high because you know the pain is worth it.
* Some topics are best explained with other tools. For example, we believe that
it's easier to understand how models work if you already know about
visualisation, tidy data, and programming.
* Programming tools are not necessarily interesting in their own right,
but do allow you to tackle considerably more challenging problems. We'll
give you a selection of programming tools in the middle of the book, and
then you'll see how they can combine with the data science tools to tackle
interesting modelling problems.
- Starting with data ingest and tidying is sub-optimal because 80% of the time it's routine and boring, and the other 20% of the time it's weird and frustrating.
That's a bad place to start learning a new subject!
Instead, we'll start with visualisation and transformation of data that's already been imported and tidied.
That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth it.
Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you've learned. While it's tempting to skip the exercises, there's no better way to learn than practicing on real problems.
- Some topics are best explained with other tools.
For example, we believe that it's easier to understand how models work if you already know about visualisation, tidy data, and programming.
- Programming tools are not necessarily interesting in their own right, but do allow you to tackle considerably more challenging problems.
We'll give you a selection of programming tools in the middle of the book, and then you'll see how they can combine with the data science tools to tackle interesting modelling problems.
Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details.
Each section of the book is paired with exercises to help you practice what you've learned.
While it's tempting to skip the exercises, there's no better way to learn than practicing on real problems.
## What you won't learn
There are some important topics that this book doesn't cover. We believe it's important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible. That means this book can't cover every important topic.
There are some important topics that this book doesn't cover.
We believe it's important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible.
That means this book can't cover every important topic.
### Big data
This book proudly focuses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). This book doesn't teach data.table because it has a very concise interface which makes it harder to learn since it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth the extra effort required to learn it.
This book proudly focuses on small, in-memory datasets.
This is the right place to start because you can't tackle big data unless you have experience with small data.
The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data.
If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table).
This book doesn't teach data.table because it has a very concise interface which makes it harder to learn since it offers fewer linguistic cues.
But if you're working with large data, the performance payoff is worth the extra effort required to learn it.
If your data is bigger than this, carefully consider if your big data problem might actually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration.
If your data is bigger than this, carefully consider if your big data problem might actually be a small data problem in disguise.
While the complete data might be big, often the data needed to answer a specific question is small.
You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in.
The challenge here is finding the right small data, which often requires a lot of iteration.
Another possibility is that your big data problem is actually a large number of small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Hadoop or Spark) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you learn new tools like sparklyr, rhipe, and ddr to solve it for the full dataset.
Another possibility is that your big data problem is actually a large number of small data problems.
Each individual problem might fit in memory, but you have millions of them.
For example, you might want to fit a model to each person in your dataset.
That would be trivial if you had just 10 or 100 people, but instead you have a million.
Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Hadoop or Spark) that allows you to send different datasets to different computers for processing.
Once you've figured out how to answer the question for a single subset using the tools described in this book, you learn new tools like sparklyr, rhipe, and ddr to solve it for the full dataset.
### Python, Julia, and friends
In this book, you won't learn anything about Python, Julia, or any other programming language useful for data science. This isn't because we think these tools are bad. They're not! And in practice, most data science teams use a mix of languages, often at least R and Python.
In this book, you won't learn anything about Python, Julia, or any other programming language useful for data science.
This isn't because we think these tools are bad.
They're not!
And in practice, most data science teams use a mix of languages, often at least R and Python.
However, we strongly believe that it's best to master one tool at a time. You will get better faster if you dive deep, rather than spreading yourself thinly over many topics. This doesn't mean you should only know one thing, just that you'll generally learn faster if you stick to one thing at a time. You should strive to learn new things throughout your career, but make sure your understanding is solid before you move on to the next interesting thing.
However, we strongly believe that it's best to master one tool at a time.
You will get better faster if you dive deep, rather than spreading yourself thinly over many topics.
This doesn't mean you should only know one thing, just that you'll generally learn faster if you stick to one thing at a time.
You should strive to learn new things throughout your career, but make sure your understanding is solid before you move on to the next interesting thing.
We think R is a great place to start your data science journey because it is an environment designed from the ground up to support data science. R is not just a programming language, but it is also an interactive environment for doing data science. To support interaction, R is a much more flexible language than many of its peers. This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process. These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.
We think R is a great place to start your data science journey because it is an environment designed from the ground up to support data science.
R is not just a programming language, but it is also an interactive environment for doing data science.
To support interaction, R is a much more flexible language than many of its peers.
This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process.
These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.
### Non-rectangular data
This book focuses exclusively on rectangular data: collections of values that are each associated with a variable and an observation. There are lots of datasets that do not naturally fit in this paradigm, including images, sounds, trees, and text. But rectangular data frames are extremely common in science and industry, and we believe that they are a great place to start your data science journey.
This book focuses exclusively on rectangular data: collections of values that are each associated with a variable and an observation.
There are lots of datasets that do not naturally fit in this paradigm, including images, sounds, trees, and text.
But rectangular data frames are extremely common in science and industry, and we believe that they are a great place to start your data science journey.
### Hypothesis confirmation
It's possible to divide data analysis into two camps: hypothesis generation and hypothesis confirmation (sometimes called confirmatory analysis). The focus of this book is unabashedly on hypothesis generation, or data exploration. Here you'll look deeply at the data and, in combination with your subject knowledge, generate many interesting hypotheses to help explain why the data behaves the way it does. You evaluate the hypotheses informally, using your scepticism to challenge the data in multiple ways.
It's possible to divide data analysis into two camps: hypothesis generation and hypothesis confirmation (sometimes called confirmatory analysis).
The focus of this book is unabashedly on hypothesis generation, or data exploration.
Here you'll look deeply at the data and, in combination with your subject knowledge, generate many interesting hypotheses to help explain why the data behaves the way it does.
You evaluate the hypotheses informally, using your scepticism to challenge the data in multiple ways.
The complement of hypothesis generation is hypothesis confirmation. Hypothesis confirmation is hard for two reasons:
The complement of hypothesis generation is hypothesis confirmation.
Hypothesis confirmation is hard for two reasons:
1. You need a precise mathematical model in order to generate falsifiable
predictions. This often requires considerable statistical sophistication.
1. You need a precise mathematical model in order to generate falsifiable predictions.
This often requires considerable statistical sophistication.
1. You can only use an observation once to confirm a hypothesis. As soon as
you use it more than once you're back to doing exploratory analysis.
This means to do hypothesis confirmation you need to "preregister"
(write out in advance) your analysis plan, and not deviate from it
even when you have seen the data. We'll talk a little about some
strategies you can use to make this easier in [modelling](#model-intro).
2. You can only use an observation once to confirm a hypothesis.
As soon as you use it more than once you're back to doing exploratory analysis.
This means to do hypothesis confirmation you need to "preregister" (write out in advance) your analysis plan, and not deviate from it even when you have seen the data.
We'll talk a little about some strategies you can use to make this easier in [modelling](#model-intro).
It's common to think about modelling as a tool for hypothesis confirmation, and visualisation as a tool for hypothesis generation. But that's a false dichotomy: models are often used for exploration, and with a little care you can use visualisation for confirmation. The key difference is how often do you look at each observation: if you look only once, it's confirmation; if you look more than once, it's exploration.
It's common to think about modelling as a tool for hypothesis confirmation, and visualisation as a tool for hypothesis generation.
But that's a false dichotomy: models are often used for exploration, and with a little care you can use visualisation for confirmation.
The key difference is how often do you look at each observation: if you look only once, it's confirmation; if you look more than once, it's exploration.
## Prerequisites
We've made a few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it's helpful if you have some programming experience already. If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.
We've made a few assumptions about what you already know in order to get the most out of this book.
You should be generally numerically literate, and it's helpful if you have some programming experience already.
If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.
There are four things you need to run the code in this book: R, RStudio, a collection of R packages called the __tidyverse__, and a handful of other packages. Packages are the fundamental units of reproducible R code. They include reusable functions, the documentation that describes how to use them, and sample data.
There are four things you need to run the code in this book: R, RStudio, a collection of R packages called the **tidyverse**, and a handful of other packages.
Packages are the fundamental units of reproducible R code.
They include reusable functions, the documentation that describes how to use them, and sample data.
### R
### R
To download R, go to CRAN, the **c**omprehensive **R** **a**rchive **n**etwork. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don't try and pick a mirror that's close to you: instead use the cloud mirror, <https://cloud.r-project.org>, which automatically figures it out for you.
To download R, go to CRAN, the **c**omprehensive **R** **a**rchive **n**etwork.
CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages.
Don't try and pick a mirror that's close to you: instead use the cloud mirror, <https://cloud.r-project.org>, which automatically figures it out for you.
A new major version of R comes out once a year, and there are 2-3 minor releases each year. It's a good idea to update regularly. Upgrading can be a bit of a hassle, especially for major versions, which require you to reinstall all your packages, but putting it off only makes it worse.
A new major version of R comes out once a year, and there are 2-3 minor releases each year.
It's a good idea to update regularly.
Upgrading can be a bit of a hassle, especially for major versions, which require you to reinstall all your packages, but putting it off only makes it worse.
### RStudio
RStudio is an integrated development environment, or IDE, for R programming. Download and install it from <http://www.rstudio.com/download>. RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know. It's a good idea to upgrade regularly so you can take advantage of the latest and greatest features. For this book, make sure you have at least RStudio 1.0.0.
RStudio is an integrated development environment, or IDE, for R programming.
Download and install it from <http://www.rstudio.com/download>.
RStudio is updated a couple of times a year.
When a new version is available, RStudio will let you know.
It's a good idea to upgrade regularly so you can take advantage of the latest and greatest features.
For this book, make sure you have at least RStudio 1.0.0.
When you start RStudio, you'll see two key regions in the interface:
@ -115,11 +181,16 @@ When you start RStudio, you'll see two key regions in the interface:
knitr::include_graphics("diagrams/rstudio-console.png")
```
For now, all you need to know is that you type R code in the console pane, and press enter to run it. You'll learn more as we go along!
For now, all you need to know is that you type R code in the console pane, and press enter to run it.
You'll learn more as we go along!
### The tidyverse
You'll also need to install some R packages. An R __package__ is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. The majority of the packages that you will learn in this book are part of the so-called tidyverse. The packages in the tidyverse share a common philosophy of data and R programming, and are designed to work together naturally.
You'll also need to install some R packages.
An R **package** is a collection of functions, data, and documentation that extends the capabilities of base R.
Using packages is key to the successful use of R.
The majority of the packages that you will learn in this book are part of the so-called tidyverse.
The packages in the tidyverse share a common philosophy of data and R programming, and are designed to work together naturally.
You can install the complete tidyverse with a single line of code:
@ -127,21 +198,29 @@ You can install the complete tidyverse with a single line of code:
install.packages("tidyverse")
```
On your own computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on to your computer. If you have problems installing, make sure that you are connected to the internet, and that <https://cloud.r-project.org/> isn't blocked by your firewall or proxy.
On your own computer, type that line of code in the console, and then press enter to run it.
R will download the packages from CRAN and install them on to your computer.
If you have problems installing, make sure that you are connected to the internet, and that <https://cloud.r-project.org/> isn't blocked by your firewall or proxy.
You will not be able to use the functions, objects, and help files in a package until you load it with `library()`. Once you have installed a package, you can load it with the `library()` function:
You will not be able to use the functions, objects, and help files in a package until you load it with `library()`.
Once you have installed a package, you can load it with the `library()` function:
```{r}
library(tidyverse)
```
This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are considered to be the __core__ of the tidyverse because you'll use them in almost every analysis.
This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages.
These are considered to be the **core** of the tidyverse because you'll use them in almost every analysis.
Packages in the tidyverse change fairly frequently. You can see if updates are available, and optionally install them, by running `tidyverse_update()`.
Packages in the tidyverse change fairly frequently.
You can see if updates are available, and optionally install them, by running `tidyverse_update()`.
### Other packages
There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, or are designed with a different set of underlying principles. This doesn't make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages. As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.
There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, or are designed with a different set of underlying principles.
This doesn't make them better or worse, just different.
In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages.
As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.
In this book we'll use three data packages from outside the tidyverse:
@ -153,7 +232,8 @@ These packages provide data on airline flights, world development, and baseball
## Running R code
The previous section showed you a couple of examples of running R code. Code in the book looks like this:
The previous section showed you a couple of examples of running R code.
Code in the book looks like this:
```{r, eval = TRUE}
1 + 2
@ -161,97 +241,98 @@ The previous section showed you a couple of examples of running R code. Code in
If you run the same code in your local console, it will look like this:
```
> 1 + 2
[1] 3
```
> 1 + 2
[1] 3
There are two main differences. In your console, you type after the `>`, called the __prompt__; we don't show the prompt in the book. In the book, output is commented out with `#>`; in your console it appears directly after your code. These two differences mean that if you're working with an electronic version of the book, you can easily copy code out of the book and into the console.
There are two main differences.
In your console, you type after the `>`, called the **prompt**; we don't show the prompt in the book.
In the book, output is commented out with `#>`; in your console it appears directly after your code.
These two differences mean that if you're working with an electronic version of the book, you can easily copy code out of the book and into the console.
Throughout the book we use a consistent set of conventions to refer to code:
* Functions are in a code font and followed by parentheses, like `sum()`,
or `mean()`.
- Functions are in a code font and followed by parentheses, like `sum()`, or `mean()`.
* Other R objects (like data or function arguments) are in a code font,
without parentheses, like `flights` or `x`.
* If we want to make it clear what package an object comes from, we'll use
the package name followed by two colons, like `dplyr::mutate()`, or
`nycflights13::flights`. This is also valid R code.
- Other R objects (like data or function arguments) are in a code font, without parentheses, like `flights` or `x`.
- If we want to make it clear what package an object comes from, we'll use the package name followed by two colons, like `dplyr::mutate()`, or\
`nycflights13::flights`. This is also valid R code.
## Getting help and learning more
This book is not an island; there is no single resource that will allow you to master R. As you start to apply the techniques described in this book to your own data you will soon find questions that we do not answer. This section describes a few tips on how to get help, and to help you keep learning.
This book is not an island; there is no single resource that will allow you to master R.
As you start to apply the techniques described in this book to your own data you will soon find questions that we do not answer.
This section describes a few tips on how to get help, and to help you keep learning.
If you get stuck, start with Google. Typically adding "R" to a query is enough to restrict it to relevant results: if the search isn't useful, it often means that there aren't any R-specific results available. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isn't in English, run `Sys.setenv(LANGUAGE = "en")` and re-run the code; you're more likely to find help for English error messages.)
If you get stuck, start with Google.
Typically adding "R" to a query is enough to restrict it to relevant results: if the search isn't useful, it often means that there aren't any R-specific results available.
Google is particularly useful for error messages.
If you get an error message and you have no idea what it means, try googling it!
Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web.
(If the error message isn't in English, run `Sys.setenv(LANGUAGE = "en")` and re-run the code; you're more likely to find help for English error messages.)
If Google doesn't help, try [Stack Overflow](http://stackoverflow.com). Start by spending a little time searching for an existing answer, including `[R]` to restrict your search to questions and answers that use R. If you don't find anything useful, prepare a minimal reproducible example or __reprex__. A good reprex makes it easier for other people to help you, and often you'll figure out the problem yourself in the course of making it.
If Google doesn't help, try [Stack Overflow](http://stackoverflow.com).
Start by spending a little time searching for an existing answer, including `[R]` to restrict your search to questions and answers that use R.
If you don't find anything useful, prepare a minimal reproducible example or **reprex**.
A good reprex makes it easier for other people to help you, and often you'll figure out the problem yourself in the course of making it.
There are three things you need to include to make your example reproducible: required packages, data, and code.
1. **Packages** should be loaded at the top of the script, so it's easy to
see which ones the example needs. This is a good time to check that you're
using the latest version of each package; it's possible you've discovered
a bug that's been fixed since you installed the package. For packages
in the tidyverse, the easiest way to check is to run `tidyverse_update()`.
1. **Packages** should be loaded at the top of the script, so it's easy to see which ones the example needs.
This is a good time to check that you're using the latest version of each package; it's possible you've discovered a bug that's been fixed since you installed the package.
For packages in the tidyverse, the easiest way to check is to run `tidyverse_update()`.
1. The easiest way to include **data** in a question is to use `dput()` to
generate the R code to recreate it. For example, to recreate the `mtcars`
dataset in R, I'd perform the following steps:
1. Run `dput(mtcars)` in R
2. Copy the output
3. In my reproducible script, type `mtcars <- ` then paste.
Try and find the smallest subset of your data that still reveals
the problem.
2. The easiest way to include **data** in a question is to use `dput()` to generate the R code to recreate it.
For example, to recreate the `mtcars` dataset in R, I'd perform the following steps:
1. Spend a little bit of time ensuring that your **code** is easy for others to
read:
1. Run `dput(mtcars)` in R
2. Copy the output
3. In my reproducible script, type `mtcars <-` then paste.
* Make sure you've used spaces and your variable names are concise, yet
informative.
* Use comments to indicate where your problem lies.
* Do your best to remove everything that is not related to the problem.
The shorter your code is, the easier it is to understand, and the
easier it is to fix.
Try and find the smallest subset of your data that still reveals the problem.
Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in.
3. Spend a little bit of time ensuring that your **code** is easy for others to read:
You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way is to follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org). This is where we post announcements about new packages, new IDE features, and in-person courses. You might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([\@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
- Make sure you've used spaces and your variable names are concise, yet informative.
To keep up with the R community more broadly, we recommend reading <http://www.r-bloggers.com>: it aggregates over 500 blogs about R from around the world. If you're an active Twitter user, follow the ([`#rstats`](https://twitter.com/search?q=%23rstats)) hashtag. Twitter is one of the key tools that Hadley uses to keep up with new developments in the community.
- Use comments to indicate where your problem lies.
- Do your best to remove everything that is not related to the problem.\
The shorter your code is, the easier it is to understand, and the easier it is to fix.
Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in.
You should also spend some time preparing yourself to solve problems before they occur.
Investing a little time in learning R each day will pay off handsomely in the long run.
One way is to follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org).
This is where we post announcements about new packages, new IDE features, and in-person courses.
You might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([\@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
To keep up with the R community more broadly, we recommend reading <http://www.r-bloggers.com>: it aggregates over 500 blogs about R from around the world.
If you're an active Twitter user, follow the ([`#rstats`](https://twitter.com/search?q=%23rstats)) hashtag.
Twitter is one of the key tools that Hadley uses to keep up with new developments in the community.
## Acknowledgements
This book isn't just the product of Hadley and Garrett, but is the result of many conversations (in person and online) that we've had with the many people in the R community. There are a few people we'd like to thank in particular, because they have spent many hours answering our dumb questions and helping us to better think about data science:
This book isn't just the product of Hadley and Garrett, but is the result of many conversations (in person and online) that we've had with the many people in the R community.
There are a few people we'd like to thank in particular, because they have spent many hours answering our dumb questions and helping us to better think about data science:
* Jenny Bryan and Lionel Henry for many helpful discussions around working
with lists and list-columns.
* The three chapters on workflow were adapted (with permission), from
<http://stat545.com/block002_hello-r-workspace-wd-project.html> by
Jenny Bryan.
- Jenny Bryan and Lionel Henry for many helpful discussions around working with lists and list-columns.
* Genevera Allen for discussions about models, modelling, the statistical
learning perspective, and the difference between hypothesis generation and
hypothesis confirmation.
- The three chapters on workflow were adapted (with permission), from <http://stat545.com/block002_hello-r-workspace-wd-project.html> by Jenny Bryan.
* Yihui Xie for his work on the [bookdown](https://github.com/rstudio/bookdown)
package, and for tirelessly responding to my feature requests.
- Genevera Allen for discussions about models, modelling, the statistical learning perspective, and the difference between hypothesis generation and hypothesis confirmation.
* Bill Behrman for his thoughtful reading of the entire book, and for trying
it out with his data science class at Stanford.
- Yihui Xie for his work on the [bookdown](https://github.com/rstudio/bookdown) package, and for tirelessly responding to my feature requests.
* The \#rstats Twitter community who reviewed all of the draft chapters
and provided tons of useful feedback.
- Bill Behrman for his thoughtful reading of the entire book, and for trying it out with his data science class at Stanford.
* Tal Galili for augmenting his dendextend package to support a section on clustering that did not make it into the final draft.
- The \#rstats Twitter community who reviewed all of the draft chapters and provided tons of useful feedback.
This book was written in the open, and many people contributed pull requests to fix minor problems. Special thanks goes to everyone who contributed via GitHub:
- Tal Galili for augmenting his dendextend package to support a section on clustering that did not make it into the final draft.
This book was written in the open, and many people contributed pull requests to fix minor problems.
Special thanks goes to everyone who contributed via GitHub:
```{r, results = "asis", echo = FALSE, message = FALSE}
library(dplyr)
@ -271,7 +352,10 @@ cat(".\n")
## Colophon
An online version of this book is available at <http://r4ds.had.co.nz>. It will continue to evolve in between reprints of the physical book. The source of the book is available at <https://github.com/hadley/r4ds>. The book is powered by <https://bookdown.org> which makes it easy to turn R markdown files into HTML, PDF, and EPUB.
An online version of this book is available at <http://r4ds.had.co.nz>.
It will continue to evolve in between reprints of the physical book.
The source of the book is available at <https://github.com/hadley/r4ds>.
The book is powered by <https://bookdown.org> which makes it easy to turn R markdown files into HTML, PDF, and EPUB.
This book was built with:

View File

@ -2,21 +2,23 @@
## Introduction
In [functions], we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting. Reducing code duplication has three main benefits:
In [functions], we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting.
Reducing code duplication has three main benefits:
1. It's easier to see the intent of your code, because your eyes are
drawn to what's different, not what stays the same.
1. It's easier to respond to changes in requirements. As your needs
change, you only need to make changes in one place, rather than
remembering to change every place that you copied-and-pasted the
code.
1. You're likely to have fewer bugs because each line of code is
used in more places.
1. It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same.
One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated. Another tool for reducing duplication is __iteration__, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming. On the imperative side you have tools like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and require quite a bit of bookkeeping code that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors.
2. It's easier to respond to changes in requirements.
As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.
3. You're likely to have fewer bugs because each line of code is used in more places.
One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated.
Another tool for reducing duplication is **iteration**, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming.
On the imperative side you have tools like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening.
However, for loops are quite verbose, and require quite a bit of bookkeeping code that is duplicated for every for loop.
Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function.
Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors.
### Prerequisites
@ -39,7 +41,8 @@ df <- tibble(
)
```
We want to compute the median of each column. You _could_ do with copy-and-paste:
We want to compute the median of each column.
You *could* do with copy-and-paste:
```{r}
median(df$a)
@ -48,7 +51,8 @@ median(df$c)
median(df$d)
```
But that breaks our rule of thumb: never copy and paste more than twice. Instead, we could use a for loop:
But that breaks our rule of thumb: never copy and paste more than twice.
Instead, we could use a for loop:
```{r}
output <- vector("double", ncol(df)) # 1. output
@ -60,69 +64,64 @@ output
Every for loop has three components:
1. The __output__: `output <- vector("double", length(x))`.
Before you start the loop, you must always allocate sufficient space
for the output. This is very important for efficiency: if you grow
the for loop at each iteration using `c()` (for example), your for loop
will be very slow.
A general way of creating an empty vector of given length is the `vector()`
function. It has two arguments: the type of the vector ("logical",
"integer", "double", "character", etc) and the length of the vector.
1. The **output**: `output <- vector("double", length(x))`.
Before you start the loop, you must always allocate sufficient space for the output.
This is very important for efficiency: if you grow the for loop at each iteration using `c()` (for example), your for loop will be very slow.
1. The __sequence__: `i in seq_along(df)`. This determines what to loop over:
each run of the for loop will assign `i` to a different value from
`seq_along(df)`. It's useful to think of `i` as a pronoun, like "it".
You might not have seen `seq_along()` before. It's a safe version of the
familiar `1:length(l)`, with an important difference: if you have a
zero-length vector, `seq_along()` does the right thing:
A general way of creating an empty vector of given length is the `vector()` function.
It has two arguments: the type of the vector ("logical", "integer", "double", "character", etc) and the length of the vector.
2. The **sequence**: `i in seq_along(df)`.
This determines what to loop over: each run of the for loop will assign `i` to a different value from `seq_along(df)`.
It's useful to think of `i` as a pronoun, like "it".
You might not have seen `seq_along()` before.
It's a safe version of the familiar `1:length(l)`, with an important difference: if you have a zero-length vector, `seq_along()` does the right thing:
```{r}
y <- vector("double", 0)
seq_along(y)
1:length(y)
```
You probably won't create a zero-length vector deliberately, but
it's easy to create them accidentally. If you use `1:length(x)` instead
of `seq_along(x)`, you're likely to get a confusing error message.
1. The __body__: `output[[i]] <- median(df[[i]])`. This is the code that does
the work. It's run repeatedly, each time with a different value for `i`.
The first iteration will run `output[[1]] <- median(df[[1]])`,
the second will run `output[[2]] <- median(df[[2]])`, and so on.
That's all there is to the for loop! Now is a good time to practice creating some basic (and not so basic) for loops using the exercises below. Then we'll move on some variations of the for loop that help you solve other problems that will crop up in practice.
You probably won't create a zero-length vector deliberately, but it's easy to create them accidentally.
If you use `1:length(x)` instead of `seq_along(x)`, you're likely to get a confusing error message.
3. The **body**: `output[[i]] <- median(df[[i]])`.
This is the code that does the work.
It's run repeatedly, each time with a different value for `i`.
The first iteration will run `output[[1]] <- median(df[[1]])`, the second will run `output[[2]] <- median(df[[2]])`, and so on.
That's all there is to the for loop!
Now is a good time to practice creating some basic (and not so basic) for loops using the exercises below.
Then we'll move on some variations of the for loop that help you solve other problems that will crop up in practice.
### Exercises
1. Write for loops to:
1. Compute the mean of every column in `mtcars`.
1. Determine the type of each column in `nycflights13::flights`.
1. Compute the number of unique values in each column of `iris`.
1. Generate 10 random normals from distributions with means of -10, 0, 10, and 100.
Think about the output, sequence, and body __before__ you start writing
the loop.
1. Compute the mean of every column in `mtcars`.
2. Determine the type of each column in `nycflights13::flights`.
3. Compute the number of unique values in each column of `iris`.
4. Generate 10 random normals from distributions with means of -10, 0, 10, and 100.
Think about the output, sequence, and body **before** you start writing the loop.
2. Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:
1. Eliminate the for loop in each of the following examples by taking
advantage of an existing function that works with vectors:
```{r, eval = FALSE}
out <- ""
for (x in letters) {
out <- stringr::str_c(out, x)
}
x <- sample(100)
sd <- 0
for (i in seq_along(x)) {
sd <- sd + (x[i] - mean(x)) ^ 2
}
sd <- sqrt(sd / (length(x) - 1))
x <- runif(100)
out <- vector("numeric", length(x))
out[1] <- x[1]
@ -131,21 +130,18 @@ That's all there is to the for loop! Now is a good time to practice creating som
}
```
1. Combine your function writing and for loop skills:
3. Combine your function writing and for loop skills:
1. Write a for loop that `prints()` the lyrics to the children's song
"Alice the camel".
1. Write a for loop that `prints()` the lyrics to the children's song "Alice the camel".
1. Convert the nursery rhyme "ten in the bed" to a function. Generalise
it to any number of people in any sleeping structure.
2. Convert the nursery rhyme "ten in the bed" to a function.
Generalise it to any number of people in any sleeping structure.
1. Convert the song "99 bottles of beer on the wall" to a function.
Generalise to any number of any vessel containing any liquid on
any surface.
3. Convert the song "99 bottles of beer on the wall" to a function.
Generalise to any number of any vessel containing any liquid on any surface.
4. It's common to see for loops that don't preallocate the output and instead increase the length of a vector at each step:
1. It's common to see for loops that don't preallocate the output and instead
increase the length of a vector at each step:
```{r, eval = FALSE}
output <- vector("integer", 0)
for (i in seq_along(x)) {
@ -153,23 +149,27 @@ That's all there is to the for loop! Now is a good time to practice creating som
}
output
```
How does this affect performance? Design and execute an experiment.
How does this affect performance?
Design and execute an experiment.
## For loop variations
Once you have the basic for loop under your belt, there are some variations that you should be aware of. These variations are important regardless of how you do iteration, so don't forget about them once you've mastered the FP techniques you'll learn about in the next section.
Once you have the basic for loop under your belt, there are some variations that you should be aware of.
These variations are important regardless of how you do iteration, so don't forget about them once you've mastered the FP techniques you'll learn about in the next section.
There are four variations on the basic theme of the for loop:
1. Modifying an existing object, instead of creating a new object.
1. Looping over names or values, instead of indices.
1. Handling outputs of unknown length.
1. Handling sequences of unknown length.
2. Looping over names or values, instead of indices.
3. Handling outputs of unknown length.
4. Handling sequences of unknown length.
### Modifying an existing object
Sometimes you want to use a for loop to modify an existing object. For example, remember our challenge from [functions]. We wanted to rescale every column in a data frame:
Sometimes you want to use a for loop to modify an existing object.
For example, remember our challenge from [functions].
We wanted to rescale every column in a data frame:
```{r}
df <- tibble(
@ -191,12 +191,11 @@ df$d <- rescale01(df$d)
To solve this with a for loop we again think about the three components:
1. __Output__: we already have the output --- it's the same as the input!
1. **Output**: we already have the output --- it's the same as the input!
1. __Sequence__: we can think about a data frame as a list of columns, so
we can iterate over each column with `seq_along(df)`.
2. **Sequence**: we can think about a data frame as a list of columns, so we can iterate over each column with `seq_along(df)`.
1. __Body__: apply `rescale01()`.
3. **Body**: apply `rescale01()`.
This gives us:
@ -206,21 +205,23 @@ for (i in seq_along(df)) {
}
```
Typically you'll be modifying a list or data frame with this sort of loop, so remember to use `[[`, not `[`. You might have spotted that I used `[[` in all my for loops: I think it's better to use `[[` even for atomic vectors because it makes it clear that I want to work with a single element.
Typically you'll be modifying a list or data frame with this sort of loop, so remember to use `[[`, not `[`.
You might have spotted that I used `[[` in all my for loops: I think it's better to use `[[` even for atomic vectors because it makes it clear that I want to work with a single element.
### Looping patterns
There are three basic ways to loop over a vector. So far I've shown you the most general: looping over the numeric indices with `for (i in seq_along(xs))`, and extracting the value with `x[[i]]`. There are two other forms:
There are three basic ways to loop over a vector.
So far I've shown you the most general: looping over the numeric indices with `for (i in seq_along(xs))`, and extracting the value with `x[[i]]`.
There are two other forms:
1. Loop over the elements: `for (x in xs)`. This is most useful if you only
care about side-effects, like plotting or saving a file, because it's
difficult to save the output efficiently.
1. Loop over the elements: `for (x in xs)`.
This is most useful if you only care about side-effects, like plotting or saving a file, because it's difficult to save the output efficiently.
2. Loop over the names: `for (nm in names(xs))`.
This gives you name, which you can use to access the value with `x[[nm]]`.
This is useful if you want to use the name in a plot title or a file name.
If you're creating named output, make sure to name the results vector like so:
1. Loop over the names: `for (nm in names(xs))`. This gives you name, which
you can use to access the value with `x[[nm]]`. This is useful if you want
to use the name in a plot title or a file name. If you're creating
named output, make sure to name the results vector like so:
```{r, eval = FALSE}
results <- vector("list", length(x))
names(results) <- names(x)
@ -237,7 +238,9 @@ for (i in seq_along(x)) {
### Unknown output length
Sometimes you might not know how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:
Sometimes you might not know how long the output will be.
For example, imagine you want to simulate some random vectors of random lengths.
You might be tempted to solve this problem by progressively growing the vector:
```{r}
means <- c(0, 1, 2)
@ -250,7 +253,8 @@ for (i in seq_along(means)) {
str(output)
```
But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations. In technical terms you get "quadratic" ($O(n^2)$) behaviour which means that a loop with three times as many elements would take nine ($3^2$) times as long to run.
But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations.
In technical terms you get "quadratic" ($O(n^2)$) behaviour which means that a loop with three times as many elements would take nine ($3^2$) times as long to run.
A better solution to save the results in a list, and then combine into a single vector after the loop is done:
@ -264,25 +268,28 @@ str(out)
str(unlist(out))
```
Here I've used `unlist()` to flatten a list of vectors into a single vector. A stricter option is to use `purrr::flatten_dbl()` --- it will throw an error if the input isn't a list of doubles.
Here I've used `unlist()` to flatten a list of vectors into a single vector.
A stricter option is to use `purrr::flatten_dbl()` --- it will throw an error if the input isn't a list of doubles.
This pattern occurs in other places too:
1. You might be generating a long string. Instead of `paste()`ing together
each iteration with the previous, save the output in a character vector and
then combine that vector into a single string with
`paste(output, collapse = "")`.
1. You might be generating a big data frame. Instead of sequentially
`rbind()`ing in each iteration, save the output in a list, then use
`dplyr::bind_rows(output)` to combine the output into a single
data frame.
1. You might be generating a long string.
Instead of `paste()`ing together each iteration with the previous, save the output in a character vector and then combine that vector into a single string with `paste(output, collapse = "")`.
Watch out for this pattern. Whenever you see it, switch to a more complex result object, and then combine in one step at the end.
2. You might be generating a big data frame.
Instead of sequentially `rbind()`ing in each iteration, save the output in a list, then use `dplyr::bind_rows(output)` to combine the output into a single data frame.
Watch out for this pattern.
Whenever you see it, switch to a more complex result object, and then combine in one step at the end.
### Unknown sequence length
Sometimes you don't even know how long the input sequence should run for. This is common when doing simulations. For example, you might want to loop until you get three heads in a row. You can't do that sort of iteration with the for loop. Instead, you can use a while loop. A while loop is simpler than for loop because it only has two components, a condition and a body:
Sometimes you don't even know how long the input sequence should run for.
This is common when doing simulations.
For example, you might want to loop until you get three heads in a row.
You can't do that sort of iteration with the for loop.
Instead, you can use a while loop.
A while loop is simpler than for loop because it only has two components, a condition and a body:
```{r, eval = FALSE}
while (condition) {
@ -324,23 +331,23 @@ while (nheads < 3) {
flips
```
I mention while loops only briefly, because I hardly ever use them. They're most often used for simulation, which is outside the scope of this book. However, it is good to know they exist so that you're prepared for problems where the number of iterations is not known in advance.
I mention while loops only briefly, because I hardly ever use them.
They're most often used for simulation, which is outside the scope of this book.
However, it is good to know they exist so that you're prepared for problems where the number of iterations is not known in advance.
### Exercises
1. Imagine you have a directory full of CSV files that you want to read in.
You have their paths in a vector,
`files <- dir("data/", pattern = "\\.csv$", full.names = TRUE)`, and now
want to read each one with `read_csv()`. Write the for loop that will
load them into a single data frame.
You have their paths in a vector, `files <- dir("data/", pattern = "\\.csv$", full.names = TRUE)`, and now want to read each one with `read_csv()`.
Write the for loop that will load them into a single data frame.
1. What happens if you use `for (nm in names(x))` and `x` has no names?
What if only some of the elements are named? What if the names are
not unique?
2. What happens if you use `for (nm in names(x))` and `x` has no names?
What if only some of the elements are named?
What if the names are not unique?
3. Write a function that prints the mean of each numeric column in a data frame, along with its name.
For example, `show_mean(iris)` would print:
1. Write a function that prints the mean of each numeric column in a data
frame, along with its name. For example, `show_mean(iris)` would print:
```{r, eval = FALSE}
show_mean(iris)
#> Sepal.Length: 5.84
@ -348,11 +355,11 @@ I mention while loops only briefly, because I hardly ever use them. They're most
#> Petal.Length: 3.76
#> Petal.Width: 1.20
```
(Extra challenge: what function did I use to make sure that the numbers
lined up nicely, even though the variable names had different lengths?)
1. What does this code do? How does it work?
(Extra challenge: what function did I use to make sure that the numbers lined up nicely, even though the variable names had different lengths?)
4. What does this code do?
How does it work?
```{r, eval = FALSE}
trans <- list(
@ -368,7 +375,8 @@ I mention while loops only briefly, because I hardly ever use them. They're most
## For loops vs. functionals
For loops are not as important in R as they are in other languages because R is a functional programming language. This means that it's possible to wrap up for loops in a function, and call that function instead of using the for loop directly.
For loops are not as important in R as they are in other languages because R is a functional programming language.
This means that it's possible to wrap up for loops in a function, and call that function instead of using the for loop directly.
To see why this is important, consider (again) this simple data frame:
@ -381,7 +389,8 @@ df <- tibble(
)
```
Imagine you want to compute the mean of every column. You could do that with a for loop:
Imagine you want to compute the mean of every column.
You could do that with a for loop:
```{r}
output <- vector("double", length(df))
@ -422,7 +431,9 @@ col_sd <- function(df) {
}
```
Uh oh! You've copied-and-pasted this code twice, so it's time to think about how to generalise it. Notice that most of this code is for-loop boilerplate and it's hard to see the one thing (`mean()`, `median()`, `sd()`) that is different between the functions.
Uh oh!
You've copied-and-pasted this code twice, so it's time to think about how to generalise it.
Notice that most of this code is for-loop boilerplate and it's hard to see the one thing (`mean()`, `median()`, `sd()`) that is different between the functions.
What would you do if you saw a set of functions like this:
@ -438,7 +449,7 @@ Hopefully, you'd notice that there's a lot of duplication, and extract it out in
f <- function(x, i) abs(x - mean(x)) ^ i
```
You've reduced the chance of bugs (because you now have 1/3 of the original code), and made it easy to generalise to new situations.
You've reduced the chance of bugs (because you now have 1/3 of the original code), and made it easy to generalise to new situations.
We can do exactly the same thing with `col_mean()`, `col_median()` and `col_sd()` by adding an argument that supplies the function to apply to each column:
@ -454,47 +465,54 @@ col_summary(df, median)
col_summary(df, mean)
```
The idea of passing a function to another function is an extremely powerful idea, and it's one of the behaviours that makes R a functional programming language. It might take you a while to wrap your head around the idea, but it's worth the investment. In the rest of the chapter, you'll learn about and use the __purrr__ package, which provides functions that eliminate the need for many common for loops. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and thus is easier to learn.
The idea of passing a function to another function is an extremely powerful idea, and it's one of the behaviours that makes R a functional programming language.
It might take you a while to wrap your head around the idea, but it's worth the investment.
In the rest of the chapter, you'll learn about and use the **purrr** package, which provides functions that eliminate the need for many common for loops.
The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and thus is easier to learn.
The goal of using purrr functions instead of for loops is to allow you to break common list manipulation challenges into independent pieces:
The goal of using purrr functions instead of for loops is to allow you to break common list manipulation challenges into independent pieces:
1. How can you solve the problem for a single element of the list? Once
you've solved that problem, purrr takes care of generalising your
solution to every element in the list.
1. How can you solve the problem for a single element of the list?
Once you've solved that problem, purrr takes care of generalising your solution to every element in the list.
1. If you're solving a complex problem, how can you break it down into
bite-sized pieces that allow you to advance one small step towards a
solution? With purrr, you get lots of small pieces that you can
compose together with the pipe.
2. If you're solving a complex problem, how can you break it down into bite-sized pieces that allow you to advance one small step towards a solution?
With purrr, you get lots of small pieces that you can compose together with the pipe.
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
This structure makes it easier to solve new problems.
It also makes it easier to understand your solutions to old problems when you re-read your old code.
### Exercises
1. Read the documentation for `apply()`. In the 2d case, what two for loops
does it generalise?
1. Read the documentation for `apply()`.
In the 2d case, what two for loops does it generalise?
1. Adapt `col_summary()` so that it only applies to numeric columns
You might want to start with an `is_numeric()` function that returns
a logical vector that has a TRUE corresponding to each numeric column.
2. Adapt `col_summary()` so that it only applies to numeric columns You might want to start with an `is_numeric()` function that returns a logical vector that has a TRUE corresponding to each numeric column.
## The map functions
The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you. There is one function for each type of output:
The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you.
There is one function for each type of output:
* `map()` makes a list.
* `map_lgl()` makes a logical vector.
* `map_int()` makes an integer vector.
* `map_dbl()` makes a double vector.
* `map_chr()` makes a character vector.
- `map()` makes a list.
- `map_lgl()` makes a logical vector.
- `map_int()` makes an integer vector.
- `map_dbl()` makes a double vector.
- `map_chr()` makes a character vector.
Each function takes a vector as input, applies a function to each piece, and then returns a new vector that's the same length (and has the same names) as the input. The type of the vector is determined by the suffix to the map function.
Each function takes a vector as input, applies a function to each piece, and then returns a new vector that's the same length (and has the same names) as the input.
The type of the vector is determined by the suffix to the map function.
Once you master these functions, you'll find it takes much less time to solve iteration problems. But you should never feel bad about using a for loop instead of a map function. The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work. The important thing is that you solve the problem that you're working on, not write the most concise and elegant code (although that's definitely something you want to strive towards!).
Once you master these functions, you'll find it takes much less time to solve iteration problems.
But you should never feel bad about using a for loop instead of a map function.
The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work.
The important thing is that you solve the problem that you're working on, not write the most concise and elegant code (although that's definitely something you want to strive towards!).
Some people will tell you to avoid for loops because they are slow. They're wrong! (Well at least they're rather out of date, as for loops haven't been slow for many years.) The chief benefits of using functions like `map()` is not speed, but clarity: they make your code easier to write and to read.
Some people will tell you to avoid for loops because they are slow.
They're wrong!
(Well at least they're rather out of date, as for loops haven't been slow for many years.) The chief benefits of using functions like `map()` is not speed, but clarity: they make your code easier to write and to read.
We can use these functions to perform the same computations as the last for loop. Those summary functions returned doubles, so we need to use `map_dbl()`:
We can use these functions to perform the same computations as the last for loop.
Those summary functions returned doubles, so we need to use `map_dbl()`:
```{r}
map_dbl(df, mean)
@ -502,7 +520,8 @@ map_dbl(df, median)
map_dbl(df, sd)
```
Compared to using a for loop, focus is on the operation being performed (i.e. `mean()`, `median()`, `sd()`), not the bookkeeping required to loop over every element and store the output. This is even more apparent if we use the pipe:
Compared to using a for loop, focus is on the operation being performed (i.e. `mean()`, `median()`, `sd()`), not the bookkeeping required to loop over every element and store the output.
This is even more apparent if we use the pipe:
```{r}
df %>% map_dbl(mean)
@ -512,21 +531,19 @@ df %>% map_dbl(sd)
There are a few differences between `map_*()` and `col_summary()`:
* All purrr functions are implemented in C. This makes them a little faster
at the expense of readability.
* The second argument, `.f`, the function to apply, can be a formula, a
character vector, or an integer vector. You'll learn about those handy
shortcuts in the next section.
* `map_*()` uses ... ([dot dot dot]) to pass along additional arguments
to `.f` each time it's called:
- All purrr functions are implemented in C.
This makes them a little faster at the expense of readability.
- The second argument, `.f`, the function to apply, can be a formula, a character vector, or an integer vector.
You'll learn about those handy shortcuts in the next section.
- `map_*()` uses ... ([dot dot dot]) to pass along additional arguments to `.f` each time it's called:
```{r}
map_dbl(df, mean, trim = 0.5)
```
* The map functions also preserve names:
- The map functions also preserve names:
```{r}
z <- list(x = 1:3, y = 4:5)
@ -535,7 +552,9 @@ There are a few differences between `map_*()` and `col_summary()`:
### Shortcuts
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits up the `mtcars` dataset into three pieces (one for each value of cylinder) and fits the same linear model to each piece:
There are a few shortcuts that you can use with `.f` in order to save a little typing.
Imagine you want to fit a linear model to each group in a dataset.
The following toy example splits up the `mtcars` dataset into three pieces (one for each value of cylinder) and fits the same linear model to each piece:
```{r}
models <- mtcars %>%
@ -551,9 +570,12 @@ models <- mtcars %>%
map(~lm(mpg ~ wt, data = .x))
```
Here I've used `.x` as a pronoun: it refers to the current list element (in the same way that `i` referred to the current index in the for loop). `.x` in a one-sided formula corresponds to an argument in an anonymous function.
Here I've used `.x` as a pronoun: it refers to the current list element (in the same way that `i` referred to the current index in the for loop).
`.x` in a one-sided formula corresponds to an argument in an anonymous function.
When you're looking at many models, you might want to extract a summary statistic like the $R^2$. To do that we need to first run `summary()` and then extract the component called `r.squared`. We could do that using the shorthand for anonymous functions:
When you're looking at many models, you might want to extract a summary statistic like the $R^2$.
To do that we need to first run `summary()` and then extract the component called `r.squared`.
We could do that using the shorthand for anonymous functions:
```{r}
models %>%
@ -569,7 +591,7 @@ models %>%
map_dbl("r.squared")
```
You can also use an integer to select elements by position:
You can also use an integer to select elements by position:
```{r}
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
@ -577,18 +599,14 @@ x %>% map_dbl(2)
```
### Base R
If you're familiar with the apply family of functions in base R, you might have noticed some similarities with the purrr functions:
* `lapply()` is basically identical to `map()`, except that `map()` is
consistent with all the other functions in purrr, and you can use the
shortcuts for `.f`.
- `lapply()` is basically identical to `map()`, except that `map()` is consistent with all the other functions in purrr, and you can use the shortcuts for `.f`.
- Base `sapply()` is a wrapper around `lapply()` that automatically simplifies the output.
This is useful for interactive work but is problematic in a function because you never know what sort of output you'll get:
* Base `sapply()` is a wrapper around `lapply()` that automatically
simplifies the output. This is useful for interactive work but is
problematic in a function because you never know what sort of output
you'll get:
```{r}
x1 <- list(
c(0.27, 0.37, 0.57, 0.91, 0.20),
@ -600,19 +618,15 @@ If you're familiar with the apply family of functions in base R, you might have
c(0.93, 0.21, 0.65, 0.13, 0.27),
c(0.39, 0.01, 0.38, 0.87, 0.34)
)
threshold <- function(x, cutoff = 0.8) x[x > cutoff]
x1 %>% sapply(threshold) %>% str()
x2 %>% sapply(threshold) %>% str()
```
* `vapply()` is a safe alternative to `sapply()` because you supply an
additional argument that defines the type. The only problem with
`vapply()` is that it's a lot of typing:
`vapply(df, is.numeric, logical(1))` is equivalent to
`map_lgl(df, is.numeric)`. One advantage of `vapply()` over purrr's map
functions is that it can also produce matrices --- the map functions only
ever produce vectors.
- `vapply()` is a safe alternative to `sapply()` because you supply an additional argument that defines the type.
The only problem with `vapply()` is that it's a lot of typing: `vapply(df, is.numeric, logical(1))` is equivalent to `map_lgl(df, is.numeric)`.
One advantage of `vapply()` over purrr's map functions is that it can also produce matrices --- the map functions only ever produce vectors.
I focus on purrr functions here because they have more consistent names and arguments, helpful shortcuts, and in the future will provide easy parallelism and progress bars.
@ -620,35 +634,44 @@ I focus on purrr functions here because they have more consistent names and argu
1. Write code that uses one of the map functions to:
1. Compute the mean of every column in `mtcars`.
1. Determine the type of each column in `nycflights13::flights`.
1. Compute the number of unique values in each column of `iris`.
1. Generate 10 random normals from distributions with means of -10, 0, 10, and 100.
1. Compute the mean of every column in `mtcars`.
2. Determine the type of each column in `nycflights13::flights`.
3. Compute the number of unique values in each column of `iris`.
4. Generate 10 random normals from distributions with means of -10, 0, 10, and 100.
1. How can you create a single vector that for each column in a data frame
indicates whether or not it's a factor?
2. How can you create a single vector that for each column in a data frame indicates whether or not it's a factor?
1. What happens when you use the map functions on vectors that aren't lists?
What does `map(1:5, runif)` do? Why?
1. What does `map(-2:2, rnorm, n = 5)` do? Why?
What does `map_dbl(-2:2, rnorm, n = 5)` do? Why?
3. What happens when you use the map functions on vectors that aren't lists?
What does `map(1:5, runif)` do?
Why?
1. Rewrite `map(x, function(df) lm(mpg ~ wt, data = df))` to eliminate the
anonymous function.
4. What does `map(-2:2, rnorm, n = 5)` do?
Why?
What does `map_dbl(-2:2, rnorm, n = 5)` do?
Why?
5. Rewrite `map(x, function(df) lm(mpg ~ wt, data = df))` to eliminate the anonymous function.
## Dealing with failure
When you use the map functions to repeat many operations, the chances are much higher that one of those operations will fail. When this happens, you'll get an error message, and no output. This is annoying: why does one failure prevent you from accessing all the other successes? How do you ensure that one bad apple doesn't ruin the whole barrel?
When you use the map functions to repeat many operations, the chances are much higher that one of those operations will fail.
When this happens, you'll get an error message, and no output.
This is annoying: why does one failure prevent you from accessing all the other successes?
How do you ensure that one bad apple doesn't ruin the whole barrel?
In this section you'll learn how to deal with this situation with a new function: `safely()`. `safely()` is an adverb: it takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead, it always returns a list with two elements:
In this section you'll learn how to deal with this situation with a new function: `safely()`.
`safely()` is an adverb: it takes a function (a verb) and returns a modified version.
In this case, the modified function will never throw an error.
Instead, it always returns a list with two elements:
1. `result` is the original result. If there was an error, this will be `NULL`.
1. `result` is the original result.
If there was an error, this will be `NULL`.
1. `error` is an error object. If the operation was successful, this will be
`NULL`.
2. `error` is an error object.
If the operation was successful, this will be `NULL`.
(You might be familiar with the `try()` function in base R. It's similar, but because it sometimes returns the original result and it sometimes returns an error object it's more difficult to work with.)
(You might be familiar with the `try()` function in base R.
It's similar, but because it sometimes returns the original result and it sometimes returns an error object it's more difficult to work with.)
Let's illustrate this with a simple example: `log()`:
@ -658,7 +681,8 @@ str(safe_log(10))
str(safe_log("a"))
```
When the function succeeds, the `result` element contains the result and the `error` element is `NULL`. When the function fails, the `result` element is `NULL` and the `error` element contains an error object.
When the function succeeds, the `result` element contains the result and the `error` element is `NULL`.
When the function fails, the `result` element is `NULL` and the `error` element contains an error object.
`safely()` is designed to work with map:
@ -668,7 +692,8 @@ y <- x %>% map(safely(log))
str(y)
```
This would be easier to work with if we had two lists: one of all the errors and one of all the output. That's easy to get with `purrr::transpose()`:
This would be easier to work with if we had two lists: one of all the errors and one of all the output.
That's easy to get with `purrr::transpose()`:
```{r}
y <- y %>% transpose()
@ -685,17 +710,16 @@ y$result[is_ok] %>% flatten_dbl()
Purrr provides two other useful adverbs:
* Like `safely()`, `possibly()` always succeeds. It's simpler than `safely()`,
because you give it a default value to return when there is an error.
- Like `safely()`, `possibly()` always succeeds.
It's simpler than `safely()`, because you give it a default value to return when there is an error.
```{r}
x <- list(1, 10, "a")
x %>% map_dbl(possibly(log, NA_real_))
```
* `quietly()` performs a similar role to `safely()`, but instead of capturing
errors, it captures printed output, messages, and warnings:
- `quietly()` performs a similar role to `safely()`, but instead of capturing errors, it captures printed output, messages, and warnings:
```{r}
x <- list(1, -1)
x %>% map(quietly(log)) %>% str()
@ -703,7 +727,11 @@ Purrr provides two other useful adverbs:
## Mapping over multiple arguments
So far we've mapped along a single input. But often you have multiple related inputs that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
So far we've mapped along a single input.
But often you have multiple related inputs that you need iterate along in parallel.
That's the job of the `map2()` and `pmap()` functions.
For example, imagine you want to simulate some random normals with different means.
You know how to do that with `map()`:
```{r}
mu <- list(5, 10, -3)
@ -712,7 +740,8 @@ mu %>%
str()
```
What if you also want to vary the standard deviation? One way to do that would be to iterate over the indices and index into vectors of means and sds:
What if you also want to vary the standard deviation?
One way to do that would be to iterate over the indices and index into vectors of means and sds:
```{r}
sigma <- list(1, 5, 10)
@ -721,7 +750,8 @@ seq_along(mu) %>%
str()
```
But that obfuscates the intent of the code. Instead we could use `map2()` which iterates over two vectors in parallel:
But that obfuscates the intent of the code.
Instead we could use `map2()` which iterates over two vectors in parallel:
```{r}
map2(mu, sigma, rnorm, n = 5) %>% str()
@ -733,7 +763,7 @@ map2(mu, sigma, rnorm, n = 5) %>% str()
knitr::include_graphics("diagrams/lists-map2.png")
```
Note that the arguments that vary for each call come _before_ the function; arguments that are the same for every call come _after_.
Note that the arguments that vary for each call come *before* the function; arguments that are the same for every call come *after*.
Like `map()`, `map2()` is just a wrapper around a for loop:
@ -747,7 +777,9 @@ map2 <- function(x, y, f, ...) {
}
```
You could also imagine `map3()`, `map4()`, `map5()`, `map6()` etc, but that would get tedious quickly. Instead, purrr provides `pmap()` which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:
You could also imagine `map3()`, `map4()`, `map5()`, `map6()` etc, but that would get tedious quickly.
Instead, purrr provides `pmap()` which takes a list of arguments.
You might use that if you wanted to vary the mean, standard deviation, and number of samples:
```{r}
n <- list(1, 3, 5)
@ -763,7 +795,8 @@ That looks like:
knitr::include_graphics("diagrams/lists-pmap-unnamed.png")
```
If you don't name the list's elements, `pmap()` will use positional matching when calling the function. That's a little fragile, and makes the code harder to read, so it's better to name the arguments:
If you don't name the list's elements, `pmap()` will use positional matching when calling the function.
That's a little fragile, and makes the code harder to read, so it's better to name the arguments:
```{r, eval = FALSE}
args2 <- list(mean = mu, sd = sigma, n = n)
@ -816,7 +849,9 @@ invoke_map(f, param, n = 5) %>% str()
knitr::include_graphics("diagrams/lists-invoke.png")
```
The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
The first argument is a list of functions or character vector of function names.
The second argument is a list of lists giving the arguments that vary for each function.
The subsequent arguments are passed on to every function.
And again, you can use `tribble()` to make creating these matching pairs a little easier:
@ -833,7 +868,9 @@ sim %>%
## Walk {#walk}
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value. Here's a very simple example:
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value.
You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value.
Here's a very simple example:
```{r}
x <- list(1, "a", 3)
@ -842,7 +879,8 @@ x %>%
walk(print)
```
`walk()` is generally not that useful compared to `walk2()` or `pwalk()`. For example, if you had a list of plots and a vector of file names, you could use `pwalk()` to save each file to the corresponding location on disk:
`walk()` is generally not that useful compared to `walk2()` or `pwalk()`.
For example, if you had a list of plots and a vector of file names, you could use `pwalk()` to save each file to the corresponding location on disk:
```{r, eval = FALSE}
library(ggplot2)
@ -854,15 +892,19 @@ paths <- stringr::str_c(names(plots), ".pdf")
pwalk(list(paths, plots), ggsave, path = tempdir())
```
`walk()`, `walk2()` and `pwalk()` all invisibly return `.`, the first argument. This makes them suitable for use in the middle of pipelines.
`walk()`, `walk2()` and `pwalk()` all invisibly return `.`, the first argument.
This makes them suitable for use in the middle of pipelines.
## Other patterns of for loops
Purrr provides a number of other functions that abstract over other types of for loops. You'll use them less frequently than the map functions, but they're useful to know about. The goal here is to briefly illustrate each function, so hopefully it will come to mind if you see a similar problem in the future. Then you can go look up the documentation for more details.
Purrr provides a number of other functions that abstract over other types of for loops.
You'll use them less frequently than the map functions, but they're useful to know about.
The goal here is to briefly illustrate each function, so hopefully it will come to mind if you see a similar problem in the future.
Then you can go look up the documentation for more details.
### Predicate functions
A number of functions work with __predicate__ functions that return either a single `TRUE` or `FALSE`.
A number of functions work with **predicate** functions that return either a single `TRUE` or `FALSE`.
`keep()` and `discard()` keep elements of the input where the predicate is `TRUE` or `FALSE` respectively:
@ -876,8 +918,7 @@ iris %>%
str()
```
`some()` and `every()` determine if the predicate is true for any or for all of
the elements.
`some()` and `every()` determine if the predicate is true for any or for all of the elements.
```{r}
x <- list(1:5, letters, list(10))
@ -914,7 +955,9 @@ x %>%
### Reduce and accumulate
Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton. This is useful if you want to apply a two-table dplyr verb to multiple tables. For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together:
Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton.
This is useful if you want to apply a two-table dplyr verb to multiple tables.
For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together:
```{r}
dfs <- list(
@ -940,7 +983,8 @@ vs %>% reduce(intersect)
`reduce()` takes a "binary" function (i.e. a function with two primary inputs), and applies it repeatedly to a list until there is only a single element left.
`accumulate()` is similar but it keeps all the interim results. You could use it to implement a cumulative sum:
`accumulate()` is similar but it keeps all the interim results.
You could use it to implement a cumulative sum:
```{r}
x <- sample(10)
@ -950,13 +994,13 @@ x %>% accumulate(`+`)
### Exercises
1. Implement your own version of `every()` using a for loop. Compare it with
`purrr::every()`. What does purrr's version do that your version doesn't?
1. Implement your own version of `every()` using a for loop.
Compare it with `purrr::every()`.
What does purrr's version do that your version doesn't?
1. Create an enhanced `col_summary()` that applies a summary function to every
numeric column in a data frame.
2. Create an enhanced `col_summary()` that applies a summary function to every numeric column in a data frame.
1. A possible base R equivalent of `col_summary()` is:
3. A possible base R equivalent of `col_summary()` is:
```{r}
col_sum3 <- function(df, f) {
@ -966,9 +1010,9 @@ x %>% accumulate(`+`)
sapply(df_num, f)
}
```
But it has a number of bugs as illustrated with the following inputs:
```{r, eval = FALSE}
df <- tibble(
x = 1:3,
@ -982,5 +1026,5 @@ x %>% accumulate(`+`)
col_sum3(df[1], mean)
col_sum3(df[0], mean)
```
What causes the bugs?

View File

@ -1,301 +0,0 @@
# Model assessment
In this chapter, you'll turn the tools of multiple models towards model assessment: learning how the model performs when given new data. So far we've focussed on models as tools for description, using models to help us understand the patterns in the data we have collected so far. But ideally a model will do more than just describe what we have seen so far - it will also help predict what will come next.
In other words, we want a model that doesn't just perform well on the sample, but also accurately summarises the underlying population.
In some industries this is primarily the use of models: you spend relatively little time fitting the model compared to how many times you use it.
There are two basic ways that a model can fail with new data:
* You can under- or over-fit the model. Underfitting is where you fail
to model and important trend: you leave too much in the residuals, and not
enough in the model. Overfitting is the opposite: you fit a trend to
what is actually random noise: you've too put much model and not left
enough in the residuals. Generally overfitting tends to be more of a
problem than underfitting.
* The process that generates the data might change. There's nothing the
model can do about this. You can protect yourself against this to some
extent by creating models that you understand and applying your knowledge
to the problem. Are these fundamentals likely to change? If you have
a model that you are going to use again and again for a long time, you
need to plan to maintain the model, regularly checking that it still
makes sense. i.e. is the population the same?
<http://research.google.com/pubs/pub43146.html>
<http://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/>
The most common problem with a model that causes it to do poorly with new data is overfitting.
Obviously, there's a bit of a problem here: we don't have new data with which to check the model, and even if we did, we'd presumably use it to make the model better in the first place. One powerful technique of approaches can help us get around this problem: resampling.
There are two main resampling techniques that we're going to cover.
* We will use __cross-validation__ to assess model quality. In
cross-validation, you split the data into test and training sets. You fit
the data to the training set, and evaluate it on the test set. This avoids
intrinsic bias of using the same data to both fit the model and assess it's
quality. However it introduces a new bias: you're not using all the data to
fit the model so it's not going to be quite as good as it could be.
* We will use __boostrapping__ to understand how stable (or how variable)
the model is. If you sample data from the same population multiple times,
how much does your model vary? Instead of going back to collect new data,
you can use the best estimate of the population data: the data you've
collected so far. The amazing idea of the bootstrap is that you can resample
from the data you already have.
There are lots of high-level helpers to do these resampling methods in R. We're going to use the tools provided by the modelr package because they are explicit - you'll see exactly what's going on at each step.
<http://topepo.github.io/caret>. [Applied Predictive Modeling](https://amzn.com/1461468485), by Max Kuhn and Kjell Johnson.
If you're competing in competitions, like Kaggle, that are predominantly about creating good predictions, developing a good strategy for avoiding overfitting is very important. Otherwise you risk tricking yourself into thinking that you have a good model, when in reality you just have a model that does a good job of fitting your data.
There is a closely related family that uses a similar idea: model ensembles. However, instead of trying to find the best models, ensembles make use of all the models, acknowledging that even models that don't fit all the data particularly well can still model some subsets well. In general, you can think of model ensemble techniques as functions that take a list of models, and a return a single model that attempts to take the best part of each.
### Prerequisites
```{r setup, message = FALSE}
# Standard data manipulation and visulisation
library(dplyr)
library(ggplot2)
# Tools for working with models
library(broom)
library(modelr)
library(splines)
# Tools for working with lots of models
library(purrr)
library(tidyr)
```
```{r}
# Options that make your life easier
options(
contrasts = c("contr.treatment", "contr.treatment"),
na.option = na.exclude
)
```
## Overfitting
Both bootstrapping and cross-validation help us to spot and remedy the problem of __over fitting__, where the model fits the data we've seen so far extremely well, but does a bad job of generalising to new data.
A classic example of over-fitting is to using a polynomial with too many degrees of freedom.
Bias - variance tradeoff. Simpler = more biased. Complex = more variable. Occam's razor.
```{r}
true_model <- function(x) {
1 + 2 * x + rnorm(length(x), sd = 0.25)
}
df <- tibble(
x = seq(0, 1, length = 20),
y = true_model(x)
)
df %>%
ggplot(aes(x, y)) +
geom_point()
```
We can create a model that fits this data very well:
```{r, message = FALSE}
library(splines)
my_model <- function(df) {
lm(y ~ poly(x, 7), data = df)
}
mod <- my_model(df)
rmse(mod, df)
grid <- df %>%
expand(x = seq_range(x, 50))
preds <- grid %>%
add_predictions(mod, var = "y")
df %>%
ggplot(aes(x, y)) +
geom_line(data = preds) +
geom_point()
```
As we fit progressively more and more complicated models, the model error decreases:
```{r}
fs <- list(
y ~ x,
y ~ poly(x, 2),
y ~ poly(x, 3),
y ~ poly(x, 4),
y ~ poly(x, 5),
y ~ poly(x, 6),
y ~ poly(x, 7)
)
models <- tibble(
n = 1:7,
f = fs,
mod = map(f, lm, data = df),
rmse = map2_dbl(mod, list(df), rmse)
)
models %>%
ggplot(aes(n, rmse)) +
geom_line(colour = "grey70") +
geom_point(size = 3)
```
But do you think this model will do well if we apply it to new data from the same population?
In real-life you can't easily go out and recollect your data. There are two approaches to help you get around this problem. I'll introduce them briefly here, and then we'll go into more depth in the following sections.
```{r}
boot <- bootstrap(df, 100) %>%
mutate(
mod = map(strap, my_model),
pred = map2(list(grid), mod, add_predictions)
)
boot %>%
unnest(pred) %>%
ggplot(aes(x, pred, group = .id)) +
geom_line(alpha = 1/3)
```
It's a little easier to see what's going on if we zoom on the y axis:
```{r}
last_plot() +
coord_cartesian(ylim = c(0, 5))
```
(You might notice that while each individual model varies a lot, the average of all the models seems like it might not be that bad. That gives rise to a model ensemble technique called model averaging.)
Bootstrapping is a useful tool to help us understand how the model might vary if we'd collected a different sample from the population. A related technique is cross-validation which allows us to explore the quality of the model. It works by repeatedly splitting the data into two pieces. One piece, the training set, is used to fit, and the other piece, the test set, is used to measure the model quality.
The following code generates 100 test-training splits, holding out 20% of the data for testing each time. We then fit a model to the training set, and evaluate the error on the test set:
```{r}
cv <- crossv_mc(df, 100) %>%
mutate(
mod = map(train, my_model),
rmse = map2_dbl(mod, test, rmse)
)
cv
```
Obviously, a plot is going to help us see distribution more easily. I've added our original estimate of the model error as a white vertical line (where the same dataset is used for both training and testing), and you can see it's very optimistic.
```{r}
cv %>%
ggplot(aes(rmse)) +
geom_ref_line(v = rmse(mod, df)) +
geom_freqpoly(binwidth = 0.2) +
geom_rug()
```
The distribution of errors is highly skewed: there are a few cases which have very high errors. These represent samples where we ended up with a few cases on all with low values or high values of x. Let's take a look:
```{r}
filter(cv, rmse > 1.5) %>%
unnest(map(train, as.data.frame)) %>%
ggplot(aes(x, .id)) +
geom_point() +
xlim(0, 1)
```
All of the models that fit particularly poorly were fit to samples that either missed the first one or two or the last one or two observation. Because polynomials shoot off to positive and negative, they give very bad predictions for those values.
Now that we've given you a quick overview and intuition for these techniques, let's dive in more detail.
## Resamples
### Building blocks
Both the boostrap and cross-validation are built on top of a "resample" object. In modelr, you can access these low-level tools directly with the `resample_*` functions.
These functions return an object of class "resample", which represents the resample in a memory efficient way. Instead of storing the resampled dataset itself, it instead stores the integer indices, and a "pointer" to the original dataset. This makes resamples take up much less memory.
```{r}
x <- resample_bootstrap(as_tibble(mtcars))
class(x)
x
```
Most modelling functions call `as.data.frame()` on the `data` argument. This generates a resampled data frame. Because it's called automatically you can just pass the object.
```{r}
lm(mpg ~ wt, data = x)
```
If you get a strange error, it's probably because the modelling function doesn't do this, and you need to do it yourself. You'll also need to do it yourself if you want to `unnest()` the data so you can visualise it. If you want to just get the rows selected, you can use `as.integer()`.
### Dataframe API
`bootstrap()` and `crossv_mc()` are built on top of these simpler primitives. They are designed to work naturally in a model exploration environment by returning data frames. Each row of the data frame represents a single sample. They return slightly different columns:
* `boostrap()` returns a data frame with two columns:
```{r}
bootstrap(df, 3)
```
`strap` gives the bootstrap sample dataset, and `.id` assigns a
unique identifier to each model (this is often useful for plotting)
* `crossv_mc()` return a data frame with three columns:
```{r}
crossv_mc(df, 3)
```
`train` contains the data that you should use to fit (train) the model,
and `test` contains the data you should use to validate the model. Together,
the test and train columns form an exclusive partition of the full dataset.
## Numeric summaries of model quality
When you start dealing with many models, it's helpful to have some rough way of comparing them so you can spend your time looking at the models that do the best job of capturing important features in the data.
One way to capture the quality of the model is to summarise the distribution of the residuals. For example, you could look at the quantiles of the absolute residuals. For this dataset, 25% of predictions are less than \$7,400 away, and 75% are less than \$25,800 away. That seems like quite a bit of error when predicting someone's income!
```{r}
heights <- tibble(readRDS("data/heights.RDS"))
h <- lm(income ~ height, data = heights)
h
qae(h, heights)
range(heights$income)
```
You might be familiar with the $R^2$. That's a single number summary that rescales the variance of the residuals to between 0 (very bad) and 1 (very good):
```{r}
rsquare(h, heights)
```
$R^2$ can be interpreted as the amount of variation in the data explained by the model. Here we're explaining 3% of the total variation - not a lot! But I don't think worrying about the relative amount of variation explained is that useful; instead I think you need to consider whether the absolute amount of variation explained is useful for your project.
It's called the $R^2$ because for simple models like this, it's just the square of the correlation between the variables:
```{r}
cor(heights$income, heights$height) ^ 2
```
The $R^2$ is an ok single number summary, but I prefer to think about the unscaled residuals because it's easier to interpret in the context of the original data. As you'll also learn later, it's also a rather optimistic interpretation of the model. Because you're assessing the model using the same data that was used to fit it, it really gives more of an upper bound on the quality of the model, not a fair assessment.
## Bootstrapping
## Cross-validation

View File

@ -2,44 +2,41 @@
## Introduction
The goal of a model is to provide a simple low-dimensional summary of a dataset. In the context of this book we're going to use models to partition data into patterns and residuals. Strong patterns will hide subtler trends, so we'll use models to help peel back layers of structure as we explore a dataset.
The goal of a model is to provide a simple low-dimensional summary of a dataset.
In the context of this book we're going to use models to partition data into patterns and residuals.
Strong patterns will hide subtler trends, so we'll use models to help peel back layers of structure as we explore a dataset.
However, before we can start using models on interesting, real, datasets, you need to understand the basics of how models work. For that reason, this chapter of the book is unique because it uses only simulated datasets. These datasets are very simple, and not at all interesting, but they will help you understand the essence of modelling before you apply the same techniques to real data in the next chapter.
However, before we can start using models on interesting, real, datasets, you need to understand the basics of how models work.
For that reason, this chapter of the book is unique because it uses only simulated datasets.
These datasets are very simple, and not at all interesting, but they will help you understand the essence of modelling before you apply the same techniques to real data in the next chapter.
There are two parts to a model:
1. First, you define a __family of models__ that express a precise, but
generic, pattern that you want to capture. For example, the pattern
might be a straight line, or a quadratic curve. You will express
the model family as an equation like `y = a_1 * x + a_2` or
`y = a_1 * x ^ a_2`. Here, `x` and `y` are known variables from your
data, and `a_1` and `a_2` are parameters that can vary to capture
different patterns.
1. First, you define a **family of models** that express a precise, but generic, pattern that you want to capture.
For example, the pattern might be a straight line, or a quadratic curve.
You will express the model family as an equation like `y = a_1 * x + a_2` or `y = a_1 * x ^ a_2`.
Here, `x` and `y` are known variables from your data, and `a_1` and `a_2` are parameters that can vary to capture different patterns.
1. Next, you generate a __fitted model__ by finding the model from the
family that is the closest to your data. This takes the generic model
family and makes it specific, like `y = 3 * x + 7` or `y = 9 * x ^ 2`.
2. Next, you generate a **fitted model** by finding the model from the family that is the closest to your data.
This takes the generic model family and makes it specific, like `y = 3 * x + 7` or `y = 9 * x ^ 2`.
It's important to understand that a fitted model is just the closest model from a family of models. That implies that you have the "best" model (according to some criteria); it doesn't imply that you have a good model and it certainly doesn't imply that the model is "true". George Box puts this well in his famous aphorism:
It's important to understand that a fitted model is just the closest model from a family of models.
That implies that you have the "best" model (according to some criteria); it doesn't imply that you have a good model and it certainly doesn't imply that the model is "true".
George Box puts this well in his famous aphorism:
> All models are wrong, but some are useful.
It's worth reading the fuller context of the quote:
> Now it would be very remarkable if any system existing in the real world
> could be exactly represented by any simple model. However, cunningly chosen
> parsimonious models often do provide remarkably useful approximations. For
> example, the law PV = RT relating pressure P, volume V and temperature T of
> an "ideal" gas via a constant R is not exactly true for any real gas, but it
> frequently provides a useful approximation and furthermore its structure is
> informative since it springs from a physical view of the behavior of gas
> molecules.
>
> For such a model there is no need to ask the question "Is the model true?".
> If "truth" is to be the "whole truth" the answer must be "No". The only
> question of interest is "Is the model illuminating and useful?".
> Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model.
> However, cunningly chosen parsimonious models often do provide remarkably useful approximations.
> For example, the law PV = RT relating pressure P, volume V and temperature T of an "ideal" gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behavior of gas molecules.
>
> For such a model there is no need to ask the question "Is the model true?".
> If "truth" is to be the "whole truth" the answer must be "No".
> The only question of interest is "Is the model illuminating and useful?".
The goal of a model is not to uncover truth, but to discover a simple approximation that is still useful.
The goal of a model is not to uncover truth, but to discover a simple approximation that is still useful.
### Prerequisites
@ -54,14 +51,22 @@ options(na.action = na.warn)
## A simple model
Lets take a look at the simulated dataset `sim1`, included with the modelr package. It contains two continuous variables, `x` and `y`. Let's plot them to see how they're related:
Lets take a look at the simulated dataset `sim1`, included with the modelr package.
It contains two continuous variables, `x` and `y`.
Let's plot them to see how they're related:
```{r}
ggplot(sim1, aes(x, y)) +
geom_point()
```
You can see a strong pattern in the data. Let's use a model to capture that pattern and make it explicit. It's our job to supply the basic form of the model. In this case, the relationship looks linear, i.e. `y = a_0 + a_1 * x`. Let's start by getting a feel for what models from that family look like by randomly generating a few and overlaying them on the data. For this simple case, we can use `geom_abline()` which takes a slope and intercept as parameters. Later on we'll learn more general techniques that work with any model.
You can see a strong pattern in the data.
Let's use a model to capture that pattern and make it explicit.
It's our job to supply the basic form of the model.
In this case, the relationship looks linear, i.e. `y = a_0 + a_1 * x`.
Let's start by getting a feel for what models from that family look like by randomly generating a few and overlaying them on the data.
For this simple case, we can use `geom_abline()` which takes a slope and intercept as parameters.
Later on we'll learn more general techniques that work with any model.
```{r}
models <- tibble(
@ -74,9 +79,13 @@ ggplot(sim1, aes(x, y)) +
geom_point()
```
There are 250 models on this plot, but a lot are really bad! We need to find the good models by making precise our intuition that a good model is "close" to the data. We need a way to quantify the distance between the data and a model. Then we can fit the model by finding the value of `a_0` and `a_1` that generate the model with the smallest distance from this data.
There are 250 models on this plot, but a lot are really bad!
We need to find the good models by making precise our intuition that a good model is "close" to the data.
We need a way to quantify the distance between the data and a model.
Then we can fit the model by finding the value of `a_0` and `a_1` that generate the model with the smallest distance from this data.
One easy place to start is to find the vertical distance between each point and the model, as in the following diagram. (Note that I've shifted the x values slightly so you can see the individual distances.)
One easy place to start is to find the vertical distance between each point and the model, as in the following diagram.
(Note that I've shifted the x values slightly so you can see the individual distances.)
```{r, echo = FALSE}
dist1 <- sim1 %>%
@ -92,9 +101,10 @@ ggplot(dist1, aes(x1, y)) +
geom_linerange(aes(ymin = y, ymax = pred), colour = "#3366FF")
```
This distance is just the difference between the y value given by the model (the __prediction__), and the actual y value in the data (the __response__).
This distance is just the difference between the y value given by the model (the **prediction**), and the actual y value in the data (the **response**).
To compute this distance, we first turn our model family into an R function. This takes the model parameters and the data as inputs, and gives values predicted by the model as output:
To compute this distance, we first turn our model family into an R function.
This takes the model parameters and the data as inputs, and gives values predicted by the model as output:
```{r}
model1 <- function(a, data) {
@ -103,9 +113,13 @@ model1 <- function(a, data) {
model1(c(7, 1.5), sim1)
```
Next, we need some way to compute an overall distance between the predicted and actual values. In other words, the plot above shows 30 distances: how do we collapse that into a single number?
Next, we need some way to compute an overall distance between the predicted and actual values.
In other words, the plot above shows 30 distances: how do we collapse that into a single number?
One common way to do this in statistics to use the "root-mean-squared deviation". We compute the difference between actual and predicted, square them, average them, and then take the square root. This distance has lots of appealing mathematical properties, which we're not going to talk about here. You'll just have to take my word for it!
One common way to do this in statistics to use the "root-mean-squared deviation".
We compute the difference between actual and predicted, square them, average them, and then take the square root.
This distance has lots of appealing mathematical properties, which we're not going to talk about here.
You'll just have to take my word for it!
```{r}
measure_distance <- function(mod, data) {
@ -115,7 +129,8 @@ measure_distance <- function(mod, data) {
measure_distance(c(7, 1.5), sim1)
```
Now we can use purrr to compute the distance for all the models defined above. We need a helper function because our distance function expects the model as a numeric vector of length 2.
Now we can use purrr to compute the distance for all the models defined above.
We need a helper function because our distance function expects the model as a numeric vector of length 2.
```{r}
sim1_dist <- function(a1, a2) {
@ -127,7 +142,8 @@ models <- models %>%
models
```
Next, let's overlay the 10 best models on to the data. I've coloured the models by `-dist`: this is an easy way to make sure that the best models (i.e. the ones with the smallest distance) get the brighest colours.
Next, let's overlay the 10 best models on to the data.
I've coloured the models by `-dist`: this is an easy way to make sure that the best models (i.e. the ones with the smallest distance) get the brighest colours.
```{r}
ggplot(sim1, aes(x, y)) +
@ -138,7 +154,9 @@ ggplot(sim1, aes(x, y)) +
)
```
We can also think about these models as observations, and visualising with a scatterplot of `a1` vs `a2`, again coloured by `-dist`. We can no longer directly see how the model compares to the data, but we can see many models at once. Again, I've highlighted the 10 best models, this time by drawing red circles underneath them.
We can also think about these models as observations, and visualising with a scatterplot of `a1` vs `a2`, again coloured by `-dist`.
We can no longer directly see how the model compares to the data, but we can see many models at once.
Again, I've highlighted the 10 best models, this time by drawing red circles underneath them.
```{r}
ggplot(models, aes(a1, a2)) +
@ -146,7 +164,8 @@ ggplot(models, aes(a1, a2)) +
geom_point(aes(colour = -dist))
```
Instead of trying lots of random models, we could be more systematic and generate an evenly spaced grid of points (this is called a grid search). I picked the parameters of the grid roughly by looking at where the best models were in the plot above.
Instead of trying lots of random models, we could be more systematic and generate an evenly spaced grid of points (this is called a grid search).
I picked the parameters of the grid roughly by looking at where the best models were in the plot above.
```{r}
grid <- expand.grid(
@ -172,7 +191,11 @@ ggplot(sim1, aes(x, y)) +
)
```
You could imagine iteratively making the grid finer and finer until you narrowed in on the best model. But there's a better way to tackle that problem: a numerical minimisation tool called Newton-Raphson search. The intuition of Newton-Raphson is pretty simple: you pick a starting point and look around for the steepest slope. You then ski down that slope a little way, and then repeat again and again, until you can't go any lower. In R, we can do that with `optim()`:
You could imagine iteratively making the grid finer and finer until you narrowed in on the best model.
But there's a better way to tackle that problem: a numerical minimisation tool called Newton-Raphson search.
The intuition of Newton-Raphson is pretty simple: you pick a starting point and look around for the steepest slope.
You then ski down that slope a little way, and then repeat again and again, until you can't go any lower.
In R, we can do that with `optim()`:
```{r}
best <- optim(c(0, 0), measure_distance, data = sim1)
@ -183,24 +206,36 @@ ggplot(sim1, aes(x, y)) +
geom_abline(intercept = best$par[1], slope = best$par[2])
```
Don't worry too much about the details of how `optim()` works. It's the intuition that's important here. If you have a function that defines the distance between a model and a dataset, an algorithm that can minimise that distance by modifying the parameters of the model can find the best model. The neat thing about this approach is that it will work for any family of models that you can write an equation for.
Don't worry too much about the details of how `optim()` works.
It's the intuition that's important here.
If you have a function that defines the distance between a model and a dataset, an algorithm that can minimise that distance by modifying the parameters of the model can find the best model.
The neat thing about this approach is that it will work for any family of models that you can write an equation for.
There's one more approach that we can use for this model, because it's a special case of a broader family: linear models. A linear model has the general form `y = a_1 + a_2 * x_1 + a_3 * x_2 + ... + a_n * x_(n - 1)`. So this simple model is equivalent to a general linear model where n is 2 and `x_1` is `x`. R has a tool specifically designed for fitting linear models called `lm()`. `lm()` has a special way to specify the model family: formulas. Formulas look like `y ~ x`, which `lm()` will translate to a function like `y = a_1 + a_2 * x`. We can fit the model and look at the output:
There's one more approach that we can use for this model, because it's a special case of a broader family: linear models.
A linear model has the general form `y = a_1 + a_2 * x_1 + a_3 * x_2 + ... + a_n * x_(n - 1)`.
So this simple model is equivalent to a general linear model where n is 2 and `x_1` is `x`.
R has a tool specifically designed for fitting linear models called `lm()`.
`lm()` has a special way to specify the model family: formulas.
Formulas look like `y ~ x`, which `lm()` will translate to a function like `y = a_1 + a_2 * x`.
We can fit the model and look at the output:
```{r}
sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
```
These are exactly the same values we got with `optim()`! Behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models. Using some connections between geometry, calculus, and linear algebra, `lm()` actually finds the closest model in a single step, using a sophisticated algorithm. This approach is both faster, and guarantees that there is a global minimum.
These are exactly the same values we got with `optim()`!
Behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models.
Using some connections between geometry, calculus, and linear algebra, `lm()` actually finds the closest model in a single step, using a sophisticated algorithm.
This approach is both faster, and guarantees that there is a global minimum.
### Exercises
1. One downside of the linear model is that it is sensitive to unusual values
because the distance incorporates a squared term. Fit a linear model to
the simulated data below, and visualise the results. Rerun a few times to
generate different simulated datasets. What do you notice about the model?
1. One downside of the linear model is that it is sensitive to unusual values because the distance incorporates a squared term.
Fit a linear model to the simulated data below, and visualise the results.
Rerun a few times to generate different simulated datasets.
What do you notice about the model?
```{r}
sim1a <- tibble(
x = rep(1:10, each = 3),
@ -208,24 +243,21 @@ These are exactly the same values we got with `optim()`! Behind the scenes `lm()
)
```
1. One way to make linear models more robust is to use a different distance
measure. For example, instead of root-mean-squared distance, you could use
mean-absolute distance:
2. One way to make linear models more robust is to use a different distance measure.
For example, instead of root-mean-squared distance, you could use mean-absolute distance:
```{r}
measure_distance <- function(mod, data) {
diff <- data$y - model1(mod, data)
mean(abs(diff))
}
```
Use `optim()` to fit this model to the simulated data above and compare it
to the linear model.
1. One challenge with performing numerical optimisation is that it's only
guaranteed to find one local optimum. What's the problem with optimising
a three parameter model like this?
Use `optim()` to fit this model to the simulated data above and compare it to the linear model.
3. One challenge with performing numerical optimisation is that it's only guaranteed to find one local optimum.
What's the problem with optimising a three parameter model like this?
```{r}
model1 <- function(a, data) {
a[1] + data$x * a[2] + a[3]
@ -234,13 +266,20 @@ These are exactly the same values we got with `optim()`! Behind the scenes `lm()
## Visualising models
For simple models, like the one above, you can figure out what pattern the model captures by carefully studying the model family and the fitted coefficients. And if you ever take a statistics course on modelling, you're likely to spend a lot of time doing just that. Here, however, we're going to take a different tack. We're going to focus on understanding a model by looking at its predictions. This has a big advantage: every type of predictive model makes predictions (otherwise what use would it be?) so we can use the same set of techniques to understand any type of predictive model.
For simple models, like the one above, you can figure out what pattern the model captures by carefully studying the model family and the fitted coefficients.
And if you ever take a statistics course on modelling, you're likely to spend a lot of time doing just that.
Here, however, we're going to take a different tack.
We're going to focus on understanding a model by looking at its predictions.
This has a big advantage: every type of predictive model makes predictions (otherwise what use would it be?) so we can use the same set of techniques to understand any type of predictive model.
It's also useful to see what the model doesn't capture, the so-called residuals which are left after subtracting the predictions from the data. Residuals are powerful because they allow us to use models to remove striking patterns so we can study the subtler trends that remain.
It's also useful to see what the model doesn't capture, the so-called residuals which are left after subtracting the predictions from the data.
Residuals are powerful because they allow us to use models to remove striking patterns so we can study the subtler trends that remain.
### Predictions
To visualise the predictions from a model, we start by generating an evenly spaced grid of values that covers the region where our data lies. The easiest way to do that is to use `modelr::data_grid()`. Its first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:
To visualise the predictions from a model, we start by generating an evenly spaced grid of values that covers the region where our data lies.
The easiest way to do that is to use `modelr::data_grid()`.
Its first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:
```{r}
grid <- sim1 %>%
@ -250,7 +289,9 @@ grid
(This will get more interesting when we start to add more variables to our model.)
Next we add predictions. We'll use `modelr::add_predictions()` which takes a data frame and a model. It adds the predictions from the model to a new column in the data frame:
Next we add predictions.
We'll use `modelr::add_predictions()` which takes a data frame and a model.
It adds the predictions from the model to a new column in the data frame:
```{r}
grid <- grid %>%
@ -260,7 +301,11 @@ grid
(You can also use this function to add predictions to your original dataset.)
Next, we plot the predictions. You might wonder about all this extra work compared to just using `geom_abline()`. But the advantage of this approach is that it will work with _any_ model in R, from the simplest to the most complex. You're only limited by your visualisation skills. For more ideas about how to visualise more complex model types, you might try <http://vita.had.co.nz/papers/model-vis.html>.
Next, we plot the predictions.
You might wonder about all this extra work compared to just using `geom_abline()`.
But the advantage of this approach is that it will work with *any* model in R, from the simplest to the most complex.
You're only limited by your visualisation skills.
For more ideas about how to visualise more complex model types, you might try <http://vita.had.co.nz/papers/model-vis.html>.
```{r}
ggplot(sim1, aes(x)) +
@ -270,9 +315,13 @@ ggplot(sim1, aes(x)) +
### Residuals
The flip-side of predictions are __residuals__. The predictions tells you the pattern that the model has captured, and the residuals tell you what the model has missed. The residuals are just the distances between the observed and predicted values that we computed above.
The flip-side of predictions are **residuals**.
The predictions tells you the pattern that the model has captured, and the residuals tell you what the model has missed.
The residuals are just the distances between the observed and predicted values that we computed above.
We add residuals to the data with `add_residuals()`, which works much like `add_predictions()`. Note, however, that we use the original dataset, not a manufactured grid. This is because to compute residuals we need actual y values.
We add residuals to the data with `add_residuals()`, which works much like `add_predictions()`.
Note, however, that we use the original dataset, not a manufactured grid.
This is because to compute residuals we need actual y values.
```{r}
sim1 <- sim1 %>%
@ -280,16 +329,19 @@ sim1 <- sim1 %>%
sim1
```
There are a few different ways to understand what the residuals tell us about the model. One way is to simply draw a frequency polygon to help us understand the spread of the residuals:
There are a few different ways to understand what the residuals tell us about the model.
One way is to simply draw a frequency polygon to help us understand the spread of the residuals:
```{r}
ggplot(sim1, aes(resid)) +
geom_freqpoly(binwidth = 0.5)
```
This helps you calibrate the quality of the model: how far away are the predictions from the observed values? Note that the average of the residual will always be 0.
This helps you calibrate the quality of the model: how far away are the predictions from the observed values?
Note that the average of the residual will always be 0.
You'll often want to recreate plots using the residuals instead of the original predictor. You'll see a lot of that in the next chapter.
You'll often want to recreate plots using the residuals instead of the original predictor.
You'll see a lot of that in the next chapter.
```{r}
ggplot(sim1, aes(x, resid)) +
@ -301,27 +353,31 @@ This looks like random noise, suggesting that our model has done a good job of c
### Exercises
1. Instead of using `lm()` to fit a straight line, you can use `loess()`
to fit a smooth curve. Repeat the process of model fitting,
grid generation, predictions, and visualisation on `sim1` using
`loess()` instead of `lm()`. How does the result compare to
`geom_smooth()`?
1. `add_predictions()` is paired with `gather_predictions()` and
`spread_predictions()`. How do these three functions differ?
1. What does `geom_ref_line()` do? What package does it come from?
Why is displaying a reference line in plots showing residuals
useful and important?
1. Why might you want to look at a frequency polygon of absolute residuals?
1. Instead of using `lm()` to fit a straight line, you can use `loess()` to fit a smooth curve.
Repeat the process of model fitting, grid generation, predictions, and visualisation on `sim1` using `loess()` instead of `lm()`.
How does the result compare to `geom_smooth()`?
2. `add_predictions()` is paired with `gather_predictions()` and `spread_predictions()`.
How do these three functions differ?
3. What does `geom_ref_line()` do?
What package does it come from?
Why is displaying a reference line in plots showing residuals useful and important?
4. Why might you want to look at a frequency polygon of absolute residuals?
What are the pros and cons compared to looking at the raw residuals?
## Formulas and model families
You've seen formulas before when using `facet_wrap()` and `facet_grid()`. In R, formulas provide a general way of getting "special behaviour". Rather than evaluating the values of the variables right away, they capture them so they can be interpreted by the function.
You've seen formulas before when using `facet_wrap()` and `facet_grid()`.
In R, formulas provide a general way of getting "special behaviour".
Rather than evaluating the values of the variables right away, they capture them so they can be interpreted by the function.
The majority of modelling functions in R use a standard conversion from formulas to functions. You've seen one simple conversion already: `y ~ x` is translated to `y = a_1 + a_2 * x`. If you want to see what R actually does, you can use the `model_matrix()` function. It takes a data frame and a formula and returns a tibble that defines the model equation: each column in the output is associated with one coefficient in the model, the function is always `y = a_1 * out_1 + a_2 * out_2`. For the simplest case of `y ~ x1` this shows us something interesting:
The majority of modelling functions in R use a standard conversion from formulas to functions.
You've seen one simple conversion already: `y ~ x` is translated to `y = a_1 + a_2 * x`.
If you want to see what R actually does, you can use the `model_matrix()` function.
It takes a data frame and a formula and returns a tibble that defines the model equation: each column in the output is associated with one coefficient in the model, the function is always `y = a_1 * out_1 + a_2 * out_2`.
For the simplest case of `y ~ x1` this shows us something interesting:
```{r}
df <- tribble(
@ -332,7 +388,9 @@ df <- tribble(
model_matrix(df, y ~ x1)
```
The way that R adds the intercept to the model is just by having a column that is full of ones. By default, R will always add this column. If you don't want, you need to explicitly drop it with `-1`:
The way that R adds the intercept to the model is just by having a column that is full of ones.
By default, R will always add this column.
If you don't want, you need to explicitly drop it with `-1`:
```{r}
model_matrix(df, y ~ x1 - 1)
@ -344,13 +402,19 @@ The model matrix grows in an unsurprising way when you add more variables to the
model_matrix(df, y ~ x1 + x2)
```
This formula notation is sometimes called "Wilkinson-Rogers notation", and was initially described in _Symbolic Description of Factorial Models for Analysis of Variance_, by G. N. Wilkinson and C. E. Rogers <https://www.jstor.org/stable/2346786>. It's worth digging up and reading the original paper if you'd like to understand the full details of the modelling algebra.
This formula notation is sometimes called "Wilkinson-Rogers notation", and was initially described in *Symbolic Description of Factorial Models for Analysis of Variance*, by G.
N. Wilkinson and C.
E. Rogers <https://www.jstor.org/stable/2346786>.
It's worth digging up and reading the original paper if you'd like to understand the full details of the modelling algebra.
The following sections expand on how this formula notation works for categorical variables, interactions, and transformation.
### Categorical variables
Generating a function from a formula is straight forward when the predictor is continuous, but things get a bit more complicated when the predictor is categorical. Imagine you have a formula like `y ~ sex`, where sex could either be male or female. It doesn't make sense to convert that to a formula like `y = a_0 + a_1 * sex` because `sex` isn't a number - you can't multiply it! Instead what R does is convert it to `y = a_0 + a_1 * sex_male` where `sex_male` is one if `sex` is male and zero otherwise:
Generating a function from a formula is straight forward when the predictor is continuous, but things get a bit more complicated when the predictor is categorical.
Imagine you have a formula like `y ~ sex`, where sex could either be male or female.
It doesn't make sense to convert that to a formula like `y = a_0 + a_1 * sex` because `sex` isn't a number - you can't multiply it!
Instead what R does is convert it to `y = a_0 + a_1 * sex_male` where `sex_male` is one if `sex` is male and zero otherwise:
```{r}
df <- tribble(
@ -362,9 +426,13 @@ df <- tribble(
model_matrix(df, response ~ sex)
```
You might wonder why R also doesn't create a `sexfemale` column. The problem is that would create a column that is perfectly predictable based on the other columns (i.e. `sexfemale = 1 - sexmale`). Unfortunately the exact details of why this is a problem is beyond the scope of this book, but basically it creates a model family that is too flexible, and will have infinitely many models that are equally close to the data.
You might wonder why R also doesn't create a `sexfemale` column.
The problem is that would create a column that is perfectly predictable based on the other columns (i.e. `sexfemale = 1 - sexmale`).
Unfortunately the exact details of why this is a problem is beyond the scope of this book, but basically it creates a model family that is too flexible, and will have infinitely many models that are equally close to the data.
Fortunately, however, if you focus on visualising predictions you don't need to worry about the exact parameterisation. Let's look at some data and models to make that concrete. Here's the `sim2` dataset from modelr:
Fortunately, however, if you focus on visualising predictions you don't need to worry about the exact parameterisation.
Let's look at some data and models to make that concrete.
Here's the `sim2` dataset from modelr:
```{r}
ggplot(sim2) +
@ -382,7 +450,8 @@ grid <- sim2 %>%
grid
```
Effectively, a model with a categorical `x` will predict the mean value for each category. (Why? Because the mean minimises the root-mean-squared distance.) That's easy to see if we overlay the predictions on top of the original data:
Effectively, a model with a categorical `x` will predict the mean value for each category.
(Why? Because the mean minimises the root-mean-squared distance.) That's easy to see if we overlay the predictions on top of the original data:
```{r}
ggplot(sim2, aes(x)) +
@ -390,7 +459,8 @@ ggplot(sim2, aes(x)) +
geom_point(data = grid, aes(y = pred), colour = "red", size = 4)
```
You can't make predictions about levels that you didn't observe. Sometimes you'll do this by accident so it's good to recognise this error message:
You can't make predictions about levels that you didn't observe.
Sometimes you'll do this by accident so it's good to recognise this error message:
```{r, error = TRUE}
tibble(x = "e") %>%
@ -399,7 +469,9 @@ tibble(x = "e") %>%
### Interactions (continuous and categorical)
What happens when you combine a continuous and a categorical variable? `sim3` contains a categorical predictor and a continuous predictor. We can visualise it with a simple plot:
What happens when you combine a continuous and a categorical variable?
`sim3` contains a categorical predictor and a continuous predictor.
We can visualise it with a simple plot:
```{r}
ggplot(sim3, aes(x1, y)) +
@ -413,19 +485,19 @@ mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
```
When you add variables with `+`, the model will estimate each effect independent of all the others. It's possible to fit the so-called interaction by using `*`. For example, `y ~ x1 * x2` is translated to `y = a_0 + a_1 * x1 + a_2 * x2 + a_12 * x1 * x2`. Note that whenever you use `*`, both the interaction and the individual components are included in the model.
When you add variables with `+`, the model will estimate each effect independent of all the others.
It's possible to fit the so-called interaction by using `*`.
For example, `y ~ x1 * x2` is translated to `y = a_0 + a_1 * x1 + a_2 * x2 + a_12 * x1 * x2`.
Note that whenever you use `*`, both the interaction and the individual components are included in the model.
To visualise these models we need two new tricks:
1. We have two predictors, so we need to give `data_grid()` both variables.
It finds all the unique values of `x1` and `x2` and then generates all
combinations.
1. To generate predictions from both models simultaneously, we can use
`gather_predictions()` which adds each prediction as a row. The
complement of `gather_predictions()` is `spread_predictions()` which adds
each prediction to a new column.
1. We have two predictors, so we need to give `data_grid()` both variables.
It finds all the unique values of `x1` and `x2` and then generates all combinations.
2. To generate predictions from both models simultaneously, we can use `gather_predictions()` which adds each prediction as a row.
The complement of `gather_predictions()` is `spread_predictions()` which adds each prediction to a new column.
Together this gives us:
```{r}
@ -444,9 +516,12 @@ ggplot(sim3, aes(x1, y, colour = x2)) +
facet_wrap(~ model)
```
Note that the model that uses `+` has the same slope for each line, but different intercepts. The model that uses `*` has a different slope and intercept for each line.
Note that the model that uses `+` has the same slope for each line, but different intercepts.
The model that uses `*` has a different slope and intercept for each line.
Which model is better for this data? We can take look at the residuals. Here I've facetted by both model and `x2` because it makes it easier to see the pattern within each group.
Which model is better for this data?
We can take look at the residuals.
Here I've facetted by both model and `x2` because it makes it easier to see the pattern within each group.
```{r}
sim3 <- sim3 %>%
@ -457,11 +532,16 @@ ggplot(sim3, aes(x1, resid, colour = x2)) +
facet_grid(model ~ x2)
```
There is little obvious pattern in the residuals for `mod2`. The residuals for `mod1` show that the model has clearly missed some pattern in `b`, and less so, but still present is pattern in `c`, and `d`. You might wonder if there's a precise way to tell which of `mod1` or `mod2` is better. There is, but it requires a lot of mathematical background, and we don't really care. Here, we're interested in a qualitative assessment of whether or not the model has captured the pattern that we're interested in.
There is little obvious pattern in the residuals for `mod2`.
The residuals for `mod1` show that the model has clearly missed some pattern in `b`, and less so, but still present is pattern in `c`, and `d`.
You might wonder if there's a precise way to tell which of `mod1` or `mod2` is better.
There is, but it requires a lot of mathematical background, and we don't really care.
Here, we're interested in a qualitative assessment of whether or not the model has captured the pattern that we're interested in.
### Interactions (two continuous)
Let's take a look at the equivalent model for two continuous variables. Initially things proceed almost identically to the previous example:
Let's take a look at the equivalent model for two continuous variables.
Initially things proceed almost identically to the previous example:
```{r}
mod1 <- lm(y ~ x1 + x2, data = sim4)
@ -476,21 +556,22 @@ grid <- sim4 %>%
grid
```
Note my use of `seq_range()` inside `data_grid()`. Instead of using every unique value of `x`, I'm going to use a regularly spaced grid of five values between the minimum and maximum numbers. It's probably not super important here, but it's a useful technique in general. There are two other useful arguments to `seq_range()`:
Note my use of `seq_range()` inside `data_grid()`.
Instead of using every unique value of `x`, I'm going to use a regularly spaced grid of five values between the minimum and maximum numbers.
It's probably not super important here, but it's a useful technique in general.
There are two other useful arguments to `seq_range()`:
- `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks nice to the human eye.
This is useful if you want to produce tables of output:
* `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks
nice to the human eye. This is useful if you want to produce tables of
output:
```{r}
seq_range(c(0.0123, 0.923423), n = 5)
seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
```
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
variables have a long tailed distribution and you want to focus on generating
values near the center:
- `trim = 0.1` will trim off 10% of the tail values.
This is useful if the variables have a long tailed distribution and you want to focus on generating values near the center:
```{r}
x1 <- rcauchy(100)
seq_range(x1, n = 5)
@ -498,10 +579,9 @@ Note my use of `seq_range()` inside `data_grid()`. Instead of using every unique
seq_range(x1, n = 5, trim = 0.25)
seq_range(x1, n = 5, trim = 0.50)
```
* `expand = 0.1` is in some sense the opposite of `trim()` it expands the
range by 10%.
- `expand = 0.1` is in some sense the opposite of `trim()` it expands the range by 10%.
```{r}
x2 <- c(0, 1)
seq_range(x2, n = 5)
@ -510,7 +590,9 @@ Note my use of `seq_range()` inside `data_grid()`. Instead of using every unique
seq_range(x2, n = 5, expand = 0.50)
```
Next let's try and visualise that model. We have two continuous predictors, so you can imagine the model like a 3d surface. We could display that using `geom_tile()`:
Next let's try and visualise that model.
We have two continuous predictors, so you can imagine the model like a 3d surface.
We could display that using `geom_tile()`:
```{r}
ggplot(grid, aes(x1, x2)) +
@ -518,7 +600,9 @@ ggplot(grid, aes(x1, x2)) +
facet_wrap(~ model)
```
That doesn't suggest that the models are very different! But that's partly an illusion: our eyes and brains are not very good at accurately comparing shades of colour. Instead of looking at the surface from the top, we could look at it from either side, showing multiple slices:
That doesn't suggest that the models are very different!
But that's partly an illusion: our eyes and brains are not very good at accurately comparing shades of colour.
Instead of looking at the surface from the top, we could look at it from either side, showing multiple slices:
```{r, asp = 1/2}
ggplot(grid, aes(x1, pred, colour = x2, group = x2)) +
@ -529,15 +613,28 @@ ggplot(grid, aes(x2, pred, colour = x1, group = x1)) +
facet_wrap(~ model)
```
This shows you that interaction between two continuous variables works basically the same way as for a categorical and continuous variable. An interaction says that there's not a fixed offset: you need to consider both values of `x1` and `x2` simultaneously in order to predict `y`.
This shows you that interaction between two continuous variables works basically the same way as for a categorical and continuous variable.
An interaction says that there's not a fixed offset: you need to consider both values of `x1` and `x2` simultaneously in order to predict `y`.
You can see that even with just two continuous variables, coming up with good visualisations are hard. But that's reasonable: you shouldn't expect it will be easy to understand how three or more variables simultaneously interact! But again, we're saved a little because we're using models for exploration, and you can gradually build up your model over time. The model doesn't have to be perfect, it just has to help you reveal a little more about your data.
You can see that even with just two continuous variables, coming up with good visualisations are hard.
But that's reasonable: you shouldn't expect it will be easy to understand how three or more variables simultaneously interact!
But again, we're saved a little because we're using models for exploration, and you can gradually build up your model over time.
The model doesn't have to be perfect, it just has to help you reveal a little more about your data.
I spent some time looking at the residuals to see if I could figure if `mod2` did better than `mod1`. I think it does, but it's pretty subtle. You'll have a chance to work on it in the exercises.
I spent some time looking at the residuals to see if I could figure if `mod2` did better than `mod1`.
I think it does, but it's pretty subtle.
You'll have a chance to work on it in the exercises.
### Transformations
You can also perform transformations inside the model formula. For example, `log(y) ~ sqrt(x1) + x2` is transformed to `log(y) = a_1 + a_2 * sqrt(x1) + a_3 * x2`. If your transformation involves `+`, `*`, `^`, or `-`, you'll need to wrap it in `I()` so R doesn't treat it like part of the model specification. For example, `y ~ x + I(x ^ 2)` is translated to `y = a_1 + a_2 * x + a_3 * x^2`. If you forget the `I()` and specify `y ~ x ^ 2 + x`, R will compute `y ~ x * x + x`. `x * x` means the interaction of `x` with itself, which is the same as `x`. R automatically drops redundant variables so `x + x` become `x`, meaning that `y ~ x ^ 2 + x` specifies the function `y = a_1 + a_2 * x`. That's probably not what you intended!
You can also perform transformations inside the model formula.
For example, `log(y) ~ sqrt(x1) + x2` is transformed to `log(y) = a_1 + a_2 * sqrt(x1) + a_3 * x2`.
If your transformation involves `+`, `*`, `^`, or `-`, you'll need to wrap it in `I()` so R doesn't treat it like part of the model specification.
For example, `y ~ x + I(x ^ 2)` is translated to `y = a_1 + a_2 * x + a_3 * x^2`.
If you forget the `I()` and specify `y ~ x ^ 2 + x`, R will compute `y ~ x * x + x`.
`x * x` means the interaction of `x` with itself, which is the same as `x`.
R automatically drops redundant variables so `x + x` become `x`, meaning that `y ~ x ^ 2 + x` specifies the function `y = a_1 + a_2 * x`.
That's probably not what you intended!
Again, if you get confused about what your model is doing, you can always use `model_matrix()` to see exactly what equation `lm()` is fitting:
@ -552,13 +649,17 @@ model_matrix(df, y ~ x^2 + x)
model_matrix(df, y ~ I(x^2) + x)
```
Transformations are useful because you can use them to approximate non-linear functions. If you've taken a calculus class, you may have heard of Taylor's theorem which says you can approximate any smooth function with an infinite sum of polynomials. That means you can use a polynomial function to get arbitrarily close to a smooth function by fitting an equation like `y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3`. Typing that sequence by hand is tedious, so R provides a helper function: `poly()`:
Transformations are useful because you can use them to approximate non-linear functions.
If you've taken a calculus class, you may have heard of Taylor's theorem which says you can approximate any smooth function with an infinite sum of polynomials.
That means you can use a polynomial function to get arbitrarily close to a smooth function by fitting an equation like `y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3`.
Typing that sequence by hand is tedious, so R provides a helper function: `poly()`:
```{r}
model_matrix(df, y ~ poly(x, 2))
```
However there's one major problem with using `poly()`: outside the range of the data, polynomials rapidly shoot off to positive or negative infinity. One safer alternative is to use the natural spline, `splines::ns()`.
However there's one major problem with using `poly()`: outside the range of the data, polynomials rapidly shoot off to positive or negative infinity.
One safer alternative is to use the natural spline, `splines::ns()`.
```{r}
library(splines)
@ -596,33 +697,36 @@ ggplot(sim5, aes(x, y)) +
facet_wrap(~ model)
```
Notice that the extrapolation outside the range of the data is clearly bad. This is the downside to approximating a function with a polynomial. But this is a very real problem with every model: the model can never tell you if the behaviour is true when you start extrapolating outside the range of the data that you have seen. You must rely on theory and science.
Notice that the extrapolation outside the range of the data is clearly bad.
This is the downside to approximating a function with a polynomial.
But this is a very real problem with every model: the model can never tell you if the behaviour is true when you start extrapolating outside the range of the data that you have seen.
You must rely on theory and science.
### Exercises
1. What happens if you repeat the analysis of `sim2` using a model without
an intercept. What happens to the model equation? What happens to the
predictions?
1. Use `model_matrix()` to explore the equations generated for the models
I fit to `sim3` and `sim4`. Why is `*` a good shorthand for interaction?
1. What happens if you repeat the analysis of `sim2` using a model without an intercept.
What happens to the model equation?
What happens to the predictions?
2. Use `model_matrix()` to explore the equations generated for the models I fit to `sim3` and `sim4`.
Why is `*` a good shorthand for interaction?
3. Using the basic principles, convert the formulas in the following two models into functions.
(Hint: start by converting the categorical variable into 0-1 variables.)
1. Using the basic principles, convert the formulas in the following two
models into functions. (Hint: start by converting the categorical variable
into 0-1 variables.)
```{r, eval = FALSE}
mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
```
1. For `sim4`, which of `mod1` and `mod2` is better? I think `mod2` does a
slightly better job at removing patterns, but it's pretty subtle. Can you
come up with a plot to support my claim?
4. For `sim4`, which of `mod1` and `mod2` is better?
I think `mod2` does a slightly better job at removing patterns, but it's pretty subtle.
Can you come up with a plot to support my claim?
## Missing values
Missing values obviously can not convey any information about the relationship between the variables, so modelling functions will drop any rows that contain missing values. R's default behaviour is to silently drop them, but `options(na.action = na.warn)` (run in the prerequisites), makes sure you get a warning.
Missing values obviously can not convey any information about the relationship between the variables, so modelling functions will drop any rows that contain missing values.
R's default behaviour is to silently drop them, but `options(na.action = na.warn)` (run in the prerequisites), makes sure you get a warning.
```{r}
df <- tribble(
@ -651,35 +755,31 @@ nobs(mod)
## Other model families
This chapter has focussed exclusively on the class of linear models, which assume a relationship of the form `y = a_1 * x1 + a_2 * x2 + ... + a_n * xn`. Linear models additionally assume that the residuals have a normal distribution, which we haven't talked about. There are a large set of model classes that extend the linear model in various interesting ways. Some of them are:
This chapter has focussed exclusively on the class of linear models, which assume a relationship of the form `y = a_1 * x1 + a_2 * x2 + ... + a_n * xn`.
Linear models additionally assume that the residuals have a normal distribution, which we haven't talked about.
There are a large set of model classes that extend the linear model in various interesting ways.
Some of them are:
* __Generalised linear models__, e.g. `stats::glm()`. Linear models assume that
the response is continuous and the error has a normal distribution.
Generalised linear models extend linear models to include non-continuous
responses (e.g. binary data or counts). They work by defining a distance
metric based on the statistical idea of likelihood.
* __Generalised additive models__, e.g. `mgcv::gam()`, extend generalised
linear models to incorporate arbitrary smooth functions. That means you can
write a formula like `y ~ s(x)` which becomes an equation like
`y = f(x)` and let `gam()` estimate what that function is (subject to some
smoothness constraints to make the problem tractable).
* __Penalised linear models__, e.g. `glmnet::glmnet()`, add a penalty term to
the distance that penalises complex models (as defined by the distance
between the parameter vector and the origin). This tends to make
models that generalise better to new datasets from the same population.
- **Generalised linear models**, e.g. `stats::glm()`.
Linear models assume that the response is continuous and the error has a normal distribution.
Generalised linear models extend linear models to include non-continuous responses (e.g. binary data or counts).
They work by defining a distance metric based on the statistical idea of likelihood.
* __Robust linear models__, e.g. `MASS::rlm()`, tweak the distance to downweight
points that are very far away. This makes them less sensitive to the presence
of outliers, at the cost of being not quite as good when there are no
outliers.
* __Trees__, e.g. `rpart::rpart()`, attack the problem in a completely different
way than linear models. They fit a piece-wise constant model, splitting the
data into progressively smaller and smaller pieces. Trees aren't terribly
effective by themselves, but they are very powerful when used in aggregate
by models like __random forests__ (e.g. `randomForest::randomForest()`) or
__gradient boosting machines__ (e.g. `xgboost::xgboost`.)
- **Generalised additive models**, e.g. `mgcv::gam()`, extend generalised linear models to incorporate arbitrary smooth functions.
That means you can write a formula like `y ~ s(x)` which becomes an equation like `y = f(x)` and let `gam()` estimate what that function is (subject to some smoothness constraints to make the problem tractable).
These models all work similarly from a programming perspective. Once you've mastered linear models, you should find it easy to master the mechanics of these other model classes. Being a skilled modeller is a mixture of some good general principles and having a big toolbox of techniques. Now that you've learned some general tools and one useful class of models, you can go on and learn more classes from other sources.
- **Penalised linear models**, e.g. `glmnet::glmnet()`, add a penalty term to the distance that penalises complex models (as defined by the distance between the parameter vector and the origin).
This tends to make models that generalise better to new datasets from the same population.
- **Robust linear models**, e.g.
`MASS::rlm()`, tweak the distance to downweight points that are very far away.
This makes them less sensitive to the presence of outliers, at the cost of being not quite as good when there are no outliers.
- **Trees**, e.g. `rpart::rpart()`, attack the problem in a completely different way than linear models.
They fit a piece-wise constant model, splitting the data into progressively smaller and smaller pieces.
Trees aren't terribly effective by themselves, but they are very powerful when used in aggregate by models like **random forests** (e.g. `randomForest::randomForest()`) or **gradient boosting machines** (e.g. `xgboost::xgboost`.)
These models all work similarly from a programming perspective.
Once you've mastered linear models, you should find it easy to master the mechanics of these other model classes.
Being a skilled modeller is a mixture of some good general principles and having a big toolbox of techniques.
Now that you've learned some general tools and one useful class of models, you can go on and learn more classes from other sources.

View File

@ -2,26 +2,36 @@
## Introduction
In the previous chapter you learned how linear models work, and learned some basic tools for understanding what a model is telling you about your data. The previous chapter focussed on simulated datasets. This chapter will focus on real data, showing you how you can progressively build up a model to aid your understanding of the data.
In the previous chapter you learned how linear models work, and learned some basic tools for understanding what a model is telling you about your data.
The previous chapter focussed on simulated datasets.
This chapter will focus on real data, showing you how you can progressively build up a model to aid your understanding of the data.
We will take advantage of the fact that you can think about a model partitioning your data into pattern and residuals. We'll find patterns with visualisation, then make them concrete and precise with a model. We'll then repeat the process, but replace the old response variable with the residuals from the model. The goal is to transition from implicit knowledge in the data and your head to explicit knowledge in a quantitative model. This makes it easier to apply to new domains, and easier for others to use.
We will take advantage of the fact that you can think about a model partitioning your data into pattern and residuals.
We'll find patterns with visualisation, then make them concrete and precise with a model.
We'll then repeat the process, but replace the old response variable with the residuals from the model.
The goal is to transition from implicit knowledge in the data and your head to explicit knowledge in a quantitative model.
This makes it easier to apply to new domains, and easier for others to use.
For very large and complex datasets this will be a lot of work. There are certainly alternative approaches - a more machine learning approach is simply to focus on the predictive ability of the model. These approaches tend to produce black boxes: the model does a really good job at generating predictions, but you don't know why. This is a totally reasonable approach, but it does make it hard to apply your real world knowledge to the model. That, in turn, makes it difficult to assess whether or not the model will continue to work in the long-term, as fundamentals change. For most real models, I'd expect you to use some combination of this approach and a more classic automated approach.
For very large and complex datasets this will be a lot of work.
There are certainly alternative approaches - a more machine learning approach is simply to focus on the predictive ability of the model.
These approaches tend to produce black boxes: the model does a really good job at generating predictions, but you don't know why.
This is a totally reasonable approach, but it does make it hard to apply your real world knowledge to the model.
That, in turn, makes it difficult to assess whether or not the model will continue to work in the long-term, as fundamentals change.
For most real models, I'd expect you to use some combination of this approach and a more classic automated approach.
It's a challenge to know when to stop. You need to figure out when your model is good enough, and when additional investment is unlikely to pay off. I particularly like this quote from reddit user Broseidon241:
It's a challenge to know when to stop.
You need to figure out when your model is good enough, and when additional investment is unlikely to pay off.
I particularly like this quote from reddit user Broseidon241:
> A long time ago in art class, my teacher told me "An artist needs to know
> when a piece is done. You can't tweak something into perfection - wrap it up.
> If you don't like it, do it over again. Otherwise begin something new". Later
> in life, I heard "A poor seamstress makes many mistakes. A good seamstress
> works hard to correct those mistakes. A great seamstress isn't afraid to
> throw out the garment and start over."
>
> A long time ago in art class, my teacher told me "An artist needs to know when a piece is done. You can't tweak something into perfection - wrap it up. If you don't like it, do it over again. Otherwise begin something new".
> Later in life, I heard "A poor seamstress makes many mistakes. A good seamstress works hard to correct those mistakes. A great seamstress isn't afraid to throw out the garment and start over."
>
> -- Broseidon241, <https://www.reddit.com/r/datascience/comments/4irajq>
### Prerequisites
We'll use the same tools as in the previous chapter, but add in some real datasets: `diamonds` from ggplot2, and `flights` from nycflights13. We'll also need lubridate in order to work with the date/times in `flights`.
We'll use the same tools as in the previous chapter, but add in some real datasets: `diamonds` from ggplot2, and `flights` from nycflights13.
We'll also need lubridate in order to work with the date/times in `flights`.
```{r setup, message = FALSE}
library(tidyverse)
@ -46,17 +56,19 @@ Note that the worst diamond color is J (slightly yellow), and the worst clarity
### Price and carat
It looks like lower quality diamonds have higher prices because there is an important confounding variable: the weight (`carat`) of the diamond. The weight of the diamond is the single most important factor for determining the price of the diamond, and lower quality diamonds tend to be larger.
It looks like lower quality diamonds have higher prices because there is an important confounding variable: the weight (`carat`) of the diamond.
The weight of the diamond is the single most important factor for determining the price of the diamond, and lower quality diamonds tend to be larger.
```{r}
ggplot(diamonds, aes(carat, price)) +
geom_hex(bins = 50)
```
We can make it easier to see how the other attributes of a diamond affect its relative `price` by fitting a model to separate out the effect of `carat`. But first, lets make a couple of tweaks to the diamonds dataset to make it easier to work with:
We can make it easier to see how the other attributes of a diamond affect its relative `price` by fitting a model to separate out the effect of `carat`.
But first, lets make a couple of tweaks to the diamonds dataset to make it easier to work with:
1. Focus on diamonds smaller than 2.5 carats (99.7% of the data)
1. Log-transform the carat and price variables.
1. Focus on diamonds smaller than 2.5 carats (99.7% of the data)
2. Log-transform the carat and price variables.
```{r}
diamonds2 <- diamonds %>%
@ -71,13 +83,16 @@ ggplot(diamonds2, aes(lcarat, lprice)) +
geom_hex(bins = 50)
```
The log-transformation is particularly useful here because it makes the pattern linear, and linear patterns are the easiest to work with. Let's take the next step and remove that strong linear pattern. We first make the pattern explicit by fitting a model:
The log-transformation is particularly useful here because it makes the pattern linear, and linear patterns are the easiest to work with.
Let's take the next step and remove that strong linear pattern.
We first make the pattern explicit by fitting a model:
```{r}
mod_diamond <- lm(lprice ~ lcarat, data = diamonds2)
```
Then we look at what the model tells us about the data. Note that I back transform the predictions, undoing the log transformation, so I can overlay the predictions on the raw data:
Then we look at what the model tells us about the data.
Note that I back transform the predictions, undoing the log transformation, so I can overlay the predictions on the raw data:
```{r}
grid <- diamonds2 %>%
@ -91,7 +106,9 @@ ggplot(diamonds2, aes(carat, price)) +
geom_line(data = grid, colour = "red", size = 1)
```
That tells us something interesting about our data. If we believe our model, then the large diamonds are much cheaper than expected. This is probably because no diamond in this dataset costs more than $19,000.
That tells us something interesting about our data.
If we believe our model, then the large diamonds are much cheaper than expected.
This is probably because no diamond in this dataset costs more than \$19,000.
Now we can look at the residuals, which verifies that we've successfully removed the strong linear pattern:
@ -103,7 +120,7 @@ ggplot(diamonds2, aes(lcarat, lresid)) +
geom_hex(bins = 50)
```
Importantly, we can now re-do our motivating plots using those residuals instead of `price`.
Importantly, we can now re-do our motivating plots using those residuals instead of `price`.
```{r dev = "png"}
ggplot(diamonds2, aes(cut, lresid)) + geom_boxplot()
@ -111,17 +128,23 @@ ggplot(diamonds2, aes(color, lresid)) + geom_boxplot()
ggplot(diamonds2, aes(clarity, lresid)) + geom_boxplot()
```
Now we see the relationship we expect: as the quality of the diamond increases, so too does its relative price. To interpret the `y` axis, we need to think about what the residuals are telling us, and what scale they are on. A residual of -1 indicates that `lprice` was 1 unit lower than a prediction based solely on its weight. $2^{-1}$ is 1/2, points with a value of -1 are half the expected price, and residuals with value 1 are twice the predicted price.
Now we see the relationship we expect: as the quality of the diamond increases, so too does its relative price.
To interpret the `y` axis, we need to think about what the residuals are telling us, and what scale they are on.
A residual of -1 indicates that `lprice` was 1 unit lower than a prediction based solely on its weight.
$2^{-1}$ is 1/2, points with a value of -1 are half the expected price, and residuals with value 1 are twice the predicted price.
### A more complicated model
If we wanted to, we could continue to build up our model, moving the effects we've observed into the model to make them explicit. For example, we could include `color`, `cut`, and `clarity` into the model so that we also make explicit the effect of these three categorical variables:
If we wanted to, we could continue to build up our model, moving the effects we've observed into the model to make them explicit.
For example, we could include `color`, `cut`, and `clarity` into the model so that we also make explicit the effect of these three categorical variables:
```{r}
mod_diamond2 <- lm(lprice ~ lcarat + color + cut + clarity, data = diamonds2)
```
This model now includes four predictors, so it's getting harder to visualise. Fortunately, they're currently all independent which means that we can plot them individually in four plots. To make the process a little easier, we're going to use the `.model` argument to `data_grid`:
This model now includes four predictors, so it's getting harder to visualise.
Fortunately, they're currently all independent which means that we can plot them individually in four plots.
To make the process a little easier, we're going to use the `.model` argument to `data_grid`:
```{r}
grid <- diamonds2 %>%
@ -133,7 +156,8 @@ ggplot(grid, aes(cut, pred)) +
geom_point()
```
If the model needs variables that you haven't explicitly supplied, `data_grid()` will automatically fill them in with "typical" value. For continuous variables, it uses the median, and categorical variables it uses the most common value (or values, if there's a tie).
If the model needs variables that you haven't explicitly supplied, `data_grid()` will automatically fill them in with "typical" value.
For continuous variables, it uses the median, and categorical variables it uses the most common value (or values, if there's a tie).
```{r}
diamonds2 <- diamonds2 %>%
@ -143,7 +167,8 @@ ggplot(diamonds2, aes(lcarat, lresid2)) +
geom_hex(bins = 50)
```
This plot indicates that there are some diamonds with quite large residuals - remember a residual of 2 indicates that the diamond is 4x the price that we expected. It's often useful to look at unusual values individually:
This plot indicates that there are some diamonds with quite large residuals - remember a residual of 2 indicates that the diamond is 4x the price that we expected.
It's often useful to look at unusual values individually:
```{r}
diamonds2 %>%
@ -154,27 +179,28 @@ diamonds2 %>%
arrange(price)
```
Nothing really jumps out at me here, but it's probably worth spending time considering if this indicates a problem with our model, or if there are errors in the data. If there are mistakes in the data, this could be an opportunity to buy diamonds that have been priced low incorrectly.
Nothing really jumps out at me here, but it's probably worth spending time considering if this indicates a problem with our model, or if there are errors in the data.
If there are mistakes in the data, this could be an opportunity to buy diamonds that have been priced low incorrectly.
### Exercises
1. In the plot of `lcarat` vs. `lprice`, there are some bright vertical
strips. What do they represent?
1. In the plot of `lcarat` vs. `lprice`, there are some bright vertical strips.
What do they represent?
1. If `log(price) = a_0 + a_1 * log(carat)`, what does that say about
the relationship between `price` and `carat`?
1. Extract the diamonds that have very high and very low residuals.
Is there anything unusual about these diamonds? Are they particularly bad
or good, or do you think these are pricing errors?
2. If `log(price) = a_0 + a_1 * log(carat)`, what does that say about the relationship between `price` and `carat`?
1. Does the final model, `mod_diamond2`, do a good job of predicting
diamond prices? Would you trust it to tell you how much to spend
if you were buying a diamond?
3. Extract the diamonds that have very high and very low residuals.
Is there anything unusual about these diamonds?
Are they particularly bad or good, or do you think these are pricing errors?
4. Does the final model, `mod_diamond2`, do a good job of predicting diamond prices?
Would you trust it to tell you how much to spend if you were buying a diamond?
## What affects the number of daily flights?
Let's work through a similar process for a dataset that seems even simpler at first glance: the number of flights that leave NYC per day. This is a really small dataset --- only 365 rows and 2 columns --- and we're not going to end up with a fully realised model, but as you'll see, the steps along the way will help us better understand the data. Let's get started by counting the number of flights per day and visualising it with ggplot2.
Let's work through a similar process for a dataset that seems even simpler at first glance: the number of flights that leave NYC per day.
This is a really small dataset --- only 365 rows and 2 columns --- and we're not going to end up with a fully realised model, but as you'll see, the steps along the way will help us better understand the data.
Let's get started by counting the number of flights per day and visualising it with ggplot2.
```{r}
daily <- flights %>%
@ -189,7 +215,8 @@ ggplot(daily, aes(date, n)) +
### Day of week
Understanding the long-term trend is challenging because there's a very strong day-of-week effect that dominates the subtler patterns. Let's start by looking at the distribution of flight numbers by day-of-week:
Understanding the long-term trend is challenging because there's a very strong day-of-week effect that dominates the subtler patterns.
Let's start by looking at the distribution of flight numbers by day-of-week:
```{r}
daily <- daily %>%
@ -198,9 +225,11 @@ ggplot(daily, aes(wday, n)) +
geom_boxplot()
```
There are fewer flights on weekends because most travel is for business. The effect is particularly pronounced on Saturday: you might sometimes leave on Sunday for a Monday morning meeting, but it's very rare that you'd leave on Saturday as you'd much rather be at home with your family.
There are fewer flights on weekends because most travel is for business.
The effect is particularly pronounced on Saturday: you might sometimes leave on Sunday for a Monday morning meeting, but it's very rare that you'd leave on Saturday as you'd much rather be at home with your family.
One way to remove this strong pattern is to use a model. First, we fit the model, and display its predictions overlaid on the original data:
One way to remove this strong pattern is to use a model.
First, we fit the model, and display its predictions overlaid on the original data:
```{r}
mod <- lm(n ~ wday, data = daily)
@ -225,11 +254,11 @@ daily %>%
geom_line()
```
Note the change in the y-axis: now we are seeing the deviation from the expected number of flights, given the day of week. This plot is useful because now that we've removed much of the large day-of-week effect, we can see some of the subtler patterns that remain:
Note the change in the y-axis: now we are seeing the deviation from the expected number of flights, given the day of week.
This plot is useful because now that we've removed much of the large day-of-week effect, we can see some of the subtler patterns that remain:
1. Our model seems to fail starting in June: you can still see a strong
regular pattern that our model hasn't captured. Drawing a plot with one
line for each day of the week makes the cause easier to see:
1. Our model seems to fail starting in June: you can still see a strong regular pattern that our model hasn't captured.
Drawing a plot with one line for each day of the week makes the cause easier to see:
```{r}
ggplot(daily, aes(date, resid, colour = wday)) +
@ -237,24 +266,21 @@ Note the change in the y-axis: now we are seeing the deviation from the expected
geom_line()
```
Our model fails to accurately predict the number of flights on Saturday:
during summer there are more flights than we expect, and during fall there
are fewer. We'll see how we can do better to capture this pattern in the
next section.
Our model fails to accurately predict the number of flights on Saturday: during summer there are more flights than we expect, and during fall there are fewer.
We'll see how we can do better to capture this pattern in the next section.
1. There are some days with far fewer flights than expected:
2. There are some days with far fewer flights than expected:
```{r}
daily %>%
filter(resid < -100)
```
If you're familiar with American public holidays, you might spot New Year's
day, July 4th, Thanksgiving and Christmas. There are some others that don't
seem to correspond to public holidays. You'll work on those in one
of the exercises.
1. There seems to be some smoother long term trend over the course of a year.
If you're familiar with American public holidays, you might spot New Year's day, July 4th, Thanksgiving and Christmas.
There are some others that don't seem to correspond to public holidays.
You'll work on those in one of the exercises.
3. There seems to be some smoother long term trend over the course of a year.
We can highlight that trend with `geom_smooth()`:
```{r}
@ -265,14 +291,14 @@ Note the change in the y-axis: now we are seeing the deviation from the expected
geom_smooth(se = FALSE, span = 0.20)
```
There are fewer flights in January (and December), and more in summer
(May-Sep). We can't do much with this pattern quantitatively, because we
only have a single year of data. But we can use our domain knowledge to
brainstorm potential explanations.
There are fewer flights in January (and December), and more in summer (May-Sep).
We can't do much with this pattern quantitatively, because we only have a single year of data.
But we can use our domain knowledge to brainstorm potential explanations.
### Seasonal Saturday effect
Let's first tackle our failure to accurately predict the number of flights on Saturday. A good place to start is to go back to the raw numbers, focussing on Saturdays:
Let's first tackle our failure to accurately predict the number of flights on Saturday.
A good place to start is to go back to the raw numbers, focussing on Saturdays:
```{r}
daily %>%
@ -285,9 +311,13 @@ daily %>%
(I've used both points and lines to make it more clear what is data and what is interpolation.)
I suspect this pattern is caused by summer holidays: many people go on holiday in the summer, and people don't mind travelling on Saturdays for vacation. Looking at this plot, we might guess that summer holidays are from early June to late August. That seems to line up fairly well with the [state's school terms](http://schools.nyc.gov/Calendar/2013-2014+School+Year+Calendars.htm): summer break in 2013 was Jun 26--Sep 9.
I suspect this pattern is caused by summer holidays: many people go on holiday in the summer, and people don't mind travelling on Saturdays for vacation.
Looking at this plot, we might guess that summer holidays are from early June to late August.
That seems to line up fairly well with the [state's school terms](http://schools.nyc.gov/Calendar/2013-2014+School+Year+Calendars.htm): summer break in 2013 was Jun 26--Sep 9.
Why are there more Saturday flights in spring than fall? I asked some American friends and they suggested that it's less common to plan family vacations during fall because of the big Thanksgiving and Christmas holidays. We don't have the data to know for sure, but it seems like a plausible working hypothesis.
Why are there more Saturday flights in spring than fall?
I asked some American friends and they suggested that it's less common to plan family vacations during fall because of the big Thanksgiving and Christmas holidays.
We don't have the data to know for sure, but it seems like a plausible working hypothesis.
Lets create a "term" variable that roughly captures the three school terms, and check our work with a plot:
@ -320,7 +350,8 @@ daily %>%
geom_boxplot()
```
It looks like there is significant variation across the terms, so fitting a separate day of week effect for each term is reasonable. This improves our model, but not as much as we might hope:
It looks like there is significant variation across the terms, so fitting a separate day of week effect for each term is reasonable.
This improves our model, but not as much as we might hope:
```{r}
mod1 <- lm(n ~ wday, data = daily)
@ -345,7 +376,9 @@ ggplot(daily, aes(wday, n)) +
facet_wrap(~ term)
```
Our model is finding the _mean_ effect, but we have a lot of big outliers, so mean tends to be far away from the typical value. We can alleviate this problem by using a model that is robust to the effect of outliers: `MASS::rlm()`. This greatly reduces the impact of the outliers on our estimates, and gives a model that does a good job of removing the day of week pattern:
Our model is finding the *mean* effect, but we have a lot of big outliers, so mean tends to be far away from the typical value.
We can alleviate this problem by using a model that is robust to the effect of outliers: `MASS::rlm()`.
This greatly reduces the impact of the outliers on our estimates, and gives a model that does a good job of removing the day of week pattern:
```{r, warn = FALSE}
mod3 <- MASS::rlm(n ~ wday * term, data = daily)
@ -359,10 +392,10 @@ daily %>%
It's now much easier to see the long-term trend, and the positive and negative outliers.
### Computed variables
If you're experimenting with many models and many visualisations, it's a good idea to bundle the creation of variables up into a function so there's no chance of accidentally applying a different transformation in different places. For example, we could write:
If you're experimenting with many models and many visualisations, it's a good idea to bundle the creation of variables up into a function so there's no chance of accidentally applying a different transformation in different places.
For example, we could write:
```{r}
compute_vars <- function(data) {
@ -381,11 +414,17 @@ wday2 <- function(x) wday(x, label = TRUE)
mod3 <- lm(n ~ wday2(date) * term(date), data = daily)
```
Either approach is reasonable. Making the transformed variable explicit is useful if you want to check your work, or use them in a visualisation. But you can't easily use transformations (like splines) that return multiple columns. Including the transformations in the model function makes life a little easier when you're working with many different datasets because the model is self contained.
Either approach is reasonable.
Making the transformed variable explicit is useful if you want to check your work, or use them in a visualisation.
But you can't easily use transformations (like splines) that return multiple columns.
Including the transformations in the model function makes life a little easier when you're working with many different datasets because the model is self contained.
### Time of year: an alternative approach
In the previous section we used our domain knowledge (how the US school term affects travel) to improve the model. An alternative to using our knowledge explicitly in the model is to give the data more room to speak. We could use a more flexible model and allow that to capture the pattern we're interested in. A simple linear trend isn't adequate, so we could try using a natural spline to fit a smooth curve across the year:
In the previous section we used our domain knowledge (how the US school term affects travel) to improve the model.
An alternative to using our knowledge explicitly in the model is to give the data more room to speak.
We could use a more flexible model and allow that to capture the pattern we're interested in.
A simple linear trend isn't adequate, so we could try using a natural spline to fit a smooth curve across the year:
```{r}
library(splines)
@ -399,16 +438,16 @@ daily %>%
geom_point()
```
We see a strong pattern in the numbers of Saturday flights. This is reassuring, because we also saw that pattern in the raw data. It's a good sign when you get the same signal from different approaches.
We see a strong pattern in the numbers of Saturday flights.
This is reassuring, because we also saw that pattern in the raw data.
It's a good sign when you get the same signal from different approaches.
### Exercises
1. Use your Google sleuthing skills to brainstorm why there were fewer than
expected flights on Jan 20, May 26, and Sep 1. (Hint: they all have the
same explanation.) How would these days generalise to another year?
1. Use your Google sleuthing skills to brainstorm why there were fewer than expected flights on Jan 20, May 26, and Sep 1.
(Hint: they all have the same explanation.) How would these days generalise to another year?
1. What do the three days with high positive residuals represent?
2. What do the three days with high positive residuals represent?
How would these days generalise to another year?
```{r}
@ -416,55 +455,41 @@ We see a strong pattern in the numbers of Saturday flights. This is reassuring,
slice_max(n = 3, resid)
```
1. Create a new variable that splits the `wday` variable into terms, but only
for Saturdays, i.e. it should have `Thurs`, `Fri`, but `Sat-summer`,
`Sat-spring`, `Sat-fall`. How does this model compare with the model with
every combination of `wday` and `term`?
1. Create a new `wday` variable that combines the day of week, term
(for Saturdays), and public holidays. What do the residuals of
that model look like?
3. Create a new variable that splits the `wday` variable into terms, but only for Saturdays, i.e. it should have `Thurs`, `Fri`, but `Sat-summer`, `Sat-spring`, `Sat-fall`.
How does this model compare with the model with every combination of `wday` and `term`?
1. What happens if you fit a day of week effect that varies by month
(i.e. `n ~ wday * month`)? Why is this not very helpful?
4. Create a new `wday` variable that combines the day of week, term (for Saturdays), and public holidays.
What do the residuals of that model look like?
1. What would you expect the model `n ~ wday + ns(date, 5)` to look like?
Knowing what you know about the data, why would you expect it to be
not particularly effective?
5. What happens if you fit a day of week effect that varies by month (i.e. `n ~ wday * month`)?
Why is this not very helpful?
1. We hypothesised that people leaving on Sundays are more likely to be
business travellers who need to be somewhere on Monday. Explore that
hypothesis by seeing how it breaks down based on distance and time: if
it's true, you'd expect to see more Sunday evening flights to places that
are far away.
6. What would you expect the model `n ~ wday + ns(date, 5)` to look like?
Knowing what you know about the data, why would you expect it to be not particularly effective?
1. It's a little frustrating that Sunday and Saturday are on separate ends
of the plot. Write a small function to set the levels of the
factor so that the week starts on Monday.
7. We hypothesised that people leaving on Sundays are more likely to be business travellers who need to be somewhere on Monday.
Explore that hypothesis by seeing how it breaks down based on distance and time: if it's true, you'd expect to see more Sunday evening flights to places that are far away.
8. It's a little frustrating that Sunday and Saturday are on separate ends of the plot.
Write a small function to set the levels of the factor so that the week starts on Monday.
## Learning more about models
We have only scratched the absolute surface of modelling, but you have hopefully gained some simple, but general-purpose tools that you can use to improve your own data analyses. It's OK to start simple! As you've seen, even very simple models can make a dramatic difference in your ability to tease out interactions between variables.
We have only scratched the absolute surface of modelling, but you have hopefully gained some simple, but general-purpose tools that you can use to improve your own data analyses.
It's OK to start simple!
As you've seen, even very simple models can make a dramatic difference in your ability to tease out interactions between variables.
These modelling chapters are even more opinionated than the rest of the book. I approach modelling from a somewhat different perspective to most others, and there is relatively little space devoted to it. Modelling really deserves a book on its own, so I'd highly recommend that you read at least one of these three books:
These modelling chapters are even more opinionated than the rest of the book.
I approach modelling from a somewhat different perspective to most others, and there is relatively little space devoted to it.
Modelling really deserves a book on its own, so I'd highly recommend that you read at least one of these three books:
* *Statistical Modeling: A Fresh Approach* by Danny Kaplan,
<http://project-mosaic-books.com/?page_id=13>. This book provides
a gentle introduction to modelling, where you build your intuition,
mathematical tools, and R skills in parallel. The book replaces a traditional
"introduction to statistics" course, providing a curriculum that is up-to-date
and relevant to data science.
- *Statistical Modeling: A Fresh Approach* by Danny Kaplan, <http://project-mosaic-books.com/?page_id=13>.
This book provides a gentle introduction to modelling, where you build your intuition, mathematical tools, and R skills in parallel.
The book replaces a traditional "introduction to statistics" course, providing a curriculum that is up-to-date and relevant to data science.
* *An Introduction to Statistical Learning* by Gareth James, Daniela Witten,
Trevor Hastie, and Robert Tibshirani, <http://www-bcf.usc.edu/~gareth/ISL/>
(available online for free). This book presents a family of modern modelling
techniques collectively known as statistical learning. For an even deeper
understanding of the math behind the models, read the classic
*Elements of Statistical Learning* by Trevor Hastie, Robert Tibshirani, and
Jerome Friedman, <https://web.stanford.edu/~hastie/Papers/ESLII.pdf> (also
available online for free).
- *An Introduction to Statistical Learning* by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, <http://www-bcf.usc.edu/~gareth/ISL/> (available online for free).
This book presents a family of modern modelling techniques collectively known as statistical learning.
For an even deeper understanding of the math behind the models, read the classic *Elements of Statistical Learning* by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, <https://web.stanford.edu/~hastie/Papers/ESLII.pdf> (also available online for free).
* *Applied Predictive Modeling* by Max Kuhn and Kjell Johnson,
<http://appliedpredictivemodeling.com>. This book is a companion to the
__caret__ package and provides practical tools for dealing with real-life
predictive modelling challenges.
- *Applied Predictive Modeling* by Max Kuhn and Kjell Johnson, <http://appliedpredictivemodeling.com>.
This book is a companion to the **caret** package and provides practical tools for dealing with real-life predictive modelling challenges.

View File

@ -6,34 +6,30 @@ In this chapter you're going to learn three powerful ideas that help you to work
1. Using many simple models to better understand complex datasets.
1. Using list-columns to store arbitrary data structures in a data frame.
For example, this will allow you to have a column that contains linear
models.
1. Using the __broom__ package, by David Robinson, to turn models into tidy
data. This is a powerful technique for working with large numbers of models
because once you have tidy data, you can apply all of the techniques that
you've learned about earlier in the book.
2. Using list-columns to store arbitrary data structures in a data frame.
For example, this will allow you to have a column that contains linear models.
We'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signals so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.
3. Using the **broom** package, by David Robinson, to turn models into tidy data.
This is a powerful technique for working with large numbers of models because once you have tidy data, you can apply all of the techniques that you've learned about earlier in the book.
We'll start by diving into a motivating example using data about life expectancy around the world.
It's a small dataset but it illustrates how important modelling can be for improving your visualisations.
We'll use a large number of simple models to partition out some of the strongest signals so we can see the subtler signals that remain.
We'll also see how model summaries can help us pick out outliers and unusual trends.
The following sections will dive into more detail about the individual techniques:
1. In [list-columns], you'll learn more about the list-column data structure,
and why it's valid to put lists in data frames.
1. In [creating list-columns], you'll learn the three main ways in which you'll
create list-columns.
1. In [simplifying list-columns] you'll learn how to convert list-columns back
to regular atomic vectors (or sets of atomic vectors) so you can work
with them more easily.
1. In [making tidy data with broom], you'll learn about the full set of tools
provided by broom, and see how they can be applied to other types of
data structure.
1. In [list-columns](#list-columns-1), you'll learn more about the list-column data structure, and why it's valid to put lists in data frames.
This chapter is somewhat aspirational: if this book is your first introduction to R, this chapter is likely to be a struggle. It requires you to have deeply internalised ideas about modelling, data structures, and iteration. So don't worry if you don't get it --- just put this chapter aside for a few months, and come back when you want to stretch your brain.
2. In [creating list-columns], you'll learn the three main ways in which you'll create list-columns.
3. In [simplifying list-columns] you'll learn how to convert list-columns back to regular atomic vectors (or sets of atomic vectors) so you can work with them more easily.
4. In [making tidy data with broom], you'll learn about the full set of tools provided by broom, and see how they can be applied to other types of data structure.
This chapter is somewhat aspirational: if this book is your first introduction to R, this chapter is likely to be a struggle.
It requires you to have deeply internalised ideas about modelling, data structures, and iteration.
So don't worry if you don't get it --- just put this chapter aside for a few months, and come back when you want to stretch your brain.
### Prerequisites
@ -46,16 +42,22 @@ library(tidyverse)
## gapminder
To motivate the power of many simple models, we're going to look into the "gapminder" data. This data was popularised by Hans Rosling, a Swedish doctor and statistician. If you've never heard of him, stop reading this chapter right now and go watch one of his videos! He is a fantastic data presenter and illustrates how you can use data to present a compelling story. A good place to start is this short video filmed in conjunction with the BBC: <https://www.youtube.com/watch?v=jbkSRLYSojo>.
To motivate the power of many simple models, we're going to look into the "gapminder" data.
This data was popularised by Hans Rosling, a Swedish doctor and statistician.
If you've never heard of him, stop reading this chapter right now and go watch one of his videos!
He is a fantastic data presenter and illustrates how you can use data to present a compelling story.
A good place to start is this short video filmed in conjunction with the BBC: <https://www.youtube.com/watch?v=jbkSRLYSojo>.
The gapminder data summarises the progression of countries over time, looking at statistics like life expectancy and GDP. The data is easy to access in R, thanks to Jenny Bryan who created the gapminder package:
The gapminder data summarises the progression of countries over time, looking at statistics like life expectancy and GDP.
The data is easy to access in R, thanks to Jenny Bryan who created the gapminder package:
```{r}
library(gapminder)
gapminder
```
In this case study, we're going to focus on just three variables to answer the question "How does life expectancy (`lifeExp`) change over time (`year`) for each country (`country`)?". A good place to start is with a plot:
In this case study, we're going to focus on just three variables to answer the question "How does life expectancy (`lifeExp`) change over time (`year`) for each country (`country`)?".
A good place to start is with a plot:
```{r}
gapminder %>%
@ -63,9 +65,15 @@ gapminder %>%
geom_line(alpha = 1/3)
```
This is a small dataset: it only has ~1,700 observations and 3 variables. But it's still hard to see what's going on! Overall, it looks like life expectancy has been steadily improving. However, if you look closely, you might notice some countries that don't follow this pattern. How can we make those countries easier to see?
This is a small dataset: it only has \~1,700 observations and 3 variables.
But it's still hard to see what's going on!
Overall, it looks like life expectancy has been steadily improving.
However, if you look closely, you might notice some countries that don't follow this pattern.
How can we make those countries easier to see?
One way is to use the same approach as in the last chapter: there's a strong signal (overall linear growth) that makes it hard to see subtler trends. We'll tease these factors apart by fitting a model with a linear trend. The model captures steady growth over time, and the residuals will show what's left.
One way is to use the same approach as in the last chapter: there's a strong signal (overall linear growth) that makes it hard to see subtler trends.
We'll tease these factors apart by fitting a model with a linear trend.
The model captures steady growth over time, and the residuals will show what's left.
You already know how to do that if we had a single country:
@ -95,7 +103,12 @@ How can we easily fit that model to every country?
### Nested data
You could imagine copy and pasting that code multiple times; but you've already learned a better way! Extract out the common code with a function and repeat using a map function from purrr. This problem is structured a little differently to what you've seen before. Instead of repeating an action for each variable, we want to repeat an action for each country, a subset of rows. To do that, we need a new data structure: the __nested data frame__. To create a nested data frame we start with a grouped data frame, and "nest" it:
You could imagine copy and pasting that code multiple times; but you've already learned a better way!
Extract out the common code with a function and repeat using a map function from purrr.
This problem is structured a little differently to what you've seen before.
Instead of repeating an action for each variable, we want to repeat an action for each country, a subset of rows.
To do that, we need a new data structure: the **nested data frame**.
To create a nested data frame we start with a grouped data frame, and "nest" it:
```{r}
by_country <- gapminder %>%
@ -107,19 +120,26 @@ by_country
(I'm cheating a little by grouping on both `continent` and `country`. Given `country`, `continent` is fixed, so this doesn't add any more groups, but it's an easy way to carry an extra variable along for the ride.)
This creates a data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames (or tibbles, to be precise). This seems like a crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea.
This creates a data frame that has one row per group (per country), and a rather unusual column: `data`.
`data` is a list of data frames (or tibbles, to be precise).
This seems like a crazy idea: we have a data frame with a column that is a list of other data frames!
I'll explain shortly why I think this is a good idea.
The `data` column is a little tricky to look at because it's a moderately complicated list, and we're still working on good tools to explore these objects. Unfortunately using `str()` is not recommended as it will often produce very long output. But if you pluck out a single element from the `data` column you'll see that it contains all the data for that country (in this case, Afghanistan).
The `data` column is a little tricky to look at because it's a moderately complicated list, and we're still working on good tools to explore these objects.
Unfortunately using `str()` is not recommended as it will often produce very long output.
But if you pluck out a single element from the `data` column you'll see that it contains all the data for that country (in this case, Afghanistan).
```{r}
by_country$data[[1]]
```
Note the difference between a standard grouped data frame and a nested data frame: in a grouped data frame, each row is an observation; in a nested data frame, each row is a group. Another way to think about a nested dataset is we now have a meta-observation: a row that represents the complete time course for a country, rather than a single point in time.
Note the difference between a standard grouped data frame and a nested data frame: in a grouped data frame, each row is an observation; in a nested data frame, each row is a group.
Another way to think about a nested dataset is we now have a meta-observation: a row that represents the complete time course for a country, rather than a single point in time.
### List-columns
Now that we have our nested data frame, we're in a good position to fit some models. We have a model-fitting function:
Now that we have our nested data frame, we're in a good position to fit some models.
We have a model-fitting function:
```{r}
country_model <- function(df) {
@ -127,15 +147,20 @@ country_model <- function(df) {
}
```
And we want to apply it to every data frame. The data frames are in a list, so we can use `purrr::map()` to apply `country_model` to each element:
And we want to apply it to every data frame.
The data frames are in a list, so we can use `purrr::map()` to apply `country_model` to each element:
```{r}
models <- map(by_country$data, country_model)
```
However, rather than leaving the list of models as a free-floating object, I think it's better to store it as a column in the `by_country` data frame. Storing related objects in columns is a key part of the value of data frames, and why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?
However, rather than leaving the list of models as a free-floating object, I think it's better to store it as a column in the `by_country` data frame.
Storing related objects in columns is a key part of the value of data frames, and why I think list-columns are such a good idea.
In the course of working with these countries, we are going to have lots of lists where we have one element per country.
So why not store them all together in one data frame?
In other words, instead of creating a new object in the global environment, we're going to create a new variable in the `by_country` data frame. That's a job for `dplyr::mutate()`:
In other words, instead of creating a new object in the global environment, we're going to create a new variable in the `by_country` data frame.
That's a job for `dplyr::mutate()`:
```{r}
by_country <- by_country %>%
@ -143,7 +168,8 @@ by_country <- by_country %>%
by_country
```
This has a big advantage: because all the related objects are stored together, you don't need to manually keep them in sync when you filter or arrange. The semantics of the data frame takes care of that for you:
This has a big advantage: because all the related objects are stored together, you don't need to manually keep them in sync when you filter or arrange.
The semantics of the data frame takes care of that for you:
```{r}
by_country %>%
@ -152,11 +178,14 @@ by_country %>%
arrange(continent, country)
```
If your list of data frames and list of models were separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer!
If your list of data frames and list of models were separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync.
If you forget, your code will continue to work, but it will give the wrong answer!
### Unnesting
Previously we computed the residuals of a single model with a single dataset. Now we have 142 data frames and 142 models. To compute the residuals, we need to call `add_residuals()` with each model-data pair:
Previously we computed the residuals of a single model with a single dataset.
Now we have 142 data frames and 142 models.
To compute the residuals, we need to call `add_residuals()` with each model-data pair:
```{r}
by_country <- by_country %>%
@ -166,7 +195,9 @@ by_country <- by_country %>%
by_country
```
But how can you plot a list of data frames? Instead of struggling to answer that question, let's turn the list of data frames back into a regular data frame. Previously we used `nest()` to turn a regular data frame into an nested data frame, and now we do the opposite with `unnest()`:
But how can you plot a list of data frames?
Instead of struggling to answer that question, let's turn the list of data frames back into a regular data frame.
Previously we used `nest()` to turn a regular data frame into an nested data frame, and now we do the opposite with `unnest()`:
```{r}
resids <- unnest(by_country, resids)
@ -186,7 +217,7 @@ resids %>%
```
Facetting by continent is particularly revealing:
```{r}
resids %>%
ggplot(aes(year, resid, group = country)) +
@ -194,11 +225,18 @@ resids %>%
facet_wrap(~continent)
```
It looks like we've missed some mild patterns. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.
It looks like we've missed some mild patterns.
There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there.
We'll explore that more in the next section, attacking it from a slightly different angle.
### Model quality
Instead of looking at the residuals from the model, we could look at some general measurements of model quality. You learned how to compute some specific measures in the previous chapter. Here we'll show a different approach using the broom package. The broom package provides a general set of functions to turn models into tidy data. Here we'll use `broom::glance()` to extract some model quality metrics. If we apply it to a model, we get a data frame with a single row:
Instead of looking at the residuals from the model, we could look at some general measurements of model quality.
You learned how to compute some specific measures in the previous chapter.
Here we'll show a different approach using the broom package.
The broom package provides a general set of functions to turn models into tidy data.
Here we'll use `broom::glance()` to extract some model quality metrics.
If we apply it to a model, we get a data frame with a single row:
```{r}
broom::glance(nz_mod)
@ -223,7 +261,9 @@ glance %>%
arrange(r.squared)
```
The worst models all appear to be in Africa. Let's double check that with a plot. Here we have a relatively small number of observations and a discrete variable, so `geom_jitter()` is effective:
The worst models all appear to be in Africa.
Let's double check that with a plot.
Here we have a relatively small number of observations and a discrete variable, so `geom_jitter()` is effective:
```{r}
glance %>%
@ -247,24 +287,25 @@ We see two main effects here: the tragedies of the HIV/AIDS epidemic and the Rwa
### Exercises
1. A linear trend seems to be slightly too simple for the overall trend.
Can you do better with a quadratic polynomial? How can you interpret
the coefficients of the quadratic? (Hint you might want to transform
`year` so that it has mean zero.)
Can you do better with a quadratic polynomial?
How can you interpret the coefficients of the quadratic?
(Hint you might want to transform `year` so that it has mean zero.)
1. Explore other methods for visualising the distribution of $R^2$ per
continent. You might want to try the ggbeeswarm package, which provides
similar methods for avoiding overlaps as jitter, but uses deterministic
methods.
2. Explore other methods for visualising the distribution of $R^2$ per continent.
You might want to try the ggbeeswarm package, which provides similar methods for avoiding overlaps as jitter, but uses deterministic methods.
1. To create the last plot (showing the data for the countries with the
worst model fits), we needed two steps: we created a data frame with
one row per country and then semi-joined it to the original dataset.
It's possible to avoid this join if we use `unnest()` instead of
`unnest(.drop = TRUE)`. How?
3. To create the last plot (showing the data for the countries with the worst model fits), we needed two steps: we created a data frame with one row per country and then semi-joined it to the original dataset.
It's possible to avoid this join if we use `unnest()` instead of `unnest(.drop = TRUE)`.
How?
## List-columns
## List-columns {#list-columns-1}
Now that you've seen a basic workflow for managing many models, let's dive back into some of the details. In this section, we'll explore the list-column data structure in a little more detail. It's only recently that I've really appreciated the idea of the list-column. List-columns are implicit in the definition of the data frame: a data frame is a named list of equal length vectors. A list is a vector, so it's always been legitimate to use a list as a column of a data frame. However, base R doesn't make it easy to create list-columns, and `data.frame()` treats a list as a list of columns:.
Now that you've seen a basic workflow for managing many models, let's dive back into some of the details.
In this section, we'll explore the list-column data structure in a little more detail.
It's only recently that I've really appreciated the idea of the list-column.
List-columns are implicit in the definition of the data frame: a data frame is a named list of equal length vectors.
A list is a vector, so it's always been legitimate to use a list as a column of a data frame.
However, base R doesn't make it easy to create list-columns, and `data.frame()` treats a list as a list of columns:.
```{r}
data.frame(x = list(1:3, 3:5))
@ -298,41 +339,42 @@ tribble(
)
```
List-columns are often most useful as intermediate data structure. They're hard to work with directly, because most R functions work with atomic vectors or data frames, but the advantage of keeping related items together in a data frame is worth a little hassle.
List-columns are often most useful as intermediate data structure.
They're hard to work with directly, because most R functions work with atomic vectors or data frames, but the advantage of keeping related items together in a data frame is worth a little hassle.
Generally there are three parts of an effective list-column pipeline:
1. You create the list-column using one of `nest()`, `summarise()` + `list()`,
or `mutate()` + a map function, as described in [Creating list-columns].
1. You create the list-column using one of `nest()`, `summarise()` + `list()`, or `mutate()` + a map function, as described in [Creating list-columns].
1. You create other intermediate list-columns by transforming existing
list columns with `map()`, `map2()` or `pmap()`. For example,
in the case study above, we created a list-column of models by transforming
a list-column of data frames.
1. You simplify the list-column back down to a data frame or atomic vector,
as described in [Simplifying list-columns].
2. You create other intermediate list-columns by transforming existing list columns with `map()`, `map2()` or `pmap()`.
For example, in the case study above, we created a list-column of models by transforming a list-column of data frames.
3. You simplify the list-column back down to a data frame or atomic vector, as described in [Simplifying list-columns].
## Creating list-columns
Typically, you won't create list-columns with `tibble()`. Instead, you'll create them from regular columns, using one of three methods:
Typically, you won't create list-columns with `tibble()`.
Instead, you'll create them from regular columns, using one of three methods:
1. With `tidyr::nest()` to convert a grouped data frame into a nested data
frame where you have list-column of data frames.
1. With `mutate()` and vectorised functions that return a list.
1. With `tidyr::nest()` to convert a grouped data frame into a nested data frame where you have list-column of data frames.
1. With `summarise()` and summary functions that return multiple results.
2. With `mutate()` and vectorised functions that return a list.
3. With `summarise()` and summary functions that return multiple results.
Alternatively, you might create them from a named list, using `tibble::enframe()`.
Generally, when creating list-columns, you should make sure they're homogeneous: each element should contain the same type of thing. There are no checks to make sure this is true, but if you use purrr and remember what you've learned about type-stable functions, you should find it happens naturally.
Generally, when creating list-columns, you should make sure they're homogeneous: each element should contain the same type of thing.
There are no checks to make sure this is true, but if you use purrr and remember what you've learned about type-stable functions, you should find it happens naturally.
### With nesting
`nest()` creates a nested data frame, which is a data frame with a list-column of data frames. In a nested data frame each row is a meta-observation: the other columns give variables that define the observation (like country and continent above), and the list-column of data frames gives the individual observations that make up the meta-observation.
`nest()` creates a nested data frame, which is a data frame with a list-column of data frames.
In a nested data frame each row is a meta-observation: the other columns give variables that define the observation (like country and continent above), and the list-column of data frames gives the individual observations that make up the meta-observation.
There are two ways to use `nest()`. So far you've seen how to use it with a grouped data frame. When applied to a grouped data frame, `nest()` keeps the grouping columns as is, and bundles everything else into the list-column:
There are two ways to use `nest()`.
So far you've seen how to use it with a grouped data frame.
When applied to a grouped data frame, `nest()` keeps the grouping columns as is, and bundles everything else into the list-column:
```{r}
gapminder %>%
@ -349,7 +391,9 @@ gapminder %>%
### From vectorised functions
Some useful functions take an atomic vector and return a list. For example, in [strings] you learned about `stringr::str_split()` which takes a character vector and returns a list of character vectors. If you use that inside mutate, you'll get a list-column:
Some useful functions take an atomic vector and return a list.
For example, in [strings] you learned about `stringr::str_split()` which takes a character vector and returns a list of character vectors.
If you use that inside mutate, you'll get a list-column:
```{r}
df <- tribble(
@ -372,7 +416,8 @@ df %>%
(If you find yourself using this pattern a lot, make sure to check out `tidyr::separate_rows()` which is a wrapper around this common pattern).
Another example of this pattern is using the `map()`, `map2()`, `pmap()` from purrr. For example, we could take the final example from [Invoking different functions] and rewrite it to use `mutate()`:
Another example of this pattern is using the `map()`, `map2()`, `pmap()` from purrr.
For example, we could take the final example from [Invoking different functions] and rewrite it to use `mutate()`:
```{r}
sim <- tribble(
@ -386,11 +431,13 @@ sim %>%
mutate(sims = invoke_map(f, params, n = 10))
```
Note that technically `sim` isn't homogeneous because it contains both double and integer vectors. However, this is unlikely to cause many problems since integers and doubles are both numeric vectors.
Note that technically `sim` isn't homogeneous because it contains both double and integer vectors.
However, this is unlikely to cause many problems since integers and doubles are both numeric vectors.
### From multivalued summaries
One restriction of `summarise()` is that it only works with summary functions that return a single value. That means that you can't use it with functions like `quantile()` that return a vector of arbitrary length:
One restriction of `summarise()` is that it only works with summary functions that return a single value.
That means that you can't use it with functions like `quantile()` that return a vector of arbitrary length:
```{r, error = TRUE}
mtcars %>%
@ -398,7 +445,8 @@ mtcars %>%
summarise(q = quantile(mpg))
```
You can however, wrap the result in a list! This obeys the contract of `summarise()`, because each summary is now a list (a vector) of length 1.
You can however, wrap the result in a list!
This obeys the contract of `summarise()`, because each summary is now a list (a vector) of length 1.
```{r}
mtcars %>%
@ -418,7 +466,9 @@ mtcars %>%
### From a named list
Data frames with list-columns provide a solution to a common problem: what do you do if you want to iterate over both the contents of a list and its elements? Instead of trying to jam everything into one object, it's often easier to make a data frame: one column can contain the elements, and one column can contain the list. An easy way to create such a data frame from a list is `tibble::enframe()`.
Data frames with list-columns provide a solution to a common problem: what do you do if you want to iterate over both the contents of a list and its elements?
Instead of trying to jam everything into one object, it's often easier to make a data frame: one column can contain the elements, and one column can contain the list.
An easy way to create such a data frame from a list is `tibble::enframe()`.
```{r}
x <- list(
@ -444,14 +494,13 @@ df %>%
### Exercises
1. List all the functions that you can think of that take a atomic vector and
return a list.
1. Brainstorm useful summary functions that, like `quantile()`, return
multiple values.
1. What's missing in the following data frame? How does `quantile()` return
that missing piece? Why isn't that helpful here?
1. List all the functions that you can think of that take a atomic vector and return a list.
2. Brainstorm useful summary functions that, like `quantile()`, return multiple values.
3. What's missing in the following data frame?
How does `quantile()` return that missing piece?
Why isn't that helpful here?
```{r}
mtcars %>%
@ -460,7 +509,8 @@ df %>%
unnest(q)
```
1. What does this code do? Why might might it be useful?
4. What does this code do?
Why might might it be useful?
```{r, eval = FALSE}
mtcars %>%
@ -470,19 +520,19 @@ df %>%
## Simplifying list-columns
To apply the techniques of data manipulation and visualisation you've learned in this book, you'll need to simplify the list-column back to a regular column (an atomic vector), or set of columns. The technique you'll use to collapse back down to a simpler structure depends on whether you want a single value per element, or multiple values:
To apply the techniques of data manipulation and visualisation you've learned in this book, you'll need to simplify the list-column back to a regular column (an atomic vector), or set of columns.
The technique you'll use to collapse back down to a simpler structure depends on whether you want a single value per element, or multiple values:
1. If you want a single value, use `mutate()` with `map_lgl()`,
`map_int()`, `map_dbl()`, and `map_chr()` to create an atomic vector.
1. If you want many values, use `unnest()` to convert list-columns back
to regular columns, repeating the rows as many times as necessary.
1. If you want a single value, use `mutate()` with `map_lgl()`, `map_int()`, `map_dbl()`, and `map_chr()` to create an atomic vector.
2. If you want many values, use `unnest()` to convert list-columns back to regular columns, repeating the rows as many times as necessary.
These are described in more detail below.
### List to vector
If you can reduce your list column to an atomic vector then it will be a regular column. For example, you can always summarise an object with its type and length, so this code will work regardless of what sort of list-column you have:
If you can reduce your list column to an atomic vector then it will be a regular column.
For example, you can always summarise an object with its type and length, so this code will work regardless of what sort of list-column you have:
```{r}
df <- tribble(
@ -498,9 +548,12 @@ df %>% mutate(
)
```
This is the same basic information that you get from the default tbl print method, but now you can use it for filtering. This is a useful technique if you have a heterogeneous list, and want to filter out the parts aren't working for you.
This is the same basic information that you get from the default tbl print method, but now you can use it for filtering.
This is a useful technique if you have a heterogeneous list, and want to filter out the parts aren't working for you.
Don't forget about the `map_*()` shortcuts - you can use `map_chr(x, "apple")` to extract the string stored in `apple` for each element of `x`. This is useful for pulling apart nested lists into regular columns. Use the `.null` argument to provide a value to use if the element is missing (instead of returning `NULL`):
Don't forget about the `map_*()` shortcuts - you can use `map_chr(x, "apple")` to extract the string stored in `apple` for each element of `x`.
This is useful for pulling apart nested lists into regular columns.
Use the `.null` argument to provide a value to use if the element is missing (instead of returning `NULL`):
```{r}
df <- tribble(
@ -516,7 +569,8 @@ df %>% mutate(
### Unnesting
`unnest()` works by repeating the regular columns once for each element of the list-column. For example, in the following very simple example we repeat the first row 4 times (because there the first element of `y` has length four), and the second row once:
`unnest()` works by repeating the regular columns once for each element of the list-column.
For example, in the following very simple example we repeat the first row 4 times (because there the first element of `y` has length four), and the second row once:
```{r}
tibble(x = 1:2, y = list(1:4, 1)) %>% unnest(y)
@ -545,26 +599,24 @@ df2
df2 %>% unnest(c(y, z))
```
The same principle applies when unnesting list-columns of data frames. You can unnest multiple list-cols as long as all the data frames in each row have the same number of rows.
The same principle applies when unnesting list-columns of data frames.
You can unnest multiple list-cols as long as all the data frames in each row have the same number of rows.
### Exercises
1. Why might the `lengths()` function be useful for creating atomic
vector columns from list-columns?
1. List the most common types of vector found in a data frame. What makes
lists different?
1. Why might the `lengths()` function be useful for creating atomic vector columns from list-columns?
2. List the most common types of vector found in a data frame.
What makes lists different?
## Making tidy data with broom
The broom package provides three general tools for turning models into tidy data frames:
1. `broom::glance(model)` returns a row for each model. Each column gives a
model summary: either a measure of model quality, or complexity, or a
combination of the two.
1. `broom::tidy(model)` returns a row for each coefficient in the model. Each
column gives information about the estimate or its variability.
1. `broom::augment(model, data)` returns a row for each row in `data`, adding
extra values like residuals, and influence statistics.
1. `broom::glance(model)` returns a row for each model.
Each column gives a model summary: either a measure of model quality, or complexity, or a combination of the two.
2. `broom::tidy(model)` returns a row for each coefficient in the model.
Each column gives information about the estimate or its variability.
3. `broom::augment(model, data)` returns a row for each row in `data`, adding extra values like residuals, and influence statistics.

View File

@ -1,59 +1,69 @@
# (PART) Model {-}
# (PART) Model {.unnumbered}
# Introduction {#model-intro}
Now that you are equipped with powerful programming tools we can finally return to modelling. You'll use your new tools of data wrangling and programming, to fit many models and understand how they work. The focus of this book is on exploration, not confirmation or formal inference. But you'll learn a few basic tools that help you understand the variation within your models.
Now that you are equipped with powerful programming tools we can finally return to modelling.
You'll use your new tools of data wrangling and programming, to fit many models and understand how they work.
The focus of this book is on exploration, not confirmation or formal inference.
But you'll learn a few basic tools that help you understand the variation within your models.
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-model.png")
```
The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in). Here we only cover "predictive" models, which, as the name suggests, generate predictions. There is another type of model that we're not going to discuss: "data discovery" models. These models don't make predictions, but instead help you discover interesting relationships within your data. (These two categories of models are sometimes called supervised and unsupervised, but I don't think that terminology is particularly illuminating.)
The goal of a model is to provide a simple low-dimensional summary of a dataset.
Ideally, the model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in).
Here we only cover "predictive" models, which, as the name suggests, generate predictions.
There is another type of model that we're not going to discuss: "data discovery" models.
These models don't make predictions, but instead help you discover interesting relationships within your data.
(These two categories of models are sometimes called supervised and unsupervised, but I don't think that terminology is particularly illuminating.)
This book is not going to give you a deep understanding of the mathematical theory that underlies models. It will, however, build your intuition about how statistical models work, and give you a family of useful tools that allow you to use models to better understand your data:
This book is not going to give you a deep understanding of the mathematical theory that underlies models.
It will, however, build your intuition about how statistical models work, and give you a family of useful tools that allow you to use models to better understand your data:
* In [model basics], you'll learn how models work mechanistically, focussing on
the important family of linear models. You'll learn general tools for gaining
insight into what a predictive model tells you about your data, focussing on
simple simulated datasets.
- In [model basics], you'll learn how models work mechanistically, focussing on the important family of linear models.
You'll learn general tools for gaining insight into what a predictive model tells you about your data, focussing on simple simulated datasets.
* In [model building], you'll learn how to use models to pull out known
patterns in real data. Once you have recognised an important pattern
it's useful to make it explicit in a model, because then you can
more easily see the subtler signals that remain.
- In [model building], you'll learn how to use models to pull out known patterns in real data.
Once you have recognised an important pattern it's useful to make it explicit in a model, because then you can more easily see the subtler signals that remain.
* In [many models], you'll learn how to use many simple models to help
understand complex datasets. This is a powerful technique, but to access
it you'll need to combine modelling and programming tools.
- In [many models], you'll learn how to use many simple models to help understand complex datasets.
This is a powerful technique, but to access it you'll need to combine modelling and programming tools.
These topics are notable because of what they don't include: any tools for quantitatively assessing models. That is deliberate: precisely quantifying a model requires a couple of big ideas that we just don't have the space to cover here. For now, you'll rely on qualitative assessment and your natural scepticism. In [Learning more about models], we'll point you to other resources where you can learn more.
These topics are notable because of what they don't include: any tools for quantitatively assessing models.
That is deliberate: precisely quantifying a model requires a couple of big ideas that we just don't have the space to cover here.
For now, you'll rely on qualitative assessment and your natural scepticism.
In [Learning more about models], we'll point you to other resources where you can learn more.
## Hypothesis generation vs. hypothesis confirmation
In this book, we are going to use models as a tool for exploration, completing the trifecta of the tools for EDA that were introduced in Part 1. This is not how models are usually taught, but as you will see, models are an important tool for exploration. Traditionally, the focus of modelling is on inference, or for confirming that an hypothesis is true. Doing this correctly is not complicated, but it is hard. There is a pair of ideas that you must understand in order to do inference correctly:
In this book, we are going to use models as a tool for exploration, completing the trifecta of the tools for EDA that were introduced in Part 1.
This is not how models are usually taught, but as you will see, models are an important tool for exploration.
Traditionally, the focus of modelling is on inference, or for confirming that an hypothesis is true.
Doing this correctly is not complicated, but it is hard.
There is a pair of ideas that you must understand in order to do inference correctly:
1. Each observation can either be used for exploration or confirmation,
not both.
1. Each observation can either be used for exploration or confirmation, not both.
1. You can use an observation as many times as you like for exploration,
but you can only use it once for confirmation. As soon as you use an
observation twice, you've switched from confirmation to exploration.
This is necessary because to confirm a hypothesis you must use data independent of the data that you used to generate the hypothesis. Otherwise you will be over optimistic. There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading.
2. You can use an observation as many times as you like for exploration, but you can only use it once for confirmation.
As soon as you use an observation twice, you've switched from confirmation to exploration.
This is necessary because to confirm a hypothesis you must use data independent of the data that you used to generate the hypothesis.
Otherwise you will be over optimistic.
There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading.
If you are serious about doing an confirmatory analysis, one approach is to split your data into three pieces before you begin the analysis:
1. 60% of your data goes into a __training__ (or exploration) set. You're
allowed to do anything you like with this data: visualise it and fit tons
of models to it.
1. 20% goes into a __query__ set. You can use this data to compare models
or visualisations by hand, but you're not allowed to use it as part of
an automated process.
1. 60% of your data goes into a **training** (or exploration) set.
You're allowed to do anything you like with this data: visualise it and fit tons of models to it.
1. 20% is held back for a __test__ set. You can only use this data ONCE, to
test your final model.
This partitioning allows you to explore the training data, occasionally generating candidate hypotheses that you check with the query set. When you are confident you have the right model, you can check it once with the test data.
2. 20% goes into a **query** set.
You can use this data to compare models or visualisations by hand, but you're not allowed to use it as part of an automated process.
3. 20% is held back for a **test** set.
You can only use this data ONCE, to test your final model.
This partitioning allows you to explore the training data, occasionally generating candidate hypotheses that you check with the query set.
When you are confident you have the right model, you can check it once with the test data.
(Note that even when doing confirmatory modelling, you will still need to do EDA. If you don't do any EDA you will remain blind to the quality problems with your data.)

211
pipes.Rmd
View File

@ -2,11 +2,16 @@
## Introduction
Pipes are a powerful tool for clearly expressing a sequence of multiple operations. So far, you've been using them without knowing how they work, or what the alternatives are. Now, in this chapter, it's time to explore the pipe in more detail. You'll learn the alternatives to the pipe, when you shouldn't use the pipe, and some useful related tools.
Pipes are a powerful tool for clearly expressing a sequence of multiple operations.
So far, you've been using them without knowing how they work, or what the alternatives are.
Now, in this chapter, it's time to explore the pipe in more detail.
You'll learn the alternatives to the pipe, when you shouldn't use the pipe, and some useful related tools.
### Prerequisites
The pipe, `%>%`, comes from the __magrittr__ package by Stefan Milton Bache. Packages in the tidyverse load `%>%` for you automatically, so you don't usually load magrittr explicitly. Here, however, we're focussing on piping, and we aren't loading any other packages, so we will load it explicitly.
The pipe, `%>%`, comes from the **magrittr** package by Stefan Milton Bache.
Packages in the tidyverse load `%>%` for you automatically, so you don't usually load magrittr explicitly.
Here, however, we're focussing on piping, and we aren't loading any other packages, so we will load it explicitly.
```{r setup, message = FALSE}
library(magrittr)
@ -14,12 +19,15 @@ library(magrittr)
## Piping alternatives
The point of the pipe is to help you write code in a way that is easier to read and understand. To see why the pipe is so useful, we're going to explore a number of ways of writing the same code. Let's use code to tell a story about a little bunny named Foo Foo:
The point of the pipe is to help you write code in a way that is easier to read and understand.
To see why the pipe is so useful, we're going to explore a number of ways of writing the same code.
Let's use code to tell a story about a little bunny named Foo Foo:
> Little bunny Foo Foo
> Went hopping through the forest
> Scooping up the field mice
> And bopping them on the head
> Little bunny Foo Foo\
> Went hopping through the forest\
> Scooping up the field mice\
> And bopping them on the he
> ad
This is a popular Children's poem that is accompanied by hand actions.
@ -29,12 +37,13 @@ We'll start by defining an object to represent little bunny Foo Foo:
foo_foo <- little_bunny()
```
And we'll use a function for each key verb: `hop()`, `scoop()`, and `bop()`. Using this object and these verbs, there are (at least) four ways we could retell the story in code:
And we'll use a function for each key verb: `hop()`, `scoop()`, and `bop()`.
Using this object and these verbs, there are (at least) four ways we could retell the story in code:
1. Save each intermediate step as a new object.
1. Overwrite the original object many times.
1. Compose functions.
1. Use the pipe.
1. Save each intermediate step as a new object.
2. Overwrite the original object many times.
3. Compose functions.
4. Use the pipe.
We'll work through each approach, showing you the code and talking about the advantages and disadvantages.
@ -48,15 +57,22 @@ foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)
```
The main downside of this form is that it forces you to name each intermediate element. If there are natural names, this is a good idea, and you should do it. But many times, like this in this example, there aren't natural names, and you add numeric suffixes to make the names unique. That leads to two problems:
The main downside of this form is that it forces you to name each intermediate element.
If there are natural names, this is a good idea, and you should do it.
But many times, like this in this example, there aren't natural names, and you add numeric suffixes to make the names unique.
That leads to two problems:
1. The code is cluttered with unimportant names
1. The code is cluttered with unimportant names
1. You have to carefully increment the suffix on each line.
2. You have to carefully increment the suffix on each line.
Whenever I write code like this, I invariably use the wrong number on one line and then spend 10 minutes scratching my head and trying to figure out what went wrong with my code.
You may also worry that this form creates many copies of your data and takes up a lot of memory. Surprisingly, that's not the case. First, note that proactively worrying about memory is not a useful way to spend your time: worry about it when it becomes a problem (i.e. you run out of memory), not before. Second, R isn't stupid, and it will share columns across data frames, where possible. Let's take a look at an actual data manipulation pipeline where we add a new column to `ggplot2::diamonds`:
You may also worry that this form creates many copies of your data and takes up a lot of memory.
Surprisingly, that's not the case.
First, note that proactively worrying about memory is not a useful way to spend your time: worry about it when it becomes a problem (i.e. you run out of memory), not before.
Second, R isn't stupid, and it will share columns across data frames, where possible.
Let's take a look at an actual data manipulation pipeline where we add a new column to `ggplot2::diamonds`:
```{r}
diamonds <- ggplot2::diamonds
@ -68,13 +84,19 @@ pryr::object_size(diamonds2)
pryr::object_size(diamonds, diamonds2)
```
`pryr::object_size()` gives the memory occupied by all of its arguments. The results seem counterintuitive at first:
`pryr::object_size()` gives the memory occupied by all of its arguments.
The results seem counterintuitive at first:
* `diamonds` takes up 3.46 MB,
* `diamonds2` takes up 3.89 MB,
* `diamonds` and `diamonds2` together take up 3.89 MB!
- `diamonds` takes up 3.46 MB,
- `diamonds2` takes up 3.89 MB,
- `diamonds` and `diamonds2` together take up 3.89 MB!
How can that work? Well, `diamonds2` has 10 columns in common with `diamonds`: there's no need to duplicate all that data, so the two data frames have variables in common. These variables will only get copied if you modify one of them. In the following example, we modify a single value in `diamonds$carat`. That means the `carat` variable can no longer be shared between the two data frames, and a copy must be made. The size of each data frame is unchanged, but the collective size increases:
How can that work?
Well, `diamonds2` has 10 columns in common with `diamonds`: there's no need to duplicate all that data, so the two data frames have variables in common.
These variables will only get copied if you modify one of them.
In the following example, we modify a single value in `diamonds$carat`.
That means the `carat` variable can no longer be shared between the two data frames, and a copy must be made.
The size of each data frame is unchanged, but the collective size increases:
```{r}
diamonds$carat[1] <- NA
@ -83,7 +105,8 @@ pryr::object_size(diamonds2)
pryr::object_size(diamonds, diamonds2)
```
(Note that we use `pryr::object_size()` here, not the built-in `object.size()`. `object.size()` only takes a single object so it can't compute how data is shared across multiple objects.)
(Note that we use `pryr::object_size()` here, not the built-in `object.size()`.
`object.size()` only takes a single object so it can't compute how data is shared across multiple objects.)
### Overwrite the original
@ -95,13 +118,12 @@ foo_foo <- scoop(foo_foo, up = field_mice)
foo_foo <- bop(foo_foo, on = head)
```
This is less typing (and less thinking), so you're less likely to make mistakes. However, there are two problems:
This is less typing (and less thinking), so you're less likely to make mistakes.
However, there are two problems:
1. Debugging is painful: if you make a mistake you'll need to re-run the
complete pipeline from the beginning.
1. The repetition of the object being transformed (we've written `foo_foo` six
times!) obscures what's changing on each line.
1. Debugging is painful: if you make a mistake you'll need to re-run the complete pipeline from the beginning.
2. The repetition of the object being transformed (we've written `foo_foo` six times!) obscures what's changing on each line.
### Function composition
@ -117,10 +139,10 @@ bop(
)
```
Here the disadvantage is that you have to read from inside-out, from right-to-left, and that the arguments end up spread far apart (evocatively called the
[Dagwood sandwich](https://en.wikipedia.org/wiki/Dagwood_sandwich) problem). In short, this code is hard for a human to consume.
Here the disadvantage is that you have to read from inside-out, from right-to-left, and that the arguments end up spread far apart (evocatively called the [Dagwood sandwich](https://en.wikipedia.org/wiki/Dagwood_sandwich) problem).
In short, this code is hard for a human to consume.
### Use the pipe
### Use the pipe
Finally, we can use the pipe:
@ -131,9 +153,15 @@ foo_foo %>%
bop(on = head)
```
This is my favourite form, because it focusses on verbs, not nouns. You can read this series of function compositions like it's a set of imperative actions. Foo Foo hops, then scoops, then bops. The downside, of course, is that you need to be familiar with the pipe. If you've never seen `%>%` before, you'll have no idea what this code does. Fortunately, most people pick up the idea very quickly, so when you share your code with others who aren't familiar with the pipe, you can easily teach them.
This is my favourite form, because it focusses on verbs, not nouns.
You can read this series of function compositions like it's a set of imperative actions.
Foo Foo hops, then scoops, then bops.
The downside, of course, is that you need to be familiar with the pipe.
If you've never seen `%>%` before, you'll have no idea what this code does.
Fortunately, most people pick up the idea very quickly, so when you share your code with others who aren't familiar with the pipe, you can easily teach them.
The pipe works by performing a "lexical transformation": behind the scenes, magrittr reassembles the code in the pipe to a form that works by overwriting an intermediate object. When you run a pipe like the one above, magrittr does something like this:
The pipe works by performing a "lexical transformation": behind the scenes, magrittr reassembles the code in the pipe to a form that works by overwriting an intermediate object.
When you run a pipe like the one above, magrittr does something like this:
```{r, eval = FALSE}
my_pipe <- function(.) {
@ -146,119 +174,106 @@ my_pipe(foo_foo)
This means that the pipe won't work for two classes of functions:
1. Functions that use the current environment. For example, `assign()`
will create a new variable with the given name in the current environment:
1. Functions that use the current environment.
For example, `assign()` will create a new variable with the given name in the current environment:
```{r}
assign("x", 10)
x
"x" %>% assign(100)
x
```
The use of assign with the pipe does not work because it assigns it to
a temporary environment used by `%>%`. If you do want to use assign with the
pipe, you must be explicit about the environment:
The use of assign with the pipe does not work because it assigns it to a temporary environment used by `%>%`.
If you do want to use assign with the pipe, you must be explicit about the environment:
```{r}
env <- environment()
"x" %>% assign(100, envir = env)
x
```
Other functions with this problem include `get()` and `load()`.
1. Functions that use lazy evaluation. In R, function arguments
are only computed when the function uses them, not prior to calling the
function. The pipe computes each element in turn, so you can't
rely on this behaviour.
One place that this is a problem is `tryCatch()`, which lets you capture
and handle errors:
2. Functions that use lazy evaluation.
In R, function arguments are only computed when the function uses them, not prior to calling the function.
The pipe computes each element in turn, so you can't rely on this behaviour.
One place that this is a problem is `tryCatch()`, which lets you capture and handle errors:
```{r, error = TRUE}
tryCatch(stop("!"), error = function(e) "An error")
stop("!") %>%
tryCatch(error = function(e) "An error")
```
There are a relatively wide class of functions with this behaviour,
including `try()`, `suppressMessages()`, and `suppressWarnings()`
in base R.
There are a relatively wide class of functions with this behaviour, including `try()`, `suppressMessages()`, and `suppressWarnings()` in base R.
## When not to use the pipe
The pipe is a powerful tool, but it's not the only tool at your disposal, and it doesn't solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:
The pipe is a powerful tool, but it's not the only tool at your disposal, and it doesn't solve every problem!
Pipes are most useful for rewriting a fairly short linear sequence of operations.
I think you should reach for another tool when:
* Your pipes are longer than (say) ten steps. In that case, create
intermediate objects with meaningful names. That will make debugging easier,
because you can more easily check the intermediate results, and it makes
it easier to understand your code, because the variable names can help
communicate intent.
* You have multiple inputs or outputs. If there isn't one primary object
being transformed, but two or more objects being combined together,
don't use the pipe.
- Your pipes are longer than (say) ten steps.
In that case, create intermediate objects with meaningful names.
That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
* You are starting to think about a directed graph with a complex
dependency structure. Pipes are fundamentally linear and expressing
complex relationships with them will typically yield confusing code.
- You have multiple inputs or outputs.
If there isn't one primary object being transformed, but two or more objects being combined together, don't use the pipe.
- You are starting to think about a directed graph with a complex dependency structure.
Pipes are fundamentally linear and expressing complex relationships with them will typically yield confusing code.
## Other tools from magrittr
All packages in the tidyverse automatically make `%>%` available for you, so you don't normally load magrittr explicitly. However, there are some other useful tools inside magrittr that you might want to try out:
All packages in the tidyverse automatically make `%>%` available for you, so you don't normally load magrittr explicitly.
However, there are some other useful tools inside magrittr that you might want to try out:
* When working with more complex pipes, it's sometimes useful to call a
function for its side-effects. Maybe you want to print out the current
object, or plot it, or save it to disk. Many times, such functions don't
return anything, effectively terminating the pipe.
To work around this problem, you can use the "tee" pipe. `%T>%` works like
`%>%` except that it returns the left-hand side instead of the right-hand
side. It's called "tee" because it's like a literal T-shaped pipe.
- When working with more complex pipes, it's sometimes useful to call a function for its side-effects.
Maybe you want to print out the current object, or plot it, or save it to disk.
Many times, such functions don't return anything, effectively terminating the pipe.
To work around this problem, you can use the "tee" pipe.
`%T>%` works like `%>%` except that it returns the left-hand side instead of the right-hand side.
It's called "tee" because it's like a literal T-shaped pipe.
```{r}
rnorm(100) %>%
matrix(ncol = 2) %>%
plot() %>%
str()
rnorm(100) %>%
matrix(ncol = 2) %T>%
plot() %>%
str()
```
* If you're working with functions that don't have a data frame based API
(i.e. you pass them individual vectors, not a data frame and expressions
to be evaluated in the context of that data frame), you might find `%$%`
useful. It "explodes" out the variables in a data frame so that you can
refer to them explicitly. This is useful when working with many functions
in base R:
- If you're working with functions that don't have a data frame based API\
(i.e. you pass them individual vectors, not a data frame and expressions to be evaluated in the context of that data frame), you might find `%$%` useful.
It "explodes" out the variables in a data frame so that you can refer to them explicitly.
This is useful when working with many functions in base R:
```{r}
mtcars %$%
cor(disp, mpg)
```
* For assignment magrittr provides the `%<>%` operator which allows you to
replace code like:
- For assignment magrittr provides the `%<>%` operator which allows you to replace code like:
```{r, eval = FALSE}
mtcars <- mtcars %>%
transform(cyl = cyl * 2)
```
with
```{r, eval = FALSE}
mtcars %<>% transform(cyl = cyl * 2)
```
I'm not a fan of this operator because I think assignment is such a
special operation that it should always be clear when it's occurring.
In my opinion, a little bit of duplication (i.e. repeating the
name of the object twice) is fine in return for making assignment
more explicit.
I'm not a fan of this operator because I think assignment is such a special operation that it should always be clear when it's occurring.
In my opinion, a little bit of duplication (i.e. repeating the name of the object twice) is fine in return for making assignment more explicit.

View File

@ -1,53 +1,62 @@
# (PART) Program {-}
# (PART) Program {.unnumbered}
# Introduction {#program-intro}
In this part of the book, you'll improve your programming skills. Programming is a cross-cutting skill needed for all data science work: you must use a computer to do data science; you cannot do it in your head, or with pencil and paper.
In this part of the book, you'll improve your programming skills.
Programming is a cross-cutting skill needed for all data science work: you must use a computer to do data science; you cannot do it in your head, or with pencil and paper.
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-program.png")
```
Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.
Programming produces code, and code is a tool of communication.
Obviously code tells the computer what you want it to do.
But it also communicates meaning to other humans.
Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative.
Even if you're not working with other people, you'll definitely be working with future-you!
Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did.
That means getting better at programming also involves getting better at communicating.
Over time, you want your code to become not just easier to write, but easier for others to read.
Writing code is similar in many ways to writing prose. One parallel which I find particularly useful is that in both cases rewriting is the key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely your first attempt will be clear.)
Writing code is similar in many ways to writing prose.
One parallel which I find particularly useful is that in both cases rewriting is the key to clarity.
The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times.
After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done.
If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did.
But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run.
(But the more you rewrite your functions the more likely your first attempt will be clear.)
In the following four chapters, you'll learn skills that will allow you to both tackle new programs and to solve existing problems with greater clarity and ease:
In the following four chapters, you'll learn skills that will allow you to both tackle new programs and to solve existing problems with greater clarity and ease:
1. In [pipes], you will dive deep into the __pipe__, `%>%`, and learn more
about how it works, what the alternatives are, and when not to use it.
1. In [pipes], you will dive deep into the **pipe**, `%>%`, and learn more about how it works, what the alternatives are, and when not to use it.
1. Copy-and-paste is a powerful tool, but you should avoid doing it more than
twice. Repeating yourself in code is dangerous because it can easily lead
to errors and inconsistencies. Instead, in [functions], you'll learn
how to write __functions__ which let you extract out repeated code so that
it can be easily reused.
2. Copy-and-paste is a powerful tool, but you should avoid doing it more than twice.
Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies.
Instead, in [functions], you'll learn how to write **functions** which let you extract out repeated code so that it can be easily reused.
1. As you start to write more powerful functions, you'll need a solid
grounding in R's __data structures__, provided by [vectors]. You must master
the four common atomic vectors, the three important S3 classes built on
top of them, and understand the mysteries of the list and data frame.
3. As you start to write more powerful functions, you'll need a solid grounding in R's **data structures**, provided by [vectors].
You must master the four common atomic vectors, the three important S3 classes built on top of them, and understand the mysteries of the list and data frame.
1. Functions extract out repeated code, but you often need to repeat the
same actions on different inputs. You need tools for __iteration__ that
let you do similar things again and again. These tools include for loops
and functional programming, which you'll learn about in [iteration].
4. Functions extract out repeated code, but you often need to repeat the same actions on different inputs.
You need tools for **iteration** that let you do similar things again and again.
These tools include for loops and functional programming, which you'll learn about in [iteration].
## Learning more
The goal of these chapters is to teach you the minimum about programming that you need to practice data science, which turns out to be a reasonable amount. Once you have mastered the material in this book, I strongly believe you should invest further in your programming skills. Learning more about programming is a long-term investment: it won't pay off immediately, but in the long term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.
The goal of these chapters is to teach you the minimum about programming that you need to practice data science, which turns out to be a reasonable amount.
Once you have mastered the material in this book, I strongly believe you should invest further in your programming skills.
Learning more about programming is a long-term investment: it won't pay off immediately, but in the long term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.
To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:
To learn more you need to study R as a programming language, not just an interactive environment for data science.
We have written two books that will help you do so:
* [_Hands on Programming with R_](https://amzn.com/1449359019),
by Garrett Grolemund. This is an introduction to R as a programming language
and is a great place to start if R is your first programming language. It
covers similar material to these chapters, but with a different style and
different motivation examples (based in the casino). It's a useful complement
if you find that these four chapters go by too quickly.
* [_Advanced R_](https://amzn.com/1466586966) by Hadley Wickham. This dives into the
details of R the programming language. This is a great place to start if you
have existing programming experience. It's also a great next step once you've
internalised the ideas in these chapters. You can read it online at
<http://adv-r.had.co.nz>.
- [*Hands on Programming with R*](https://amzn.com/1449359019), by Garrett Grolemund.
This is an introduction to R as a programming language and is a great place to start if R is your first programming language.
It covers similar material to these chapters, but with a different style and different motivation examples (based in the casino).
It's a useful complement if you find that these four chapters go by too quickly.
- [*Advanced R*](https://amzn.com/1466586966) by Hadley Wickham.
This dives into the details of R the programming language.
This is a great place to start if you have existing programming experience.
It's also a great next step once you've internalised the ideas in these chapters.
You can read it online at <http://adv-r.had.co.nz>.

View File

@ -16,3 +16,8 @@ AutoAppendNewline: Yes
StripTrailingWhitespace: Yes
BuildType: Website
MarkdownWrap: Sentence
MarkdownCanonical: Yes
SpellingDictionary: en_GB

View File

@ -2,21 +2,28 @@
## Introduction
It's rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you're interested in. Collectively, multiple tables of data are called __relational data__ because it is the relations, not just the individual datasets, that are important.
It's rare that a data analysis involves only a single table of data.
Typically you have many tables of data, and you must combine them to answer the questions that you're interested in.
Collectively, multiple tables of data are called **relational data** because it is the relations, not just the individual datasets, that are important.
Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair. Sometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.
Relations are always defined between a pair of tables.
All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair.
Sometimes both elements of a pair can be the same table!
This is needed if, for example, you have a table of people, and each person has a reference to their parents.
To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:
To work with relational data you need verbs that work with pairs of tables.
There are three families of verbs designed to work with relational data:
* __Mutating joins__, which add new variables to one data frame from matching
observations in another.
- **Mutating joins**, which add new variables to one data frame from matching observations in another.
* __Filtering joins__, which filter observations from one data frame based on
whether or not they match an observation in the other table.
- **Filtering joins**, which filter observations from one data frame based on whether or not they match an observation in the other table.
* __Set operations__, which treat observations as if they were set elements.
- **Set operations**, which treat observations as if they were set elements.
The most common place to find relational data is in a _relational_ database management system (or RDBMS), a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren't commonly needed for data analysis.
The most common place to find relational data is in a *relational* database management system (or RDBMS), a term that encompasses almost all modern databases.
If you've used a database before, you've almost certainly used SQL.
If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different.
Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren't commonly needed for data analysis.
### Prerequisites
@ -29,29 +36,28 @@ library(nycflights13)
## nycflights13 {#nycflights13-relational}
We will use the nycflights13 package to learn about relational data. nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `planes` which are all related to the `flights` table that you used in [data transformation]:
We will use the nycflights13 package to learn about relational data.
nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `planes` which are all related to the `flights` table that you used in [data transformation]:
* `airlines` lets you look up the full carrier name from its abbreviated
code:
- `airlines` lets you look up the full carrier name from its abbreviated code:
```{r}
airlines
```
* `airports` gives information about each airport, identified by the `faa`
airport code:
- `airports` gives information about each airport, identified by the `faa` airport code:
```{r}
airports
```
* `planes` gives information about each plane, identified by its `tailnum`:
- `planes` gives information about each plane, identified by its `tailnum`:
```{r}
planes
```
* `weather` gives the weather at each NYC airport for each hour:
- `weather` gives the weather at each NYC airport for each hour:
```{r}
weather
@ -63,55 +69,59 @@ One way to show the relationships between the different tables is with a drawing
knitr::include_graphics("diagrams/relational-nycflights.png")
```
This diagram is a little overwhelming, but it's simple compared to some you'll see in the wild! The key to understanding diagrams like this is to remember each relation always concerns a pair of tables. You don't need to understand the whole thing; you just need to understand the chain of relations between the tables that you are interested in.
This diagram is a little overwhelming, but it's simple compared to some you'll see in the wild!
The key to understanding diagrams like this is to remember each relation always concerns a pair of tables.
You don't need to understand the whole thing; you just need to understand the chain of relations between the tables that you are interested in.
For nycflights13:
* `flights` connects to `planes` via a single variable, `tailnum`.
- `flights` connects to `planes` via a single variable, `tailnum`.
* `flights` connects to `airlines` through the `carrier` variable.
- `flights` connects to `airlines` through the `carrier` variable.
* `flights` connects to `airports` in two ways: via the `origin` and
`dest` variables.
- `flights` connects to `airports` in two ways: via the `origin` and `dest` variables.
* `flights` connects to `weather` via `origin` (the location), and
`year`, `month`, `day` and `hour` (the time).
- `flights` connects to `weather` via `origin` (the location), and `year`, `month`, `day` and `hour` (the time).
### Exercises
1. Imagine you wanted to draw (approximately) the route each plane flies from
its origin to its destination. What variables would you need? What tables
would you need to combine?
1. Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination.
What variables would you need?
What tables would you need to combine?
1. I forgot to draw the relationship between `weather` and `airports`.
2. I forgot to draw the relationship between `weather` and `airports`.
What is the relationship and how should it appear in the diagram?
1. `weather` only contains information for the origin (NYC) airports. If
it contained weather records for all airports in the USA, what additional
relation would it define with `flights`?
3. `weather` only contains information for the origin (NYC) airports.
If it contained weather records for all airports in the USA, what additional relation would it define with `flights`?
1. We know that some days of the year are "special", and fewer people than
usual fly on them. How might you represent that data as a data frame?
What would be the primary keys of that table? How would it connect to the
existing tables?
4. We know that some days of the year are "special", and fewer people than usual fly on them.
How might you represent that data as a data frame?
What would be the primary keys of that table?
How would it connect to the existing tables?
## Keys
The variables used to connect each pair of tables are called __keys__. A key is a variable (or set of variables) that uniquely identifies an observation. In simple cases, a single variable is sufficient to identify an observation. For example, each plane is uniquely identified by its `tailnum`. In other cases, multiple variables may be needed. For example, to identify an observation in `weather` you need five variables: `year`, `month`, `day`, `hour`, and `origin`.
The variables used to connect each pair of tables are called **keys**.
A key is a variable (or set of variables) that uniquely identifies an observation.
In simple cases, a single variable is sufficient to identify an observation.
For example, each plane is uniquely identified by its `tailnum`.
In other cases, multiple variables may be needed.
For example, to identify an observation in `weather` you need five variables: `year`, `month`, `day`, `hour`, and `origin`.
There are two types of keys:
* A __primary key__ uniquely identifies an observation in its own table.
For example, `planes$tailnum` is a primary key because it uniquely identifies
each plane in the `planes` table.
- A **primary key** uniquely identifies an observation in its own table.
For example, `planes$tailnum` is a primary key because it uniquely identifies each plane in the `planes` table.
* A __foreign key__ uniquely identifies an observation in another table.
For example, `flights$tailnum` is a foreign key because it appears in the
`flights` table where it matches each flight to a unique plane.
- A **foreign key** uniquely identifies an observation in another table.
For example, `flights$tailnum` is a foreign key because it appears in the `flights` table where it matches each flight to a unique plane.
A variable can be both a primary key _and_ a foreign key. For example, `origin` is part of the `weather` primary key, and is also a foreign key for the `airports` table.
A variable can be both a primary key *and* a foreign key.
For example, `origin` is part of the `weather` primary key, and is also a foreign key for the `airports` table.
Once you've identified the primary keys in your tables, it's good practice to verify that they do indeed uniquely identify each observation. One way to do that is to `count()` the primary keys and look for entries where `n` is greater than one:
Once you've identified the primary keys in your tables, it's good practice to verify that they do indeed uniquely identify each observation.
One way to do that is to `count()` the primary keys and look for entries where `n` is greater than one:
```{r}
planes %>%
@ -123,7 +133,9 @@ weather %>%
filter(n > 1)
```
Sometimes a table doesn't have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. For example, what's the primary key in the `flights` table? You might think it would be the date plus the flight or tail number, but neither of those are unique:
Sometimes a table doesn't have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it.
For example, what's the primary key in the `flights` table?
You might think it would be the date plus the flight or tail number, but neither of those are unique:
```{r}
flights %>%
@ -135,36 +147,47 @@ flights %>%
filter(n > 1)
```
When starting to work with this data, I had naively assumed that each flight number would be only used once per day: that would make it much easier to communicate problems with a specific flight. Unfortunately that is not the case! If a table lacks a primary key, it's sometimes useful to add one with `mutate()` and `row_number()`. That makes it easier to match observations if you've done some filtering and want to check back in with the original data. This is called a __surrogate key__.
When starting to work with this data, I had naively assumed that each flight number would be only used once per day: that would make it much easier to communicate problems with a specific flight.
Unfortunately that is not the case!
If a table lacks a primary key, it's sometimes useful to add one with `mutate()` and `row_number()`.
That makes it easier to match observations if you've done some filtering and want to check back in with the original data.
This is called a **surrogate key**.
A primary key and the corresponding foreign key in another table form a __relation__. Relations are typically one-to-many. For example, each flight has one plane, but each plane has many flights. In other data, you'll occasionally see a 1-to-1 relationship. You can think of this as a special case of 1-to-many. You can model many-to-many relations with a many-to-1 relation plus a 1-to-many relation. For example, in this data there's a many-to-many relationship between airlines and airports: each airline flies to many airports; each airport hosts many airlines.
A primary key and the corresponding foreign key in another table form a **relation**.
Relations are typically one-to-many.
For example, each flight has one plane, but each plane has many flights.
In other data, you'll occasionally see a 1-to-1 relationship.
You can think of this as a special case of 1-to-many.
You can model many-to-many relations with a many-to-1 relation plus a 1-to-many relation.
For example, in this data there's a many-to-many relationship between airlines and airports: each airline flies to many airports; each airport hosts many airlines.
### Exercises
1. Add a surrogate key to `flights`.
1. Identify the keys in the following datasets
2. Identify the keys in the following datasets
1. `Lahman::Batting`,
1. `babynames::babynames`
1. `nasaweather::atmos`
1. `fueleconomy::vehicles`
1. `ggplot2::diamonds`
2. `babynames::babynames`
3. `nasaweather::atmos`
4. `fueleconomy::vehicles`
5. `ggplot2::diamonds`
(You might need to install some packages and read some documentation.)
1. Draw a diagram illustrating the connections between the `Batting`,
`People`, and `Salaries` tables in the Lahman package. Draw another diagram
that shows the relationship between `People`, `Managers`, `AwardsManagers`.
3. Draw a diagram illustrating the connections between the `Batting`, `People`, and `Salaries` tables in the Lahman package.
Draw another diagram that shows the relationship between `People`, `Managers`, `AwardsManagers`.
How would you characterise the relationship between the `Batting`,
`Pitching`, and `Fielding` tables?
How would you characterise the relationship between the `Batting`, `Pitching`, and `Fielding` tables?
## Mutating joins {#mutating-joins}
The first tool we'll look at for combining a pair of tables is the __mutating join__. A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.
The first tool we'll look at for combining a pair of tables is the **mutating join**.
A mutating join allows you to combine variables from two tables.
It first matches observations by their keys, then copies across variables from one table to the other.
Like `mutate()`, the join functions add variables to the right, so if you have a lot of variables already, the new variables won't get printed out. For these examples, we'll make it easier to see what's going on in the examples by creating a narrower dataset:
Like `mutate()`, the join functions add variables to the right, so if you have a lot of variables already, the new variables won't get printed out.
For these examples, we'll make it easier to see what's going on in the examples by creating a narrower dataset:
```{r}
flights2 <- flights %>%
@ -174,7 +197,8 @@ flights2
(Remember, when you're in RStudio, you can also use `View()` to avoid this problem.)
Imagine you want to add the full airline name to the `flights2` data. You can combine the `airlines` and `flights2` data frames with `left_join()`:
Imagine you want to add the full airline name to the `flights2` data.
You can combine the `airlines` and `flights2` data frames with `left_join()`:
```{r}
flights2 %>%
@ -182,7 +206,9 @@ flights2 %>%
left_join(airlines, by = "carrier")
```
The result of joining airlines to flights2 is an additional variable: `name`. This is why I call this type of join a mutating join. In this case, you could have got to the same place using `mutate()` and R's base subsetting:
The result of joining airlines to flights2 is an additional variable: `name`.
This is why I call this type of join a mutating join.
In this case, you could have got to the same place using `mutate()` and R's base subsetting:
```{r}
flights2 %>%
@ -192,7 +218,11 @@ flights2 %>%
But this is hard to generalise when you need to match multiple variables, and takes close reading to figure out the overall intent.
The following sections explain, in detail, how mutating joins work. You'll start by learning a useful visual representation of joins. We'll then use that to explain the four mutating join functions: the inner join, and the three outer joins. When working with real data, keys don't always uniquely identify observations, so next we'll talk about what happens when there isn't a unique match. Finally, you'll learn how to tell dplyr which variables are the keys for a given join.
The following sections explain, in detail, how mutating joins work.
You'll start by learning a useful visual representation of joins.
We'll then use that to explain the four mutating join functions: the inner join, and the three outer joins.
When working with real data, keys don't always uniquely identify observations, so next we'll talk about what happens when there isn't a unique match.
Finally, you'll learn how to tell dplyr which variables are the keys for a given join.
### Understanding joins
@ -201,6 +231,7 @@ To help you learn how joins work, I'm going to use a visual representation:
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/join-setup.png")
```
```{r}
x <- tribble(
~key, ~val_x,
@ -216,9 +247,12 @@ y <- tribble(
)
```
The coloured column represents the "key" variable: these are used to match the rows between the tables. The grey column represents the "value" column that is carried along for the ride. In these examples I'll show a single key variable, but the idea generalises in a straightforward way to multiple keys and multiple values.
The coloured column represents the "key" variable: these are used to match the rows between the tables.
The grey column represents the "value" column that is carried along for the ride.
In these examples I'll show a single key variable, but the idea generalises in a straightforward way to multiple keys and multiple values.
A join is a way of connecting each row in `x` to zero, one, or more rows in `y`. The following diagram shows each potential match as an intersection of a pair of lines.
A join is a way of connecting each row in `x` to zero, one, or more rows in `y`.
The following diagram shows each potential match as an intersection of a pair of lines.
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/join-setup2.png")
@ -226,7 +260,8 @@ knitr::include_graphics("diagrams/join-setup2.png")
(If you look closely, you might notice that we've switched the order of the key and value columns in `x`. This is to emphasise that joins match based on the key; the value is just carried along for the ride.)
In an actual join, matches will be indicated with dots. The number of dots = the number of matches = the number of rows in the output.
In an actual join, matches will be indicated with dots.
The number of dots = the number of matches = the number of rows in the output.
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/join-inner.png")
@ -234,32 +269,38 @@ knitr::include_graphics("diagrams/join-inner.png")
### Inner join {#inner-join}
The simplest type of join is the __inner join__. An inner join matches pairs of observations whenever their keys are equal:
The simplest type of join is the **inner join**.
An inner join matches pairs of observations whenever their keys are equal:
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/join-inner.png")
```
(To be precise, this is an inner __equijoin__ because the keys are matched using the equality operator. Since most joins are equijoins we usually drop that specification.)
(To be precise, this is an inner **equijoin** because the keys are matched using the equality operator. Since most joins are equijoins we usually drop that specification.)
The output of an inner join is a new data frame that contains the key, the x values, and the y values. We use `by` to tell dplyr which variable is the key:
The output of an inner join is a new data frame that contains the key, the x values, and the y values.
We use `by` to tell dplyr which variable is the key:
```{r}
x %>%
inner_join(y, by = "key")
```
The most important property of an inner join is that unmatched rows are not included in the result. This means that generally inner joins are usually not appropriate for use in analysis because it's too easy to lose observations.
The most important property of an inner join is that unmatched rows are not included in the result.
This means that generally inner joins are usually not appropriate for use in analysis because it's too easy to lose observations.
### Outer joins {#outer-join}
An inner join keeps observations that appear in both tables. An __outer join__ keeps observations that appear in at least one of the tables. There are three types of outer joins:
An inner join keeps observations that appear in both tables.
An **outer join** keeps observations that appear in at least one of the tables.
There are three types of outer joins:
* A __left join__ keeps all observations in `x`.
* A __right join__ keeps all observations in `y`.
* A __full join__ keeps all observations in `x` and `y`.
- A **left join** keeps all observations in `x`.
- A **right join** keeps all observations in `y`.
- A **full join** keeps all observations in `x` and `y`.
These joins work by adding an additional "virtual" observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with `NA`.
These joins work by adding an additional "virtual" observation to each table.
This observation has a key that always matches (if no other key matches), and a value filled with `NA`.
Graphically, that looks like:
@ -267,7 +308,8 @@ Graphically, that looks like:
knitr::include_graphics("diagrams/join-outer.png")
```
The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn't a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn't a match.
The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
Another way to depict the different types of joins is with a Venn diagram:
@ -275,23 +317,25 @@ Another way to depict the different types of joins is with a Venn diagram:
knitr::include_graphics("diagrams/join-venn.png")
```
However, this is not a great representation. It might jog your memory about which join preserves the observations in which table, but it suffers from a major limitation: a Venn diagram can't show what happens when keys don't uniquely identify an observation.
However, this is not a great representation.
It might jog your memory about which join preserves the observations in which table, but it suffers from a major limitation: a Venn diagram can't show what happens when keys don't uniquely identify an observation.
### Duplicate keys {#join-matches}
So far all the diagrams have assumed that the keys are unique. But that's not always the case. This section explains what happens when the keys are not unique. There are two possibilities:
So far all the diagrams have assumed that the keys are unique.
But that's not always the case.
This section explains what happens when the keys are not unique.
There are two possibilities:
1. One table has duplicate keys. This is useful when you want to
add in additional information as there is typically a one-to-many
relationship.
1. One table has duplicate keys.
This is useful when you want to add in additional information as there is typically a one-to-many relationship.
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/join-one-to-many.png")
```
Note that I've put the key column in a slightly different position
in the output. This reflects that the key is a primary key in `y`
and a foreign key in `x`.
Note that I've put the key column in a slightly different position in the output.
This reflects that the key is a primary key in `y` and a foreign key in `x`.
```{r}
x <- tribble(
@ -309,9 +353,9 @@ So far all the diagrams have assumed that the keys are unique. But that's not al
left_join(x, y, by = "key")
```
1. Both tables have duplicate keys. This is usually an error because in
neither table do the keys uniquely identify an observation. When you join
duplicated keys, you get all possible combinations, the Cartesian product:
2. Both tables have duplicate keys.
This is usually an error because in neither table do the keys uniquely identify an observation.
When you join duplicated keys, you get all possible combinations, the Cartesian product:
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/join-many-to-many.png")
@ -337,54 +381,48 @@ So far all the diagrams have assumed that the keys are unique. But that's not al
### Defining the key columns {#join-by}
So far, the pairs of tables have always been joined by a single variable, and that variable has the same name in both tables. That constraint was encoded by `by = "key"`. You can use other values for `by` to connect the tables in other ways:
So far, the pairs of tables have always been joined by a single variable, and that variable has the same name in both tables.
That constraint was encoded by `by = "key"`.
You can use other values for `by` to connect the tables in other ways:
* The default, `by = NULL`, uses all variables that appear in both tables,
the so called __natural__ join. For example, the flights and weather tables
match on their common variables: `year`, `month`, `day`, `hour` and
`origin`.
- The default, `by = NULL`, uses all variables that appear in both tables, the so called **natural** join.
For example, the flights and weather tables match on their common variables: `year`, `month`, `day`, `hour` and `origin`.
```{r}
flights2 %>%
left_join(weather)
```
* A character vector, `by = "x"`. This is like a natural join, but uses only
some of the common variables. For example, `flights` and `planes` have
`year` variables, but they mean different things so we only want to join by
`tailnum`.
- A character vector, `by = "x"`.
This is like a natural join, but uses only some of the common variables.
For example, `flights` and `planes` have `year` variables, but they mean different things so we only want to join by `tailnum`.
```{r}
flights2 %>%
left_join(planes, by = "tailnum")
```
Note that the `year` variables (which appear in both input data frames,
but are not constrained to be equal) are disambiguated in the output with
a suffix.
Note that the `year` variables (which appear in both input data frames, but are not constrained to be equal) are disambiguated in the output with a suffix.
* A named character vector: `by = c("a" = "b")`. This will
match variable `a` in table `x` to variable `b` in table `y`. The
variables from `x` will be used in the output.
- A named character vector: `by = c("a" = "b")`.
This will match variable `a` in table `x` to variable `b` in table `y`.
The variables from `x` will be used in the output.
For example, if we want to draw a map we need to combine the flights data
with the airports data which contains the location (`lat` and `lon`) of
each airport. Each flight has an origin and destination `airport`, so we
need to specify which one we want to join to:
For example, if we want to draw a map we need to combine the flights data with the airports data which contains the location (`lat` and `lon`) of each airport.
Each flight has an origin and destination `airport`, so we need to specify which one we want to join to:
```{r}
flights2 %>%
left_join(airports, c("dest" = "faa"))
flights2 %>%
left_join(airports, c("origin" = "faa"))
```
### Exercises
1. Compute the average delay by destination, then join on the `airports`
data frame so you can show the spatial distribution of delays. Here's an
easy way to draw a map of the United States:
1. Compute the average delay by destination, then join on the `airports` data frame so you can show the spatial distribution of delays.
Here's an easy way to draw a map of the United States:
```{r, eval = FALSE}
airports %>%
@ -395,21 +433,18 @@ So far, the pairs of tables have always been joined by a single variable, and th
coord_quickmap()
```
(Don't worry if you don't understand what `semi_join()` does --- you'll
learn about it next.)
(Don't worry if you don't understand what `semi_join()` does --- you'll learn about it next.)
You might want to use the `size` or `colour` of the points to display
the average delay for each airport.
You might want to use the `size` or `colour` of the points to display the average delay for each airport.
1. Add the location of the origin _and_ destination (i.e. the `lat` and `lon`)
to `flights`.
2. Add the location of the origin *and* destination (i.e. the `lat` and `lon`) to `flights`.
1. Is there a relationship between the age of a plane and its delays?
3. Is there a relationship between the age of a plane and its delays?
1. What weather conditions make it more likely to see a delay?
4. What weather conditions make it more likely to see a delay?
1. What happened on June 13 2013? Display the spatial pattern of delays,
and then use Google to cross-reference with the weather.
5. What happened on June 13 2013?
Display the spatial pattern of delays, and then use Google to cross-reference with the weather.
```{r, eval = FALSE, include = FALSE}
worst <- filter(flights, !is.na(dep_time), month == 6, day == 13)
@ -428,36 +463,40 @@ So far, the pairs of tables have always been joined by a single variable, and th
`base::merge()` can perform all four types of mutating join:
dplyr | merge
-------------------|-------------------------------------------
`inner_join(x, y)` | `merge(x, y)`
`left_join(x, y)` | `merge(x, y, all.x = TRUE)`
`right_join(x, y)` | `merge(x, y, all.y = TRUE)`,
`full_join(x, y)` | `merge(x, y, all.x = TRUE, all.y = TRUE)`
| dplyr | merge |
|--------------------|-------------------------------------------|
| `inner_join(x, y)` | `merge(x, y)` |
| `left_join(x, y)` | `merge(x, y, all.x = TRUE)` |
| `right_join(x, y)` | `merge(x, y, all.y = TRUE)`, |
| `full_join(x, y)` | `merge(x, y, all.x = TRUE, all.y = TRUE)` |
The advantages of the specific dplyr verbs is that they more clearly convey the intent of your code: the difference between the joins is really important but concealed in the arguments of `merge()`. dplyr's joins are considerably faster and don't mess with the order of the rows.
The advantages of the specific dplyr verbs is that they more clearly convey the intent of your code: the difference between the joins is really important but concealed in the arguments of `merge()`.
dplyr's joins are considerably faster and don't mess with the order of the rows.
SQL is the inspiration for dplyr's conventions, so the translation is straightforward:
dplyr | SQL
-----------------------------|-------------------------------------------
`inner_join(x, y, by = "z")` | `SELECT * FROM x INNER JOIN y USING (z)`
`left_join(x, y, by = "z")` | `SELECT * FROM x LEFT OUTER JOIN y USING (z)`
`right_join(x, y, by = "z")` | `SELECT * FROM x RIGHT OUTER JOIN y USING (z)`
`full_join(x, y, by = "z")` | `SELECT * FROM x FULL OUTER JOIN y USING (z)`
| dplyr | SQL |
|------------------------------|------------------------------------------------|
| `inner_join(x, y, by = "z")` | `SELECT * FROM x INNER JOIN y USING (z)` |
| `left_join(x, y, by = "z")` | `SELECT * FROM x LEFT OUTER JOIN y USING (z)` |
| `right_join(x, y, by = "z")` | `SELECT * FROM x RIGHT OUTER JOIN y USING (z)` |
| `full_join(x, y, by = "z")` | `SELECT * FROM x FULL OUTER JOIN y USING (z)` |
Note that "INNER" and "OUTER" are optional, and often omitted.
Joining different variables between the tables, e.g. `inner_join(x, y, by = c("a" = "b"))` uses a slightly different syntax in SQL: `SELECT * FROM x INNER JOIN y ON x.a = y.b`. As this syntax suggests, SQL supports a wider range of join types than dplyr because you can connect the tables using constraints other than equality (sometimes called non-equijoins).
Joining different variables between the tables, e.g. `inner_join(x, y, by = c("a" = "b"))` uses a slightly different syntax in SQL: `SELECT * FROM x INNER JOIN y ON x.a = y.b`.
As this syntax suggests, SQL supports a wider range of join types than dplyr because you can connect the tables using constraints other than equality (sometimes called non-equijoins).
## Filtering joins {#filtering-joins}
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables.
There are two types:
* `semi_join(x, y)` __keeps__ all observations in `x` that have a match in `y`.
* `anti_join(x, y)` __drops__ all observations in `x` that have a match in `y`.
- `semi_join(x, y)` **keeps** all observations in `x` that have a match in `y`.
- `anti_join(x, y)` **drops** all observations in `x` that have a match in `y`.
Semi-joins are useful for matching filtered summary tables back to the original rows. For example, imagine you've found the top ten most popular destinations:
Semi-joins are useful for matching filtered summary tables back to the original rows.
For example, imagine you've found the top ten most popular destinations:
```{r}
top_dest <- flights %>%
@ -466,14 +505,17 @@ top_dest <- flights %>%
top_dest
```
Now you want to find each flight that went to one of those destinations. You could construct a filter yourself:
Now you want to find each flight that went to one of those destinations.
You could construct a filter yourself:
```{r}
flights %>%
filter(dest %in% top_dest$dest)
```
But it's difficult to extend that approach to multiple variables. For example, imagine that you'd found the 10 days with highest average delays. How would you construct the filter statement that used `year`, `month`, and `day` to match it back to `flights`?
But it's difficult to extend that approach to multiple variables.
For example, imagine that you'd found the 10 days with highest average delays.
How would you construct the filter statement that used `year`, `month`, and `day` to match it back to `flights`?
Instead you can use a semi-join, which connects the two tables like a mutating join, but instead of adding new columns, only keeps the rows in `x` that have a match in `y`:
@ -488,19 +530,22 @@ Graphically, a semi-join looks like this:
knitr::include_graphics("diagrams/join-semi.png")
```
Only the existence of a match is important; it doesn't matter which observation is matched. This means that filtering joins never duplicate rows like mutating joins do:
Only the existence of a match is important; it doesn't matter which observation is matched.
This means that filtering joins never duplicate rows like mutating joins do:
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/join-semi-many.png")
```
The inverse of a semi-join is an anti-join. An anti-join keeps the rows that _don't_ have a match:
The inverse of a semi-join is an anti-join.
An anti-join keeps the rows that *don't* have a match:
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/join-anti.png")
```
Anti-joins are useful for diagnosing join mismatches. For example, when connecting `flights` and `planes`, you might be interested to know that there are many `flights` that don't have a match in `planes`:
Anti-joins are useful for diagnosing join mismatches.
For example, when connecting `flights` and `planes`, you might be interested to know that there are many `flights` that don't have a match in `planes`:
```{r}
flights %>%
@ -510,67 +555,62 @@ flights %>%
### Exercises
1. What does it mean for a flight to have a missing `tailnum`? What do the
tail numbers that don't have a matching record in `planes` have in common?
(Hint: one variable explains ~90% of the problems.)
1. What does it mean for a flight to have a missing `tailnum`?
What do the tail numbers that don't have a matching record in `planes` have in common?
(Hint: one variable explains \~90% of the problems.)
1. Filter flights to only show flights with planes that have flown at least 100
flights.
2. Filter flights to only show flights with planes that have flown at least 100 flights.
1. Combine `fueleconomy::vehicles` and `fueleconomy::common` to find only the
records for the most common models.
3. Combine `fueleconomy::vehicles` and `fueleconomy::common` to find only the records for the most common models.
1. Find the 48 hours (over the course of the whole year) that have the worst
delays. Cross-reference it with the `weather` data. Can you see any
patterns?
4. Find the 48 hours (over the course of the whole year) that have the worst delays.
Cross-reference it with the `weather` data.
Can you see any patterns?
1. What does `anti_join(flights, airports, by = c("dest" = "faa"))` tell you?
5. What does `anti_join(flights, airports, by = c("dest" = "faa"))` tell you?
What does `anti_join(airports, flights, by = c("faa" = "dest"))` tell you?
1. You might expect that there's an implicit relationship between plane
and airline, because each plane is flown by a single airline. Confirm
or reject this hypothesis using the tools you've learned above.
6. You might expect that there's an implicit relationship between plane and airline, because each plane is flown by a single airline.
Confirm or reject this hypothesis using the tools you've learned above.
## Join problems
The data you've been working with in this chapter has been cleaned up so that you'll have as few problems as possible. Your own data is unlikely to be so nice, so there are a few things that you should do with your own data to make your joins go smoothly.
The data you've been working with in this chapter has been cleaned up so that you'll have as few problems as possible.
Your own data is unlikely to be so nice, so there are a few things that you should do with your own data to make your joins go smoothly.
1. Start by identifying the variables that form the primary key in each table.
You should usually do this based on your understanding of the data, not
empirically by looking for a combination of variables that give a
unique identifier. If you just look for variables without thinking about
what they mean, you might get (un)lucky and find a combination that's
unique in your current data but the relationship might not be true in
general.
You should usually do this based on your understanding of the data, not empirically by looking for a combination of variables that give a unique identifier.
If you just look for variables without thinking about what they mean, you might get (un)lucky and find a combination that's unique in your current data but the relationship might not be true in general.
For example, the altitude and longitude uniquely identify each airport,
but they are not good identifiers!
For example, the altitude and longitude uniquely identify each airport, but they are not good identifiers!
```{r}
airports %>% count(alt, lon) %>% filter(n > 1)
```
1. Check that none of the variables in the primary key are missing. If
a value is missing then it can't identify an observation!
2. Check that none of the variables in the primary key are missing.
If a value is missing then it can't identify an observation!
1. Check that your foreign keys match primary keys in another table. The
best way to do this is with an `anti_join()`. It's common for keys
not to match because of data entry errors. Fixing these is often a lot of
work.
3. Check that your foreign keys match primary keys in another table.
The best way to do this is with an `anti_join()`.
It's common for keys not to match because of data entry errors.
Fixing these is often a lot of work.
If you do have missing keys, you'll need to be thoughtful about your
use of inner vs. outer joins, carefully considering whether or not you
want to drop rows that don't have a match.
If you do have missing keys, you'll need to be thoughtful about your use of inner vs. outer joins, carefully considering whether or not you want to drop rows that don't have a match.
Be aware that simply checking the number of rows before and after the join is not sufficient to ensure that your join has gone smoothly. If you have an inner join with duplicate keys in both tables, you might get unlucky as the number of dropped rows might exactly equal the number of duplicated rows!
Be aware that simply checking the number of rows before and after the join is not sufficient to ensure that your join has gone smoothly.
If you have an inner join with duplicate keys in both tables, you might get unlucky as the number of dropped rows might exactly equal the number of duplicated rows!
## Set operations {#set-operations}
The final type of two-table verb are the set operations. Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets:
The final type of two-table verb are the set operations.
Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces.
All these operations work with a complete row, comparing the values of every variable.
These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets:
* `intersect(x, y)`: return only observations in both `x` and `y`.
* `union(x, y)`: return unique observations in `x` and `y`.
* `setdiff(x, y)`: return observations in `x`, but not in `y`.
- `intersect(x, y)`: return only observations in both `x` and `y`.
- `union(x, y)`: return unique observations in `x` and `y`.
- `setdiff(x, y)`: return observations in `x`, but not in `y`.
Given this simple data:

View File

@ -2,25 +2,27 @@
## Introduction
So far you've seen R Markdown used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with R Markdown. There are two ways to set the output of a document:
So far you've seen R Markdown used to produce HTML documents.
This chapter gives a brief overview of some of the many other types of output you can produce with R Markdown.
There are two ways to set the output of a document:
1. Permanently, by modifying the YAML header:
```yaml
1. Permanently, by modifying the YAML header:
``` {.yaml}
title: "Viridis Demo"
output: html_document
```
1. Transiently, by calling `rmarkdown::render()` by hand:
2. Transiently, by calling `rmarkdown::render()` by hand:
```{r eval = FALSE}
rmarkdown::render("diamond-sizes.Rmd", output_format = "word_document")
```
This is useful if you want to programmatically produce multiple types of
output.
RStudio's knit button renders a file to the first format listed in its `output` field. You can render to additional formats by clicking the dropdown menu beside the knit button.
This is useful if you want to programmatically produce multiple types of output.
RStudio's knit button renders a file to the first format listed in its `output` field.
You can render to additional formats by clicking the dropdown menu beside the knit button.
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("screenshots/rmarkdown-knit.png")
@ -28,11 +30,16 @@ knitr::include_graphics("screenshots/rmarkdown-knit.png")
## Output options
Each output format is associated with an R function. You can either write `foo` or `pkg::foo`. If you omit `pkg`, the default is assumed to be rmarkdown. It's important to know the name of the function that makes the output because that's where you get help. For example, to figure out what parameters you can set with `html_document`, look at `?rmarkdown::html_document`.
Each output format is associated with an R function.
You can either write `foo` or `pkg::foo`.
If you omit `pkg`, the default is assumed to be rmarkdown.
It's important to know the name of the function that makes the output because that's where you get help.
For example, to figure out what parameters you can set with `html_document`, look at `?rmarkdown::html_document`.
To override the default parameter values, you need to use an expanded `output` field. For example, if you wanted to render an `html_document` with a floating table of contents, you'd use:
To override the default parameter values, you need to use an expanded `output` field.
For example, if you wanted to render an `html_document` with a floating table of contents, you'd use:
```yaml
``` {.yaml}
output:
html_document:
toc: true
@ -41,7 +48,7 @@ output:
You can even render to multiple outputs by supplying a list of formats:
```yaml
``` {.yaml}
output:
html_document:
toc: true
@ -53,24 +60,22 @@ Note the special syntax if you don't want to override any of the default options
## Documents
The previous chapter focused on the default `html_document` output. There are a number of basic variations on that theme, generating different types of documents:
The previous chapter focused on the default `html_document` output.
There are a number of basic variations on that theme, generating different types of documents:
* `pdf_document` makes a PDF with LaTeX (an open source document layout
system), which you'll need to install. RStudio will prompt you if you
don't already have it.
* `word_document` for Microsoft Word documents (`.docx`).
* `odt_document` for OpenDocument Text documents (`.odt`).
* `rtf_document` for Rich Text Format (`.rtf`) documents.
* `md_document` for a Markdown document. This isn't typically useful by
itself, but you might use it if, for example, your corporate CMS or
lab wiki uses markdown.
* `github_document`: this is a tailored version of `md_document`
designed for sharing on GitHub.
- `pdf_document` makes a PDF with LaTeX (an open source document layout system), which you'll need to install.
RStudio will prompt you if you don't already have it.
- `word_document` for Microsoft Word documents (`.docx`).
- `odt_document` for OpenDocument Text documents (`.odt`).
- `rtf_document` for Rich Text Format (`.rtf`) documents.
- `md_document` for a Markdown document.
This isn't typically useful by itself, but you might use it if, for example, your corporate CMS or lab wiki uses markdown.
- `github_document`: this is a tailored version of `md_document` designed for sharing on GitHub.
Remember, when generating a document to share with decision makers, you can turn off the default display of code by setting global options in the setup chunk:
@ -80,7 +85,7 @@ knitr::opts_chunk$set(echo = FALSE)
For `html_document`s another option is to make the code chunks hidden by default, but visible with a click:
```yaml
``` {.yaml}
output:
html_document:
code_folding: hide
@ -88,55 +93,66 @@ output:
## Notebooks
A notebook, `html_notebook`, is a variation on a `html_document`. The rendered outputs are very similar, but the purpose is different. A `html_document` is focused on communicating with decision makers, while a notebook is focused on collaborating with other data scientists. These different purposes lead to using the HTML output in different ways. Both HTML outputs will contain the fully rendered output, but the notebook also contains the full source code. That means you can use the `.nb.html` generated by the notebook in two ways:
A notebook, `html_notebook`, is a variation on a `html_document`.
The rendered outputs are very similar, but the purpose is different.
A `html_document` is focused on communicating with decision makers, while a notebook is focused on collaborating with other data scientists.
These different purposes lead to using the HTML output in different ways.
Both HTML outputs will contain the fully rendered output, but the notebook also contains the full source code.
That means you can use the `.nb.html` generated by the notebook in two ways:
1. You can view it in a web browser, and see the rendered output. Unlike
`html_document`, this rendering always includes an embedded copy of
the source code that generated it.
1. You can view it in a web browser, and see the rendered output.
Unlike `html_document`, this rendering always includes an embedded copy of the source code that generated it.
1. You can edit it in RStudio. When you open an `.nb.html` file, RStudio will
automatically recreate the `.Rmd` file that generated it. In the future, you
will also be able to include supporting files (e.g. `.csv` data files), which
will be automatically extracted when needed.
2. You can edit it in RStudio.
When you open an `.nb.html` file, RStudio will automatically recreate the `.Rmd` file that generated it.
In the future, you will also be able to include supporting files (e.g. `.csv` data files), which will be automatically extracted when needed.
Emailing `.nb.html` files is a simple way to share analyses with your colleagues. But things will get painful as soon as they want to make changes. If this starts to happen, it's a good time to learn Git and GitHub. Learning Git and GitHub is definitely painful at first, but the collaboration payoff is huge. As mentioned earlier, Git and GitHub are outside the scope of the book, but there's one tip that's useful if you're already using them: use both `html_notebook` and `github_document` outputs:
Emailing `.nb.html` files is a simple way to share analyses with your colleagues.
But things will get painful as soon as they want to make changes.
If this starts to happen, it's a good time to learn Git and GitHub.
Learning Git and GitHub is definitely painful at first, but the collaboration payoff is huge.
As mentioned earlier, Git and GitHub are outside the scope of the book, but there's one tip that's useful if you're already using them: use both `html_notebook` and `github_document` outputs:
```yaml
``` {.yaml}
output:
html_notebook: default
github_document: default
```
`html_notebook` gives you a local preview, and a file that you can share via email. `github_document` creates a minimal md file that you can check into git. You can easily see how the results of your analysis (not just the code) change over time, and GitHub will render it for you nicely online.
`html_notebook` gives you a local preview, and a file that you can share via email.
`github_document` creates a minimal md file that you can check into git.
You can easily see how the results of your analysis (not just the code) change over time, and GitHub will render it for you nicely online.
## Presentations
You can also use R Markdown to produce presentations. You get less visual control than with a tool like Keynote or PowerPoint, but automatically inserting the results of your R code into a presentation can save a huge amount of time. Presentations work by dividing your content into slides, with a new slide beginning at each first (`#`) or second (`##`) level header. You can also insert a horizontal rule (`***`) to create a new slide without a header.
You can also use R Markdown to produce presentations.
You get less visual control than with a tool like Keynote or PowerPoint, but automatically inserting the results of your R code into a presentation can save a huge amount of time.
Presentations work by dividing your content into slides, with a new slide beginning at each first (`#`) or second (`##`) level header.
You can also insert a horizontal rule (`***`) to create a new slide without a header.
R Markdown comes with three presentation formats built-in:
1. `ioslides_presentation` - HTML presentation with ioslides
1. `slidy_presentation` - HTML presentation with W3C Slidy
2. `slidy_presentation` - HTML presentation with W3C Slidy
1. `beamer_presentation` - PDF presentation with LaTeX Beamer.
3. `beamer_presentation` - PDF presentation with LaTeX Beamer.
Two other popular formats are provided by packages:
1. `revealjs::revealjs_presentation` - HTML presentation with reveal.js.
Requires the __revealjs__ package.
1. `revealjs::revealjs_presentation` - HTML presentation with reveal.js.
Requires the **revealjs** package.
1. __rmdshower__, <https://github.com/MangoTheCat/rmdshower>, provides a
wrapper around the __shower__, <https://github.com/shower/shower>,
presentation engine
2. **rmdshower**, <https://github.com/MangoTheCat/rmdshower>, provides a wrapper around the **shower**, <https://github.com/shower/shower>, presentation engine
## Dashboards
Dashboards are a useful way to communicate large amounts of information visually and quickly. Flexdashboard makes it particularly easy to create dashboards using R Markdown and a convention for how the headers affect the layout:
Dashboards are a useful way to communicate large amounts of information visually and quickly.
Flexdashboard makes it particularly easy to create dashboards using R Markdown and a convention for how the headers affect the layout:
* Each level 1 header (`#`) begins a new page in the dashboard.
* Each level 2 header (`##`) begins a new column.
* Each level 3 header (`###`) begins a new row.
- Each level 1 header (`#`) begins a new page in the dashboard.
- Each level 2 header (`##`) begins a new column.
- Each level 3 header (`###`) begins a new row.
For example, you can produce this dashboard:
@ -150,7 +166,8 @@ Using this code:
cat(readr::read_file("rmarkdown/dashboard.Rmd"))
```
Flexdashboard also provides simple tools for creating sidebars, tabsets, value boxes, and gauges. To learn more about flexdashboard visit <http://rmarkdown.rstudio.com/flexdashboard/>.
Flexdashboard also provides simple tools for creating sidebars, tabsets, value boxes, and gauges.
To learn more about flexdashboard visit <http://rmarkdown.rstudio.com/flexdashboard/>.
## Interactivity
@ -158,7 +175,10 @@ Any HTML format (document, notebook, presentation, or dashboard) can contain int
### htmlwidgets
HTML is an interactive format, and you can take advantage of that interactivity with __htmlwidgets__, R functions that produce interactive HTML visualisations. For example, take the __leaflet__ map below. If you're viewing this page on the web, you can drag the map around, zoom in and out, etc. You obviously can't do that in a book, so rmarkdown automatically inserts a static screenshot for you.
HTML is an interactive format, and you can take advantage of that interactivity with **htmlwidgets**, R functions that produce interactive HTML visualisations.
For example, take the **leaflet** map below.
If you're viewing this page on the web, you can drag the map around, zoom in and out, etc.
You obviously can't do that in a book, so rmarkdown automatically inserts a static screenshot for you.
```{r}
library(leaflet)
@ -168,29 +188,31 @@ leaflet() %>%
addMarkers(174.764, -36.877, popup = "Maungawhau")
```
The great thing about htmlwidgets is that you don't need to know anything about HTML or JavaScript to use them. All the details are wrapped inside the package, so you don't need to worry about it.
The great thing about htmlwidgets is that you don't need to know anything about HTML or JavaScript to use them.
All the details are wrapped inside the package, so you don't need to worry about it.
There are many packages that provide htmlwidgets, including:
* __dygraphs__, <http://rstudio.github.io/dygraphs/>, for interactive time
series visualisations.
- **dygraphs**, <http://rstudio.github.io/dygraphs/>, for interactive time series visualisations.
* __DT__, <http://rstudio.github.io/DT/>, for interactive tables.
- **DT**, <http://rstudio.github.io/DT/>, for interactive tables.
* __threejs__, <https://github.com/bwlewis/rthreejs> for interactive 3d plots.
- **threejs**, <https://github.com/bwlewis/rthreejs> for interactive 3d plots.
* __DiagrammeR__, <http://rich-iannone.github.io/DiagrammeR/> for diagrams
(like flow charts and simple node-link diagrams).
- **DiagrammeR**, <http://rich-iannone.github.io/DiagrammeR/> for diagrams (like flow charts and simple node-link diagrams).
To learn more about htmlwidgets and see a more complete list of packages that provide them visit <http://www.htmlwidgets.org/>.
### Shiny
htmlwidgets provide __client-side__ interactivity --- all the interactivity happens in the browser, independently of R. On one hand, that's great because you can distribute the HTML file without any connection to R. However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript. An alternative approach is to use __shiny__, a package that allows you to create interactivity using R code, not JavaScript.
htmlwidgets provide **client-side** interactivity --- all the interactivity happens in the browser, independently of R.
On one hand, that's great because you can distribute the HTML file without any connection to R.
However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript.
An alternative approach is to use **shiny**, a package that allows you to create interactivity using R code, not JavaScript.
To call Shiny code from an R Markdown document, add `runtime: shiny` to the header:
```yaml
``` {.yaml}
title: "Shiny Web App"
output: html_document
runtime: shiny
@ -204,12 +226,18 @@ library(shiny)
textInput("name", "What is your name?")
numericInput("age", "How old are you?", NA, min = 0, max = 150)
```
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("screenshots/rmarkdown-shiny.png")
```
You can then refer to the values with `input$name` and `input$age`, and the code that uses them will be automatically re-run whenever they change.
I can't show you a live shiny app here because shiny interactions occur on the __server-side__. This means that you can write interactive apps without knowing JavaScript, but you need a server to run them on. This introduces a logistical issue: Shiny apps need a Shiny server to be run online. When you run shiny apps on your own computer, shiny automatically sets up a shiny server for you, but you need a public facing shiny server if you want to publish this sort of interactivity online. That's the fundamental trade-off of shiny: you can do anything in a shiny document that you can do in R, but it requires someone to be running R.
You can then refer to the values with `input$name` and `input$age`, and the code that uses them will be automatically re-run whenever they change.
I can't show you a live shiny app here because shiny interactions occur on the **server-side**.
This means that you can write interactive apps without knowing JavaScript, but you need a server to run them on.
This introduces a logistical issue: Shiny apps need a Shiny server to be run online.
When you run shiny apps on your own computer, shiny automatically sets up a shiny server for you, but you need a public facing shiny server if you want to publish this sort of interactivity online.
That's the fundamental trade-off of shiny: you can do anything in a shiny document that you can do in R, but it requires someone to be running R.
Learn more about Shiny at <http://shiny.rstudio.com/>.
@ -217,17 +245,18 @@ Learn more about Shiny at <http://shiny.rstudio.com/>.
With a little additional infrastructure you can use R Markdown to generate a complete website:
* Put your `.Rmd` files in a single directory. `index.Rmd` will become
the home page.
- Put your `.Rmd` files in a single directory.
`index.Rmd` will become the home page.
* Add a YAML file named `_site.yml` provides the navigation for the site.
- Add a YAML file named `_site.yml` provides the navigation for the site.
For example:
```{r echo = FALSE, comment = ""}
cat(readr::read_file("rmarkdown/example-site.yml"))
```
Execute `rmarkdown::render_site()` to build `_site`, a directory of files ready to deploy as a standalone static website, or if you use an RStudio Project for your website directory. RStudio will add a Build tab to the IDE that you can use to build and preview your site.
Execute `rmarkdown::render_site()` to build `_site`, a directory of files ready to deploy as a standalone static website, or if you use an RStudio Project for your website directory.
RStudio will add a Build tab to the IDE that you can use to build and preview your site.
Read more at <http://rmarkdown.rstudio.com/rmarkdown_websites.html>.
@ -235,45 +264,30 @@ Read more at <http://rmarkdown.rstudio.com/rmarkdown_websites.html>.
Other packages provide even more output formats:
* The __bookdown__ package, <https://github.com/rstudio/bookdown>,
makes it easy to write books, like this one. To learn more, read
[_Authoring Books with R Markdown_](https://bookdown.org/yihui/bookdown/),
by Yihui Xie, which is, of course, written in bookdown. Visit
<http://www.bookdown.org> to see other bookdown books written by the
wider R community.
- The **bookdown** package, <https://github.com/rstudio/bookdown>, makes it easy to write books, like this one.
To learn more, read [*Authoring Books with R Markdown*](https://bookdown.org/yihui/bookdown/), by Yihui Xie, which is, of course, written in bookdown.
Visit <http://www.bookdown.org> to see other bookdown books written by the wider R community.
* The __prettydoc__ package, <https://github.com/yixuan/prettydoc/>,
provides lightweight document formats with a range of attractive
themes.
- The **prettydoc** package, <https://github.com/yixuan/prettydoc/>, provides lightweight document formats with a range of attractive themes.
* The __rticles__ package, <https://github.com/rstudio/rticles>, compiles a
selection of formats tailored for specific scientific journals.
- The **rticles** package, <https://github.com/rstudio/rticles>, compiles a selection of formats tailored for specific scientific journals.
See <http://rmarkdown.rstudio.com/formats.html> for a list of even more formats. You can also create your own by following the instructions at <http://rmarkdown.rstudio.com/developer_custom_formats.html>.
See <http://rmarkdown.rstudio.com/formats.html> for a list of even more formats.
You can also create your own by following the instructions at <http://rmarkdown.rstudio.com/developer_custom_formats.html>.
## Learning more
To learn more about effective communication in these different formats I recommend the following resources:
* To improve your presentation skills, I recommend
[_Presentation Patterns_](https://amzn.com/0321820800), by Neal Ford,
Matthew McCollough, and Nathaniel Schutta. It provides a set of effective
patterns (both low- and high-level) that you can apply to improve your
presentations.
* If you give academic talks, I recommend reading the [_Leek group guide
to giving talks_](https://github.com/jtleek/talkguide).
* I haven't taken it myself, but I've heard good things about Matt
McGarrity's online course on public speaking:
<https://www.coursera.org/learn/public-speaking>.
- To improve your presentation skills, I recommend [*Presentation Patterns*](https://amzn.com/0321820800), by Neal Ford, Matthew McCollough, and Nathaniel Schutta.
It provides a set of effective patterns (both low- and high-level) that you can apply to improve your presentations.
* If you are creating a lot of dashboards, make sure to read Stephen Few's
[Information Dashboard Design: The Effective Visual Communication
of Data](https://amzn.com/0596100167). It will help you create dashboards
that are truly useful, not just pretty to look at.
- If you give academic talks, I recommend reading the [*Leek group guide to giving talks*](https://github.com/jtleek/talkguide).
* Effectively communicating your ideas often benefits from some
knowledge of graphic design. [_The Non-Designer's Design
Book_](http://amzn.com/0133966151) is a great place to start.
- I haven't taken it myself, but I've heard good things about Matt McGarrity's online course on public speaking: <https://www.coursera.org/learn/public-speaking>.
- If you are creating a lot of dashboards, make sure to read Stephen Few's [Information Dashboard Design: The Effective Visual Communication of Data](https://amzn.com/0596100167).
It will help you create dashboards that are truly useful, not just pretty to look at.
- Effectively communicating your ideas often benefits from some knowledge of graphic design.
[*The Non-Designer's Design Book*](http://amzn.com/0133966151) is a great place to start.

View File

@ -1,65 +1,58 @@
# R Markdown workflow
Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the _console_, then capture what works in the _script editor_. R Markdown brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When you're happy, you move on and start a new chunk.
Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the *console*, then capture what works in the *script editor*.
R Markdown brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture.
You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter.
When you're happy, you move on and start a new chunk.
R Markdown is also important because it so tightly integrates prose and code. This makes it a great __analysis notebook__ because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:
R Markdown is also important because it so tightly integrates prose and code.
This makes it a great **analysis notebook** because it lets you develop code and record your thoughts.
An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences.
It:
* Records what you did and why you did it. Regardless of how great your
memory is, if you don't record what you do, there will come a time when
you have forgotten important details. Write them down so you don't forget!
- Records what you did and why you did it.
Regardless of how great your memory is, if you don't record what you do, there will come a time when you have forgotten important details.
Write them down so you don't forget!
* Supports rigorous thinking. You are more likely to come up with a strong
analysis if you record your thoughts as you go, and continue to reflect
on them. This also saves you time when you eventually write up your
analysis to share with others.
- Supports rigorous thinking.
You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them.
This also saves you time when you eventually write up your analysis to share with others.
* Helps others understand your work. It is rare to do data analysis by
yourself, and you'll often be working as part of a team. A lab notebook
helps you share not only what you've done, but why you did it with your
colleagues or lab mates.
- Helps others understand your work.
It is rare to do data analysis by yourself, and you'll often be working as part of a team.
A lab notebook helps you share not only what you've done, but why you did it with your colleagues or lab mates.
Much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. I've drawn on my own experiences and Colin Purrington's advice on lab notebooks (<http://colinpurrington.com/tips/lab-notebooks>) to come up with the following tips:
Much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks.
I've drawn on my own experiences and Colin Purrington's advice on lab notebooks (<http://colinpurrington.com/tips/lab-notebooks>) to come up with the following tips:
* Ensure each notebook has a descriptive title, an evocative filename, and a
first paragraph that briefly describes the aims of the analysis.
- Ensure each notebook has a descriptive title, an evocative filename, and a first paragraph that briefly describes the aims of the analysis.
* Use the YAML header date field to record the date you started working on the
notebook:
- Use the YAML header date field to record the date you started working on the notebook:
```yaml
``` {.yaml}
date: 2016-08-23
```
Use ISO8601 YYYY-MM-DD format so that's there no ambiguity. Use it
even if you don't normally write dates that way!
Use ISO8601 YYYY-MM-DD format so that's there no ambiguity.
Use it even if you don't normally write dates that way!
* If you spend a lot of time on an analysis idea and it turns out to be a
dead end, don't delete it! Write up a brief note about why it failed and
leave it in the notebook. That will help you avoid going down the same
dead end when you come back to the analysis in the future.
- If you spend a lot of time on an analysis idea and it turns out to be a dead end, don't delete it!
Write up a brief note about why it failed and leave it in the notebook.
That will help you avoid going down the same dead end when you come back to the analysis in the future.
* Generally, you're better off doing data entry outside of R. But if you
do need to record a small snippet of data, clearly lay it out using
`tibble::tribble()`.
- Generally, you're better off doing data entry outside of R.
But if you do need to record a small snippet of data, clearly lay it out using `tibble::tribble()`.
* If you discover an error in a data file, never modify it directly, but
instead write code to correct the value. Explain why you made the fix.
- If you discover an error in a data file, never modify it directly, but instead write code to correct the value.
Explain why you made the fix.
* Before you finish for the day, make sure you can knit the notebook
(if you're using caching, make sure to clear the caches). That will
let you fix any problems while the code is still fresh in your mind.
- Before you finish for the day, make sure you can knit the notebook (if you're using caching, make sure to clear the caches).
That will let you fix any problems while the code is still fresh in your mind.
* If you want your code to be reproducible in the long-run (i.e. so you can
come back to run it next month or next year), you'll need to track the
versions of the packages that your code uses. A rigorous approach is to use
__packrat__, <http://rstudio.github.io/packrat/>, which stores packages
in your project directory, or __checkpoint__,
<https://github.com/RevolutionAnalytics/checkpoint>, which will reinstall
packages available on a specified date. A quick and dirty hack is to include
a chunk that runs `sessionInfo()` --- that won't let you easily recreate
your packages as they are today, but at least you'll know what they were.
- If you want your code to be reproducible in the long-run (i.e. so you can come back to run it next month or next year), you'll need to track the versions of the packages that your code uses.
A rigorous approach is to use **packrat**, <http://rstudio.github.io/packrat/>, which stores packages in your project directory, or **checkpoint**, <https://github.com/RevolutionAnalytics/checkpoint>, which will reinstall packages available on a specified date.
A quick and dirty hack is to include a chunk that runs `sessionInfo()` --- that won't let you easily recreate your packages as they are today, but at least you'll know what they were.
* You are going to create many, many, many analysis notebooks over the course
of your career. How are you going to organise them so you can find them
again in the future? I recommend storing them in individual projects,
and coming up with a good naming scheme.
- You are going to create many, many, many analysis notebooks over the course of your career.
How are you going to organise them so you can find them again in the future?
I recommend storing them in individual projects, and coming up with a good naming scheme.

View File

@ -2,33 +2,30 @@
## Introduction
R Markdown provides an unified authoring framework for data science, combining your code, its results, and your prose commentary. R Markdown documents are fully reproducible and support dozens of output formats, like PDFs, Word files, slideshows, and more.
R Markdown provides an unified authoring framework for data science, combining your code, its results, and your prose commentary.
R Markdown documents are fully reproducible and support dozens of output formats, like PDFs, Word files, slideshows, and more.
R Markdown files are designed to be used in three ways:
1. For communicating to decision makers, who want to focus on the conclusions,
not the code behind the analysis.
1. For communicating to decision makers, who want to focus on the conclusions, not the code behind the analysis.
1. For collaborating with other data scientists (including future you!), who
are interested in both your conclusions, and how you reached them (i.e.
the code).
1. As an environment in which to _do_ data science, as a modern day lab
notebook where you can capture not only what you did, but also what you
were thinking.
2. For collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).
R Markdown integrates a number of R packages and external tools. This means that help is, by-and-large, not available through `?`. Instead, as you work through this chapter, and use R Markdown in the future, keep these resources close to hand:
3. As an environment in which to *do* data science, as a modern day lab notebook where you can capture not only what you did, but also what you were thinking.
* R Markdown Cheat Sheet: _Help > Cheatsheets > R Markdown Cheat Sheet_,
R Markdown integrates a number of R packages and external tools.
This means that help is, by-and-large, not available through `?`.
Instead, as you work through this chapter, and use R Markdown in the future, keep these resources close to hand:
* R Markdown Reference Guide: _Help > Cheatsheets > R Markdown Reference
Guide_.
- R Markdown Cheat Sheet: *Help \> Cheatsheets \> R Markdown Cheat Sheet*,
- R Markdown Reference Guide: *Help \> Cheatsheets \> R Markdown Reference Guide*.
Both cheatsheets are also available at <https://rstudio.com/resources/cheatsheets/>.
### Prerequisites
You need the __rmarkdown__ package, but you don't need to explicitly install it or load it, as RStudio automatically does both when needed.
You need the **rmarkdown** package, but you don't need to explicitly install it or load it, as RStudio automatically does both when needed.
```{r setup, include = FALSE}
chunk <- "```"
@ -46,160 +43,175 @@ cat(htmltools::includeText("rmarkdown/diamond-sizes.Rmd"))
It contains three important types of content:
1. An (optional) __YAML header__ surrounded by `---`s.
1. __Chunks__ of R code surrounded by ```` ``` ````.
1. Text mixed with simple text formatting like `# heading` and `_italics_`.
1. An (optional) **YAML header** surrounded by `---`s.
2. **Chunks** of R code surrounded by ```` ``` ````.
3. Text mixed with simple text formatting like `# heading` and `_italics_`.
When you open an `.Rmd`, you get a notebook interface where code and output are interleaved. You can run each code chunk by clicking the Run icon (it looks like a play button at the top of the chunk), or by pressing Cmd/Ctrl + Shift + Enter. RStudio executes the code and displays the results inline with the code:
When you open an `.Rmd`, you get a notebook interface where code and output are interleaved.
You can run each code chunk by clicking the Run icon (it looks like a play button at the top of the chunk), or by pressing Cmd/Ctrl + Shift + Enter.
RStudio executes the code and displays the results inline with the code:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("rmarkdown/diamond-sizes-notebook.png")
```
To produce a complete report containing all text, code, and results, click "Knit" or press Cmd/Ctrl + Shift + K. You can also do this programmatically with `rmarkdown::render("1-example.Rmd")`. This will display the report in the viewer pane, and create a self-contained HTML file that you can share with others.
To produce a complete report containing all text, code, and results, click "Knit" or press Cmd/Ctrl + Shift + K.
You can also do this programmatically with `rmarkdown::render("1-example.Rmd")`.
This will display the report in the viewer pane, and create a self-contained HTML file that you can share with others.
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("rmarkdown/diamond-sizes-report.png")
```
When you __knit__ the document, R Markdown sends the .Rmd file to __knitr__, http://yihui.name/knitr/, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output. The markdown file generated by knitr is then processed by __pandoc__, <http://pandoc.org/>, which is responsible for creating the finished file. The advantage of this two step workflow is that you can create a very wide range of output formats, as you'll learn about in [R markdown formats].
When you **knit** the document, R Markdown sends the .Rmd file to **knitr**, <http://yihui.name/knitr/>, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output.
The markdown file generated by knitr is then processed by **pandoc**, <http://pandoc.org/>, which is responsible for creating the finished file.
The advantage of this two step workflow is that you can create a very wide range of output formats, as you'll learn about in [R markdown formats].
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("images/RMarkdownFlow.png")
```
To get started with your own `.Rmd` file, select *File > New File > R Markdown...* in the menubar. RStudio will launch a wizard that you can use to pre-populate your file with useful content that reminds you how the key features of R Markdown work.
To get started with your own `.Rmd` file, select *File \> New File \> R Markdown...* in the menubar.
RStudio will launch a wizard that you can use to pre-populate your file with useful content that reminds you how the key features of R Markdown work.
The following sections dive into the three components of an R Markdown document in more details: the markdown text, the code chunks, and the YAML header.
### Exercises
1. Create a new notebook using _File > New File > R Notebook_. Read the
instructions. Practice running the chunks. Verify that you can modify
the code, re-run it, and see modified output.
1. Create a new R Markdown document with _File > New File > R Markdown..._
Knit it by clicking the appropriate button. Knit it by using the
appropriate keyboard short cut. Verify that you can modify the
input and see the output update.
1. Compare and contrast the R notebook and R markdown files you created
above. How are the outputs similar? How are they different? How are
the inputs similar? How are they different? What happens if you
copy the YAML header from one to the other?
1. Create a new notebook using *File \> New File \> R Notebook*.
Read the instructions.
Practice running the chunks.
Verify that you can modify the code, re-run it, and see modified output.
1. Create one new R Markdown document for each of the three built-in
formats: HTML, PDF and Word. Knit each of the three documents.
How does the output differ? How does the input differ? (You may need
to install LaTeX in order to build the PDF output --- RStudio will
prompt you if this is necessary.)
2. Create a new R Markdown document with *File \> New File \> R Markdown...* Knit it by clicking the appropriate button.
Knit it by using the appropriate keyboard short cut.
Verify that you can modify the input and see the output update.
3. Compare and contrast the R notebook and R markdown files you created above.
How are the outputs similar?
How are they different?
How are the inputs similar?
How are they different?
What happens if you copy the YAML header from one to the other?
4. Create one new R Markdown document for each of the three built-in formats: HTML, PDF and Word.
Knit each of the three documents.
How does the output differ?
How does the input differ?
(You may need to install LaTeX in order to build the PDF output --- RStudio will prompt you if this is necessary.)
## Text formatting with Markdown
Prose in `.Rmd` files is written in Markdown, a lightweight set of conventions for formatting plain text files. Markdown is designed to be easy to read and easy to write. It is also very easy to learn. The guide below shows how to use Pandoc's Markdown, a slightly extended version of Markdown that R Markdown understands.
Prose in `.Rmd` files is written in Markdown, a lightweight set of conventions for formatting plain text files.
Markdown is designed to be easy to read and easy to write.
It is also very easy to learn.
The guide below shows how to use Pandoc's Markdown, a slightly extended version of Markdown that R Markdown understands.
```{r, echo = FALSE, comment = ""}
cat(readr::read_file("rmarkdown/markdown.Rmd"))
```
The best way to learn these is simply to try them out. It will take a few days, but soon they will become second nature, and you won't need to think about them. If you forget, you can get to a handy reference sheet with *Help > Markdown Quick Reference*.
The best way to learn these is simply to try them out.
It will take a few days, but soon they will become second nature, and you won't need to think about them.
If you forget, you can get to a handy reference sheet with *Help \> Markdown Quick Reference*.
### Exercises
1. Practice what you've learned by creating a brief CV. The title should be
your name, and you should include headings for (at least) education or
employment. Each of the sections should include a bulleted list of
jobs/degrees. Highlight the year in bold.
1. Using the R Markdown quick reference, figure out how to:
1. Practice what you've learned by creating a brief CV.
The title should be your name, and you should include headings for (at least) education or employment.
Each of the sections should include a bulleted list of jobs/degrees.
Highlight the year in bold.
2. Using the R Markdown quick reference, figure out how to:
1. Add a footnote.
1. Add a horizontal rule.
1. Add a block quote.
1. Copy and paste the contents of `diamond-sizes.Rmd` from
<https://github.com/hadley/r4ds/tree/master/rmarkdown> in to a local
R markdown document. Check that you can run it, then add text after the
frequency polygon that describes its most striking features.
2. Add a horizontal rule.
3. Add a block quote.
3. Copy and paste the contents of `diamond-sizes.Rmd` from <https://github.com/hadley/r4ds/tree/master/rmarkdown> in to a local R markdown document.
Check that you can run it, then add text after the frequency polygon that describes its most striking features.
## Code chunks
To run code inside an R Markdown document, you need to insert a chunk. There are three ways to do so:
To run code inside an R Markdown document, you need to insert a chunk.
There are three ways to do so:
1. The keyboard shortcut Cmd/Ctrl + Alt + I
1. The keyboard shortcut Cmd/Ctrl + Alt + I
1. The "Insert" button icon in the editor toolbar.
2. The "Insert" button icon in the editor toolbar.
1. By manually typing the chunk delimiters ` ```{r} ` and ` ``` `.
3. By manually typing the chunk delimiters ```` ```{r} ```` and ```` ``` ````.
Obviously, I'd recommend you learn the keyboard shortcut. It will save you a lot of time in the long run!
Obviously, I'd recommend you learn the keyboard shortcut.
It will save you a lot of time in the long run!
You can continue to run the code using the keyboard shortcut that by now (I hope!) you know and love: Cmd/Ctrl + Enter. However, chunks get a new keyboard shortcut: Cmd/Ctrl + Shift + Enter, which runs all the code in the chunk. Think of a chunk like a function. A chunk should be relatively self-contained, and focussed around a single task.
You can continue to run the code using the keyboard shortcut that by now (I hope!) you know and love: Cmd/Ctrl + Enter.
However, chunks get a new keyboard shortcut: Cmd/Ctrl + Shift + Enter, which runs all the code in the chunk.
Think of a chunk like a function.
A chunk should be relatively self-contained, and focussed around a single task.
The following sections describe the chunk header which consists of ```` ```{r ````, followed by an optional chunk name, followed by comma separated options, followed by `}`. Next comes your R code and the chunk end is indicated by a final ```` ``` ````.
The following sections describe the chunk header which consists of ```` ```{r ````, followed by an optional chunk name, followed by comma separated options, followed by `}`.
Next comes your R code and the chunk end is indicated by a final ```` ``` ````.
### Chunk name
Chunks can be given an optional name: ```` ```{r by-name} ````. This has three advantages:
Chunks can be given an optional name: ```` ```{r by-name} ````.
This has three advantages:
1. You can more easily navigate to specific chunks using the drop-down
code navigator in the bottom-left of the script editor:
1. You can more easily navigate to specific chunks using the drop-down code navigator in the bottom-left of the script editor:
```{r, echo = FALSE, out.width = "30%"}
knitr::include_graphics("screenshots/rmarkdown-chunk-nav.png")
```
1. Graphics produced by the chunks will have useful names that make
them easier to use elsewhere. More on that in [other important options].
1. You can set up networks of cached chunks to avoid re-performing expensive
computations on every run. More on that below.
2. Graphics produced by the chunks will have useful names that make them easier to use elsewhere.
More on that in [other important options].
There is one chunk name that imbues special behaviour: `setup`. When you're in a notebook mode, the chunk named setup will be run automatically once, before any other code is run.
3. You can set up networks of cached chunks to avoid re-performing expensive computations on every run.
More on that below.
There is one chunk name that imbues special behaviour: `setup`.
When you're in a notebook mode, the chunk named setup will be run automatically once, before any other code is run.
### Chunk options
Chunk output can be customised with __options__, arguments supplied to chunk header. Knitr provides almost 60 options that you can use to customize your code chunks. Here we'll cover the most important chunk options that you'll use frequently. You can see the full list at <http://yihui.name/knitr/options/>.
Chunk output can be customised with **options**, arguments supplied to chunk header.
Knitr provides almost 60 options that you can use to customize your code chunks.
Here we'll cover the most important chunk options that you'll use frequently.
You can see the full list at <http://yihui.name/knitr/options/>.
The most important set of options controls if your code block is executed and what results are inserted in the finished report:
* `eval = FALSE` prevents code from being evaluated. (And obviously if the
code is not run, no results will be generated). This is useful for
displaying example code, or for disabling a large block of code without
commenting each line.
* `include = FALSE` runs the code, but doesn't show the code or results
in the final document. Use this for setup code that you don't want
cluttering your report.
- `eval = FALSE` prevents code from being evaluated.
(And obviously if the code is not run, no results will be generated).
This is useful for displaying example code, or for disabling a large block of code without commenting each line.
* `echo = FALSE` prevents code, but not the results from appearing in the
finished file. Use this when writing reports aimed at people who don't
want to see the underlying R code.
* `message = FALSE` or `warning = FALSE` prevents messages or warnings
from appearing in the finished file.
- `include = FALSE` runs the code, but doesn't show the code or results in the final document.
Use this for setup code that you don't want cluttering your report.
* `results = 'hide'` hides printed output; `fig.show = 'hide'` hides
plots.
- `echo = FALSE` prevents code, but not the results from appearing in the finished file.
Use this when writing reports aimed at people who don't want to see the underlying R code.
- `message = FALSE` or `warning = FALSE` prevents messages or warnings from appearing in the finished file.
- `results = 'hide'` hides printed output; `fig.show = 'hide'` hides plots.
- `error = TRUE` causes the render to continue even if code returns an error.
This is rarely something you'll want to include in the final version of your report, but can be very useful if you need to debug exactly what is going on inside your `.Rmd`.
It's also useful if you're teaching R and want to deliberately include an error.
The default, `error = FALSE` causes knitting to fail if there is a single error in the document.
* `error = TRUE` causes the render to continue even if code returns an error.
This is rarely something you'll want to include in the final version
of your report, but can be very useful if you need to debug exactly
what is going on inside your `.Rmd`. It's also useful if you're teaching R
and want to deliberately include an error. The default, `error = FALSE` causes
knitting to fail if there is a single error in the document.
The following table summarises which types of output each option suppresses:
Option | Run code | Show code | Output | Plots | Messages | Warnings
-------------------|----------|-----------|--------|-------|----------|---------
`eval = FALSE` | - | | - | - | - | -
`include = FALSE` | | - | - | - | - | -
`echo = FALSE` | | - | | | |
`results = "hide"` | | | - | | |
`fig.show = "hide"`| | | | - | |
`message = FALSE` | | | | | - |
`warning = FALSE` | | | | | | -
| Option | Run code | Show code | Output | Plots | Messages | Warnings |
|---------------------|----------|-----------|--------|-------|----------|----------|
| `eval = FALSE` | \- | | \- | \- | \- | \- |
| `include = FALSE` | | \- | \- | \- | \- | \- |
| `echo = FALSE` | | \- | | | | |
| `results = "hide"` | | | \- | | | |
| `fig.show = "hide"` | | | | \- | | |
| `message = FALSE` | | | | | \- | |
| `warning = FALSE` | | | | | | \- |
### Table
@ -209,7 +221,8 @@ By default, R Markdown prints data frames and matrices as you'd see them in the
mtcars[1:5, ]
```
If you prefer that data be displayed with additional formatting you can use the `knitr::kable` function. The code below generates Table \@ref(tab:kable).
If you prefer that data be displayed with additional formatting you can use the `knitr::kable` function.
The code below generates Table \@ref(tab:kable).
```{r kable}
knitr::kable(
@ -218,27 +231,37 @@ knitr::kable(
)
```
Read the documentation for `?knitr::kable` to see the other ways in which you can customise the table. For even deeper customisation, consider the __xtable__, __stargazer__, __pander__, __tables__, and __ascii__ packages. Each provides a set of tools for returning formatted tables from R code.
Read the documentation for `?knitr::kable` to see the other ways in which you can customise the table.
For even deeper customisation, consider the **xtable**, **stargazer**, **pander**, **tables**, and **ascii** packages.
Each provides a set of tools for returning formatted tables from R code.
There is also a rich set of options for controlling how figures are embedded. You'll learn about these in [saving your plots].
There is also a rich set of options for controlling how figures are embedded.
You'll learn about these in [saving your plots].
### Caching
Normally, each knit of a document starts from a completely clean slate. This is great for reproducibility, because it ensures that you've captured every important computation in code. However, it can be painful if you have some computations that take a long time. The solution is `cache = TRUE`. When set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasn't, it will reuse the cached results.
Normally, each knit of a document starts from a completely clean slate.
This is great for reproducibility, because it ensures that you've captured every important computation in code.
However, it can be painful if you have some computations that take a long time.
The solution is `cache = TRUE`.
When set, this will save the output of the chunk to a specially named file on disk.
On subsequent runs, knitr will check to see if the code has changed, and if it hasn't, it will reuse the cached results.
The caching system must be used with care, because by default it is based on the code only, not its dependencies. For example, here the `processed_data` chunk depends on the `raw_data` chunk:
The caching system must be used with care, because by default it is based on the code only, not its dependencies.
For example, here the `processed_data` chunk depends on the `raw_data` chunk:
`r chunk`{r raw_data}
rawdata <- readr::read_csv("a_very_large_file.csv")
`r chunk`
`r chunk`{r processed_data, cache = TRUE}
processed_data <- rawdata %>%
filter(!is.na(import_var)) %>%
mutate(new_variable = complicated_transformation(x, y, z))
`r chunk`
Caching the `processed_data` chunk means that it will get re-run if the dplyr pipeline is changed, but it won't get rerun if the `read_csv()` call changes. You can avoid that problem with the `dependson` chunk option:
Caching the `processed_data` chunk means that it will get re-run if the dplyr pipeline is changed, but it won't get rerun if the `read_csv()` call changes.
You can avoid that problem with the `dependson` chunk option:
`r chunk`{r processed_data, cache = TRUE, dependson = "raw_data"}
processed_data <- rawdata %>%
@ -246,9 +269,14 @@ Caching the `processed_data` chunk means that it will get re-run if the dplyr pi
mutate(new_variable = complicated_transformation(x, y, z))
`r chunk`
`dependson` should contain a character vector of *every* chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.
`dependson` should contain a character vector of *every* chunk that the cached chunk depends on.
Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.
Note that the chunks won't update if `a_very_large_file.csv` changes, because knitr caching only tracks changes within the `.Rmd` file. If you want to also track changes to that file you can use the `cache.extra` option. This is an arbitrary R expression that will invalidate the cache whenever it changes. A good function to use is `file.info()`: it returns a bunch of information about the file including when it was last modified. Then you can write:
Note that the chunks won't update if `a_very_large_file.csv` changes, because knitr caching only tracks changes within the `.Rmd` file.
If you want to also track changes to that file you can use the `cache.extra` option.
This is an arbitrary R expression that will invalidate the cache whenever it changes.
A good function to use is `file.info()`: it returns a bunch of information about the file including when it was last modified.
Then you can write:
`r chunk`{r raw_data, cache.extra = file.info("a_very_large_file.csv")}
rawdata <- readr::read_csv("a_very_large_file.csv")
@ -256,11 +284,14 @@ Note that the chunks won't update if `a_very_large_file.csv` changes, because kn
As your caching strategies get progressively more complicated, it's a good idea to regularly clear out all your caches with `knitr::clean_cache()`.
I've used the advice of [David Robinson](https://twitter.com/drob/status/738786604731490304) to name these chunks: each chunk is named after the primary object that it creates. This makes it easier to understand the `dependson` specification.
I've used the advice of [David Robinson](https://twitter.com/drob/status/738786604731490304) to name these chunks: each chunk is named after the primary object that it creates.
This makes it easier to understand the `dependson` specification.
### Global options
As you work more with knitr, you will discover that some of the default chunk options don't fit your needs and you want to change them. You can do this by calling `knitr::opts_chunk$set()` in a code chunk. For example, when writing books and tutorials I set:
As you work more with knitr, you will discover that some of the default chunk options don't fit your needs and you want to change them.
You can do this by calling `knitr::opts_chunk$set()` in a code chunk.
For example, when writing books and tutorials I set:
```{r, eval = FALSE}
knitr::opts_chunk$set(
@ -269,7 +300,8 @@ knitr::opts_chunk$set(
)
```
This uses my preferred comment formatting, and ensures that the code and output are kept closely entwined. On the other hand, if you were preparing a report, you might set:
This uses my preferred comment formatting, and ensures that the code and output are kept closely entwined.
On the other hand, if you were preparing a report, you might set:
```{r eval = FALSE}
knitr::opts_chunk$set(
@ -277,22 +309,28 @@ knitr::opts_chunk$set(
)
```
That will hide the code by default, so only showing the chunks you deliberately choose to show (with `echo = TRUE`). You might consider setting `message = FALSE` and `warning = FALSE`, but that would make it harder to debug problems because you wouldn't see any messages in the final document.
That will hide the code by default, so only showing the chunks you deliberately choose to show (with `echo = TRUE`).
You might consider setting `message = FALSE` and `warning = FALSE`, but that would make it harder to debug problems because you wouldn't see any messages in the final document.
### Inline code
There is one other way to embed R code into an R Markdown document: directly into the text, with: `r inline()`. This can be very useful if you mention properties of your data in the text. For example, in the example document I used at the start of the chapter I had:
There is one other way to embed R code into an R Markdown document: directly into the text, with: `r inline()`.
This can be very useful if you mention properties of your data in the text.
For example, in the example document I used at the start of the chapter I had:
> We have data about `r inline('nrow(diamonds)')` diamonds.
> Only `r inline('nrow(diamonds) - nrow(smaller)')` are larger
> than 2.5 carats. The distribution of the remainder is shown below:
> We have data about `r inline('nrow(diamonds)')` diamonds.
> Only `r inline('nrow(diamonds) - nrow(smaller)')` are larger than 2.5 carats.
> The distribution of the remainder is shown below:
When the report is knit, the results of these computations are inserted into the text:
> We have data about 53940 diamonds. Only 126 are larger than
> 2.5 carats. The distribution of the remainder is shown below:
> We have data about 53940 diamonds.
> Only 126 are larger than 2.5 carats.
> The distribution of the remainder is shown below:
When inserting numbers into text, `format()` is your friend. It allows you to set the number of `digits` so you don't print to a ridiculous degree of accuracy, and a `big.mark` to make numbers easier to read. I'll often combine these into a helper function:
When inserting numbers into text, `format()` is your friend.
It allows you to set the number of `digits` so you don't print to a ridiculous degree of accuracy, and a `big.mark` to make numbers easier to read.
I'll often combine these into a helper function:
```{r}
comma <- function(x) format(x, digits = 2, big.mark = ",")
@ -302,39 +340,47 @@ comma(.12358124331)
### Exercises
1. Add a section that explores how diamond sizes vary by cut, colour,
and clarity. Assume you're writing a report for someone who doesn't know
R, and instead of setting `echo = FALSE` on each chunk, set a global
option.
1. Add a section that explores how diamond sizes vary by cut, colour, and clarity.
Assume you're writing a report for someone who doesn't know R, and instead of setting `echo = FALSE` on each chunk, set a global option.
1. Download `diamond-sizes.Rmd` from
<https://github.com/hadley/r4ds/tree/master/rmarkdown>. Add a section
that describes the largest 20 diamonds, including a table that displays
their most important attributes.
2. Download `diamond-sizes.Rmd` from <https://github.com/hadley/r4ds/tree/master/rmarkdown>.
Add a section that describes the largest 20 diamonds, including a table that displays their most important attributes.
1. Modify `diamonds-sizes.Rmd` to use `comma()` to produce nicely
formatted output. Also include the percentage of diamonds that are
larger than 2.5 carats.
3. Modify `diamonds-sizes.Rmd` to use `comma()` to produce nicely formatted output.
Also include the percentage of diamonds that are larger than 2.5 carats.
1. Set up a network of chunks where `d` depends on `c` and `b`, and
both `b` and `c` depend on `a`. Have each chunk print `lubridate::now()`,
set `cache = TRUE`, then verify your understanding of caching.
4. Set up a network of chunks where `d` depends on `c` and `b`, and both `b` and `c` depend on `a`.
Have each chunk print `lubridate::now()`, set `cache = TRUE`, then verify your understanding of caching.
## Troubleshooting
Troubleshooting R Markdown documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. The first thing you should always try is to recreate the problem in an interactive session. Restart R, then "Run all chunks" (either from Code menu, under Run region), or with the keyboard shortcut Ctrl + Alt + R. If you're lucky, that will recreate the problem, and you can figure out what's going on interactively.
Troubleshooting R Markdown documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks.
The first thing you should always try is to recreate the problem in an interactive session.
Restart R, then "Run all chunks" (either from Code menu, under Run region), or with the keyboard shortcut Ctrl + Alt + R.
If you're lucky, that will recreate the problem, and you can figure out what's going on interactively.
If that doesn't help, there must be something different between your interactive environment and the R markdown environment. You're going to need to systematically explore the options. The most common difference is the working directory: the working directory of an R Markdown is the directory in which it lives. Check the working directory is what you expect by including `getwd()` in a chunk.
If that doesn't help, there must be something different between your interactive environment and the R markdown environment.
You're going to need to systematically explore the options.
The most common difference is the working directory: the working directory of an R Markdown is the directory in which it lives.
Check the working directory is what you expect by including `getwd()` in a chunk.
Next, brainstorm all the things that might cause the bug. You'll need to systematically check that they're the same in your R session and your R markdown session. The easiest way to do that is to set `error = TRUE` on the chunk causing the problem, then use `print()` and `str()` to check that settings are as you expect.
Next, brainstorm all the things that might cause the bug.
You'll need to systematically check that they're the same in your R session and your R markdown session.
The easiest way to do that is to set `error = TRUE` on the chunk causing the problem, then use `print()` and `str()` to check that settings are as you expect.
## YAML header
You can control many other "whole document" settings by tweaking the parameters of the YAML header. You might wonder what YAML stands for: it's "yet another markup language", which is designed for representing hierarchical data in a way that's easy for humans to read and write. R Markdown uses it to control many details of the output. Here we'll discuss two: document parameters and bibliographies.
You can control many other "whole document" settings by tweaking the parameters of the YAML header.
You might wonder what YAML stands for: it's "yet another markup language", which is designed for representing hierarchical data in a way that's easy for humans to read and write.
R Markdown uses it to control many details of the output.
Here we'll discuss two: document parameters and bibliographies.
### Parameters
R Markdown documents can include one or more parameters whose values can be set when you render the report. Parameters are useful when you want to re-render the same report with distinct values for various key inputs. For example, you might be producing sales reports per branch, exam results by student, or demographic summaries by country. To declare one or more parameters, use the `params` field.
R Markdown documents can include one or more parameters whose values can be set when you render the report.
Parameters are useful when you want to re-render the same report with distinct values for various key inputs.
For example, you might be producing sales reports per branch, exam results by student, or demographic summaries by country.
To declare one or more parameters, use the `params` field.
This example uses a `my_class` parameter to determine which class of cars to display:
@ -344,15 +390,19 @@ cat(readr::read_file("rmarkdown/fuel-economy.Rmd"))
As you can see, parameters are available within the code chunks as a read-only list named `params`.
You can write atomic vectors directly into the YAML header. You can also run arbitrary R expressions by prefacing the parameter value with `!r`. This is a good way to specify date/time parameters.
You can write atomic vectors directly into the YAML header.
You can also run arbitrary R expressions by prefacing the parameter value with `!r`.
This is a good way to specify date/time parameters.
```yaml
``` {.yaml}
params:
start: !r lubridate::ymd("2015-01-01")
snapshot: !r lubridate::ymd_hms("2015-01-01 12:30:00")
```
In RStudio, you can click the "Knit with Parameters" option in the Knit dropdown menu to set parameters, render, and preview the report in a single user friendly step. You can customise the dialog by setting other options in the header. See <http://rmarkdown.rstudio.com/developer_parameterized_reports.html#parameter_user_interfaces> for more details.
In RStudio, you can click the "Knit with Parameters" option in the Knit dropdown menu to set parameters, render, and preview the report in a single user friendly step.
You can customise the dialog by setting other options in the header.
See <http://rmarkdown.rstudio.com/developer_parameterized_reports.html#parameter_user_interfaces> for more details.
Alternatively, if you need to produce many such parameterised reports, you can call `rmarkdown::render()` with a list of `params`:
@ -360,7 +410,9 @@ Alternatively, if you need to produce many such parameterised reports, you can c
rmarkdown::render("fuel-economy.Rmd", params = list(my_class = "suv"))
```
This is particularly powerful in conjunction with `purrr:pwalk()`. The following example creates a report for each value of `class` found in `mpg`. First we create a data frame that has one row for each class, giving the `filename` of the report and the `params`:
This is particularly powerful in conjunction with `purrr:pwalk()`.
The following example creates a report for each value of `class` found in `mpg`.
First we create a data frame that has one row for each class, giving the `filename` of the report and the `params`:
```{r}
reports <- tibble(
@ -381,17 +433,21 @@ reports %>%
### Bibliographies and Citations
Pandoc can automatically generate citations and a bibliography in a number of styles. To use this feature, specify a bibliography file using the `bibliography` field in your file's header. The field should contain a path from the directory that contains your .Rmd file to the file that contains the bibliography file:
Pandoc can automatically generate citations and a bibliography in a number of styles.
To use this feature, specify a bibliography file using the `bibliography` field in your file's header.
The field should contain a path from the directory that contains your .Rmd file to the file that contains the bibliography file:
```yaml
``` {.yaml}
bibliography: rmarkdown.bib
```
You can use many common bibliography formats including BibLaTeX, BibTeX, endnote, medline.
To create a citation within your .Rmd file, use a key composed of @ + the citation identifier from the bibliography file. Then place the citation in square brackets. Here are some examples:
To create a citation within your .Rmd file, use a key composed of '\@' + the citation identifier from the bibliography file.
Then place the citation in square brackets.
Here are some examples:
```markdown
``` {.markdown}
Separate multiple citations with a `;`: Blah blah [@smith04; @doe99].
You can add arbitrary comments inside the square brackets:
@ -404,29 +460,39 @@ Add a `-` before the citation to suppress the author's name:
Smith says blah [-@smith04].
```
When R Markdown renders your file, it will build and append a bibliography to the end of your document. The bibliography will contain each of the cited references from your bibliography file, but it will not contain a section heading. As a result it is common practice to end your file with a section header for the bibliography, such as `# References` or `# Bibliography`.
When R Markdown renders your file, it will build and append a bibliography to the end of your document.
The bibliography will contain each of the cited references from your bibliography file, but it will not contain a section heading.
As a result it is common practice to end your file with a section header for the bibliography, such as `# References` or `# Bibliography`.
You can change the style of your citations and bibliography by referencing a CSL (citation style language) file in the `csl` field:
```yaml
``` {.yaml}
bibliography: rmarkdown.bib
csl: apa.csl
```
As with the bibliography field, your csl file should contain a path to the file. Here I assume that the csl file is in the same directory as the .Rmd file. A good place to find CSL style files for common bibliography styles is <http://github.com/citation-style-language/styles>.
As with the bibliography field, your csl file should contain a path to the file.
Here I assume that the csl file is in the same directory as the .Rmd file.
A good place to find CSL style files for common bibliography styles is <http://github.com/citation-style-language/styles>.
## Learning more
R Markdown is still relatively young, and is still growing rapidly. The best place to stay on top of innovations is the official R Markdown website: <http://rmarkdown.rstudio.com>.
R Markdown is still relatively young, and is still growing rapidly.
The best place to stay on top of innovations is the official R Markdown website: <http://rmarkdown.rstudio.com>.
There are two important topics that we haven't covered here: collaboration, and the details of accurately communicating your ideas to other humans. Collaboration is a vital part of modern data science, and you can make your life much easier by using version control tools, like Git and GitHub. We recommend two free resources that will teach you about Git:
There are two important topics that we haven't covered here: collaboration, and the details of accurately communicating your ideas to other humans.
Collaboration is a vital part of modern data science, and you can make your life much easier by using version control tools, like Git and GitHub.
We recommend two free resources that will teach you about Git:
1. "Happy Git with R": a user friendly introduction to Git and GitHub from
R users, by Jenny Bryan. The book is freely available online:
<http://happygitwithr.com>
1. The "Git and GitHub" chapter of _R Packages_, by Hadley. You can also
read it for free online: <http://r-pkgs.had.co.nz/git.html>.
1. "Happy Git with R": a user friendly introduction to Git and GitHub from R users, by Jenny Bryan.
The book is freely available online: <http://happygitwithr.com>
I have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, I highly recommend reading either [_Style: Lessons in Clarity and Grace_](https://amzn.com/0134080416) by Joseph M. Williams & Joseph Bizup, or [_The Sense of Structure: Writing from the Reader's Perspective_](https://amzn.com/0205296327) by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they're used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at <https://www.georgegopen.com/the-litigation-articles.html>. They are aimed at lawyers, but almost everything applies to data scientists too.
2. The "Git and GitHub" chapter of *R Packages*, by Hadley.
You can also read it for free online: <http://r-pkgs.had.co.nz/git.html>.
I have also not touched on what you should actually write in order to clearly communicate the results of your analysis.
To improve your writing, I highly recommend reading either [*Style: Lessons in Clarity and Grace*](https://amzn.com/0134080416) by Joseph M. Williams & Joseph Bizup, or [*The Sense of Structure: Writing from the Reader's Perspective*](https://amzn.com/0205296327) by George Gopen.
Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear.
(These books are rather expensive if purchased new, but they're used by many English classes so there are plenty of cheap second-hand copies).
George Gopen also has a number of short articles on writing at <https://www.georgegopen.com/the-litigation-articles.html>.
They are aimed at lawyers, but almost everything applies to data scientists too.

File diff suppressed because it is too large Load Diff

View File

@ -2,13 +2,18 @@
## Introduction
Throughout this book we work with "tibbles" instead of R's traditional `data.frame`. Tibbles _are_ data frames, but they tweak some older behaviours to make life a little easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the __tibble__ package, which provides opinionated data frames that make working in the tidyverse a little easier. In most places, I'll use the term tibble and data frame interchangeably; when I want to draw particular attention to R's built-in data frame, I'll call them `data.frame`s.
Throughout this book we work with "tibbles" instead of R's traditional `data.frame`.
Tibbles *are* data frames, but they tweak some older behaviours to make life a little easier.
R is an old language, and some things that were useful 10 or 20 years ago now get in your way.
It's difficult to change base R without breaking existing code, so most innovation occurs in packages.
Here we will describe the **tibble** package, which provides opinionated data frames that make working in the tidyverse a little easier.
In most places, I'll use the term tibble and data frame interchangeably; when I want to draw particular attention to R's built-in data frame, I'll call them `data.frame`s.
If this chapter leaves you wanting to learn more about tibbles, you might enjoy `vignette("tibble")`.
### Prerequisites
In this chapter we'll explore the __tibble__ package, part of the core tidyverse.
In this chapter we'll explore the **tibble** package, part of the core tidyverse.
```{r setup, message = FALSE}
library(tidyverse)
@ -16,13 +21,16 @@ library(tidyverse)
## Creating tibbles
Almost all of the functions that you'll use in this book produce tibbles, as tibbles are one of the unifying features of the tidyverse. Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with `as_tibble()`:
Almost all of the functions that you'll use in this book produce tibbles, as tibbles are one of the unifying features of the tidyverse.
Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble.
You can do that with `as_tibble()`:
```{r}
as_tibble(iris)
```
You can create a new tibble from individual vectors with `tibble()`. `tibble()` will automatically recycle inputs of length 1, and allows you to refer to variables that you just created, as shown below.
You can create a new tibble from individual vectors with `tibble()`.
`tibble()` will automatically recycle inputs of length 1, and allows you to refer to variables that you just created, as shown below.
```{r}
tibble(
@ -34,7 +42,9 @@ tibble(
If you're already familiar with `data.frame()`, note that `tibble()` does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.
It's possible for a tibble to have column names that are not valid R variable names, aka __non-syntactic__ names. For example, they might not start with a letter, or they might contain unusual characters like a space. To refer to these variables, you need to surround them with backticks, `` ` ``:
It's possible for a tibble to have column names that are not valid R variable names, aka **non-syntactic** names.
For example, they might not start with a letter, or they might contain unusual characters like a space.
To refer to these variables, you need to surround them with backticks, `` ` ``:
```{r}
tb <- tibble(
@ -47,7 +57,9 @@ tb
You'll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.
Another way to create a tibble is with `tribble()`, short for **tr**ansposed tibble. `tribble()` is customised for data entry in code: column headings are defined by formulas (i.e. they start with `~`), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.
Another way to create a tibble is with `tribble()`, short for **tr**ansposed tibble.
`tribble()` is customised for data entry in code: column headings are defined by formulas (i.e. they start with `~`), and entries are separated by commas.
This makes it possible to lay out small amounts of data in easy to read form.
```{r}
tribble(
@ -66,7 +78,9 @@ There are two main differences in the usage of a tibble vs. a classic `data.fram
### Printing
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from `str()`:
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.
This makes it much easier to work with large data.
In addition to its name, each column reports its type, a nice feature borrowed from `str()`:
```{r}
tibble(
@ -78,9 +92,12 @@ tibble(
)
```
Tibbles are designed so that you don't accidentally overwhelm your console when you print large data frames. But sometimes you need more output than the default display. There are a few options that can help.
Tibbles are designed so that you don't accidentally overwhelm your console when you print large data frames.
But sometimes you need more output than the default display.
There are a few options that can help.
First, you can explicitly `print()` the data frame and control the number of rows (`n`) and the `width` of the display. `width = Inf` will display all columns:
First, you can explicitly `print()` the data frame and control the number of rows (`n`) and the `width` of the display.
`width = Inf` will display all columns:
```{r, eval = FALSE}
nycflights13::flights %>%
@ -89,16 +106,15 @@ nycflights13::flights %>%
You can also control the default print behaviour by setting options:
* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `n`
rows, print only `m` rows. Use `options(tibble.print_min = Inf)` to always
show all rows.
- `options(tibble.print_max = n, tibble.print_min = m)`: if more than `n` rows, print only `m` rows.
Use `options(tibble.print_min = Inf)` to always show all rows.
* Use `options(tibble.width = Inf)` to always print all columns, regardless
of the width of the screen.
- Use `options(tibble.width = Inf)` to always print all columns, regardless of the width of the screen.
You can see a complete list of options by looking at the package help with `package?tibble`.
A final option is to use RStudio's built-in data viewer to get a scrollable view of the complete dataset. This is also often useful at the end of a long chain of manipulations.
A final option is to use RStudio's built-in data viewer to get a scrollable view of the complete dataset.
This is also often useful at the end of a long chain of manipulations.
```{r, eval = FALSE}
nycflights13::flights %>%
@ -107,7 +123,9 @@ nycflights13::flights %>%
### Subsetting
So far all the tools you've learned have worked with complete data frames. If you want to pull out a single variable, you need some new tools, `$` and `[[`. `[[` can extract by name or position; `$` only extracts by name but is a little less typing.
So far all the tools you've learned have worked with complete data frames.
If you want to pull out a single variable, you need some new tools, `$` and `[[`.
`[[` can extract by name or position; `$` only extracts by name but is a little less typing.
```{r}
df <- tibble(
@ -134,23 +152,27 @@ Compared to a `data.frame`, tibbles are more strict: they never do partial match
## Interacting with older code
Some older functions don't work with tibbles. If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a `data.frame`:
Some older functions don't work with tibbles.
If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a `data.frame`:
```{r}
class(as.data.frame(tb))
```
The main reason that some older functions don't work with tibble is the `[` function. We don't use `[` much in this book because `dplyr::filter()` and `dplyr::select()` allow you to solve the same problems with clearer code (but you will learn a little about it in [vector subsetting](#vector-subsetting)). With base R data frames, `[` sometimes returns a data frame, and sometimes returns a vector. With tibbles, `[` always returns another tibble.
The main reason that some older functions don't work with tibble is the `[` function.
We don't use `[` much in this book because `dplyr::filter()` and `dplyr::select()` allow you to solve the same problems with clearer code (but you will learn a little about it in [vector subsetting](#vector-subsetting)).
With base R data frames, `[` sometimes returns a data frame, and sometimes returns a vector.
With tibbles, `[` always returns another tibble.
## Exercises
1. How can you tell if an object is a tibble? (Hint: try printing `mtcars`,
which is a regular data frame).
1. How can you tell if an object is a tibble?
(Hint: try printing `mtcars`, which is a regular data frame).
2. Compare and contrast the following operations on a `data.frame` and equivalent tibble.
What is different?
Why might the default data frame behaviours cause you frustration?
1. Compare and contrast the following operations on a `data.frame` and
equivalent tibble. What is different? Why might the default data frame
behaviours cause you frustration?
```{r, eval = FALSE}
df <- data.frame(abc = 1, xyz = "a")
df$x
@ -158,19 +180,18 @@ The main reason that some older functions don't work with tibble is the `[` func
df[, c("abc", "xyz")]
```
1. If you have the name of a variable stored in an object, e.g. `var <- "mpg"`,
how can you extract the reference variable from a tibble?
3. If you have the name of a variable stored in an object, e.g. `var <- "mpg"`, how can you extract the reference variable from a tibble?
1. Practice referring to non-syntactic names in the following data frame by:
4. Practice referring to non-syntactic names in the following data frame by:
1. Extracting the variable called `1`.
1. Plotting a scatterplot of `1` vs `2`.
2. Plotting a scatterplot of `1` vs `2`.
3. Creating a new column called `3` which is `2` divided by `1`.
4. Renaming the columns to `one`, `two` and `three`.
1. Creating a new column called `3` which is `2` divided by `1`.
1. Renaming the columns to `one`, `two` and `three`.
```{r}
annoying <- tibble(
`1` = 1:10,
@ -178,7 +199,7 @@ The main reason that some older functions don't work with tibble is the `[` func
)
```
1. What does `tibble::enframe()` do? When might you use it?
5. What does `tibble::enframe()` do?
When might you use it?
1. What option controls how many additional column names are printed
at the footer of a tibble?
6. What option controls how many additional column names are printed at the footer of a tibble?

441
tidy.Rmd
View File

@ -2,19 +2,21 @@
## Introduction
> "Happy families are all alike; every unhappy family is unhappy in its
> own way." -- Leo Tolstoy
> "Happy families are all alike; every unhappy family is unhappy in its own way." ---- Leo Tolstoy
> "Tidy datasets are all alike, but every messy dataset is messy in its
> own way." -- Hadley Wickham
> "Tidy datasets are all alike, but every messy dataset is messy in its own way." ---- Hadley Wickham
In this chapter, you will learn a consistent way to organise your data in R, an organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
In this chapter, you will learn a consistent way to organise your data in R, an organisation called **tidy data**.
Getting your data into this format requires some upfront work, but that work pays off in the long term.
Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
This chapter will give you a practical introduction to tidy data and the accompanying tools in the __tidyr__ package. If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.
This chapter will give you a practical introduction to tidy data and the accompanying tools in the **tidyr** package.
If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.
### Prerequisites
In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.
In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets.
tidyr is a member of the core tidyverse.
```{r setup, message = FALSE}
library(tidyverse)
@ -22,7 +24,9 @@ library(tidyverse)
## Tidy data
You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organises the values in a different way.
You can represent the same underlying data in multiple ways.
The example below shows the same data organised in four different ways.
Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organises the values in a different way.
```{r}
table1
@ -34,13 +38,14 @@ table4a # cases
table4b # population
```
These are all representations of the same underlying data, but they are not equally easy to use. One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.
These are all representations of the same underlying data, but they are not equally easy to use.
One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.
There are three interrelated rules which make a dataset tidy:
1. Each variable must have its own column.
1. Each observation must have its own row.
1. Each value must have its own cell.
2. Each observation must have its own row.
3. Each value must have its own cell.
Figure \@ref(fig:tidy-structure) shows the rules visually.
@ -48,26 +53,27 @@ Figure \@ref(fig:tidy-structure) shows the rules visually.
knitr::include_graphics("images/tidy-1.png")
```
These three rules are interrelated because it's impossible to only satisfy two of the three. That interrelationship leads to an even simpler set of practical instructions:
These three rules are interrelated because it's impossible to only satisfy two of the three.
That interrelationship leads to an even simpler set of practical instructions:
1. Put each dataset in a tibble.
1. Put each variable in a column.
2. Put each variable in a column.
In this example, only `table1` is tidy. It's the only representation where each column is a variable.
In this example, only `table1` is tidy.
It's the only representation where each column is a variable.
Why ensure that your data is tidy? There are two main advantages:
Why ensure that your data is tidy?
There are two main advantages:
1. There's a general advantage to picking one consistent way of storing
data. If you have a consistent data structure, it's easier to learn the
tools that work with it because they have an underlying uniformity.
1. There's a specific advantage to placing variables in columns because
it allows R's vectorised nature to shine. As you learned in
Sections \@ref(mutate-funs) and \@ref(summarise-funs), most
built-in R functions work with vectors of values. That makes transforming
tidy data feel particularly natural.
1. There's a general advantage to picking one consistent way of storing data.
If you have a consistent data structure, it's easier to learn the tools that work with it because they have an underlying uniformity.
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with `table1`.
2. There's a specific advantage to placing variables in columns because it allows R's vectorised nature to shine.
As you learned in Sections \@ref(mutate-funs) and \@ref(summarise-funs), most built-in R functions work with vectors of values.
That makes transforming tidy data feel particularly natural.
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.
Here are a couple of small examples showing how you might work with `table1`.
```{r fig.width = 5, fig.alt = "This figure shows the numbers of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale."}
# Compute rate per 10,000
@ -87,44 +93,50 @@ ggplot(table1, aes(year, cases)) +
### Exercises
1. Using prose, describe how the variables and observations are organised in
each of the sample tables.
1. Using prose, describe how the variables and observations are organised in each of the sample tables.
1. Compute the `rate` for `table2`, and `table4a` + `table4b`.
2. Compute the `rate` for `table2`, and `table4a` + `table4b`.
You will need to perform four operations:
a. Extract the number of TB cases per country per year.
a. Extract the matching population per country per year.
a. Divide cases by population, and multiply by 10000.
a. Store back in the appropriate place.
Which representation is easiest to work with? Which is hardest? Why?
b. Extract the matching population per country per year.
c. Divide cases by population, and multiply by 10000.
d. Store back in the appropriate place.
1. Recreate the plot showing change in cases over time using `table2`
instead of `table1`. What do you need to do first?
Which representation is easiest to work with?
Which is hardest?
Why?
3. Recreate the plot showing change in cases over time using `table2` instead of `table1`.
What do you need to do first?
## Pivoting
The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy. Unfortunately, however, most data that you will encounter will be untidy. There are two main reasons:
The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy.
Unfortunately, however, most data that you will encounter will be untidy.
There are two main reasons:
1. Most people aren't familiar with the principles of tidy data, and it's hard
to derive them yourself unless you spend a _lot_ of time working with data.
1. Data is often organised to facilitate some use other than analysis. For
example, data is often organised to make entry as easy as possible.
This means for most real analyses, you'll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data.
1. Most people aren't familiar with the principles of tidy data, and it's hard to derive them yourself unless you spend a *lot* of time working with data.
2. Data is often organised to facilitate some use other than analysis.
For example, data is often organised to make entry as easy as possible.
This means for most real analyses, you'll need to do some tidying.
The first step is always to figure out what the variables and observations are.
Sometimes this is easy; other times you'll need to consult with the people who originally generated the data.
The second step is to resolve one of two common problems:
1. One variable might be spread across multiple columns.
1. One variable might be spread across multiple columns.
1. One observation might be scattered across multiple rows.
2. One observation might be scattered across multiple rows.
Typically a dataset will only suffer from one of these problems; it'll only suffer from both if you're really unlucky! To fix these problems, you'll need the two most important functions in tidyr: `pivot_longer()` and `pivot_wider()`.
Typically a dataset will only suffer from one of these problems; it'll only suffer from both if you're really unlucky!
To fix these problems, you'll need the two most important functions in tidyr: `pivot_longer()` and `pivot_wider()`.
### Longer
A common problem is a dataset where some of the column names are not names of variables, but _values_ of a variable. Suppose you have your data in the following format.
A common problem is a dataset where some of the column names are not names of variables, but *values* of a variable.
Suppose you have your data in the following format.
```{r}
table4a
@ -160,15 +172,16 @@ table4a %>%
However in `table4a` the column names `1999` and `2000` represent values of the `year` variable, the values in the `1999` and `2000` columns represent values of the `cases` variable, and each row represents two observations, not one.
To tidy a dataset like this, we need to __pivot__ the offending columns into a new pair of variables. To describe that operation we need three parameters:
To tidy a dataset like this, we need to **pivot** the offending columns into a new pair of variables.
To describe that operation we need three parameters:
* The set of columns whose names are values, not variables. In this example,
those are the columns `1999` and `2000`.
- The set of columns whose names are values, not variables.
In this example, those are the columns `1999` and `2000`.
* The name of the variable to move the column names to: `year`.
- The name of the variable to move the column names to: `year`.
- The name of the variable to move the column values to: `cases`.
* The name of the variable to move the column values to: `cases`.
Together those parameters generate the call to `pivot_longer()`:
```{r}
@ -180,17 +193,27 @@ table4a %>%
)
```
The columns to pivot are specified with `dplyr::select()` style notation in the `cols` argument. Here there are only two columns, so we list them individually. Note that `1999` and `2000` are non-syntactic names (because they don't start with a letter) so we have to surround them in backticks. To refresh your memory of the other ways to select columns, see Section \@ref(select).
The columns to pivot are specified with `dplyr::select()` style notation in the `cols` argument.
Here there are only two columns, so we list them individually.
Note that `1999` and `2000` are non-syntactic names (because they don't start with a letter) so we have to surround them in backticks.
To refresh your memory of the other ways to select columns, see Section \@ref(select).
`year` and `cases` do not exist in `table4a` so we put their names in quotes in `names_to` and `values_to` arguments, respectively.
In the final result, the pivoted columns are dropped, and we get new `year` and `cases` columns. Otherwise, the relationships between the original variables are preserved. Visually, this is shown in Figure \@ref(fig:tidy-pivot-longer).
In the final result, the pivoted columns are dropped, and we get new `year` and `cases` columns.
Otherwise, the relationships between the original variables are preserved.
Visually, this is shown in Figure \@ref(fig:tidy-pivot-longer).
```{r tidy-pivot-longer, echo = FALSE, out.width = "100%", fig.cap = "Pivoting `table4a` into a \"longer\", tidy form.", fig.alt = "Two panels, one with a longer and the other with a wider data frame. Arrows represent how values in the 1999 and 2000 columns of the wider data frame are pivoted to a column named cases in the longer data frame and how column names from the wider data frame (1999 and 2000) are pivoted into column names in the longer data frame."}
knitr::include_graphics("images/tidy-9.png")
```
There is still one issue though. Take a peek at the type of the `year` variable. We would expect `year` to be numeric (or specifically, we would expect it to be an integer), however it's showing up as a character. This is because the values in the `year` variable came from column headings in `table4a`. We can add a new step to our pipeline using `dplyr::mutate()` to parse this variable as an integer with `readr::parse_integer()`. You can refer back to Section \@ref(parsing-a-vector) for functions for parsing other types of vectors.
There is still one issue though.
Take a peek at the type of the `year` variable.
We would expect `year` to be numeric (or specifically, we would expect it to be an integer), however it's showing up as a character.
This is because the values in the `year` variable came from column headings in `table4a`.
We can add a new step to our pipeline using `dplyr::mutate()` to parse this variable as an integer with `readr::parse_integer()`.
You can refer back to Section \@ref(parsing-a-vector) for functions for parsing other types of vectors.
```{r}
table4a %>%
@ -207,9 +230,12 @@ Once we have our data in this longer format, we can create the visualisation tha
```{r ref.label = "tidy-pivot-longer-plot-lines", fig.alt = "Number of cases over the years for each country.", fig.alt = "This figure shows the numbers of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale."}
```
`pivot_longer()` makes datasets longer by increasing the number of rows and decreasing the number of columns. I don't believe it makes sense to describe a dataset as being in “long form”. Length is a relative term, and you can only say (e.g.) that dataset A is longer than dataset B.
`pivot_longer()` makes datasets longer by increasing the number of rows and decreasing the number of columns.
I don't believe it makes sense to describe a dataset as being in "long form".
Length is a relative term, and you can only say (e.g.) that dataset A is longer than dataset B.
We can use `pivot_longer()` to tidy `table4b` in a similar fashion. The only difference is the variable stored in the cell values:
We can use `pivot_longer()` to tidy `table4b` in a similar fashion.
The only difference is the variable stored in the cell values:
```{r}
table4b %>%
@ -243,7 +269,9 @@ left_join(tidy4a, tidy4b)
### Wider
`pivot_wider()` is the opposite of `pivot_longer()`. You use it when an observation is scattered across multiple rows. For example, take `table2`: an observation is a country in a year, but each observation is spread across two rows.
`pivot_wider()` is the opposite of `pivot_longer()`.
You use it when an observation is scattered across multiple rows.
For example, take `table2`: an observation is a country in a year, but each observation is spread across two rows.
```{r}
table2
@ -257,11 +285,13 @@ table2 %>%
mutate(rate = cases / population)
```
This means we need a data frame with `cases` and `population` as separate columns, and in those columns, each cell will hold the values of the relevant `count`s. Let's analyse the representation in similar way to `pivot_longer()`. This time, however, we only need two parameters:
This means we need a data frame with `cases` and `population` as separate columns, and in those columns, each cell will hold the values of the relevant `count`s.
Let's analyse the representation in similar way to `pivot_longer()`.
This time, however, we only need two parameters:
* The column to take variable names from: `type`.
- The column to take variable names from: `type`.
* The column to take values from: `count`.
- The column to take values from: `count`.
We can use `pivot_wider()`, as shown programmatically below, and visually in Figure \@ref(fig:tidy-pivot-wider).
@ -304,15 +334,17 @@ table2 %>%
relocate(country, contains("1999"))
```
This representation is rarely useful for data analysis but it might be useful as the basis of a table for communication of results in a data analysis report.
This representation is rarely useful for data analysis but it might be useful as the basis of a table for communication of results in a data analysis report.
To achieve this we need to add year information in column headings for `cases`, `population`, and `rate` as well as distribute the values that are currently under these three columns into six columns (two columns for each year we have data for). This is represented in Figure \@ref(fig:tidy-pivot-even-wider).
To achieve this we need to add year information in column headings for `cases`, `population`, and `rate` as well as distribute the values that are currently under these three columns into six columns (two columns for each year we have data for).
This is represented in Figure \@ref(fig:tidy-pivot-even-wider).
```{r tidy-pivot-even-wider, echo = FALSE, out.width = "100%", fig.cap = "Pivoting `table2` into an even \"wider\" form. Arrows for `cases` and `rate` values are omitted for clarity.", fig.alt = "Two panels, one with a wider and the other with an even wider data frame. Arrows represent how population values for 1999 and 2000 that are stored in a single column in the wide data frame are spread across two columns in the data frame that is even wider. These new columns are called population_1999 and population_2000."}
knitr::include_graphics("images/tidy-19.png")
```
To do so, we'll take advantage of the fact that the pivot functions can operate on multiple columns at once. The first three lines of the following code chunk is what we've already done in the previous step and we add on to the pipeline another `pivot_wider()` step where the values for the added columns come from `cases`, `population`, and `rate` and the column names are automatically suffixed with values from the `year` variable.
To do so, we'll take advantage of the fact that the pivot functions can operate on multiple columns at once.
The first three lines of the following code chunk is what we've already done in the previous step and we add on to the pipeline another `pivot_wider()` step where the values for the added columns come from `cases`, `population`, and `rate` and the column names are automatically suffixed with values from the `year` variable.
```{r}
table2 %>%
@ -324,18 +356,20 @@ table2 %>%
)
```
The last step for achieving our goal is to relocate columns in the resulting data frame so columns for 1999 data come before those for 2000. We can use the `relocate()` function to move the 1999 columns ahead of the 2000 columns.
The last step for achieving our goal is to relocate columns in the resulting data frame so columns for 1999 data come before those for 2000.
We can use the `relocate()` function to move the 1999 columns ahead of the 2000 columns.
```{r ref.label = "tidy-pivot-even-wider-case-ratio"}
```
As you might have guessed from their names, `pivot_wider()` and `pivot_longer()` are complements. `pivot_longer()` makes wide tables narrower and longer; `pivot_wider()` makes long tables shorter and wider.
As you might have guessed from their names, `pivot_wider()` and `pivot_longer()` are complements.
`pivot_longer()` makes wide tables narrower and longer; `pivot_wider()` makes long tables shorter and wider.
### Exercises
1. Why are `pivot_longer()` and `pivot_wider()` not perfectly symmetrical?
1. Why are `pivot_longer()` and `pivot_wider()` not perfectly symmetrical?\
Carefully consider the following example:
```{r, eval = FALSE}
stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
@ -346,21 +380,22 @@ As you might have guessed from their names, `pivot_wider()` and `pivot_longer()`
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
```
(Hint: look at the variable types and think about column _names_.)
`pivot_longer()` has a `names_ptypes` argument, e.g.
`names_ptypes = list(year = double())`. What does it do?
1. Why does this code fail?
(Hint: look at the variable types and think about column *names*.)
`pivot_longer()` has a `names_ptypes` argument, e.g. `names_ptypes = list(year = double())`.
What does it do?
2. Why does this code fail?
```{r, error = TRUE}
table4a %>%
pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
```
1. What would happen if you widen this table? Why? How could you add a
new column to uniquely identify each value?
3. What would happen if you widen this table?
Why?
How could you add a new column to uniquely identify each value?
```{r}
people <- tribble(
@ -374,11 +409,10 @@ As you might have guessed from their names, `pivot_wider()` and `pivot_longer()`
)
```
1. The simple tibble below summarizes information on whether employees at a
small company know how to drive and whether they prefer a position where
they will need to drive daily for sales calls. Tidy the table to get it into
a format where each observation is an employee. Do you need to make it wider
or longer? What are the variables?
4. The simple tibble below summarizes information on whether employees at a small company know how to drive and whether they prefer a position where they will need to drive daily for sales calls.
Tidy the table to get it into a format where each observation is an employee.
Do you need to make it wider or longer?
What are the variables?
```{r}
employees <- tribble(
@ -388,20 +422,16 @@ As you might have guessed from their names, `pivot_wider()` and `pivot_longer()`
)
```
1. One way of summarising the distribution of one categorical variable based
on the levels of another is using `dplyr::count()`, e.g. the following
gives the distribution of `drv` (type of drive train) for each level of
`cyl` (number of cylinders) for cars in the `mpg` dataset.
5. One way of summarising the distribution of one categorical variable based on the levels of another is using `dplyr::count()`, e.g. the following gives the distribution of `drv` (type of drive train) for each level of `cyl` (number of cylinders) for cars in the `mpg` dataset.
```{r}
mpg %>%
count(cyl, drv)
```
A contingency table is another way commonly used way of summarising this
information. Use one of the pivoting functions to construct the contingency
table shown below based on the output above.
A contingency table is another way commonly used way of summarising this information.
Use one of the pivoting functions to construct the contingency table shown below based on the output above.
```{r echo = FALSE}
mpg %>%
count(cyl, drv) %>%
@ -410,17 +440,22 @@ As you might have guessed from their names, `pivot_wider()` and `pivot_longer()`
## Separating
So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function. You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
So far you've learned how to tidy `table2`, `table4a`, and `table4b`, but not `table3`.
`table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`).
To fix this problem, we'll need the `separate()` function.
You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
### Separate
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take `table3`:
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
Take `table3`:
```{r}
table3
```
The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables. `separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables.
`separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
```{r}
table3 %>%
@ -431,7 +466,10 @@ table3 %>%
knitr::include_graphics("images/tidy-17.png")
```
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. For example, we could rewrite the code above as:
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter).
For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`.
For example, we could rewrite the code above as:
```{r eval = FALSE}
table3 %>%
@ -440,7 +478,10 @@ table3 %>%
(Formally, `sep` is a regular expression, which you'll learn more about in Chapter \@ref(strings).)
Look carefully at the column types: you'll notice that `cases` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful as those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`:
Look carefully at the column types: you'll notice that `cases` and `population` are character columns.
This is the default behaviour in `separate()`: it leaves the type of the column as is.
Here, however, it's not very useful as those really are numbers.
We can ask `separate()` to try and convert to better types using `convert = TRUE`:
```{r}
table3 %>%
@ -449,16 +490,21 @@ table3 %>%
### Unite
`unite()` is the inverse of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
`unite()` is the inverse of `separate()`: it combines multiple columns into a single column.
You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example. That data is saved as `tidyr::table1`. `unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
We can use `unite()` to rejoin the `cases` and `population` columns that we created in the last example.
That data is saved as `tidyr::table1`.
`unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
```{r}
table1 %>%
unite(rate, cases, population)
```
In this case we also need to use the `sep` argument. The default will place an underscore (`_`) between the values from different columns. Here we want `"/"` instead:
In this case we also need to use the `sep` argument.
The default will place an underscore (`_`) between the values from different columns.
Here we want `"/"` instead:
```{r}
table1 %>%
@ -467,27 +513,26 @@ table1 %>%
### Exercises
1. What do the `extra` and `fill` arguments do in `separate()`?
1. What do the `extra` and `fill` arguments do in `separate()`?
Experiment with the various options for the following two toy datasets.
```{r, eval = FALSE}
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
```
1. Both `unite()` and `separate()` have a `remove` argument. What does it
do? Why would you set it to `FALSE`?
2. Both `unite()` and `separate()` have a `remove` argument.
What does it do?
Why would you set it to `FALSE`?
1. Compare and contrast `separate()` and `extract()`. Why are there
three variations of separation (by position, by separator, and with
groups), but only one unite?
1. In the following example we're using `unite()` to create a `date` column
from `month` and `day` columns. How would you achieve the same outcome
using `mutate()` and `paste()` instead of unite?
3. Compare and contrast `separate()` and `extract()`.
Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
```{r, eval = FALSE}
events <- tribble(
@ -496,18 +541,16 @@ table1 %>%
1 , 21,
1 , 22
)
events %>%
unite("date", month:day, sep = "-", remove = FALSE)
```
1. You can also pass a vector of integers to `sep`. `separate()` will interpret
the integers as positions to split at. Positive values start at 1 on the
far-left of the strings; negative value start at -1 on the far-right of the
strings. Use `separate()` to represent location information in the following
tibble in two columns: `state` (represented by the first two characters) and
`county`. Do this in two ways: using a positive and a negative value for `sep`.
5. You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
Do this in two ways: using a positive and a negative value for `sep`.
```{r}
baker <- tribble(
~location,
@ -517,14 +560,14 @@ table1 %>%
)
baker
```
## Missing values
Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:
Changing the representation of a dataset brings up an important subtlety of missing values.
Surprisingly, a value can be missing in one of two possible ways:
* __Explicitly__, i.e. flagged with `NA`.
* __Implicitly__, i.e. simply not present in the data.
- **Explicitly**, i.e. flagged with `NA`.
- **Implicitly**, i.e. simply not present in the data.
Let's illustrate this idea with a very simple data set:
@ -538,15 +581,14 @@ stocks <- tibble(
There are two missing values in this dataset:
* The return for the fourth quarter of 2015 is explicitly missing, because
the cell where its value should be instead contains `NA`.
* The return for the first quarter of 2016 is implicitly missing, because it
simply does not appear in the dataset.
- The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains `NA`.
- The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit by putting years in the columns:
The way that a dataset is represented can make implicit values explicit.
For example, we can make the implicit missing value explicit by putting years in the columns:
```{r}
stocks %>%
@ -573,9 +615,11 @@ stocks %>%
complete(year, qtr)
```
`complete()` takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
`complete()` takes a set of columns, and finds all unique combinations.
It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
There's one other important tool that you should know for working with missing values.
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
```{r}
treatment <- tribble(
@ -587,7 +631,8 @@ treatment <- tribble(
)
```
You can fill in these missing values with `fill()`. It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
You can fill in these missing values with `fill()`.
It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
```{r}
treatment %>%
@ -596,13 +641,15 @@ treatment %>%
### Exercises
1. Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`.
1. Compare and contrast the `fill` arguments to `pivot_wider()` and `complete()`.
1. What does the direction argument to `fill()` do?
2. What does the direction argument to `fill()` do?
## Case study
To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem. The `tidyr::who` dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the *2014 World Health Organization Global Tuberculosis Report*, available at <http://www.who.int/tb/country/data/download/en>.
To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem.
The `tidyr::who` dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method.
The data comes from the *2014 World Health Organization Global Tuberculosis Report*, available at <http://www.who.int/tb/country/data/download/en>.
There's a wealth of epidemiological information in this dataset, but it's challenging to work with the data in the form that it's provided:
@ -610,7 +657,10 @@ There's a wealth of epidemiological information in this dataset, but it's challe
who
```
This is a very typical real-life example dataset. It contains redundant columns, odd variable names, and many missing values. In short, the `who` dataset is messy, and we'll need to be methodical about how we tidy it. With functions like `pivot_wider()` and `pivot_longer()` this generally means an iterative approach will work well -- aim to accomplish one goal at a time, run the function and examine the resulting data frame, then go back and set more arguments of the function as needed until the resulting data frame is exactly what you need.
This is a very typical real-life example dataset.
It contains redundant columns, odd variable names, and many missing values.
In short, the `who` dataset is messy, and we'll need to be methodical about how we tidy it.
With functions like `pivot_wider()` and `pivot_longer()` this generally means an iterative approach will work well -- aim to accomplish one goal at a time, run the function and examine the resulting data frame, then go back and set more arguments of the function as needed until the resulting data frame is exactly what you need.
The best place to start is to take a good look at the variable names and determine whether they are actually variables or if they contain information that should be captured as values in a new column.
@ -618,46 +668,43 @@ The best place to start is to take a good look at the variable names and determi
names(who)
```
- It looks like `country`, `iso2`, and `iso3` are three variables that redundantly specify the country.
* It looks like `country`, `iso2`, and `iso3` are three variables that
redundantly specify the country.
* `year` is also a variable.
- `year` is also a variable.
* The first three letters of the variables `new_sp_m014` through `newrel_f65`
denote whether the column contains new or old cases of TB. In this dataset,
each column contains new cases, so we don't really need this information to be
captured in a variable. The remaining characters in encode three variables in
their names. You might be able to parse this out by yourself with a little thought
and some experimentation, but luckily we have the data dictionary handy. It tells us:
- The first three letters of the variables `new_sp_m014` through `newrel_f65` denote whether the column contains new or old cases of TB.
In this dataset, each column contains new cases, so we don't really need this information to be captured in a variable.
The remaining characters in encode three variables in their names.
You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy.
It tells us:
1. The next two or three letters describe the diagnosis of TB:
* `rel` stands for cases of relapse
* `ep` stands for cases of extrapulmonary TB
* `sn` stands for cases of pulmonary TB that could not be diagnosed by
a pulmonary smear (smear negative)
* `sp` stands for cases of pulmonary TB that could be diagnosed by
a pulmonary smear (smear positive)
1. The next two or three letters describe the diagnosis of TB:
1. The next letter gives the sex of TB patients. The dataset groups
cases by males (`m`) and females (`f`).
- `rel` stands for cases of relapse
- `ep` stands for cases of extrapulmonary TB
- `sn` stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
- `sp` stands for cases of pulmonary TB that could be diagnosed by a pulmonary smear (smear positive)
1. The remaining numbers gives the age group. The dataset groups cases into
seven age groups:
* `014` = 0 -- 14 years old
* `1524` = 15 -- 24 years old
* `2534` = 25 -- 34 years old
* `3544` = 35 -- 44 years old
* `4554` = 45 -- 54 years old
* `5564` = 55 -- 64 years old
* `65` = 65 or older
2. The next letter gives the sex of TB patients.
The dataset groups cases by males (`m`) and females (`f`).
We can break these variables up by specifying multiple column names in `names_to` and then either providing `names_pattern` to specify how we want to break them up with a regular expression containing groups (defined by `()`) and it puts each group in a column. You'll learn more about regular expressions in Chapter \@ref(strings), but the basic idea is that in a variable name like `new_sp_m014`, we want to capture `sp`, `m`, and `014` as separate groups, so we can think about this variable's name as `new_(sp)_(m)(014)`. In constructing the appropriate regular expression we need to keep in mind a few messy features of these variable names:
3. The remaining numbers gives the age group.
The dataset groups cases into seven age groups:
* Some of the variables start with `new_` while some of them start with `new` without an underscore separating it from the diagnosis.
* The diagnoses and the age groups are indicated by varying numbers of characters (e.g. `sp` vs. `rel` and `014` vs. `4554`.)
- `014` = 0 -- 14 years old
- `1524` = 15 -- 24 years old
- `2534` = 25 -- 34 years old
- `3544` = 35 -- 44 years old
- `4554` = 45 -- 54 years old
- `5564` = 55 -- 64 years old
- `65` = 65 or older
We can break these variables up by specifying multiple column names in `names_to` and then either providing `names_pattern` to specify how we want to break them up with a regular expression containing groups (defined by `()`) and it puts each group in a column.
You'll learn more about regular expressions in Chapter \@ref(strings), but the basic idea is that in a variable name like `new_sp_m014`, we want to capture `sp`, `m`, and `014` as separate groups, so we can think about this variable's name as `new_(sp)_(m)(014)`.
In constructing the appropriate regular expression we need to keep in mind a few messy features of these variable names:
- Some of the variables start with `new_` while some of them start with `new` without an underscore separating it from the diagnosis.
- The diagnoses and the age groups are indicated by varying numbers of characters (e.g. `sp` vs. `rel` and `014` vs. `4554`.)
The regular expression that will capture all of these inconsistencies and extract the three groups of information we need is `new_?(.*)_(.)(.*)`.
@ -671,7 +718,9 @@ who %>%
)
```
This looks pretty good for a first pass, but there are some improvements we can make. First, we're seeing lots of `NA`s in the `cases` column. We can drop these observations by setting `values_drop_na` to `TRUE`.
This looks pretty good for a first pass, but there are some improvements we can make.
First, we're seeing lots of `NA`s in the `cases` column.
We can drop these observations by setting `values_drop_na` to `TRUE`.
```{r}
who %>%
@ -684,7 +733,8 @@ who %>%
)
```
Second, `diagnosis` and `gender` are characters by default, however it's a good idea to convert them to factors since they are categorical variables with a known set of values. We'll use the `parse_factor()` function from readr to make the conversion in a `mutate()` step we add to the pipeline.
Second, `diagnosis` and `gender` are characters by default, however it's a good idea to convert them to factors since they are categorical variables with a known set of values.
We'll use the `parse_factor()` function from readr to make the conversion in a `mutate()` step we add to the pipeline.
```{r}
who %>%
@ -705,7 +755,8 @@ who %>%
)
```
Finally, we might want to recode the `age` variable with level names that are a bit easier to read and a bit more informative. We'll do this within the `mutate()` step of our pipeline using `forcats::fct_recode()` that you'll learn more about in Chapter \@ref(factors).
Finally, we might want to recode the `age` variable with level names that are a bit easier to read and a bit more informative.
We'll do this within the `mutate()` step of our pipeline using `forcats::fct_recode()` that you'll learn more about in Chapter \@ref(factors).
```{r}
who_tidy <- who %>%
@ -737,7 +788,8 @@ who_tidy <- who %>%
who_tidy
```
This tidy data frame allows us to explore the data with more ease than the original `who` dataset. For example, we can easily filter for a particular type of TB for a given country and sum over the number of cases to see how case numbers for this type of TB have evolved over the years.
This tidy data frame allows us to explore the data with more ease than the original `who` dataset.
For example, we can easily filter for a particular type of TB for a given country and sum over the number of cases to see how case numbers for this type of TB have evolved over the years.
```{r fig.alt = "A scatterplot of number of smear positive pulmonary TB cases in the US over the years, with year on the x-axis ranging from 1995 to 2013 and yearly total number of cases on the y-axis ranging from 3000 to 8000. The points on the scatterplot are overlaid with a smooth curve, which shows a strong, negative association between the two variables."}
who_tidy %>%
@ -750,32 +802,33 @@ who_tidy %>%
labs(title = "Number of smear positive pulmonary TB cases in the US")
```
### Exercises
1. In this case study I set `values_drop_na = TRUE` just to make it easier to
check that we had the correct values. Is this reasonable? Think about
how missing values are represented in this dataset. Are there implicit
missing values? What's the difference between an `NA` and zero?
1. In this case study I set `values_drop_na = TRUE` just to make it easier to check that we had the correct values.
Is this reasonable?
Think about how missing values are represented in this dataset.
Are there implicit missing values?
What's the difference between an `NA` and zero?
1. I claimed that `iso2` and `iso3` were redundant with `country`.
Confirm this claim and think about situations where we might want to keep this
information in the data frame and when we might choose to discard the
redundant columns.
2. I claimed that `iso2` and `iso3` were redundant with `country`.
Confirm this claim and think about situations where we might want to keep this information in the data frame and when we might choose to discard the redundant columns.
1. For each country, year, and sex compute the total number of cases of
TB. Make an informative visualisation of the data.
3. For each country, year, and sex compute the total number of cases of TB.
Make an informative visualisation of the data.
## Non-tidy data
Before we continue on to other topics, it's worth talking briefly about non-tidy data. Earlier in the chapter, I used the pejorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well-founded data structures that are not tidy data. There are two main reasons to use other data structures:
Before we continue on to other topics, it's worth talking briefly about non-tidy data.
Earlier in the chapter, I used the pejorative term "messy" to refer to non-tidy data.
That's an oversimplification: there are lots of useful and well-founded data structures that are not tidy data.
There are two main reasons to use other data structures:
* Alternative representations may have substantial performance or space
advantages.
* Specialised fields have evolved their own conventions for storing data
that may be quite different to the conventions of tidy data.
- Alternative representations may have substantial performance or space advantages.
Either of these reasons means you'll need something other than a tibble (or data frame). If your data does fit naturally into a rectangular structure composed of observations and variables, I think tidy data should be your default choice. But there are good reasons to use other structures; tidy data is not the only way.
- Specialised fields have evolved their own conventions for storing data that may be quite different to the conventions of tidy data.
Either of these reasons means you'll need something other than a tibble (or data frame).
If your data does fit naturally into a rectangular structure composed of observations and variables, I think tidy data should be your default choice.
But there are good reasons to use other structures; tidy data is not the only way.
If you'd like to learn more about non-tidy data, I'd highly recommend this thoughtful blog post by Jeff Leek: <http://simplystatistics.org/2016/02/17/non-tidy-data>.

File diff suppressed because it is too large Load Diff

View File

@ -2,13 +2,20 @@
## Introduction
So far this book has focussed on tibbles and packages that work with them. But as you start to write your own functions, and dig deeper into R, you need to learn about vectors, the objects that underlie tibbles. If you've learned R in a more traditional way, you're probably already familiar with vectors, as most R resources start with vectors and work their way up to tibbles. I think it's better to start with tibbles because they're immediately useful, and then work your way down to the underlying components.
So far this book has focussed on tibbles and packages that work with them.
But as you start to write your own functions, and dig deeper into R, you need to learn about vectors, the objects that underlie tibbles.
If you've learned R in a more traditional way, you're probably already familiar with vectors, as most R resources start with vectors and work their way up to tibbles.
I think it's better to start with tibbles because they're immediately useful, and then work your way down to the underlying components.
Vectors are particularly important as most of the functions you will write will work with vectors. It is possible to write functions that work with tibbles (like ggplot2, dplyr, and tidyr), but the tools you need to write such functions are currently idiosyncratic and immature. I am working on a better approach, <https://github.com/hadley/lazyeval>, but it will not be ready in time for the publication of the book. Even when complete, you'll still need to understand vectors, it'll just make it easier to write a user-friendly layer on top.
Vectors are particularly important as most of the functions you will write will work with vectors.
It is possible to write functions that work with tibbles (like ggplot2, dplyr, and tidyr), but the tools you need to write such functions are currently idiosyncratic and immature.
I am working on a better approach, <https://github.com/hadley/lazyeval>, but it will not be ready in time for the publication of the book.
Even when complete, you'll still need to understand vectors, it'll just make it easier to write a user-friendly layer on top.
### Prerequisites
The focus of this chapter is on base R data structures, so it isn't essential to load any packages. We will, however, use a handful of functions from the __purrr__ package to avoid some inconsistencies in base R.
The focus of this chapter is on base R data structures, so it isn't essential to load any packages.
We will, however, use a handful of functions from the **purrr** package to avoid some inconsistencies in base R.
```{r setup, message = FALSE}
library(tidyverse)
@ -18,51 +25,58 @@ library(tidyverse)
There are two types of vectors:
1. __Atomic__ vectors, of which there are six types:
__logical__, __integer__, __double__, __character__, __complex__, and
__raw__. Integer and double vectors are collectively known as
__numeric__ vectors.
1. **Atomic** vectors, of which there are six types: **logical**, **integer**, **double**, **character**, **complex**, and **raw**.
Integer and double vectors are collectively known as **numeric** vectors.
1. __Lists__, which are sometimes called recursive vectors because lists can
contain other lists.
2. **Lists**, which are sometimes called recursive vectors because lists can contain other lists.
The chief difference between atomic vectors and lists is that atomic vectors are __homogeneous__, while lists can be __heterogeneous__. There's one other related object: `NULL`. `NULL` is often used to represent the absence of a vector (as opposed to `NA` which is used to represent the absence of a value in a vector). `NULL` typically behaves like a vector of length 0. Figure \@ref(fig:datatypes) summarises the interrelationships.
The chief difference between atomic vectors and lists is that atomic vectors are **homogeneous**, while lists can be **heterogeneous**.
There's one other related object: `NULL`.
`NULL` is often used to represent the absence of a vector (as opposed to `NA` which is used to represent the absence of a value in a vector).
`NULL` typically behaves like a vector of length 0.
Figure \@ref(fig:datatypes) summarises the interrelationships.
```{r datatypes, echo = FALSE, out.width = "50%", fig.cap = "The hierarchy of R's vector types"}
knitr::include_graphics("diagrams/data-structures-overview.png")
```
Every vector has two key properties:
Every vector has two key properties:
1. Its __type__, which you can determine with `typeof()`.
1. Its **type**, which you can determine with `typeof()`.
```{r}
typeof(letters)
typeof(1:10)
```
1. Its __length__, which you can determine with `length()`.
2. Its **length**, which you can determine with `length()`.
```{r}
x <- list("a", "b", 1:10)
length(x)
```
Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create __augmented vectors__ which build on additional behaviour. There are three important types of augmented vector:
Vectors can also contain arbitrary additional metadata in the form of attributes.
These attributes are used to create **augmented vectors** which build on additional behaviour.
There are three important types of augmented vector:
* Factors are built on top of integer vectors.
* Dates and date-times are built on top of numeric vectors.
* Data frames and tibbles are built on top of lists.
- Factors are built on top of integer vectors.
- Dates and date-times are built on top of numeric vectors.
- Data frames and tibbles are built on top of lists.
This chapter will introduce you to these important vectors from simplest to most complicated. You'll start with atomic vectors, then build up to lists, and finish off with augmented vectors.
This chapter will introduce you to these important vectors from simplest to most complicated.
You'll start with atomic vectors, then build up to lists, and finish off with augmented vectors.
## Important types of atomic vector
The four most important types of atomic vector are logical, integer, double, and character. Raw and complex are rarely used during a data analysis, so I won't discuss them here.
The four most important types of atomic vector are logical, integer, double, and character.
Raw and complex are rarely used during a data analysis, so I won't discuss them here.
### Logical
Logical vectors are the simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`. Logical vectors are usually constructed with comparison operators, as described in [comparisons]. You can also create them by hand with `c()`:
Logical vectors are the simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`.
Logical vectors are usually constructed with comparison operators, as described in [comparisons].
You can also create them by hand with `c()`:
```{r}
1:10 %% 3 == 0
@ -72,7 +86,9 @@ c(TRUE, TRUE, FALSE, NA)
### Numeric
Integer and double vectors are known collectively as numeric vectors. In R, numbers are doubles by default. To make an integer, place an `L` after the number:
Integer and double vectors are known collectively as numeric vectors.
In R, numbers are doubles by default.
To make an integer, place an `L` after the number:
```{r}
typeof(1)
@ -82,9 +98,9 @@ typeof(1L)
The distinction between integers and doubles is not usually important, but there are two important differences that you should be aware of:
1. Doubles are approximations. Doubles represent floating point numbers that
can not always be precisely represented with a fixed amount of memory.
This means that you should consider all doubles to be approximations.
1. Doubles are approximations.
Doubles represent floating point numbers that can not always be precisely represented with a fixed amount of memory.
This means that you should consider all doubles to be approximations.
For example, what is square of the square root of two?
```{r}
@ -93,33 +109,35 @@ The distinction between integers and doubles is not usually important, but there
x - 2
```
This behaviour is common when working with floating point numbers: most
calculations include some approximation error. Instead of comparing floating
point numbers using `==`, you should use `dplyr::near()` which allows for
some numerical tolerance.
This behaviour is common when working with floating point numbers: most calculations include some approximation error.
Instead of comparing floating point numbers using `==`, you should use `dplyr::near()` which allows for some numerical tolerance.
2. Integers have one special value: `NA`, while doubles have four: `NA`, `NaN`, `Inf` and `-Inf`.
All three special values `NaN`, `Inf` and `-Inf` can arise during division:
1. Integers have one special value: `NA`, while doubles have four:
`NA`, `NaN`, `Inf` and `-Inf`. All three special values `NaN`, `Inf` and `-Inf` can arise during division:
```{r}
c(-1, 0, 1) / 0
```
Avoid using `==` to check for these other special values. Instead use the
helper functions `is.finite()`, `is.infinite()`, and `is.nan()`:
| | 0 | Inf | NA | NaN |
|------------------|-----|-----|-----|-----|
| `is.finite()` | x | | | |
| `is.infinite()` | | x | | |
| `is.na()` | | | x | x |
| `is.nan()` | | | | x |
Avoid using `==` to check for these other special values.
Instead use the helper functions `is.finite()`, `is.infinite()`, and `is.nan()`:
| | 0 | Inf | NA | NaN |
|-----------------|-----|-----|-----|-----|
| `is.finite()` | x | | | |
| `is.infinite()` | | x | | |
| `is.na()` | | | x | x |
| `is.nan()` | | | | x |
### Character
Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data.
Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data.
You've already learned a lot about working with strings in [strings]. Here I wanted to mention one important feature of the underlying string implementation: R uses a global string pool. This means that each unique string is only stored in memory once, and every use of the string points to that representation. This reduces the amount of memory needed by duplicated strings. You can see this behaviour in practice with `pryr::object_size()`:
You've already learned a lot about working with strings in [strings].
Here I wanted to mention one important feature of the underlying string implementation: R uses a global string pool.
This means that each unique string is only stored in memory once, and every use of the string points to that representation.
This reduces the amount of memory needed by duplicated strings.
You can see this behaviour in practice with `pryr::object_size()`:
```{r}
x <- "This is a reasonably long string."
@ -129,7 +147,8 @@ y <- rep(x, 1000)
pryr::object_size(y)
```
`y` doesn't take up 1,000x as much memory as `x`, because each element of `y` is just a pointer to that same string. A pointer is 8 bytes, so 1000 pointers to a 152 B string is 8 * 1000 + 152 = 8.14 kB.
`y` doesn't take up 1,000x as much memory as `x`, because each element of `y` is just a pointer to that same string.
A pointer is 8 bytes, so 1000 pointers to a 152 B string is 8 \* 1000 + 152 = 8.14 kB.
### Missing values
@ -142,59 +161,58 @@ NA_real_ # double
NA_character_ # character
```
Normally you don't need to know about these different types because you can always use `NA` and it will be converted to the correct type using the implicit coercion rules described next. However, there are some functions that are strict about their inputs, so it's useful to have this knowledge sitting in your back pocket so you can be specific when needed.
Normally you don't need to know about these different types because you can always use `NA` and it will be converted to the correct type using the implicit coercion rules described next.
However, there are some functions that are strict about their inputs, so it's useful to have this knowledge sitting in your back pocket so you can be specific when needed.
### Exercises
1. Describe the difference between `is.finite(x)` and `!is.infinite(x)`.
1. Describe the difference between `is.finite(x)` and `!is.infinite(x)`.
1. Read the source code for `dplyr::near()` (Hint: to see the source code,
drop the `()`). How does it work?
2. Read the source code for `dplyr::near()` (Hint: to see the source code, drop the `()`).
How does it work?
1. A logical vector can take 3 possible values. How many possible
values can an integer vector take? How many possible values can
a double take? Use google to do some research.
3. A logical vector can take 3 possible values.
How many possible values can an integer vector take?
How many possible values can a double take?
Use google to do some research.
1. Brainstorm at least four functions that allow you to convert a double to an
integer. How do they differ? Be precise.
1. What functions from the readr package allow you to turn a string
into logical, integer, and double vector?
4. Brainstorm at least four functions that allow you to convert a double to an integer.
How do they differ?
Be precise.
5. What functions from the readr package allow you to turn a string into logical, integer, and double vector?
## Using atomic vectors
Now that you understand the different types of atomic vector, it's useful to review some of the important tools for working with them. These include:
Now that you understand the different types of atomic vector, it's useful to review some of the important tools for working with them.
These include:
1. How to convert from one type to another, and when that happens
automatically.
1. How to convert from one type to another, and when that happens automatically.
1. How to tell if an object is a specific type of vector.
2. How to tell if an object is a specific type of vector.
1. What happens when you work with vectors of different lengths.
3. What happens when you work with vectors of different lengths.
1. How to name the elements of a vector.
4. How to name the elements of a vector.
1. How to pull out elements of interest.
5. How to pull out elements of interest.
### Coercion
There are two ways to convert, or coerce, one type of vector to another:
1. Explicit coercion happens when you call a function like `as.logical()`,
`as.integer()`, `as.double()`, or `as.character()`. Whenever you find
yourself using explicit coercion, you should always check whether you can
make the fix upstream, so that the vector never had the wrong type in
the first place. For example, you may need to tweak your readr
`col_types` specification.
1. Explicit coercion happens when you call a function like `as.logical()`, `as.integer()`, `as.double()`, or `as.character()`.
Whenever you find yourself using explicit coercion, you should always check whether you can make the fix upstream, so that the vector never had the wrong type in the first place.
For example, you may need to tweak your readr `col_types` specification.
1. Implicit coercion happens when you use a vector in a specific context
that expects a certain type of vector. For example, when you use a logical
vector with a numeric summary function, or when you use a double vector
where an integer vector is expected.
Because explicit coercion is used relatively rarely, and is largely easy to understand, I'll focus on implicit coercion here.
2. Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector.
For example, when you use a logical vector with a numeric summary function, or when you use a double vector where an integer vector is expected.
You've already seen the most important type of implicit coercion: using a logical vector in a numeric context. In this case `TRUE` is converted to `1` and `FALSE` converted to `0`. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues:
Because explicit coercion is used relatively rarely, and is largely easy to understand, I'll focus on implicit coercion here.
You've already seen the most important type of implicit coercion: using a logical vector in a numeric context.
In this case `TRUE` is converted to `1` and `FALSE` converted to `0`.
That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues:
```{r}
x <- sample(20, 100, replace = TRUE)
@ -211,7 +229,9 @@ if (length(x)) {
}
```
In this case, 0 is converted to `FALSE` and everything else is converted to `TRUE`. I think this makes it harder to understand your code, and I don't recommend it. Instead be explicit: `length(x) > 0`.
In this case, 0 is converted to `FALSE` and everything else is converted to `TRUE`.
I think this makes it harder to understand your code, and I don't recommend it.
Instead be explicit: `length(x) > 0`.
It's also important to understand what happens when you try and create a vector containing multiple types with `c()`: the most complex type always wins.
@ -221,35 +241,45 @@ typeof(c(1L, 1.5))
typeof(c(1.5, "a"))
```
An atomic vector can not have a mix of different types because the type is a property of the complete vector, not the individual elements. If you need to mix multiple types in the same vector, you should use a list, which you'll learn about shortly.
An atomic vector can not have a mix of different types because the type is a property of the complete vector, not the individual elements.
If you need to mix multiple types in the same vector, you should use a list, which you'll learn about shortly.
### Test functions
Sometimes you want to do different things based on the type of vector. One option is to use `typeof()`. Another is to use a test function which returns a `TRUE` or `FALSE`. Base R provides many functions like `is.vector()` and `is.atomic()`, but they often return surprising results. Instead, it's safer to use the `is_*` functions provided by purrr, which are summarised in the table below.
Sometimes you want to do different things based on the type of vector.
One option is to use `typeof()`.
Another is to use a test function which returns a `TRUE` or `FALSE`.
Base R provides many functions like `is.vector()` and `is.atomic()`, but they often return surprising results.
Instead, it's safer to use the `is_*` functions provided by purrr, which are summarised in the table below.
| | lgl | int | dbl | chr | list |
|------------------|-----|-----|-----|-----|------|
| `is_logical()` | x | | | | |
| `is_integer()` | | x | | | |
| `is_double()` | | | x | | |
| `is_numeric()` | | x | x | | |
| `is_character()` | | | | x | |
| `is_atomic()` | x | x | x | x | |
| `is_list()` | | | | | x |
| `is_vector()` | x | x | x | x | x |
| `is_logical()` | x | | | | |
| `is_integer()` | | x | | | |
| `is_double()` | | | x | | |
| `is_numeric()` | | x | x | | |
| `is_character()` | | | | x | |
| `is_atomic()` | x | x | x | x | |
| `is_list()` | | | | | x |
| `is_vector()` | x | x | x | x | x |
### Scalars and recycling rules
As well as implicitly coercing the types of vectors to be compatible, R will also implicitly coerce the length of vectors. This is called vector __recycling__, because the shorter vector is repeated, or recycled, to the same length as the longer vector.
As well as implicitly coercing the types of vectors to be compatible, R will also implicitly coerce the length of vectors.
This is called vector **recycling**, because the shorter vector is repeated, or recycled, to the same length as the longer vector.
This is generally most useful when you are mixing vectors and "scalars". I put scalars in quotes because R doesn't actually have scalars: instead, a single number is a vector of length 1. Because there are no scalars, most built-in functions are __vectorised__, meaning that they will operate on a vector of numbers. That's why, for example, this code works:
This is generally most useful when you are mixing vectors and "scalars".
I put scalars in quotes because R doesn't actually have scalars: instead, a single number is a vector of length 1.
Because there are no scalars, most built-in functions are **vectorised**, meaning that they will operate on a vector of numbers.
That's why, for example, this code works:
```{r}
sample(10) + 100
runif(10) > 0.5
```
In R, basic mathematical operations work with vectors. That means that you should never need to perform explicit iteration when performing simple mathematical computations.
In R, basic mathematical operations work with vectors.
That means that you should never need to perform explicit iteration when performing simple mathematical computations.
It's intuitive what should happen if you add two vectors of the same length, or a vector and a "scalar", but what happens if you add two vectors of different lengths?
@ -257,13 +287,16 @@ It's intuitive what should happen if you add two vectors of the same length, or
1:10 + 1:2
```
Here, R will expand the shortest vector to the same length as the longest, so called recycling. This is silent except when the length of the longer is not an integer multiple of the length of the shorter:
Here, R will expand the shortest vector to the same length as the longest, so called recycling.
This is silent except when the length of the longer is not an integer multiple of the length of the shorter:
```{r}
1:10 + 1:3
```
While vector recycling can be used to create very succinct, clever code, it can also silently conceal problems. For this reason, the vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar. If you do want to recycle, you'll need to do it yourself with `rep()`:
While vector recycling can be used to create very succinct, clever code, it can also silently conceal problems.
For this reason, the vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar.
If you do want to recycle, you'll need to do it yourself with `rep()`:
```{r, error = TRUE}
tibble(x = 1:4, y = 1:2)
@ -275,7 +308,8 @@ tibble(x = 1:4, y = rep(1:2, each = 2))
### Naming vectors
All types of vectors can be named. You can name them during creation with `c()`:
All types of vectors can be named.
You can name them during creation with `c()`:
```{r}
c(x = 1, y = 2, z = 4)
@ -291,112 +325,112 @@ Named vectors are most useful for subsetting, described next.
### Subsetting {#vector-subsetting}
So far we've used `dplyr::filter()` to filter the rows in a tibble. `filter()` only works with tibble, so we'll need new tool for vectors: `[`. `[` is the subsetting function, and is called like `x[a]`. There are four types of things that you can subset a vector with:
So far we've used `dplyr::filter()` to filter the rows in a tibble.
`filter()` only works with tibble, so we'll need new tool for vectors: `[`.
`[` is the subsetting function, and is called like `x[a]`.
There are four types of things that you can subset a vector with:
1. A numeric vector containing only integers.
The integers must either be all positive, all negative, or zero.
1. A numeric vector containing only integers. The integers must either be all
positive, all negative, or zero.
Subsetting with positive integers keeps the elements at those positions:
```{r}
x <- c("one", "two", "three", "four", "five")
x[c(3, 2, 5)]
```
By repeating a position, you can actually make a longer output than
input:
By repeating a position, you can actually make a longer output than input:
```{r}
x[c(1, 1, 5, 5, 5, 2)]
```
Negative values drop the elements at the specified positions:
```{r}
x[c(-1, -3, -5)]
```
It's an error to mix positive and negative values:
```{r, error = TRUE}
x[c(1, -1)]
```
The error message mentions subsetting with zero, which returns no values:
```{r}
x[0]
```
This is not useful very often, but it can be helpful if you want to create
unusual data structures to test your functions with.
1. Subsetting with a logical vector keeps all values corresponding to a
`TRUE` value. This is most often useful in conjunction with the
comparison functions.
This is not useful very often, but it can be helpful if you want to create unusual data structures to test your functions with.
2. Subsetting with a logical vector keeps all values corresponding to a `TRUE` value.
This is most often useful in conjunction with the comparison functions.
```{r}
x <- c(10, 3, NA, 5, 8, 1, NA)
# All non-missing values of x
x[!is.na(x)]
# All even (or missing!) values of x
x[x %% 2 == 0]
```
1. If you have a named vector, you can subset it with a character vector:
3. If you have a named vector, you can subset it with a character vector:
```{r}
x <- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]
```
Like with positive integers, you can also use a character vector to
duplicate individual entries.
1. The simplest type of subsetting is nothing, `x[]`, which returns the
complete `x`. This is not useful for subsetting vectors, but it is useful
when subsetting matrices (and other high dimensional structures) because
it lets you select all the rows or all the columns, by leaving that
index blank. For example, if `x` is 2d, `x[1, ]` selects the first row and
all the columns, and `x[, -1]` selects all rows and all columns except
the first.
To learn more about the applications of subsetting, reading the "Subsetting" chapter of _Advanced R_: <http://adv-r.had.co.nz/Subsetting.html#applications>.
Like with positive integers, you can also use a character vector to duplicate individual entries.
There is an important variation of `[` called `[[`. `[[` only ever extracts a single element, and always drops names. It's a good idea to use it whenever you want to make it clear that you're extracting a single item, as in a for loop. The distinction between `[` and `[[` is most important for lists, as we'll see shortly.
4. The simplest type of subsetting is nothing, `x[]`, which returns the complete `x`.
This is not useful for subsetting vectors, but it is useful when subsetting matrices (and other high dimensional structures) because it lets you select all the rows or all the columns, by leaving that index blank.
For example, if `x` is 2d, `x[1, ]` selects the first row and all the columns, and `x[, -1]` selects all rows and all columns except the first.
To learn more about the applications of subsetting, reading the "Subsetting" chapter of *Advanced R*: <http://adv-r.had.co.nz/Subsetting.html#applications>.
There is an important variation of `[` called `[[`.
`[[` only ever extracts a single element, and always drops names.
It's a good idea to use it whenever you want to make it clear that you're extracting a single item, as in a for loop.
The distinction between `[` and `[[` is most important for lists, as we'll see shortly.
### Exercises
1. What does `mean(is.na(x))` tell you about a vector `x`? What about
`sum(!is.finite(x))`?
1. What does `mean(is.na(x))` tell you about a vector `x`?
What about `sum(!is.finite(x))`?
1. Carefully read the documentation of `is.vector()`. What does it actually
test for? Why does `is.atomic()` not agree with the definition of
atomic vectors above?
1. Compare and contrast `setNames()` with `purrr::set_names()`.
2. Carefully read the documentation of `is.vector()`.
What does it actually test for?
Why does `is.atomic()` not agree with the definition of atomic vectors above?
1. Create functions that take a vector as input and returns:
1. The last value. Should you use `[` or `[[`?
3. Compare and contrast `setNames()` with `purrr::set_names()`.
1. The elements at even numbered positions.
1. Every element except the last value.
1. Only even numbers (and no missing values).
4. Create functions that take a vector as input and returns:
1. Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`?
1. The last value.
Should you use `[` or `[[`?
1. What happens when you subset with a positive integer that's bigger
than the length of the vector? What happens when you subset with a
name that doesn't exist?
2. The elements at even numbered positions.
3. Every element except the last value.
4. Only even numbers (and no missing values).
5. Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`?
6. What happens when you subset with a positive integer that's bigger than the length of the vector?
What happens when you subset with a name that doesn't exist?
## Recursive vectors (lists) {#lists}
Lists are a step up in complexity from atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You create a list with `list()`:
Lists are a step up in complexity from atomic vectors, because lists can contain other lists.
This makes them suitable for representing hierarchical or tree-like structures.
You create a list with `list()`:
```{r}
x <- list(1, 2, 3)
@ -428,7 +462,8 @@ str(z)
### Visualising lists
To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists. For example, take these three lists:
To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists.
For example, take these three lists:
```{r}
x1 <- list(c(1, 2), c(3, 4))
@ -444,14 +479,12 @@ knitr::include_graphics("diagrams/lists-structure.png")
There are three principles:
1. Lists have rounded corners. Atomic vectors have square corners.
1. Children are drawn inside their parent, and have a slightly darker
background to make it easier to see the hierarchy.
1. The orientation of the children (i.e. rows or columns) isn't important,
so I'll pick a row or column orientation to either save space or illustrate
an important property in the example.
1. Lists have rounded corners.
Atomic vectors have square corners.
2. Children are drawn inside their parent, and have a slightly darker background to make it easier to see the hierarchy.
3. The orientation of the children (i.e. rows or columns) isn't important, so I'll pick a row or column orientation to either save space or illustrate an important property in the example.
### Subsetting
@ -461,33 +494,34 @@ There are three ways to subset a list, which I'll illustrate with a list named `
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
```
* `[` extracts a sub-list. The result will always be a list.
- `[` extracts a sub-list.
The result will always be a list.
```{r}
str(a[1:2])
str(a[4])
```
Like with vectors, you can subset with a logical, integer, or character
vector.
* `[[` extracts a single component from a list. It removes a level of
hierarchy from the list.
Like with vectors, you can subset with a logical, integer, or character vector.
- `[[` extracts a single component from a list.
It removes a level of hierarchy from the list.
```{r}
str(a[[1]])
str(a[[4]])
```
* `$` is a shorthand for extracting named elements of a list. It works
similarly to `[[` except that you don't need to use quotes.
- `$` is a shorthand for extracting named elements of a list.
It works similarly to `[[` except that you don't need to use quotes.
```{r}
a$a
a[["a"]]
```
The distinction between `[` and `[[` is really important for lists, because `[[` drills down into the list while `[` returns a new, smaller list. Compare the code and output above with the visual representation in Figure \@ref(fig:lists-subsetting).
The distinction between `[` and `[[` is really important for lists, because `[[` drills down into the list while `[` returns a new, smaller list.
Compare the code and output above with the visual representation in Figure \@ref(fig:lists-subsetting).
```{r lists-subsetting, echo = FALSE, out.width = "75%", fig.cap = "Subsetting a list, visually."}
knitr::include_graphics("diagrams/lists-subsetting.png")
@ -495,29 +529,31 @@ knitr::include_graphics("diagrams/lists-subsetting.png")
### Lists of condiments
The difference between `[` and `[[` is very important, but it's easy to get confused. To help you remember, let me show you an unusual pepper shaker.
The difference between `[` and `[[` is very important, but it's easy to get confused.
To help you remember, let me show you an unusual pepper shaker.
```{r, echo = FALSE, out.width = "25%"}
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper.jpg")
```
If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:
```{r, echo = FALSE, out.width = "25%"}
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper-1.jpg")
```
`x[2]` would look the same, but would contain the second packet. `x[1:2]` would be a pepper shaker containing two pepper packets.
`x[2]` would look the same, but would contain the second packet.
`x[1:2]` would be a pepper shaker containing two pepper packets.
`x[[1]]` is:
```{r, echo = FALSE, out.width = "25%"}
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper-2.jpg")
```
If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:
```{r, echo = FALSE, out.width = "25%"}
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper-3.jpg")
```
@ -526,14 +562,15 @@ knitr::include_graphics("images/pepper-3.jpg")
1. Draw the following lists as nested sets:
1. `list(a, b, list(c, d), list(e, f))`
1. `list(list(list(list(list(list(a))))))`
2. `list(list(list(list(list(list(a))))))`
1. What happens if you subset a tibble as if you're subsetting a list?
2. What happens if you subset a tibble as if you're subsetting a list?
What are the key differences between a list and a tibble?
## Attributes
Any vector can contain arbitrary additional metadata through its __attributes__. You can think of attributes as named list of vectors that can be attached to any object.
Any vector can contain arbitrary additional metadata through its **attributes**.
You can think of attributes as named list of vectors that can be attached to any object.
You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
```{r}
@ -546,11 +583,14 @@ attributes(x)
There are three very important attributes that are used to implement fundamental parts of R:
1. __Names__ are used to name the elements of a vector.
1. __Dimensions__ (dims, for short) make a vector behave like a matrix or array.
1. __Class__ is used to implement the S3 object oriented system.
1. **Names** are used to name the elements of a vector.
2. **Dimensions** (dims, for short) make a vector behave like a matrix or array.
3. **Class** is used to implement the S3 object oriented system.
You've seen names above, and we won't cover dimensions because we don't use matrices in this book. It remains to describe the class, which controls how __generic functions__ work. Generic functions are key to object oriented programming in R, because they make functions behave differently for different classes of input. A detailed discussion of object oriented programming is beyond the scope of this book, but you can read more about it in _Advanced R_ at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
You've seen names above, and we won't cover dimensions because we don't use matrices in this book.
It remains to describe the class, which controls how **generic functions** work.
Generic functions are key to object oriented programming in R, because they make functions behave differently for different classes of input.
A detailed discussion of object oriented programming is beyond the scope of this book, but you can read more about it in *Advanced R* at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
Here's what a typical generic function looks like:
@ -558,7 +598,9 @@ Here's what a typical generic function looks like:
as.Date
```
The call to "UseMethod" means that this is a generic function, and it will call a specific __method__, a function, based on the class of the first argument. (All methods are functions; not all functions are methods). You can list all the methods for a generic with `methods()`:
The call to "UseMethod" means that this is a generic function, and it will call a specific **method**, a function, based on the class of the first argument.
(All methods are functions; not all functions are methods).
You can list all the methods for a generic with `methods()`:
```{r}
methods("as.Date")
@ -573,22 +615,27 @@ getS3method("as.Date", "default")
getS3method("as.Date", "numeric")
```
The most important S3 generic is `print()`: it controls how the object is printed when you type its name at the console. Other important generics are the subsetting functions `[`, `[[`, and `$`.
The most important S3 generic is `print()`: it controls how the object is printed when you type its name at the console.
Other important generics are the subsetting functions `[`, `[[`, and `$`.
## Augmented vectors
Atomic vectors and lists are the building blocks for other important vector types like factors and dates. I call these __augmented vectors__, because they are vectors with additional __attributes__, including class. Because augmented vectors have a class, they behave differently to the atomic vector on which they are built. In this book, we make use of four important augmented vectors:
Atomic vectors and lists are the building blocks for other important vector types like factors and dates.
I call these **augmented vectors**, because they are vectors with additional **attributes**, including class.
Because augmented vectors have a class, they behave differently to the atomic vector on which they are built.
In this book, we make use of four important augmented vectors:
* Factors
* Dates
* Date-times
* Tibbles
- Factors
- Dates
- Date-times
- Tibbles
These are described below.
### Factors
Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:
Factors are designed to represent categorical data that can take a fixed set of possible values.
Factors are built on top of integers, and have a levels attribute:
```{r}
x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
@ -608,7 +655,8 @@ typeof(x)
attributes(x)
```
Date-times are numeric vectors with class `POSIXct` that represent the number of seconds since 1 January 1970. (In case you were wondering, "POSIXct" stands for "Portable Operating System Interface", calendar time.)
Date-times are numeric vectors with class `POSIXct` that represent the number of seconds since 1 January 1970.
(In case you were wondering, "POSIXct" stands for "Portable Operating System Interface", calendar time.)
```{r}
x <- lubridate::ymd_hm("1970-01-01 01:00")
@ -618,7 +666,8 @@ typeof(x)
attributes(x)
```
The `tzone` attribute is optional. It controls how the time is printed, not what absolute time it refers to.
The `tzone` attribute is optional.
It controls how the time is printed, not what absolute time it refers to.
```{r}
attr(x, "tzone") <- "US/Pacific"
@ -628,7 +677,8 @@ attr(x, "tzone") <- "US/Eastern"
x
```
There is another type of date-times called POSIXlt. These are built on top of named lists:
There is another type of date-times called POSIXlt.
These are built on top of named lists:
```{r}
y <- as.POSIXlt(x)
@ -636,7 +686,10 @@ typeof(y)
attributes(y)
```
POSIXlts are rare inside the tidyverse. They do crop up in base R, because they are needed to extract specific components of a date, like the year or month. Since lubridate provides helpers for you to do this instead, you don't need them. POSIXct's are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a regular date time `lubridate::as_date_time()`.
POSIXlts are rare inside the tidyverse.
They do crop up in base R, because they are needed to extract specific components of a date, like the year or month.
Since lubridate provides helpers for you to do this instead, you don't need them.
POSIXct's are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a regular date time `lubridate::as_date_time()`.
### Tibbles
@ -648,7 +701,8 @@ typeof(tb)
attributes(tb)
```
The difference between a tibble and a list is that all the elements of a data frame must be vectors with the same length. All functions that work with tibbles enforce this constraint.
The difference between a tibble and a list is that all the elements of a data frame must be vectors with the same length.
All functions that work with tibbles enforce this constraint.
Traditional data.frames have a very similar structure:
@ -658,16 +712,17 @@ typeof(df)
attributes(df)
```
The main difference is the class. The class of tibble includes "data.frame" which means tibbles inherit the regular data frame behaviour by default.
The main difference is the class.
The class of tibble includes "data.frame" which means tibbles inherit the regular data frame behaviour by default.
### Exercises
1. What does `hms::hms(3600)` return? How does it print? What primitive
type is the augmented vector built on top of? What attributes does it
use?
1. Try and make a tibble that has columns with different lengths. What
happens?
1. What does `hms::hms(3600)` return?
How does it print?
What primitive type is the augmented vector built on top of?
What attributes does it use?
1. Based on the definition above, is it ok to have a list as a
column of a tibble?
2. Try and make a tibble that has columns with different lengths.
What happens?
3. Based on the definition above, is it ok to have a list as a column of a tibble?

File diff suppressed because it is too large Load Diff

View File

@ -1,12 +1,16 @@
# Workflow: basics
You now have some experience running R code. We didn't give you many details, but you've obviously figured out the basics, or you would've thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that it's both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.
You now have some experience running R code.
We didn't give you many details, but you've obviously figured out the basics, or you would've thrown this book away in frustration!
Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain.
But while you should expect to be a little frustrated, take comfort in that it's both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.
Before we go any further, let's make sure you've got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.
## Coding basics
Let's review some basics we've so far omitted in the interests of getting you plotting as quickly as possible. You can use R as a calculator:
Let's review some basics we've so far omitted in the interests of getting you plotting as quickly as possible.
You can use R as a calculator:
```{r}
1 / 200 * 30
@ -20,7 +24,7 @@ You can create new objects with `<-`:
x <- 3 * 4
```
All R statements where you create objects, __assignment__ statements, have the same form:
All R statements where you create objects, **assignment** statements, have the same form:
```{r eval = FALSE}
object_name <- value
@ -28,11 +32,17 @@ object_name <- value
When reading that code say "object name gets value" in your head.
You will make lots of assignments and `<-` is a pain to type. Don't be lazy and use `=`: it will work, but it will cause confusion later. Instead, use RStudio's keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automagically surrounds `<-` with spaces, which is a good code formatting practice. Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.
You will make lots of assignments and `<-` is a pain to type.
Don't be lazy and use `=`: it will work, but it will cause confusion later.
Instead, use RStudio's keyboard shortcut: Alt + - (the minus sign).
Notice that RStudio automagically surrounds `<-` with spaces, which is a good code formatting practice.
Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.
## What's in a name?
Object names must start with a letter, and can only contain letters, numbers, `_` and `.`. You want your object names to be descriptive, so you'll need a convention for multiple words. We recommend __snake_case__ where you separate lowercase words with `_`.
Object names must start with a letter, and can only contain letters, numbers, `_` and `.`.
You want your object names to be descriptive, so you'll need a convention for multiple words.
We recommend **snake_case** where you separate lowercase words with `_`.
```{r, eval = FALSE}
i_use_snake_case
@ -57,7 +67,13 @@ this_is_a_really_long_name <- 2.5
To inspect this object, try out RStudio's completion facility: type "this", press TAB, add characters until you have a unique prefix, then press return.
Ooops, you made a mistake! `this_is_a_really_long_name` should have value 3.5 not 2.5. Use another keyboard shortcut to help you fix it. Type "this" then press Cmd/Ctrl + ↑. That will list all the commands you've typed that start those letters. Use the arrow keys to navigate, then press enter to retype the command. Change 2.5 to 3.5 and rerun.
Ooops, you made a mistake!
`this_is_a_really_long_name` should have value 3.5 not 2.5.
Use another keyboard shortcut to help you fix it.
Type "this" then press Cmd/Ctrl + ↑.
That will list all the commands you've typed that start those letters.
Use the arrow keys to navigate, then press enter to retype the command.
Change 2.5 to 3.5 and rerun.
Make yet another assignment:
@ -74,7 +90,9 @@ R_rocks
#> Error: object 'R_rocks' not found
```
There's an implied contract between you and R: it will do the tedious computation for you, but in return, you must be completely precise in your instructions. Typos matter. Case matters.
There's an implied contract between you and R: it will do the tedious computation for you, but in return, you must be completely precise in your instructions.
Typos matter.
Case matters.
## Calling functions
@ -84,9 +102,16 @@ R has a large collection of built-in functions that are called like this:
function_name(arg1 = val1, arg2 = val2, ...)
```
Let's try using `seq()` which makes regular **seq**uences of numbers and, while we're at it, learn more helpful features of RStudio. Type `se` and hit TAB. A popup shows you possible completions. Specify `seq()` by typing more (a "q") to disambiguate, or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the function's arguments and purpose. If you want more help, press F1 to get all the details in the help tab in the lower right pane.
Let's try using `seq()` which makes regular **seq**uences of numbers and, while we're at it, learn more helpful features of RStudio.
Type `se` and hit TAB.
A popup shows you possible completions.
Specify `seq()` by typing more (a "q") to disambiguate, or by using ↑/↓ arrows to select.
Notice the floating tooltip that pops up, reminding you of the function's arguments and purpose.
If you want more help, press F1 to get all the details in the help tab in the lower right pane.
Press TAB once more when you've selected the function you want. RStudio will add matching opening (`(`) and closing (`)`) parentheses for you. Type the arguments `1, 10` and hit return.
Press TAB once more when you've selected the function you want.
RStudio will add matching opening (`(`) and closing (`)`) parentheses for you.
Type the arguments `1, 10` and hit return.
```{r}
seq(1, 10)
@ -98,16 +123,18 @@ Type this code and notice you get similar assistance with the paired quotation m
x <- "hello world"
```
Quotation marks and parentheses must always come in a pair. RStudio does its best to help you, but it's still possible to mess up and end up with a mismatch. If this happens, R will show you the continuation character "+":
Quotation marks and parentheses must always come in a pair.
RStudio does its best to help you, but it's still possible to mess up and end up with a mismatch.
If this happens, R will show you the continuation character "+":
```
> x <- "hello
+
```
> x <- "hello
+
The `+` tells you that R is waiting for more input; it doesn't think you're done yet. Usually that means you've forgotten either a `"` or a `)`. Either add the missing pair, or press ESCAPE to abort the expression and try again.
The `+` tells you that R is waiting for more input; it doesn't think you're done yet.
Usually that means you've forgotten either a `"` or a `)`. Either add the missing pair, or press ESCAPE to abort the expression and try again.
If you make an assignment, you don't get to see the value. You're then tempted to immediately double-check the result:
If you make an assignment, you don't get to see the value.
You're then tempted to immediately double-check the result:
```{r}
y <- seq(1, 10, length.out = 5)
@ -136,23 +163,22 @@ Here you can see all of the objects that you've created.
my_variable <- 10
my_varıable
```
Look carefully! (This may seem like an exercise in pointlessness, but
training your brain to notice even the tiniest difference will pay off
when programming.)
1. Tweak each of the following R commands so that they run correctly:
Look carefully!
(This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.)
2. Tweak each of the following R commands so that they run correctly:
```{r, eval = FALSE}
library(tidyverse)
ggplot(dota = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
fliter(mpg, cyl = 8)
filter(diamond, carat > 3)
```
1. Press Alt + Shift + K. What happens? How can you get to the same place
using the menus?
3. Press Alt + Shift + K.
What happens?
How can you get to the same place using the menus?

View File

@ -1,17 +1,22 @@
# Workflow: projects
One day you will need to quit R, go do something else and return to your analysis the next day. One day you will be working on multiple analyses simultaneously that all use R and you want to keep them separate. One day you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world. To handle these real life situations, you need to make two decisions:
One day you will need to quit R, go do something else and return to your analysis the next day.
One day you will be working on multiple analyses simultaneously that all use R and you want to keep them separate.
One day you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.
To handle these real life situations, you need to make two decisions:
1. What about your analysis is "real", i.e. what will you save as your
lasting record of what happened?
1. What about your analysis is "real", i.e. what will you save as your lasting record of what happened?
1. Where does your analysis "live"?
2. Where does your analysis "live"?
## What is real?
As a beginning R user, it's OK to consider your environment (i.e. the objects listed in the environment pane) "real". However, in the long run, you'll be much better off if you consider your R scripts as "real".
As a beginning R user, it's OK to consider your environment (i.e. the objects listed in the environment pane) "real".
However, in the long run, you'll be much better off if you consider your R scripts as "real".
With your R scripts (and your data files), you can recreate the environment. It's much harder to recreate your R scripts from your environment! You'll either have to retype a lot of code from memory (making mistakes all the way) or you'll have to carefully mine your R history.
With your R scripts (and your data files), you can recreate the environment.
It's much harder to recreate your R scripts from your environment!
You'll either have to retype a lot of code from memory (making mistakes all the way) or you'll have to carefully mine your R history.
To foster this behaviour, I highly recommend that you instruct RStudio not to preserve your workspace between sessions:
@ -19,18 +24,22 @@ To foster this behaviour, I highly recommend that you instruct RStudio not to pr
knitr::include_graphics("screenshots/rstudio-workspace.png")
```
This will cause you some short-term pain, because now when you restart RStudio it will not remember the results of the code that you ran last time. But this short-term pain will save you long-term agony because it forces you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code.
This will cause you some short-term pain, because now when you restart RStudio it will not remember the results of the code that you ran last time.
But this short-term pain will save you long-term agony because it forces you to capture all important interactions in your code.
There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code.
There is a great pair of keyboard shortcuts that will work together to make sure you've captured the important parts of your code in the editor:
1. Press Cmd/Ctrl + Shift + F10 to restart RStudio.
2. Press Cmd/Ctrl + Shift + S to rerun the current script.
1. Press Cmd/Ctrl + Shift + F10 to restart RStudio.
2. Press Cmd/Ctrl + Shift + S to rerun the current script.
I use this pattern hundreds of times a week.
## Where does your analysis live?
R has a powerful notion of the __working directory__. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. RStudio shows your current working directory at the top of the console:
R has a powerful notion of the **working directory**.
This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save.
RStudio shows your current working directory at the top of the console:
```{r, echo = FALSE, out.width = "50%"}
knitr::include_graphics("screenshots/rstudio-wd.png")
@ -43,9 +52,11 @@ getwd()
#> [1] "/Users/hadley/Documents/r4ds/r4ds"
```
As a beginning R user, it's OK to let your home directory, documents directory, or any other weird directory on your computer be R's working directory. But you're six chapters into this book, and you're no longer a rank beginner. Very soon now you should evolve to organising your analytical projects into directories and, when working on a project, setting R's working directory to the associated directory.
As a beginning R user, it's OK to let your home directory, documents directory, or any other weird directory on your computer be R's working directory.
But you're six chapters into this book, and you're no longer a rank beginner.
Very soon now you should evolve to organising your analytical projects into directories and, when working on a project, setting R's working directory to the associated directory.
__I do not recommend it__, but you can also set the working directory from within R:
**I do not recommend it**, but you can also set the working directory from within R:
```{r eval = FALSE}
setwd("/path/to/my/CoolProject")
@ -55,33 +66,29 @@ But you should never do this because there's a better way; a way that also puts
## Paths and directories
Paths and directories are a little complicated because there are two basic styles of paths: Mac/Linux and Windows. There are three chief ways in which they differ:
Paths and directories are a little complicated because there are two basic styles of paths: Mac/Linux and Windows.
There are three chief ways in which they differ:
1. The most important difference is how you separate the components of the
path. Mac and Linux uses slashes (e.g. `plots/diamonds.pdf`) and Windows
uses backslashes (e.g. `plots\diamonds.pdf`). R can work with either type
(no matter what platform you're currently using), but unfortunately,
backslashes mean something special to R, and to get a single backslash
in the path, you need to type two backslashes! That makes life frustrating,
so I recommend always using the Linux/Mac style with forward slashes.
1. The most important difference is how you separate the components of the path.
Mac and Linux uses slashes (e.g. `plots/diamonds.pdf`) and Windows uses backslashes (e.g. `plots\diamonds.pdf`).
R can work with either type (no matter what platform you're currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes!
That makes life frustrating, so I recommend always using the Linux/Mac style with forward slashes.
1. Absolute paths (i.e. paths that point to the same place regardless of
your working directory) look different. In Windows they start with a drive
letter (e.g. `C:`) or two backslashes (e.g. `\\servername`) and in
Mac/Linux they start with a slash "/" (e.g. `/users/hadley`). You should
__never__ use absolute paths in your scripts, because they hinder sharing:
no one else will have exactly the same directory configuration as you.
2. Absolute paths (i.e. paths that point to the same place regardless of your working directory) look different.
In Windows they start with a drive letter (e.g. `C:`) or two backslashes (e.g. `\\servername`) and in Mac/Linux they start with a slash "/" (e.g. `/users/hadley`).
You should **never** use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.
1. The last minor difference is the place that `~` points to. `~` is a
convenient shortcut to your home directory. Windows doesn't really have
the notion of a home directory, so it instead points to your documents
directory.
3. The last minor difference is the place that `~` points to.
`~` is a convenient shortcut to your home directory.
Windows doesn't really have the notion of a home directory, so it instead points to your documents directory.
## RStudio projects
R experts keep all the files associated with a project together --- input data, R scripts, analytical results, figures. This is such a wise and common practice that RStudio has built-in support for this via __projects__.
R experts keep all the files associated with a project together --- input data, R scripts, analytical results, figures.
This is such a wise and common practice that RStudio has built-in support for this via **projects**.
Let's make a project for you to use while you're working through the rest of this book. Click File > New Project, then:
Let's make a project for you to use while you're working through the rest of this book.
Click File \> New Project, then:
```{r, echo = FALSE, out.width = "50%"}
knitr::include_graphics("screenshots/rstudio-project-1.png")
@ -89,18 +96,22 @@ knitr::include_graphics("screenshots/rstudio-project-2.png")
knitr::include_graphics("screenshots/rstudio-project-3.png")
```
Call your project `r4ds` and think carefully about which _subdirectory_ you put the project in. If you don't store it somewhere sensible, it will be hard to find it in the future!
Call your project `r4ds` and think carefully about which *subdirectory* you put the project in.
If you don't store it somewhere sensible, it will be hard to find it in the future!
Once this process is complete, you'll get a new RStudio project just for this book. Check that the "home" directory of your project is the current working directory:
Once this process is complete, you'll get a new RStudio project just for this book.
Check that the "home" directory of your project is the current working directory:
```{r eval = FALSE}
getwd()
#> [1] /Users/hadley/Documents/r4ds/r4ds
```
Whenever you refer to a file with a relative path it will look for it here.
Whenever you refer to a file with a relative path it will look for it here.
Now enter the following commands in the script editor, and save the file, calling it "diamonds.R". Next, run the complete script which will save a PDF and CSV file into your project directory. Don't worry about the details, you'll learn them later in the book.
Now enter the following commands in the script editor, and save the file, calling it "diamonds.R".
Next, run the complete script which will save a PDF and CSV file into your project directory.
Don't worry about the details, you'll learn them later in the book.
```{r toy-line, eval = FALSE}
library(tidyverse)
@ -112,23 +123,29 @@ ggsave("diamonds.pdf")
write_csv(diamonds, "diamonds.csv")
```
Quit RStudio. Inspect the folder associated with your project --- notice the `.Rproj` file. Double-click that file to re-open the project. Notice you get back to where you left off: it's the same working directory and command history, and all the files you were working on are still open. Because you followed my instructions above, you will, however, have a completely fresh environment, guaranteeing that you're starting with a clean slate.
Quit RStudio.
Inspect the folder associated with your project --- notice the `.Rproj` file.
Double-click that file to re-open the project.
Notice you get back to where you left off: it's the same working directory and command history, and all the files you were working on are still open.
Because you followed my instructions above, you will, however, have a completely fresh environment, guaranteeing that you're starting with a clean slate.
In your favorite OS-specific way, search your computer for `diamonds.pdf` and you will find the PDF (no surprise) but _also the script that created it_ (`diamonds.R`). This is huge win! One day you will want to remake a figure or just understand where it came from. If you rigorously save figures to files __with R code__ and never with the mouse or the clipboard, you will be able to reproduce old work with ease!
In your favorite OS-specific way, search your computer for `diamonds.pdf` and you will find the PDF (no surprise) but *also the script that created it* (`diamonds.R`).
This is huge win!
One day you will want to remake a figure or just understand where it came from.
If you rigorously save figures to files **with R code** and never with the mouse or the clipboard, you will be able to reproduce old work with ease!
## Summary
In summary, RStudio projects give you a solid workflow that will serve you well in the future:
* Create an RStudio project for each data analysis project.
- Create an RStudio project for each data analysis project.
* Keep data files there; we'll talk about loading them into R in
[data import].
- Keep data files there; we'll talk about loading them into R in [data import].
* Keep scripts there; edit them, run them in bits or as a whole.
- Keep scripts there; edit them, run them in bits or as a whole.
* Save your outputs (plots and cleaned data) there.
- Save your outputs (plots and cleaned data) there.
* Only ever use relative paths, not absolute paths.
- Only ever use relative paths, not absolute paths.
Everything you need is in one place, and cleanly separated from all the other projects that you are working on.

View File

@ -1,16 +1,29 @@
# Workflow: scripts
So far you've been using the console to run code. That's a great place to start, but you'll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and dplyr pipes. To give yourself more room to work, it's a great idea to use the script editor. Open it up either by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now you'll see four panes:
So far you've been using the console to run code.
That's a great place to start, but you'll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and dplyr pipes.
To give yourself more room to work, it's a great idea to use the script editor.
Open it up either by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N.
Now you'll see four panes:
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/rstudio-editor.png")
```
The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor. RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open. Nevertheless, it's a good idea to save your scripts regularly and to back them up.
The script editor is a great place to put code you care about.
Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor.
RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open.
Nevertheless, it's a good idea to save your scripts regularly and to back them up.
## Running code
The script editor is also a great place to build up complex ggplot2 plots or long sequences of dplyr manipulations. The key to using the script editor effectively is to memorise one of the most important keyboard shortcuts: Cmd/Ctrl + Enter. This executes the current R expression in the console. For example, take the code below. If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates `not_cancelled`. It will also move the cursor to the next statement (beginning with `not_cancelled %>%`). That makes it easy to run your complete script by repeatedly pressing Cmd/Ctrl + Enter.
The script editor is also a great place to build up complex ggplot2 plots or long sequences of dplyr manipulations.
The key to using the script editor effectively is to memorise one of the most important keyboard shortcuts: Cmd/Ctrl + Enter.
This executes the current R expression in the console.
For example, take the code below.
If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates `not_cancelled`.
It will also move the cursor to the next statement (beginning with `not_cancelled %>%`).
That makes it easy to run your complete script by repeatedly pressing Cmd/Ctrl + Enter.
```{r, eval = FALSE}
library(dplyr)
@ -24,11 +37,16 @@ not_cancelled %>%
summarise(mean = mean(dep_delay))
```
Instead of running expression-by-expression, you can also execute the complete script in one step: Cmd/Ctrl + Shift + S. Doing this regularly is a great way to check that you've captured all the important parts of your code in the script.
Instead of running expression-by-expression, you can also execute the complete script in one step: Cmd/Ctrl + Shift + S.
Doing this regularly is a great way to check that you've captured all the important parts of your code in the script.
I recommend that you always start your script with the packages that you need. That way, if you share your code with others, they can easily see what packages they need to install. Note, however, that you should never include `install.packages()` or `setwd()` in a script that you share. It's very antisocial to change settings on someone else's computer!
I recommend that you always start your script with the packages that you need.
That way, if you share your code with others, they can easily see what packages they need to install.
Note, however, that you should never include `install.packages()` or `setwd()` in a script that you share.
It's very antisocial to change settings on someone else's computer!
When working through future chapters, I highly recommend starting in the editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you won't even think about it.
When working through future chapters, I highly recommend starting in the editor and practicing your keyboard shortcuts.
Over time, sending code to the console in this way will become so natural that you won't even think about it.
## RStudio diagnostics
@ -52,10 +70,8 @@ knitr::include_graphics("screenshots/rstudio-diagnostic-warn.png")
## Exercises
1. Go to the RStudio Tips Twitter account, <https://twitter.com/rstudiotips>
and find one tip that looks interesting. Practice using it!
1. Go to the RStudio Tips Twitter account, <https://twitter.com/rstudiotips> and find one tip that looks interesting.
Practice using it!
1. What other common mistakes will RStudio diagnostics report? Read
<https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics> to
find out.
2. What other common mistakes will RStudio diagnostics report?
Read <https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics> to find out.

View File

@ -1,8 +1,10 @@
# (PART) Wrangle {-}
# (PART) Wrangle {.unnumbered}
# Introduction {#wrangle-intro}
In this part of the book, you'll learn about data wrangling, the art of getting your data into R in a useful form for visualisation and modelling. Data wrangling is very important: without it you can't work with your own data! There are three main parts to data wrangling:
In this part of the book, you'll learn about data wrangling, the art of getting your data into R in a useful form for visualisation and modelling.
Data wrangling is very important: without it you can't work with your own data!
There are three main parts to data wrangling:
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-wrangle.png")
@ -10,30 +12,23 @@ knitr::include_graphics("diagrams/data-science-wrangle.png")
This part of the book proceeds as follows:
* In [tibbles], you'll learn about the variant of the data frame that we use
in this book: the __tibble__. You'll learn what makes them different
from regular data frames, and how you can construct them "by hand".
- In [tibbles], you'll learn about the variant of the data frame that we use in this book: the **tibble**.
You'll learn what makes them different from regular data frames, and how you can construct them "by hand".
* In [data import], you'll learn how to get your data from disk and into R.
We'll focus on plain-text rectangular formats, but will give you pointers
to packages that help with other types of data.
- In [data import], you'll learn how to get your data from disk and into R.
We'll focus on plain-text rectangular formats, but will give you pointers to packages that help with other types of data.
* In [tidy data], you'll learn about tidy data, a consistent way of storing
your data that makes transformation, visualisation, and modelling easier.
You'll learn the underlying principles, and how to get your data into a
tidy form.
- In [tidy data], you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualisation, and modelling easier.
You'll learn the underlying principles, and how to get your data into a tidy form.
Data wrangling also encompasses data transformation, which you've already learned a little about. Now we'll focus on new skills for three specific types of data you will frequently encounter in practice:
Data wrangling also encompasses data transformation, which you've already learned a little about.
Now we'll focus on new skills for three specific types of data you will frequently encounter in practice:
* [Relational data] will give you tools for working with multiple
interrelated datasets.
* [Strings] will introduce regular expressions, a powerful tool for
manipulating strings.
- [Relational data] will give you tools for working with multiple interrelated datasets.
* [Factors] are how R stores categorical data. They are used when a variable
has a fixed set of possible values, or when you want to use a non-alphabetical
ordering of a string.
* [Dates and times] will give you the key tools for working with
dates and date-times.
- [Strings] will introduce regular expressions, a powerful tool for manipulating strings.
- [Factors] are how R stores categorical data.
They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.
- [Dates and times] will give you the key tools for working with dates and date-times.