Integrate suggestions from @behrman

This commit is contained in:
hadley 2016-10-04 14:21:04 -05:00
parent 2ae5eeb1b5
commit 4e49cfacc3
12 changed files with 64 additions and 89 deletions

View File

@ -336,9 +336,28 @@ ggplot(mpg, aes(displ, hwy)) +
### Replacing a scale
Instead of just tweaking the details a little, you can instead replace the scale altogether. We'll focus on colour scales because there are many options, and they're the scales you're mostly likely to want to change. The same principles apply to the other aesthetics. All colour scales have two variants: `scale_colour_x()` and `scale_fill_x()` for the `colour` and `fill` aesthetics respectively (the colour scales are available in both UK and US spellings).
Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you're mostly likely to want to switch out: continuous position scales and colour scales. Fortunately, the same principles apply to all the other aesthetics, so once you've mastered position and colour, you'll be able to quickly pick up other scale replacements.
The default categorical scale picks colours that are evenly spaced around the colour wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness.
It's very useful to plot transformations of your variable. For example, as we've seen in [diamond prices][diamond-prices] it's easier to see the precise relationship between `carat` and `price` if we log transform them:
```{r, fig.align = "default", out.width = "50%"}
ggplot(diamonds, aes(carat, price)) +
geom_bin2d()
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d()
```
However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale.
```{r}
ggplot(diamonds, aes(carat, price)) +
geom_bin2d() +
scale_x_log10() +
scale_y_log10()
```
Another scale that is frequently customised is colour.The default categorical scale picks colours that are evenly spaced around the colour wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness.
```{r, fig.align = "default", out.width = "50%"}
ggplot(mpg, aes(displ, hwy)) +
@ -394,6 +413,8 @@ ggplot(df, aes(x, y)) +
coord_fixed()
```
Note that all colour scales come in two variety: `scale_colour_x()` and `scale_fill_x()` for the `colour` and `fill` aesthetics respectively (the colour scales are available in both UK and US spellings).
### Exercises
1. Why doesn't the following code override the default scale?
@ -530,7 +551,7 @@ I only ever use three of the five options:
* I control the output size with `out.width` and set it to a percentage
of the line width). I default to `out.width = "70%"`
and `fig.align = 'center'`. That give plots room to breathe, without taking
and `fig.align = "center"`. That give plots room to breathe, without taking
up too much space.
* To put multiple plots in a single row I set the `out.width` to
@ -554,7 +575,7 @@ plot
plot
```
If you want to make sure the font size is consistent across all your figures, whenever you set `out.width`, you'll also need to adjust `fig.width` to maintain the same ratio with your default `out.width`. For example, if your default `fig.width` is 6 and `out.width` is 0.7, when you set `out.width = "50%"` you'll need to set `fig.width` to 4.2 (6 * 0.5 / 0.7).
If you want to make sure the font size is consistent across all your figures, whenever you set `out.width`, you'll also need to adjust `fig.width` to maintain the same ratio with your default `out.width`. For example, if your default `fig.width` is 6 and `out.width` is 0.7, when you set `out.width = "50%"` you'll need to set `fig.width` to 4.3 (6 * 0.5 / 0.7).
### Other important options

View File

@ -239,10 +239,6 @@ gss_cat %>%
1. Why did moving "Not applicable" to the front of the levels move it to the
bottom of the plot?
1. Recreate the display of marital status by age, using `geom_area()` instead
of `geom_line()`. What do you need to change to the plot? How might you
tweak the levels?
## Modifying factor levels
More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is `fct_recode()`. It allows you to recode, or change, the value of each level. For example, take the `gss_cat$partyid`:
@ -325,7 +321,4 @@ gss_cat %>%
1. How have the proportions of people identifying as Democrat, Republican, and
Independent changed over time?
1. Display the joint distribution of the `relig` and `denom` variables in
a single plot.
1. How could you collapse `rincome` into a small set of categories?

View File

@ -176,7 +176,7 @@ df %>% transpose() %>% str()
### Exercises
1. Challenge: read all the csv files in a directory. Which ones failed
1. Challenge: read all the CSV files in a directory. Which ones failed
and why?
```{r, eval = FALSE}

View File

@ -89,7 +89,7 @@ Another option that commonly needs tweaking is `na`: this specifies the value (o
read_csv("a,b,c\n1,2,.", na = ".")
```
This is all you need to know to read ~75% of csv files that you'll encounter in practice. You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`. To read in more challenging files, you'll need to learn more about how readr parses each column, turning them into R vectors.
This is all you need to know to read ~75% of CSV files that you'll encounter in practice. You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`. To read in more challenging files, you'll need to learn more about how readr parses each column, turning them into R vectors.
### Compared to base R
@ -118,7 +118,7 @@ If you've used R before, you might wonder why we're not using `read.csv()`. Ther
1. What are the most important arguments to `read_fwf()`?
1. Sometimes strings in a csv file contain commas. To prevent them from
1. Sometimes strings in a CSV file contain commas. To prevent them from
causing problems they need to be surrounded by a quoting character, like
`"` or `'`. By convention, `read_csv()` assumes that the quoting
character will be `"`, and if you want to change it you'll need to
@ -129,7 +129,7 @@ If you've used R before, you might wonder why we're not using `read.csv()`. Ther
"x,y\n1,'a,b'"
```
1. Identify what is wrong with each of the following inline csv files.
1. Identify what is wrong with each of the following inline CSV files.
What happens when you run the code?
```{r, eval = FALSE}
@ -459,7 +459,7 @@ These defaults don't always work for larger files. There are two basic problems:
vector, whereas you probably want to parse it as something more
specific.
readr contains a challenging csv that illustrates both of these problems:
readr contains a challenging CSV that illustrates both of these problems:
```{r}
challenge <- read_csv(readr_example("challenge.csv"))
@ -593,7 +593,7 @@ write_csv(challenge, "challenge-2.csv")
read_csv("challenge-2.csv")
```
This makes csvs a little unreliable for caching interim results---you need to recreate the column specification every time you load in. There are two alternatives:
This makes CSVs a little unreliable for caching interim results---you need to recreate the column specification every time you load in. There are two alternatives:
1. `write_rds()` and `read_rds()` are uniform wrappers around the base
functions `readRDS()` and `saveRDS()`. These store data in R's custom

View File

@ -328,7 +328,7 @@ I mention while loops only briefly, because I hardly ever use them. They're most
### Exercises
1. Imagine you have a directory full of csv files that you want to read in.
1. Imagine you have a directory full of CSV files that you want to read in.
You have their paths in a vector,
`files <- dir("data/", pattern = "\\.csv$", full.names = TRUE)`, and now
want to read each one with `read_csv()`. Write the for loop that will

View File

@ -192,7 +192,7 @@ sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
```
These are exactly the same values we got with `optim()`! Behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models. Using some connections between geometry, calculus, and linear algebra, `lm()` actually finds the closest model by (effectively) inverting a matrix. This approach is both faster, and guarantees that there is a global minimum.
These are exactly the same values we got with `optim()`! Behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models. Using some connections between geometry, calculus, and linear algebra, `lm()` actually finds the closest model by in a single step, using a sophisticated algorithm. This approach is both faster, and guarantees that there is a global minimum.
### Exercises
@ -591,52 +591,7 @@ ggplot(sim5, aes(x, y)) +
facet_wrap(~ model)
```
Notice that the extrapolation outside the range of the data is clearly bad. This is the downside to approximating a function with a polynomial. But this is a very real problem with every model: the model can never tell you if the behaviour is true when you start extrapolating outside the range of the data that you have seen. You must rely on theory or science.
### Interpolation vs. extrapolation
So far, when we've visualised the predictions from a model, we've been careful to overlay them on the data. This is important because it helps make sure we're not extrapolating the model to data that is far away from what we've observed.
However, as you start working with large datasets overlaying all the data will become increasing challenging. Particularly as you'll often use a model to simplify away some of the complexities of the raw data! To make your life easier, you might be tempted to just display the predictions. This is dangerous because you might have accidentally generated predictions that are very far away from your data.
As a compromise, you can use the convenient `similarityweight()` function from the __condvis__ package by Mark O'Connell. You give it your prediction grid and your dataset, and it computes the similarlity between each observation in your original dataset and the prediction grid. You can then display only the points that are close to your predictions. If no points are close, you know that your prediction grid is dangerous!
```{r}
sim6 <- tibble(
x1 = rbeta(1000, 2, 5),
x2 = rbeta(1000, 2, 5),
y = 2 * x1 * x2 + x1 + 2 + rnorm(length(x1))
)
mod <- lm(y ~ x1 * x2, data = sim6)
grid <- sim6 %>%
data_grid(
x1 = seq_range(x1, 10),
x2 = c(0, 0.5, 1, 1.5)
) %>%
add_predictions(mod)
add_similarity <- function(data, grid, ..., .similarity = "sim") {
common <- intersect(names(data), names(grid))
message("Using ", paste(common, collapse = ", "))
sim_m <- condvis::similarityweight(grid, data[common], ...)
sim <- apply(sim_m, 2, max)
data[[.similarity]] <- sim
data
}
sim6 %>%
add_similarity(grid) %>%
filter(sim > 0) %>%
ggplot(aes(x1, y, group = x2)) +
geom_point(aes(alpha = sim)) +
geom_line(data = grid, aes(y = pred)) +
scale_alpha(limit = c(0, NA))
```
Notice that the extrapolation outside the range of the data is clearly bad. This is the downside to approximating a function with a polynomial. But this is a very real problem with every model: the model can never tell you if the behaviour is true when you start extrapolating outside the range of the data that you have seen. You must rely on theory and science.
### Exercises

View File

@ -32,7 +32,7 @@ library(nycflights13)
library(lubridate)
```
## Why are low quality diamonds more expensive?
## Why are low quality diamonds more expensive? {#diamond-prices}
In previous chapters we've seen a surprising relationship between the quality of diamonds and their price: low quality diamonds (poor cuts, bad colours, and inferior clarity) have higher prices.
@ -405,7 +405,7 @@ We see a strong pattern in the numbers of Saturday flights. This is reassuring,
### Exercises
1. Use your Google sleuthing skills to brainstorm why there were fewer than
expected flights on Jan 20, May 26, and Sep 9. (Hint: they all have the
expected flights on Jan 20, May 26, and Sep 1. (Hint: they all have the
same explanation.) How would these days generalise to another year?
1. What do the three days with high positive residuals represent?

View File

@ -194,7 +194,7 @@ resids %>%
facet_wrap(~continent)
```
It looks like we've missed some mild quadratic pattern. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.
It looks like we've missed some mild pattern. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.
### Model quality
@ -433,7 +433,6 @@ df %>%
mutate(
smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1]))
)
```
### Exercises

View File

@ -186,7 +186,7 @@ To learn more about htmlwidgets and see a more complete list of packages that pr
### Shiny
htmlwidgets provide __client-side__ interactivity --- all the interactive happens in the browser, independently of R. On one hand, that's great because you can distribute the HTML file without any connection to R. However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript. An alternative approach is to use __shiny__, a package that allows you to create interactivity using R code, not JavaScript.
htmlwidgets provide __client-side__ interactivity --- all the interactivity happens in the browser, independently of R. On one hand, that's great because you can distribute the HTML file without any connection to R. However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript. An alternative approach is to use __shiny__, a package that allows you to create interactivity using R code, not JavaScript.
To call Shiny code from an R Markdown document, add `runtime: shiny` to the header:
@ -264,7 +264,7 @@ To learn more about effective communication in these different formats I recomme
* If you give academic talks, I recommend reading the [_Leek group guide
to giving talks_](https://github.com/jtleek/talkguide).
* I haven't taken it personally, but I've heard good things about Matt
* I haven't taken it myself, but I've heard good things about Matt
McGarrity's online course on public speaking:
<https://www.coursera.org/learn/public-speaking>.

View File

@ -20,11 +20,12 @@ R Markdown files are designed to be used in three ways:
R Markdown integrates a number of R packages and external tools. This means that helps is, by-and-large, not available through `?`. Instead, as you work through this chapter, and use R Markdown in the future, keep these resources close to hand:
* R Markdown Cheat Sheet: _Help > Cheatsheets > R Markdown Cheat Sheet_,
or from <http://rstudio.com/cheatsheets>.
* R Markdown Reference Guide: _Help > Cheatsheets > R Markdown Reference
Guide_.
Both cheatsheets are also available at <http://rstudio.com/cheatsheets>.
### Prerequisites
You need the __rmarkdown__ package, but you don't need to explicitly install it or load it, as RStudio automatically does both when needed.
@ -32,6 +33,7 @@ You need the __rmarkdown__ package, but you don't need to explicitly install it
```{r setup, include = FALSE}
chunk <- "```"
inline <- function(x = "") paste0("`` `r ", x, "` ``")
library(tidyverse)
```
## R Markdown basics
@ -46,7 +48,7 @@ It contains three important types of content:
1. An (optional) __YAML header__ surrounded by `---`s.
1. __Chunks__ of R code surrounded by ```` ``` ````.
1. Text mixed with simple text formatting like `##` and .
1. Text mixed with simple text formatting like `# heading` and `_italics_`.
When you open an `.Rmd`, you get a notebook interface where code and output are interleaved. You can run each code chunk by clicking the Run icon (it looks like a play button at the top of the chunk), or by pressing Cmd/Ctrl + Shift + Enter. RStudio executes the code and displays the results inline with the code:
@ -115,16 +117,16 @@ The best way to learn these is simply to try them out. It will take a few days,
1. Add a horizontal rule.
1. Add a block quote.
1. Download `diamond-sizes.Rmd` from
<https://github.com/hadley/r4ds/tree/master/rmarkdown>. Check that you
can run it, then add text after the frequency polygon that describes its
most striking features.
1. Copy and paste the contents of `diamond-sizes.Rmd` from
<https://github.com/hadley/r4ds/tree/master/rmarkdown> in to a local
R markdown document. Check that you can run it, then add text after the
frequency polygon that describes its most striking features.
## Code chunks
To run code inside an R Markdown document, you need to insert a chunk. There are three ways to do so:
1. The keyboard shortcut Cmd/Ctrl + Alt + I.
1. The keyboard shortcut Cmd/Ctrl + Alt + I
1. The "Insert" button icon in the editor toolbar.
@ -187,7 +189,7 @@ The most important set of options controls if your code block is executed and wh
and want to deliberately include an error. The default, `error = FALSE` causes
knitting to failure if there is a single error in the document.
The following table summarises the options:
The following table summarises which types of output each option supressess:
Option | Run code | Show code | Output | Plots | Messages | Warnings
-------------------|----------|-----------|--------|-------|----------|---------
@ -342,7 +344,7 @@ cat(readr::read_file("rmarkdown/fuel-economy.Rmd"))
As you can see, parameters are available within the code chunks as a read-only list named `params`.
You can write atomic vectors directly into the YAML header. You can also run arbitrary R expressions by prefacing the parameter value with `!R`. This is a good way to specify date/time parameters.
You can write atomic vectors directly into the YAML header. You can also run arbitrary R expressions by prefacing the parameter value with `!r`. This is a good way to specify date/time parameters.
```yaml
params:
@ -358,15 +360,20 @@ Alternatively, if you need to produce many such paramterised reports, you can ca
rmarkdown::render("fuel-economy.Rmd", params = list(my_class = "suv"))
```
This is particularly powerful in conjunction with `purrr:pwalk()`. The following example creates a report for each value of `class` found in `mpg`.
This is particularly powerful in conjunction with `purrr:pwalk()`. The following example creates a report for each value of `class` found in `mpg`. First we create a data frame that has one row for each class, giving the `filename` of report and the `params` it should be given:
```{r, eval = FALSE}
```{r}
reports <- tibble(
class = unique(mpg$class),
filename = stringr::str_c("fuel-economy-", class, ".html"),
params = purrr::map(class, ~ list(my_class = .))
)
reports
```
Then we match the column names to the argument names of `render()`, and use purrr's **parallel* walk to call `render()` once for each row:
```{r, eval = FALSE}
reports %>%
select(output_file = filename, params) %>%
purrr::pwalk(rmarkdown::render, input = "fuel-economy.Rmd")

View File

@ -70,8 +70,8 @@ Tibbles have a refined print method that shows only the first 10 rows, and all t
```{r}
tibble(
a = lubridate::now() + runif(1e3) * 60,
b = lubridate::today() + runif(1e3),
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
@ -90,7 +90,7 @@ nycflights13::flights %>%
You can also control the default print behaviour by setting options:
* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m`
rows, print only `n` rows. Use `options(dplyr.print_max = Inf)` to always
rows, print only `n` rows. Use `options(dplyr.print_min = Inf)` to always
show all rows.
* Use `options(tibble.width = Inf)` to always print all columns, regardless
@ -158,7 +158,9 @@ The main reason that some older functions don't work with tibble is the `[` func
df[, c("abc", "xyz")]
```
1. Practice referring to non-syntactic names by:
1. Practice referring to non-syntactic names in the following data frame by:
1. Extracting the variable called `1`.
1. Plotting a scatterplot of `1` vs `2`.
@ -166,8 +168,6 @@ The main reason that some older functions don't work with tibble is the `[` func
1. Renaming the columns to `one`, `two` and `three`.
1. Extracting the variable called `1`.
```{r}
annoying <- tibble(
`1` = 1:10,

View File

@ -100,7 +100,7 @@ getwd()
Whenever you refer to a file with a relative path it will look for it here.
Now enter the following commands in the script editor, and save the file, calling it "diamonds.R". Next, run the complete script which will save a pdf and csv file into your project directory. Don't worry about the details, you'll learn them later in the book.
Now enter the following commands in the script editor, and save the file, calling it "diamonds.R". Next, run the complete script which will save a PDF and CSV file into your project directory. Don't worry about the details, you'll learn them later in the book.
```{r toy-line, eval = FALSE}
library(tidyverse)